As a consequence:
• Bump minimum supported Python version to 3.7.
• Move Vagrant environment to Debian 10, which has Python 3.7.
• Move CI frontend tests to Debian 10.
• Move production build test to Debian 10.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
This catches nine functional indexes that the previous query didn’t:
upper_preregistration_email_idx
upper_stream_name_idx
upper_subject_idx
upper_userprofile_email_idx
zerver_message_recipient_upper_subject
zerver_mutedtopic_stream_topic
zerver_stream_realm_id_name_uniq
zerver_userprofile_realm_id_delivery_email_uniq
zerver_userprofile_realm_id_email_uniq
Signed-off-by: Anders Kaseorg <anders@zulip.com>
Restarting the uwsgi processes by way of supervisor opens a window
during which nginx 502's all responses. uwsgi has a configuration
called "chain reloading" which allows for rolling restart of the uwsgi
processes, such that only one process at once in unavailable; see
uwsgi documentation ([1]).
The tradeoff is that this requires that the uwsgi processes load the
libraries after forking, rather than before ("lazy apps"); in theory
this can lead to larger memory footprints, since they are not shared.
In practice, as Django defers much of the loading, this is not as much
of an issue. In a very basic test of memory consumption (measured by
total memory - free - caches - buffers; 6 uwsgi workers), both
immediately after restarting Django, and after requesting `/` 60 times
with 6 concurrent requests:
| Non-lazy | Lazy app | Difference
------------------+------------+------------+-------------
Fresh | 2,827,216 | 2,870,480 | +43,264
After 60 requests | 3,332,284 | 3,409,608 | +77,324
..................|............|............|.............
Difference | +505,068 | +539,128 | +34,060
That is, "lazy app" loading increased the footprint pre-requests by
43MB, and after 60 requests grew the memory footprint by 539MB, as
opposed to non-lazy loading, which grew it by 505MB. Using wsgi "lazy
app" loading does increase the memory footprint, but not by a large
percentage.
The other effect is that processes may be served by either old or new
code during the restart window. This may cause transient failures
when new frontend code talks to old backend code.
Enable chain-reloading during graceful, puppetless restarts, but only
if enabled via a zulip.conf configuration flag.
Fixes#2559.
[1]: https://uwsgi-docs.readthedocs.io/en/latest/articles/TheArtOfGracefulReloading.html#chain-reloading-lazy-apps
On a system where ‘apt-get update’ has never been run, ‘apt-cache
policy’ may show no repositories at all. Try to correct this with
‘apt-get update’ before giving up.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
If nginx was already installed, and we're using the webroot method of
initializing certbot, nginx needs to be reloaded. Hooks in
`/etc/letsencrypt/renewal-hooks/deploy/` do not run during initial
`certbot certonly`, so an explicit reload is required.
The certbot package installs its own systemd timer (and cron job,
which disabled itself if systemd is enabled) which updates
certificates. This process races with the cron job which Zulip
installs -- the only difference being that Zulip respects the
`certbot.auto_renew` setting, and that it passes the deploy hook.
This means that occasionally nginx would not be reloaded, when the
systemd timer caught the expiration first.
Remove the custom cron job and `certbot-maybe-renew` script, and
reconfigure certbot to always reload nginx after deploying, using
certbot directory hooks.
Since `certbot.auto_renew` can't have an effect, remove the setting.
In turn, this removes the need for `--no-zulip-conf` to
`setup-certbot`. `--deploy-hook` is similarly removed, as running
deploy hooks to restart nginx is now the default; pass
`--no-directory-hooks` in standalone mode to not attempt to reload
nginx. The other property of `--deploy-hook`, of skipping symlinking
into place, is given its own flog.
We've had a number of unhappy reports of upgrades failing due to
webpack requiring too much memory. While the previous commit will
likely fix this issue for everyone, it's worth improving the error
message for failures here.
We avoid doing the stop+retry ourselves, because that could cause an
outage in a production system if webpack fails for another reason.
Fixes#20105.
Since the upgrade to Webpack 5, we've been seeing occasional reports
that servers with roughly 4GiB of RAM were getting OOM kills while
running webpack.
Since we can't readily optimize the memory requirements for webpack
itself, we should raise the RAM requirements for doing the
lower-downtime upgrade strategy.
Fixes#20231.
Stopping both `zulip-tornado` and `zulip-tornado:*` causes errors on
deploys with tornado sharding, as the plain `zulip-tornado` service
does not exist.
Pass `zulip-tornado:*`, which matches both plain `zulip-tornado`, as
well as the sharded `zulip-tornado:zulip-tornado-port-9800` cases.
The `analyze_new_cluster.sh` script output by `pg_upgrade` just runs
`vacuumdb --all --analyze-in-stages`, which runs three passes over the
database, getting better stats each time. Each of these passes is
independent; the third pass does not require the first two.
`--analyze-in-stages` is only provided to get "something" into the
database, on the theory that it could then be started and used. Since
we wait for all three passes to complete before starting the database,
the first two passes add no value.
Additionally, PosttgreSQL 14 and up stop writing the
`analyze_new_cluster.sh` script as part of `pg_upgrade`, suggesting
the equivalent `vacuumdb --all --analyze-in-stages` call instead.
Switch to explicitly call `vacuumdb --all --analyze-only`, since we do
not gain any benefit from `--analyze-in-stages`. We also enable
parallelism, with `--jobs 10`, in order to analyze up to 10 tables in
parallel. This may increase load, but will accelerate the upgrade
process.
scripts.lib.node_cache expects Yarn to be in /srv/zulip-yarn, so if
it’s installed somewhere else, even if it’s the right version, we need
to reinstall it.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
It recently started failing on Debian 10 (buster). We immediately
follow this by replacing these packages with our own versions from
pip.txt, anyway.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
The support for bullseye was added in #17951
but it was not documented as bullseye was
frozen and did not have proper configuration
files, hence wasn't documented.
Since now bullseye is released as a stable
version, it's support can be documented.
The usual output from this command looks like
Notice: Compiled catalog for localhost in environment production in 2.33 seconds
Notice: /Stage[main]/Zulip::Apt_repository/Exec[setup_apt_repo]/returns: current_value 'notrun', should be ['0'] (noop)
Notice: Class[Zulip::Apt_repository]: Would have triggered 'refresh' from 1 event
Notice: Stage[main]: Would have triggered 'refresh' from 1 event
Notice: Applied catalog in 1.20 seconds
which doesn’t seem abnormally alarming, and hiding it makes failures
harder to diagnose.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
We previously used `zulip-puppet-apply` with a custom config file,
with an updated PostgreSQL version but more limited set of
`puppet_classes`, to pre-create the basic settings for the new cluster
before running `pg_upgradecluster`.
Unfortunately, the supervisor config uses `purge => true` to remove
all SUPERVISOR configuration files that are not included in the puppet
configuration; this leads to it removing all other supervisor
processes during the upgrade, only to add them back and start them
during the second `zulip-puppet-apply`.
It also leads to `process-fts-updates` not being started after the
upgrade completes; this is the one supervisor config file which was
not removed and re-added, and thus the one that is not re-started due
to having been re-added. This was not detected in CI because CI added
a `start-server` command which was not in the upgrade documentation.
Set a custom facter fact that prevents the `purge` behaviour of the
supervisor configuration. We want to preserve that behaviour in
general, and using `zulip-puppet-apply` continues to be the best way
to pre-set-up the PostgreSQL configuration -- but we wish to avoid
that behaviour when we know we are applying a subset of the puppet
classes.
Since supervisor configs are no longer removed and re-added, this
requires an explicit start-server step in the instructions after the
upgrades complete. This brings the documentation into alignment with
what CI is testing.
These changes are all independent of each other; I just didn’t feel
like making dozens of commits for them.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
For our marketing emails, we want a width that's more appropriate for
newsletter context, vs. the narrow emails we use for transactional
content.
I haven't figured out a cleaner way to do this than duplicating most
of email_base_default.source.html. But it's not a big deal to
duplicate, since we've been changing that base template only about
once a year.
The script is added to upgrade steps for 20.04 and Buster because
those are the upgrades that cross glibc 2.28, which is most
problematic. It will also be called out in the upgrade notes, to
catch those that have already done that upgrade.
Running `supervisorctl stop` or `supervisorctl restart` on a process
name which is not known is an error:
```
$ supervisorctl stop nonexistent-process
nonexistent-process: ERROR (no such process)
$ echo $?
1
```
ef6d0ec5ca moved
zulip_deliver_scheduled_* out of the `workers:` group. Since upgrades
run `stop-server` before applying puppet, the list of processes at
that time is from the previous version of Zulip, so may not have the
new `zulip_deliver_scheduled_*` names -- and the `stop-server` will
hence fail.
If the upgrade is not applying puppet, it will `restart-server`. At
that point, the old names will still be in the configuration, so
relying on the current `superisorctl status` is the best gauge of what
exists to restart.
In short, only ever stop/start/restart the `zulip_deliver_scheduled_*`
processes if `supervisorctl status` knows about them already.
Nonexistent processes and groups passed to `supervisortctl status` are
printed to STDOUT as follows:
```
$ supervisorctl status zulip-django nonexistent-process nonexistent-group:*
nonexistent-process: ERROR (no such process)
nonexistent-group: ERROR (no such group)
zulip-django RUNNING pid 16043, uptime 17:31:31
```
On supervisor 4 and above, this exits with an exit code of 4;
previously, it returned exit code 0. Ubuntu 18.04 has version 3.3.1,
and Ubuntu 20.04 has version 4.1.0.
Skip any lines with `ERROR (no such ...)`, and accept exit code 4 from
`supervisorctl status`.
Mypy can’t follow absolute imports based on directories other than the
root. This was hiding some type errors due to ignore_missing_imports.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
With two space-separated classes in `puppet_classes`, the second one
is silently ignored. With three of more, puppet generates the
following very opaque error message:
```
Error: Could not parse for environment production: This
Name has no effect. A value was produced and then forgotten (one or
more preceding expressions may have the wrong form)
```
Catch when this has happened, and give an error message to the user.
Fixes#18992.
This command is part of a statsd infrastructure that we stopped
supporting years ago. Its only purpose for some time has been to
provide sample code for how the restart script might trigger a
notification to a graphing system, which doesn't justify maintaining
it.
Fixes part of #18898.
The current `upgrade-zulip` and `upgrade-zulip-from-git`
bash scripts exit with a zero status even if the
upgrade commands exit with a non-zero status.
Hence add `set -e` command which exits the script with
the same status as the non-zero command.
For pipe commands however, the net status of a command
is the status of the last command, hence if the other parts
fail, the net status is only determined by the last command.
This is the case with our main /lib/upgrade-zulip* command
in the scripts whose status is determined by the `tee` command
instead. Hence add a small condition to get the status of the
actual upgrade command and exit the script if it fails with
a non-zero command.
We also check whether the script is being run as root, matching the
install script logic.
This parameter is somewhat useful, and adding this also fixes a
regression where purge-old-deployments would crash since the changes
around c5580607a7 because of
inconsistent supported args lists.
Fixes#16659.
If the server is behind a reverse proxy with http_only=True, the
requests made by email-mirror-postfix need to use http, as https
doesn't work.
Staging and other hosts that are `zulip::app_frontend_base` but not
`zulip::app_frontend_once` do not have a
/etc/supervisor/conf.d/zulip/zulip-once.conf and as such do not have
`zulip_deliver_scheduled_emails` or `zulip_deliver_scheduled_messages`
and thus supervisor will fail to reload.
Making the contents of `zulip-workers` contingent on if the server is
_also_ a `-once` server is complicated, and would involve using Concat
fragments, which severely limit readability.
Instead, expel those two from `zulip-workers`; this is somewhat
reasonable, since they are use an entirely different codepath from
zulip_events_*, using the database rather than RabbitMQ for their
queuing.
We currently run the `clean_unused_caches.py` as a
script to clean the unused caches.
This commit replaces that with
`clean_unused_caches.main` function as it would be
faster.
This commit will allow us to pass the arguments in the
'clean...' functions when calling the `main` function (in
`provision`). It also changes args parsing
function location to `if __name__ == "__main__"` block as
we wouldn't need it to parse args when we call the
function.
We convert the `clean-unused-caches` script to a
python file so we can run it in provision by importing it
instead of running the script, hence saving some time.
Appending data back-to-back without serializing it loses the
information about where the breaks between them lie, which can lead to
different inputs having the same hash.
Using puppet modules from the puppet forge judiciously will allow us
to simplify the configuration somewhat; this specifically pulls in the
stdlib module, which we were already using parts of.
This moves the `.asc` files into subdirectories, and writes out the
according `.list` files into them. It moves from templates to
written-out `.list` files for clarity and ease of
implementation (Debian and Ubuntu need different templates for
`zulip`), and as a way of making explicit which releases are supported
for each list. For the special-case of the PGroonga signing key, we
source an additional file within the directory.
This simplifies the process for adding another class of `.list` file.
Add support for custom database names and database users, which can be
set with the `--postgresql-database-name` and
`--postgresql-database-user` install script options. If these
parameters aren't provided, then the defaults remain "zulip".
Fixes#17662.
Co-authored-by: Alex Vandiver <alexmv@zulip.com>
This adds basic support for `postgresql.database_user` and
`postgresql.database_name` settings in `zulip.conf`; the defaults if
unspecified are left as `zulip`.
Co-authored-by: Adam Birds <adam.birds@adbwebdesigns.co.uk>
Add a helper `run_psql_as_postgres` function in
`scripts/lib/zulip_tools.py`. This is preparatory refactoring for the
work to add custom database and user names.
Fixes this error when running the installer from a directory that
isn’t world-readable:
+ su zulip -c 'git config --global user.email anders@zulip.com'
fatal: cannot come back to cwd: Permission denied
Signed-off-by: Anders Kaseorg <anders@zulip.com>
This reverts part of commit 476524c0c1
(#18215), to fix this error when running the installer from a
directory that isn’t world-readable:
+ '[' -e /var/run/supervisor.sock ']'
+++ dirname /root/zulip-server-4.1/scripts/setup/postgresql-init-db
++ dirname /root/zulip-server-4.1/scripts/setup
+ su zulip -c /root/zulip-server-4.1/scripts/stop-server
bash: /root/zulip-server-4.1/scripts/stop-server: Permission denied
Zulip installation failed (exit code 126)!
Signed-off-by: Anders Kaseorg <anders@zulip.com>
When upgrading from a pre-4.0 release, scripts/stop-server logic would
check whether supervisord configuration files were present to
determine what it needed to restart, but only considered paths to
those files that are introduced in Zulip 4.0.
Fixed#18493.
This ensures that the `git describe` queries that we run for caching
Zulip's Git version are guaranteed to include recent releases.
This change ensures that we have accurate output even if we're pointed
at a fork of Zulip that never updates its tags.
Additionally, it will make it possible to record the
`git merge-base upstream/master` in future commits.
Note that because we run this code before unpacking the new version,
the pre-upgrade version of this code runs.
As a result, we cannot assume that the upstream repository exists.
This removes a possible window where an installer error could leave
`nvm` in a state where it had prepended the full path to the
newly-installed `npm` to `$PATH`; we would like to avoid `nvm`
fiddling with path whenever possible (ref ebe930ab2c).
During the upgrade process of a postgresql-only Zulip installation,
(`puppet_classes = zulip::profile::postgresql` in
`/etc/zulip/zulip.conf`) either `scripts/start-server` or
`scripts/stop-server` fail because they try to handle supervisor
services that are not available (e.g. Tornado) since only
`/etc/supervisor/conf.d/zulip/zulip_db.conf` is present and not
`/etc/supervisor/conf.d/zulip/zulip.conf`.
While this wasn't previously supported, it's a pretty reasonable thing
to do, and can be readily supported by just adding a few conditionals.
Matching the full process name (-x without -f) or full command
line (-xf) is less prone to mistakes like matching a random substring
of some other command line or pgrep matching itself.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
Thumbor and tc-aws have been dragging their feet on Python 3 support
for years, and even the alphas and unofficial forks we’ve been running
don’t seem to be maintained anymore. Depending on these projects is
no longer viable for us.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
The `en_US.UTF-8` locale may not be configured or generated on all
installs; it also requires that the `locales` package be installed.
If users generate the `en_US.UTF-8` locale without adding it to the
permanent set of system locales, the generated `en_US.UTF-8` stops
working when the `locales` package is updated.
Switch to using `C.UTF-8` in all cases, which is guaranteed to be
installed.
Fixes#15819.
In some cases, puppet can end up restarting supervisord services - which
will use code from the old deployment, because when puppet runs,
/home/zulip/deployments/current still points there. Thus restart-server
needs to be used in favor of start-server, unless we know that puppet
has been skipped.
Previous versions of zulip used `nvm alias default ...` to have `nvm`
prepend the full path to the latest `node` install to the `PATH` in
root's shell. Unfortunately, this means that `update-prod-static`,
when called from `upgrade-zulip-stage-2` after an upgrade of node in
`install-node`, would still have the full path to the _old_ `node` at
the start of its PATH, because the PATH of `upgrade-zulip-stage-2`
would still be unchanged.
Bootstrap out of this by setting a known-reasonable PATH during
upgrade, and remove the problematic `nvm alias default` behaviour.
Fixes#18258.
In Debian, becoming root as `su` does not alter the `$PATH`; this can
lead to the root user not having `/usr/sbin` in its path, and thus
the `useradd zulip` step of the installer fails.
Fixes#17441.
Instead of taking the "onion" approach, where all services are
stopped, and then started back up again, default to a rolling restart
across all processes. This draws out how long the overall "restart"
takes, but minimizes the time that any of the services are down. This
minimizes user-visible impact and queue buildup.
In cases where speed is more important than minimal impact (for
example, there is already a current outage), a --less-graceful flag is
provided, which brings the services down more suddenly, and back up in
a still-correct order.
This hits the unauthenticated Github API to get the list of tags,
which is rate-limited to 60 requests per hour. This means that the
tool can only be run 60 times per hour before it starts to exit with
errors, but that seems like a reasonable limit for the moment.
This commit removes redundant yarn cache by removing the old
version directories, i.e. All the directory under `~/.cache/yarn`
except `~/.cache/yarn/v6` (current version directory).
Fixes#15964.
In general, `./scripts/restart-server` will already work in any
circumstance where the server is already stopped and needs to be
started. However, it will output a couple minor warnings, and it is
not readily obvious that it *will* work correctly.
Add an alias for `restart-server` named `start-server`, for
parallelism with `stop-server`, which omits the steps of
`restart-server` which would stop the server first.
Using `supervisorctl stop all` to stop the server is not terribly
discoverable, and may stop services which are not part of Zulip
proper.
Add an explicit tool which only stops the relevant services. It also
more carefully controls the order in which services are stopped to
minimize lost requests, and maximally quiesce the server.
Locations which may be stopping _older_ versions of Zulip (without
this script) are left with using `supervisorctl stop all`.
Fixes#14959.
The path which contains all of the Zulip supervisor files changed in
3ab9b31d2f to make it easier to purge
now-unwanted supervisor configuration files. However, the paths that
the zulip upgrade process, and restart-server, look at were not
adjusted.
Fix the supervisor configuration file paths.
3314fefaec started needing `python3-yaml`, but incorrectly claimed
that it was always an indirect dependency; it is a dependency of
`ubuntu-minimal` on 20.04, but not required on 18.04 or Debian. We
cannot install it in puppet because then is definitionally too late;
it is needed at load time by `zulip-puppet-apply`.
Install `python3-yaml`, but guarded by a simple check so as to not
further slow most installs.
Fixes#18179.
The stacktraces here are seldom useful -- for the calls to
upgrade-stage-2, we know precisely what was run. For the `run`
wrapper, the output contains the command that failed, which is
sufficient to identify where in the upgrade process it was. Showing
more stacktrace below the actual error merely confuses users and
scrolls the real error off of the screen.
For installs which use the `upgrade-zulip-from-git` process, the
deployment directory is a git checkout. This means that an
administrator can, as an emergency tool, run `git revert` and similar
commands -- assuming there is a `~/.gitconfig` set up for the zulip
user.
Add commands to `scripts/lib/install` to create a `~/.gitconfig` file
at installation time. The `user.name` and `user.email` fields are set
to the hostname and passed-in `--email` value, respectively.
Fixes#18039.
0663b23d54 changed zulip-puppet-apply to
use the venv, because it began using `yaml` to parse the output of
puppet to determine if changes would happen.
However, not every install ends with a venv; notably, non-frontend
servers do not have one. Attempting to run zulip-puppet-apply on them
hence now fails.
Remove this dependency on the venv, by installing a system
python3-yaml package -- though in reality, this package is already an
indirect dependency of the system. Especially since pyyaml is quite
stable, we're not using it in any interesting way, and it does not
actually add to the dependencies, it is preferable to parsing the YAML
by hand in this instance.
When exception is raised inside an exception handler, Python 3
helpfully prints both tracebacks separated by “During handling of the
above exception, another exception occurred:”. But when we’re using
an exception handler to retry the same operation, multiple tracebacks
are just noise. Suppress the earlier one using PEP 409 syntax.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
This means that in steady-state, `zulip-puppet-apply` is expected to
produce no changes or commands to execute. The verification step of
`setup-apt-repo` is quite fast, so this cleans up the output for very
little cost.
The class names need to be renamed even if we are not about to run
puppet ourselves; otherwise, deployments which rely on running puppet
themselves will still have the wrong class names.
These are respected by `urllib`, and thus also `requests`. We set
`HTTP_proxy`, not `HTTP_PROXY`, because the latter is ignored in
situations which might be running under CGI -- in such cases it may be
coming from the `Proxy:` header in the request.