Attempting to "upgrade" from `main` to 4.x should abort; Django does
not prevent running old code against the new database (though it
likely errors at runtime), and `./manage.py migrate` from the old
version during the "upgrade" does not downgrade the database, since
the migrations are entirely missing in that directory, so don't get
reversed.
Compare the list of applied migrations to the list of on-disk
migrations, and abort if there are applied migrations which are not
found on disk.
Fixes: #19284.
For many uses, shelling out to `supervisorctl` is going to produce
better error messages. However, for instances where we wish to parse
the output of `supervisorctl`, using the API directly is less brittle.
The RabbitMQ docs state ([1]):
RabbitMQ nodes and CLI tools (e.g. rabbitmqctl) use a cookie to
determine whether they are allowed to communicate with each
other. [...] The cookie is just a string of alphanumeric
characters up to 255 characters in size. It is usually stored in a
local file.
...and goes on to state (emphasis ours):
If the file does not exist, Erlang VM will try to create one with
a randomly generated value when the RabbitMQ server starts
up. Using such generated cookie files are **appropriate in
development environments only.**
The auto-generated cookie does not use cryptographic sources of
randomness, and generates 20 characters of `[A-Z]`. Because of a
semi-predictable seed, the entropy of this password is thus less than
the idealized 26^20 = 94 bits of entropy; in actuality, it is 36 bits
of entropy, or potentially as low as 20 if the performance of the
server is known.
These sizes are well within the scope of remote brute-force attacks.
On provision, install, and upgrade, replace the default insecure
20-character Erlang cookie with a cryptographically secure
255-character string (the max length allowed).
[1] https://www.rabbitmq.com/clustering.html#erlang-cookie
5c450afd2d, in ancient history, switched from `check_call` to
`check_output` and throwing away its result.
Use check_call, so that we show the steps to (re)starting the server.
This is required in order to lock down the RabbitMQ port to only
listen on localhost. If the nodename is `rabbit@hostname`, in most
circumstances the hostname will resolve to an external IP, which the
rabbitmq port will not be bound to.
Installs which used `rabbit@hostname`, due to RabbitMQ having been
installed before Zulip, would not have functioned if the host or
RabbitMQ service was restarted, as the localhost restrictions in the
RabbitMQ configuration would have made rabbitmqctl (and Zulip cron
jobs that call it) unable to find the rabbitmq server.
The previous commit ensures that configure-rabbitmq is re-run after
the nodename has changed. However, rabbitmq needs to be stopped
before `rabbitmq-env.conf` is changed; we use an `onlyif` on an `exec`
to print the warning about the node change, and let the subsequent
config change and notify of the service and configure-rabbitmq to
complete the re-configuration.
`/etc/rabbitmq/rabbitmq-env.conf` sets the nodename; anytime the
nodename changes, the backing database changes, and this requires
re-creating the rabbitmq users and permissions.
Trigger this in puppet by running configure-rabbitmq after the file
changes.
This addresses the problems mentioned in the previous commit, but for
existing installations which have `authenticator = standalone` in
their configurations.
This reconfigures all hostnames in certbot to use the webroot
authenticator, and attempts to force-renew their certificates.
Force-renewal is necessary because certbot contains no way to merely
update the configuration. Let's Encrypt allows for multiple extra
renewals per week, so this is a reasonable cost.
Because the certbot configuration is `configobj`, and not
`configparser`, we have no way to easily parse to determine if webroot
is in use; additionally, `certbot certificates` does not provide this
information. We use `grep`, on the assumption that this will catch
nearly all cases.
It is possible that this will find `authenticator = standalone`
certificates which are managed by Certbot, but not Zulip certificates.
These certificates would also fail to renew while Zulip is running, so
switching them to use the Zulip webroot would still be an improvement.
Fixes#20593.
Installing certbot with --method=standalone means that the
configuration file will be written to assume that the standalone
method will be used going forward. Since nginx will be running,
attempts to renew the certificate will fail.
Install a temporary self-signed certificate, just to allow nginx to
start, and then follow up (after applying puppet to start nginx) with
the call to setup-certbot, which will use the webroot authenticator.
The `setup-certbot --method=standalone` option is left intact, for use
in development environments.
Fixes part of #20593; it does not address installs which were
previously improperly configured with `authenticator = standalone`.
As a consequence:
• Bump minimum supported Python version to 3.7.
• Move Vagrant environment to Debian 10, which has Python 3.7.
• Move CI frontend tests to Debian 10.
• Move production build test to Debian 10.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
On a system where ‘apt-get update’ has never been run, ‘apt-cache
policy’ may show no repositories at all. Try to correct this with
‘apt-get update’ before giving up.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
The certbot package installs its own systemd timer (and cron job,
which disabled itself if systemd is enabled) which updates
certificates. This process races with the cron job which Zulip
installs -- the only difference being that Zulip respects the
`certbot.auto_renew` setting, and that it passes the deploy hook.
This means that occasionally nginx would not be reloaded, when the
systemd timer caught the expiration first.
Remove the custom cron job and `certbot-maybe-renew` script, and
reconfigure certbot to always reload nginx after deploying, using
certbot directory hooks.
Since `certbot.auto_renew` can't have an effect, remove the setting.
In turn, this removes the need for `--no-zulip-conf` to
`setup-certbot`. `--deploy-hook` is similarly removed, as running
deploy hooks to restart nginx is now the default; pass
`--no-directory-hooks` in standalone mode to not attempt to reload
nginx. The other property of `--deploy-hook`, of skipping symlinking
into place, is given its own flog.
We've had a number of unhappy reports of upgrades failing due to
webpack requiring too much memory. While the previous commit will
likely fix this issue for everyone, it's worth improving the error
message for failures here.
We avoid doing the stop+retry ourselves, because that could cause an
outage in a production system if webpack fails for another reason.
Fixes#20105.
Since the upgrade to Webpack 5, we've been seeing occasional reports
that servers with roughly 4GiB of RAM were getting OOM kills while
running webpack.
Since we can't readily optimize the memory requirements for webpack
itself, we should raise the RAM requirements for doing the
lower-downtime upgrade strategy.
Fixes#20231.
scripts.lib.node_cache expects Yarn to be in /srv/zulip-yarn, so if
it’s installed somewhere else, even if it’s the right version, we need
to reinstall it.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
It recently started failing on Debian 10 (buster). We immediately
follow this by replacing these packages with our own versions from
pip.txt, anyway.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
The support for bullseye was added in #17951
but it was not documented as bullseye was
frozen and did not have proper configuration
files, hence wasn't documented.
Since now bullseye is released as a stable
version, it's support can be documented.
The usual output from this command looks like
Notice: Compiled catalog for localhost in environment production in 2.33 seconds
Notice: /Stage[main]/Zulip::Apt_repository/Exec[setup_apt_repo]/returns: current_value 'notrun', should be ['0'] (noop)
Notice: Class[Zulip::Apt_repository]: Would have triggered 'refresh' from 1 event
Notice: Stage[main]: Would have triggered 'refresh' from 1 event
Notice: Applied catalog in 1.20 seconds
which doesn’t seem abnormally alarming, and hiding it makes failures
harder to diagnose.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
These changes are all independent of each other; I just didn’t feel
like making dozens of commits for them.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
Nonexistent processes and groups passed to `supervisortctl status` are
printed to STDOUT as follows:
```
$ supervisorctl status zulip-django nonexistent-process nonexistent-group:*
nonexistent-process: ERROR (no such process)
nonexistent-group: ERROR (no such group)
zulip-django RUNNING pid 16043, uptime 17:31:31
```
On supervisor 4 and above, this exits with an exit code of 4;
previously, it returned exit code 0. Ubuntu 18.04 has version 3.3.1,
and Ubuntu 20.04 has version 4.1.0.
Skip any lines with `ERROR (no such ...)`, and accept exit code 4 from
`supervisorctl status`.
This parameter is somewhat useful, and adding this also fixes a
regression where purge-old-deployments would crash since the changes
around c5580607a7 because of
inconsistent supported args lists.
Fixes#16659.
If the server is behind a reverse proxy with http_only=True, the
requests made by email-mirror-postfix need to use http, as https
doesn't work.
Staging and other hosts that are `zulip::app_frontend_base` but not
`zulip::app_frontend_once` do not have a
/etc/supervisor/conf.d/zulip/zulip-once.conf and as such do not have
`zulip_deliver_scheduled_emails` or `zulip_deliver_scheduled_messages`
and thus supervisor will fail to reload.
Making the contents of `zulip-workers` contingent on if the server is
_also_ a `-once` server is complicated, and would involve using Concat
fragments, which severely limit readability.
Instead, expel those two from `zulip-workers`; this is somewhat
reasonable, since they are use an entirely different codepath from
zulip_events_*, using the database rather than RabbitMQ for their
queuing.
This commit will allow us to pass the arguments in the
'clean...' functions when calling the `main` function (in
`provision`). It also changes args parsing
function location to `if __name__ == "__main__"` block as
we wouldn't need it to parse args when we call the
function.
We convert the `clean-unused-caches` script to a
python file so we can run it in provision by importing it
instead of running the script, hence saving some time.
Appending data back-to-back without serializing it loses the
information about where the breaks between them lie, which can lead to
different inputs having the same hash.
Using puppet modules from the puppet forge judiciously will allow us
to simplify the configuration somewhat; this specifically pulls in the
stdlib module, which we were already using parts of.
This moves the `.asc` files into subdirectories, and writes out the
according `.list` files into them. It moves from templates to
written-out `.list` files for clarity and ease of
implementation (Debian and Ubuntu need different templates for
`zulip`), and as a way of making explicit which releases are supported
for each list. For the special-case of the PGroonga signing key, we
source an additional file within the directory.
This simplifies the process for adding another class of `.list` file.
Add support for custom database names and database users, which can be
set with the `--postgresql-database-name` and
`--postgresql-database-user` install script options. If these
parameters aren't provided, then the defaults remain "zulip".
Fixes#17662.
Co-authored-by: Alex Vandiver <alexmv@zulip.com>
Add a helper `run_psql_as_postgres` function in
`scripts/lib/zulip_tools.py`. This is preparatory refactoring for the
work to add custom database and user names.
Fixes this error when running the installer from a directory that
isn’t world-readable:
+ su zulip -c 'git config --global user.email anders@zulip.com'
fatal: cannot come back to cwd: Permission denied
Signed-off-by: Anders Kaseorg <anders@zulip.com>
When upgrading from a pre-4.0 release, scripts/stop-server logic would
check whether supervisord configuration files were present to
determine what it needed to restart, but only considered paths to
those files that are introduced in Zulip 4.0.
Fixed#18493.
This ensures that the `git describe` queries that we run for caching
Zulip's Git version are guaranteed to include recent releases.
This change ensures that we have accurate output even if we're pointed
at a fork of Zulip that never updates its tags.
Additionally, it will make it possible to record the
`git merge-base upstream/master` in future commits.
Note that because we run this code before unpacking the new version,
the pre-upgrade version of this code runs.
As a result, we cannot assume that the upstream repository exists.
This removes a possible window where an installer error could leave
`nvm` in a state where it had prepended the full path to the
newly-installed `npm` to `$PATH`; we would like to avoid `nvm`
fiddling with path whenever possible (ref ebe930ab2c).
During the upgrade process of a postgresql-only Zulip installation,
(`puppet_classes = zulip::profile::postgresql` in
`/etc/zulip/zulip.conf`) either `scripts/start-server` or
`scripts/stop-server` fail because they try to handle supervisor
services that are not available (e.g. Tornado) since only
`/etc/supervisor/conf.d/zulip/zulip_db.conf` is present and not
`/etc/supervisor/conf.d/zulip/zulip.conf`.
While this wasn't previously supported, it's a pretty reasonable thing
to do, and can be readily supported by just adding a few conditionals.
Thumbor and tc-aws have been dragging their feet on Python 3 support
for years, and even the alphas and unofficial forks we’ve been running
don’t seem to be maintained anymore. Depending on these projects is
no longer viable for us.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
The `en_US.UTF-8` locale may not be configured or generated on all
installs; it also requires that the `locales` package be installed.
If users generate the `en_US.UTF-8` locale without adding it to the
permanent set of system locales, the generated `en_US.UTF-8` stops
working when the `locales` package is updated.
Switch to using `C.UTF-8` in all cases, which is guaranteed to be
installed.
Fixes#15819.
In some cases, puppet can end up restarting supervisord services - which
will use code from the old deployment, because when puppet runs,
/home/zulip/deployments/current still points there. Thus restart-server
needs to be used in favor of start-server, unless we know that puppet
has been skipped.
Previous versions of zulip used `nvm alias default ...` to have `nvm`
prepend the full path to the latest `node` install to the `PATH` in
root's shell. Unfortunately, this means that `update-prod-static`,
when called from `upgrade-zulip-stage-2` after an upgrade of node in
`install-node`, would still have the full path to the _old_ `node` at
the start of its PATH, because the PATH of `upgrade-zulip-stage-2`
would still be unchanged.
Bootstrap out of this by setting a known-reasonable PATH during
upgrade, and remove the problematic `nvm alias default` behaviour.
Fixes#18258.
In Debian, becoming root as `su` does not alter the `$PATH`; this can
lead to the root user not having `/usr/sbin` in its path, and thus
the `useradd zulip` step of the installer fails.
Fixes#17441.
This commit removes redundant yarn cache by removing the old
version directories, i.e. All the directory under `~/.cache/yarn`
except `~/.cache/yarn/v6` (current version directory).
Fixes#15964.
The path which contains all of the Zulip supervisor files changed in
3ab9b31d2f to make it easier to purge
now-unwanted supervisor configuration files. However, the paths that
the zulip upgrade process, and restart-server, look at were not
adjusted.
Fix the supervisor configuration file paths.
3314fefaec started needing `python3-yaml`, but incorrectly claimed
that it was always an indirect dependency; it is a dependency of
`ubuntu-minimal` on 20.04, but not required on 18.04 or Debian. We
cannot install it in puppet because then is definitionally too late;
it is needed at load time by `zulip-puppet-apply`.
Install `python3-yaml`, but guarded by a simple check so as to not
further slow most installs.
Fixes#18179.
The stacktraces here are seldom useful -- for the calls to
upgrade-stage-2, we know precisely what was run. For the `run`
wrapper, the output contains the command that failed, which is
sufficient to identify where in the upgrade process it was. Showing
more stacktrace below the actual error merely confuses users and
scrolls the real error off of the screen.
For installs which use the `upgrade-zulip-from-git` process, the
deployment directory is a git checkout. This means that an
administrator can, as an emergency tool, run `git revert` and similar
commands -- assuming there is a `~/.gitconfig` set up for the zulip
user.
Add commands to `scripts/lib/install` to create a `~/.gitconfig` file
at installation time. The `user.name` and `user.email` fields are set
to the hostname and passed-in `--email` value, respectively.
Fixes#18039.
0663b23d54 changed zulip-puppet-apply to
use the venv, because it began using `yaml` to parse the output of
puppet to determine if changes would happen.
However, not every install ends with a venv; notably, non-frontend
servers do not have one. Attempting to run zulip-puppet-apply on them
hence now fails.
Remove this dependency on the venv, by installing a system
python3-yaml package -- though in reality, this package is already an
indirect dependency of the system. Especially since pyyaml is quite
stable, we're not using it in any interesting way, and it does not
actually add to the dependencies, it is preferable to parsing the YAML
by hand in this instance.
When exception is raised inside an exception handler, Python 3
helpfully prints both tracebacks separated by “During handling of the
above exception, another exception occurred:”. But when we’re using
an exception handler to retry the same operation, multiple tracebacks
are just noise. Suppress the earlier one using PEP 409 syntax.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
This means that in steady-state, `zulip-puppet-apply` is expected to
produce no changes or commands to execute. The verification step of
`setup-apt-repo` is quite fast, so this cleans up the output for very
little cost.
The class names need to be renamed even if we are not about to run
puppet ourselves; otherwise, deployments which rely on running puppet
themselves will still have the wrong class names.
These are respected by `urllib`, and thus also `requests`. We set
`HTTP_proxy`, not `HTTP_PROXY`, because the latter is ignored in
situations which might be running under CGI -- in such cases it may be
coming from the `Proxy:` header in the request.
Using `config_file.write()` only writes out what python stored of the
file; as such, it strips all comments and whitespace.
Use `crudini --set`, which only modifies the line whose contents are
changed.
There is only one PostgreSQL database; the "appdb" is irrelevant.
Also use "postgresql," as it is the name of the software, whereas
"postgres" the name of the binary and colloquial name. This is minor
cleanup, but enabled by the other renames in the previous commit.
The "voyager" name is non-intuitive and not significant.
`zulip::voyager` and `zulip::dockervoyager` stubs are kept for
back-compatibility with existing `zulip.conf` files.
This moves the puppet configuration closer to the "roles and profiles
method"[1] which is suggested for organizing puppet classes. Notably,
here it makes clear which classes are meant to be able to stand alone
as deployments.
Shims are left behind at the previous names, for compatibility with
existing `zulip.conf` files when upgrading.
[1] https://puppet.com/docs/pe/2019.8/the_roles_and_profiles_method