This adds a --skip-restart which makes `deployments/next` in a state
where it can be restarted into, but holds off on conducting that
restart.
This requires many of the same guarantees as `--skip-tornado`, in
terms of there being no Puppet or database schema changes between the
versions. Enforce those with `--skip-restart`, and also broaden both
flags to prevent other, less common changes which nonetheless
potentially might affect the other deploy.
Because Tornado and Django use memcached as a shared cache for
checking session information, they must agree on the prefix used to
store those values.
Subsequent commits will work to ensure that it is always _safe_ to
share that cache.
These are expensive, and moving them to one explicit call early has
considerable time savings in the critical period:
```
$ hyperfine './manage.py fill_memcached_caches' './manage.py fill_memcached_caches --skip-checks'
Benchmark #1: ./manage.py fill_memcached_caches
Time (mean ± σ): 5.264 s ± 0.146 s [User: 4.885 s, System: 0.344 s]
Range (min … max): 5.119 s … 5.569 s 10 runs
Benchmark #2: ./manage.py fill_memcached_caches --skip-checks
Time (mean ± σ): 3.090 s ± 0.089 s [User: 2.853 s, System: 0.214 s]
Range (min … max): 2.950 s … 3.204 s 10 runs
Summary
'./manage.py fill_memcached_caches --skip-checks' ran
1.70 ± 0.07 times faster than './manage.py fill_memcached_caches'
```
Treating the restart as a start is important in reducing the critical
period during upgrades -- we call restart even when we suspect the
services are stopped, because puppet has a small possibility of
placing them in indeterminate state. However, restart orders the
workers first, then tornado/django, which prolongs the outage.
Recognize when no services are currently started, and switch to acting
like a start, not a restart, which places tornado/django first.
This hides ugly output if the services were already stopped:
```
2022-03-25 23:26:04,165 upgrade-zulip-stage-2: Stopping Zulip...
process-fts-updates: ERROR (not running)
zulip-django: ERROR (not running)
zulip_deliver_scheduled_emails: ERROR (not running)
zulip_deliver_scheduled_messages: ERROR (not running)
Zulip stopped successfully!
```
Being able to skip having to shell out to `supervisorctl`, if all
services are already stopped is also a significant performance
improvement.
These have more accurate timestamps, and have user information --
but are harder to parse, and will not show requests when Django or
Tornado is stopped.
This is a script to search nginx log files by server hostname or
client IP address, and output matching lines, all while skipping
common and less-interesting request lines.
As a consequence:
• Bump minimum supported Python version to 3.8.
• Move Vagrant environment to Ubuntu 20.04, which has Python 3.8.
• Move CI frontend tests to Ubuntu 20.04.
• Move production build test to Ubuntu 20.04.
• Move 3.4 upgrade test to Ubuntu 20.04.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
We previously used restart-server if puppet was run, as a nod to the
fact that `supervisor reread && supervisor update` will _start_
service groups that were modified, even if they were previously
stopped; this is because they are marked as `autostart=true`, which is
honored on service change.
However, upgrades want to run while there are no services running. If
puppet is run, explicitly set the server as potentially being "up", so
that a `shutdown_server()` before migrations, if they exist, will stop
services.
7c4293a7d3 switched to checking if the
service was already running, and use `supervisorctl start` if it was
not.
Unfortunately, `list_supervisor_processes("zulip-tornado:*")` did not
include `zulip-tornado`, and as such a non-sharded process was always
considered to _not_ be running, and was thus started, not restarted.
Starting an already-started service is a no-op, and thus non-sharded
tornado processes were never restarted.
The observed behaviour is that requests to the tornado process attempt
to load the user from the cache, with a different prefix from Django,
and immediately invalidate the session and eject the user back to the
login page.
Fix the `list_supervisor_processes` logic to match without the
trailing `:*`.