zulip

Commit Graph

Author	SHA1	Message	Date
Anders Kaseorg	4eda29bd86	ruff: Fix RUF005 Consider spread instead of concatenation. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2023-01-26 10:16:30 -08:00
Alex Vandiver	f5f6a3789b	restart-server: Default to running config and database checks. If there is a syntax error in `settings.py`, `restart-server` should provide a reasonable message about this. It did so prior to `af08bcdb3f`, becausde any invocation `./manage.py` without `--skip-checks` will verify `settings.py`, among several other checks. After `af08bcdb3f`, there are no `./manage.py` calls in most restarts, which `fa77be6e6c` took further. Add an explicit `./manage.py check` in the default case. upgrade-zulip-stage-2 overrides this by passing `--skip-checks`, for performance. This also means that `upgrade-zulip-from-git` itself picks up the same `--skip-checks` flag, since it inherits the same flag parsing, though that is perhaps of dubious utility.	2022-10-14 13:10:46 -07:00
Alex Vandiver	fa77be6e6c	upgrade: Only run Django system checks once, explicitly. These are expensive, and moving them to one explicit call early has considerable time savings in the critical period: ``` $ hyperfine './manage.py fill_memcached_caches' './manage.py fill_memcached_caches --skip-checks' Benchmark #1: ./manage.py fill_memcached_caches Time (mean ± σ): 5.264 s ± 0.146 s [User: 4.885 s, System: 0.344 s] Range (min … max): 5.119 s … 5.569 s 10 runs Benchmark #2: ./manage.py fill_memcached_caches --skip-checks Time (mean ± σ): 3.090 s ± 0.089 s [User: 2.853 s, System: 0.214 s] Range (min … max): 2.950 s … 3.204 s 10 runs Summary './manage.py fill_memcached_caches --skip-checks' ran 1.70 ± 0.07 times faster than './manage.py fill_memcached_caches' ```	2022-05-22 14:52:38 -07:00
Alex Vandiver	3928606886	restart-server: Treat as a start if nothing is running. Treating the restart as a start is important in reducing the critical period during upgrades -- we call restart even when we suspect the services are stopped, because puppet has a small possibility of placing them in indeterminate state. However, restart orders the workers first, then tornado/django, which prolongs the outage. Recognize when no services are currently started, and switch to acting like a start, not a restart, which places tornado/django first.	2022-05-22 14:52:38 -07:00
Alex Vandiver	7c4293a7d3	restart-server: Check if service is running before restart, vs start. In some instances (e.g. during upgrades) we run `restart-server` and not `start-server`, even though we expect the server to most likely already be stopped. `supervisorctl restart servicename` if the service is stopped produces the perhaps-alarming message: ``` restart-server: Restarting servicename servicename: ERROR (not running) servicename: started ``` This may cause operators to worry that something is broken, when it is not. Check if the service is already running, and switch from "restart" to "start" in cases where it is not. The race condition here is safe -- if the service transitions from stopped to started between the check and the `start` call, it will merely output: ``` servicename: ERROR (already started) ``` ...and continue, as that has exit status 0. If the service transitions from started to stopped between the check and the `restart` call, we are merely back in the current case, where it outputs: ``` servicename: ERROR (not running) servicename: started ``` In none of these cases does a call to "restart" fail to result in the service being stopped and then started.	2022-03-09 14:42:15 -08:00
Alex Vandiver	2066860ab6	start-server: Start auxiliary services, if they exist. Services like go-camo and smokescreen are not stopped in stop-server, since they are upgraded and restarted by puppet application. As such, they also do not appear in start-server, despite the server relying on them to be running to function properly. Ensure those services are started, by starting them in start-server, if they are configured in supervisor on the host.	2022-01-26 12:39:54 -08:00
Alex Vandiver	88c3f560ae	supervisor: Add a filter for only(-not)-running.	2022-01-26 12:39:54 -08:00
Alex Vandiver	7243c3c73d	scripts: Re-implement list_supervisor_processes using API.	2022-01-26 12:39:54 -08:00
Alex Vandiver	6218ed91c2	puppet: Use lazy-apps and uwsgi control sockets for rolling reloads. Restarting the uwsgi processes by way of supervisor opens a window during which nginx 502's all responses. uwsgi has a configuration called "chain reloading" which allows for rolling restart of the uwsgi processes, such that only one process at once in unavailable; see uwsgi documentation ([1]). The tradeoff is that this requires that the uwsgi processes load the libraries after forking, rather than before ("lazy apps"); in theory this can lead to larger memory footprints, since they are not shared. In practice, as Django defers much of the loading, this is not as much of an issue. In a very basic test of memory consumption (measured by total memory - free - caches - buffers; 6 uwsgi workers), both immediately after restarting Django, and after requesting `/` 60 times with 6 concurrent requests: \| Non-lazy \| Lazy app \| Difference ------------------+------------+------------+------------- Fresh \| 2,827,216 \| 2,870,480 \| +43,264 After 60 requests \| 3,332,284 \| 3,409,608 \| +77,324 ..................\|............\|............\|............. Difference \| +505,068 \| +539,128 \| +34,060 That is, "lazy app" loading increased the footprint pre-requests by 43MB, and after 60 requests grew the memory footprint by 539MB, as opposed to non-lazy loading, which grew it by 505MB. Using wsgi "lazy app" loading does increase the memory footprint, but not by a large percentage. The other effect is that processes may be served by either old or new code during the restart window. This may cause transient failures when new frontend code talks to old backend code. Enable chain-reloading during graceful, puppetless restarts, but only if enabled via a zulip.conf configuration flag. Fixes #2559. [1]: https://uwsgi-docs.readthedocs.io/en/latest/articles/TheArtOfGracefulReloading.html#chain-reloading-lazy-apps	2022-01-05 14:48:52 -08:00
Alex Vandiver	fb3368b482	restart-server: Factor out argparser, to allow reuse.	2021-12-31 11:17:14 -08:00
Alex Vandiver	939d2e2705	scripts: Only stop/start existing tornado processes. Stopping both `zulip-tornado` and `zulip-tornado:` causes errors on deploys with tornado sharding, as the plain `zulip-tornado` service does not exist. Pass `zulip-tornado:`, which matches both plain `zulip-tornado`, as well as the sharded `zulip-tornado:zulip-tornado-port-9800` cases.	2021-12-08 14:06:06 -08:00
Alex Vandiver	c9bb2c16cc	restart-server: Add a --skip-tornado. Tornado restarts are the most user-visible; provide a means to restart everything but them, for changes which are known to not affect Tornado.	2021-08-04 10:57:53 -07:00
Alex Vandiver	16691110a6	scripts: Only stop/restart zulip_deliver_scheduled_* processes if known. Running `supervisorctl stop` or `supervisorctl restart` on a process name which is not known is an error: ``` $ supervisorctl stop nonexistent-process nonexistent-process: ERROR (no such process) $ echo $? 1 ``` `ef6d0ec5ca` moved zulip_deliver_scheduled_* out of the `workers:` group. Since upgrades run `stop-server` before applying puppet, the list of processes at that time is from the previous version of Zulip, so may not have the new `zulip_deliver_scheduled_` names -- and the `stop-server` will hence fail. If the upgrade is not applying puppet, it will `restart-server`. At that point, the old names will still be in the configuration, so relying on the current `superisorctl status` is the best gauge of what exists to restart. In short, only ever stop/start/restart the `zulip_deliver_scheduled_` processes if `supervisorctl status` knows about them already.	2021-07-09 10:04:53 -07:00
Alex Vandiver	85a9c0982a	zulip_tools: Extract out `list_supervisor_processes`.	2021-07-09 10:04:53 -07:00
Gaurav Pandey	af08bcdb3f	management: Delete send_stats command. This command is part of a statsd infrastructure that we stopped supporting years ago. Its only purpose for some time has been to provide sample code for how the restart script might trigger a notification to a graphing system, which doesn't justify maintaining it. Fixes part of #18898.	2021-06-25 09:13:48 -07:00
Alex Vandiver	d51272cc3d	puppet: Remove zulip_deliver_scheduled_* from zulip-workers:. Staging and other hosts that are `zulip::app_frontend_base` but not `zulip::app_frontend_once` do not have a /etc/supervisor/conf.d/zulip/zulip-once.conf and as such do not have `zulip_deliver_scheduled_emails` or `zulip_deliver_scheduled_messages` and thus supervisor will fail to reload. Making the contents of `zulip-workers` contingent on if the server is _also_ a `-once` server is complicated, and would involve using Concat fragments, which severely limit readability. Instead, expel those two from `zulip-workers`; this is somewhat reasonable, since they are use an entirely different codepath from zulip_events_, using the database rather than RabbitMQ for their queuing.	2021-06-14 17:12:59 -07:00
Tim Abbott	de47feab43	scripts: Fix check for services running when upgrading. When upgrading from a pre-4.0 release, scripts/stop-server logic would check whether supervisord configuration files were present to determine what it needed to restart, but only considered paths to those files that are introduced in Zulip 4.0. Fixed #18493.	2021-05-13 18:57:19 -07:00
Robert Imschweiler	534d78232c	scripts: Add {start,stop,restart}-server support for postgresql role. During the upgrade process of a postgresql-only Zulip installation, (`puppet_classes = zulip::profile::postgresql` in `/etc/zulip/zulip.conf`) either `scripts/start-server` or `scripts/stop-server` fail because they try to handle supervisor services that are not available (e.g. Tornado) since only `/etc/supervisor/conf.d/zulip/zulip_db.conf` is present and not `/etc/supervisor/conf.d/zulip/zulip.conf`. While this wasn't previously supported, it's a pretty reasonable thing to do, and can be readily supported by just adding a few conditionals.	2021-05-07 09:41:05 -07:00
Anders Kaseorg	9d57fa9759	puppet: Use pgrep -x to avoid accidental matches. Matching the full process name (-x without -f) or full command line (-xf) is less prone to mistakes like matching a random substring of some other command line or pgrep matching itself. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2021-05-07 08:54:41 -07:00
Anders Kaseorg	405bc8dabf	requirements: Remove Thumbor. Thumbor and tc-aws have been dragging their feet on Python 3 support for years, and even the alphas and unofficial forks we’ve been running don’t seem to be maintained anymore. Depending on these projects is no longer viable for us. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2021-05-06 20:07:32 -07:00
Alex Vandiver	daabc52a78	restart-server: Reorder supervisorctl calls for less downtime. Instead of taking the "onion" approach, where all services are stopped, and then started back up again, default to a rolling restart across all processes. This draws out how long the overall "restart" takes, but minimizes the time that any of the services are down. This minimizes user-visible impact and queue buildup. In cases where speed is more important than minimal impact (for example, there is already a current outage), a --less-graceful flag is provided, which brings the services down more suddenly, and back up in a still-correct order.	2021-04-30 16:47:15 -07:00
Alex Vandiver	ec12a6128a	scripts: Add a start-server as well. In general, `./scripts/restart-server` will already work in any circumstance where the server is already stopped and needs to be started. However, it will output a couple minor warnings, and it is not readily obvious that it will work correctly. Add an alias for `restart-server` named `start-server`, for parallelism with `stop-server`, which omits the steps of `restart-server` which would stop the server first.	2021-04-21 10:24:08 -07:00
Alex Vandiver	31169526ec	scripts: Say "Zulip" rather than "Application".	2021-04-21 10:24:08 -07:00
Alex Vandiver	0de8357820	scripts: Fix path to additional Zulip supervisor files. The path which contains all of the Zulip supervisor files changed in `3ab9b31d2f` to make it easier to purge now-unwanted supervisor configuration files. However, the paths that the zulip upgrade process, and restart-server, look at were not adjusted. Fix the supervisor configuration file paths.	2021-04-21 10:24:08 -07:00
Anders Kaseorg	6e4c3e41dc	python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2021-02-12 13:11:19 -08:00
Anders Kaseorg	11741543da	python: Reformat with Black, except quotes. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2021-02-12 13:11:19 -08:00
Alex Vandiver	2a12fedcf1	tornado: Remove explicit tornado_processes setting; compute it. We can compute the intended number of processes from the sharding configuration. In doing so, also validate that all of the ports are contiguous. This removes a discrepancy between `scripts/lib/sharding.py` and other parts of the codebase about if merely having a `[tornado_sharding]` section is sufficient to enable sharding. Having behaviour which changes merely based on if an empty section exists is surprising. This does require that a (presumably empty) `9800` configuration line exist, but making that default explicit is useful. After this commit, configuring sharding can be done by adding to `zulip.conf`: ``` [tornado_sharding] 9800 = # default 9801 = other_realm ``` Followed by running `./scripts/refresh-sharding-and-restart`.	2020-09-18 15:13:40 -07:00
Alex Vandiver	efdaa58c24	supervisor: Use more specific process_name than "port-9800". Making this include "zulip-tornado" makes it clearer in supervisor logs. Without this, one only sees: ``` 2020-09-14 03:43:13,788 INFO waiting for port-9807 to stop 2020-09-14 03:43:14,466 INFO stopped: port-9807 (exit status 1) 2020-09-14 03:43:14,469 INFO spawned: 'port-9807' with pid 24289 2020-09-14 03:43:15,470 INFO success: port-9807 entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) ```	2020-09-14 22:17:51 -07:00
Alex Vandiver	dc58dec231	restart-server: Start services in opposite order from stop. `supervisorctl` starts and stops its arguments sequentially, in the order they are passed[1]. Start them in the opposite order from the order in which they were stopped -- this puts the dependencies first, and the most core services (`zulip-django`) last. While the only "dependency" here is currently thumbor, this sets us up in case others are added later. [1] https://github.com/Supervisor/supervisor/blob/master/supervisor/supervisorctl.py#L782	2020-09-14 16:27:15 -07:00
Anders Kaseorg	b4597a8ca8	python: Elide default for store_{true,false} argparse arguments. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2020-09-03 16:17:14 -07:00
Anders Kaseorg	1ded51aa9d	python: Replace list literal concatenation with * unpacking. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2020-09-02 11:15:41 -07:00
Anders Kaseorg	a5dbab8fb0	python: Remove redundant dest for argparse arguments. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2020-09-02 11:04:10 -07:00
Anders Kaseorg	5dc9b55c43	python: Manually convert more percent-formatting to f-strings. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2020-06-14 23:27:22 -07:00
Anders Kaseorg	365fe0b3d5	python: Sort imports with isort. Fixes #2665. Regenerated by tabbott with `lint --fix` after a rebase and change in parameters. Note from tabbott: In a few cases, this converts technical debt in the form of unsorted imports into different technical debt in the form of our largest files having very long, ugly import sequences at the start. I expect this change will increase pressure for us to split those files, which isn't a bad thing. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2020-06-11 16:45:32 -07:00
Anders Kaseorg	67e7a3631d	python: Convert percent formatting to Python 3.6 f-strings. Generated by pyupgrade --py36-plus. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2020-06-10 15:02:09 -07:00
Anders Kaseorg	333f7d16c9	logging: Pass more format arguments to logging. Commit `bdc365d0fe` (#14852) missed this because of https://github.com/returntocorp/semgrep/issues/831. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2020-05-26 11:42:23 -07:00
Tim Abbott	0f1bdcc46f	restart-server: Restart Tornado processes individually. After some testing, I've confirmed that this seems to behave significantly better in terms of the number of failed requests due to Tornado being the process of restarting compared with the previous version, as each individual process is only down for a short time, rather than all of them being down at once.	2020-03-27 06:23:34 -07:00
Anders Kaseorg	ea6934c26d	dependencies: Remove WebSockets system for sending messages. Zulip has had a small use of WebSockets (specifically, for the code path of sending messages, via the webapp only) since ~2013. We originally added this use of WebSockets in the hope that the latency benefits of doing so would allow us to avoid implementing a markdown local echo; they were not. Further, HTTP/2 may have eliminated the latency difference we hoped to exploit by using WebSockets in any case. While we’d originally imagined using WebSockets for other endpoints, there was never a good justification for moving more components to the WebSockets system. This WebSockets code path had a lot of downsides/complexity, including: * The messy hack involving constructing an emulated request object to hook into doing Django requests. * The `message_senders` queue processor system, which increases RAM needs and must be provisioned independently from the rest of the server). * A duplicate check_send_receive_time Nagios test specific to WebSockets. * The requirement for users to have their firewalls/NATs allow WebSocket connections, and a setting to disable them for networks where WebSockets don’t work. * Dependencies on the SockJS family of libraries, which has at times been poorly maintained, and periodically throws random JavaScript exceptions in our production environments without a deep enough traceback to effectively investigate. * A total of about 1600 lines of our code related to the feature. * Increased load on the Tornado system, especially around a Zulip server restart, and especially for large installations like zulipchat.com, resulting in extra delay before messages can be sent again. As detailed in https://github.com/zulip/zulip/pull/12862#issuecomment-536152397, it appears that removing WebSockets moderately increases the time it takes for the `send_message` API query to return from the server, but does not significantly change the time between when a message is sent and when it is received by clients. We don’t understand the reason for that change (suggesting the possibility of a measurement error), and even if it is a real change, we consider that potential small latency regression to be acceptable. If we later want WebSockets, we’ll likely want to just use Django Channels. Signed-off-by: Anders Kaseorg <anders@zulipchat.com>	2020-01-14 22:34:00 -08:00
Anders Kaseorg	8d91bebf95	restart-server: Warn if the shell’s PWD goes through an updated symlink. Signed-off-by: Anders Kaseorg <anders@zulipchat.com>	2019-09-21 12:02:15 -07:00
Harshit Bansal	50ef91bb08	scripts: Add argparse option to `restart-zerver` for `--fill-cache`. Nowm unless you specify `--fill-cache`, memcached caches will not be pre-filled after a server restart. This will be helpful when someone is in a hurry (e.g. if the server is down right now, or if he/she testing a configuration change in a newly setup server), it's best to just restart without pre-filling the cache. Fixes: #10900.	2019-01-14 15:20:01 -08:00
Anders Kaseorg	a694c3cafd	scripts/restart-server: Avoid shelling out for ln. Signed-off-by: Anders Kaseorg <andersk@mit.edu>	2018-11-28 17:26:54 -08:00
Tim Abbott	5a56925495	restart-server: Fix restarting server with multiple tornado processes. Previously, we unconditionally tried to restart the Tornado process name corresponding to the historically always-true case of a single Tornado process. This resulted in Tornado not being automatically restarted on a production deployment on servers with more than one Tornado process configured.	2018-11-27 17:20:05 -08:00
Tim Abbott	1a0e9fe2f9	restart-server: Restart tornado early. This dramatically reduces the Tornado downtime when restarting a Zulip server, which is generally the most significant source of user-facing bad experiences.	2018-10-16 15:04:07 -07:00
Abhilash Verma	0e2322a322	logging: Show timestamp in UTC in non-django production scripts. Done in pair programming with @aero31aero. Fixes #9678.	2018-08-20 12:52:40 -07:00
Tim Abbott	a8e5551395	restart-server: Ensure we restart process-fts-updates. This is mostly important in that if you're running this as part of a follow-up to a failed upgrade, and you don't do this, process-fts-updates will be left not running, resulting in full-text search not updating.	2018-07-30 16:27:53 -07:00
Joshua Schmidlkofer	b1a57d144f	thumbor: Add production installer/puppet support. This commits adds the necessary puppet configuration and installer/upgrade code for installing and managing the thumbor service in production. This configuration is gated by the 'thumbor.pp' manifest being enabled (which is not yet the default), and so this commit should have no effect in a default Zulip production environment (or in the long term, in any Zulip production server that isn't using thumbor). Credit for this effort is shared by @TigorC (who initiated the work on this project), @joshland (who did a great deal of work on this and got it working during PyCon 2017) and @adnrs96, who completed the work.	2018-07-12 20:37:34 +05:30
rht	71188d7b0a	scripts: Remove import print_function.	2017-09-29 15:43:30 -07:00
Greg Price	a099e698e2	py3: Switch almost all shebang lines to use `python3`. This causes `upgrade-zulip-from-git`, as well as a no-option run of `tools/build-release-tarball`, to produce a Zulip install running Python 3, rather than Python 2. In particular this means that the virtualenv we create, in which all application code runs, is Python 3. One shebang line, on `zulip-ec2-configure-interfaces`, explicitly keeps Python 2, and at least one external ops script, `wal-e`, also still runs on Python 2. See discussion on the respective previous commits that made those explicit. There may also be some other third-party scripts we use, outside of this source tree and running outside our virtualenv, that still run on Python 2.	2017-08-16 17:54:43 -07:00
Umair Khan	336a041ac0	Django 1.10: Use uWSGI. Fixes: #1121 With some tweaks by tabbott to make the number of processes configurable.	2016-12-13 21:40:43 -08:00
Anders Kaseorg	207cf6302b	Always start python via shebang lines. This is preparation for supporting using Python 3 in production. Signed-off-by: Anders Kaseorg <andersk@mit.edu>	2016-11-26 14:46:37 -08:00

1 2

68 Commits