zulip

Commit Graph

Author	SHA1	Message	Date
Alex Vandiver	27d53ecbe1	restart-server: Remove --skip-tornado flag. This flag was generally used not because we wanted to avoid restarting Tornado, but because we wanted to avoid increasing load the server when all of the clients were told to reload. Since we have laid the groundwork for separately telling Tornado to tell clients to restart, we remove the --skip-tornado flag; the next commit will add the ability to skip client restarts.	2024-02-15 15:42:50 -08:00
Alex Vandiver	0115fa9c60	start-server/restart-server: Drop privileges if necessary. Rather than tell the user to re-run the command as `zulip` instead of `root`, do the privilege-dropping ourselves.	2024-02-07 12:33:00 -08:00
Anders Kaseorg	d257002ad8	scripts: Use setup_path in restart-server, stop-server. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2023-10-12 12:28:41 -07:00
Anders Kaseorg	c43629a222	ruff: Fix PLW1510 `subprocess.run` without explicit `check` argument. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2023-08-17 17:05:34 -07:00
Alex Vandiver	f4683de742	puppet: Switch the `rolling_restart` setting to use the bool values. `2c5fc1827c` standardized which values are "true"; use them.	2023-05-11 15:54:15 -07:00
Alex Vandiver	dd5dbcabcb	start-server: More gracefully handle only starting part of the server. While the previous commit handles the common case of all of the server being started already, it still produces ERROR output lines from supervisorctl when most of the server is already running. Take the case where one worker is stopped: ``` $ supervisorctl stop zulip-workers:zulip_events_deferred_work zulip-workers:zulip_events_deferred_work: stopped $ ./scripts/start-server 2023-04-04 15:50:28,505 start-server: Running syntax and database checks System check identified no issues (15 silenced). 2023-04-04 15:50:31,977 start-server: Starting Tornado process on port 9800 zulip-tornado:zulip-tornado-port-9800: ERROR (already started) 2023-04-04 15:50:32,283 start-server: Starting Tornado process on port 9801 zulip-tornado:zulip-tornado-port-9801: ERROR (already started) 2023-04-04 15:50:32,592 start-server: Starting django server zulip-django: ERROR (already started) 2023-04-04 15:50:33,340 start-server: Starting workers zulip-workers:zulip_events_deferred_work: started zulip_deliver_scheduled_emails: ERROR (already started) zulip_deliver_scheduled_messages: ERROR (already started) process-fts-updates: ERROR (already started) 2023-04-04 15:50:34,659 start-server: Done! Zulip started successfully! ``` More gracefully handle these cases: ``` $ ./scripts/start-server 2023-04-04 15:52:39,815 start-server: Running syntax and database checks System check identified no issues (15 silenced). 2023-04-04 15:52:43,270 start-server: Starting Tornado process on port 9800 2023-04-04 15:52:43,287 start-server: zulip-tornado:zulip-tornado-port-9800 already started! 2023-04-04 15:52:43,287 start-server: Starting Tornado process on port 9801 2023-04-04 15:52:43,300 start-server: zulip-tornado:zulip-tornado-port-9801 already started! 2023-04-04 15:52:43,300 start-server: Starting django server 2023-04-04 15:52:43,316 start-server: zulip-django already started! 2023-04-04 15:52:43,793 start-server: Starting workers zulip-workers:zulip_events_deferred_work: started 2023-04-04 15:52:45,111 start-server: Done! Zulip started successfully! ```	2023-04-04 10:58:56 -07:00
Alex Vandiver	cb097760b9	start-server: Make start-server a clean explicit no-op if already running. Currently, the output from `start-server` if the server is already running is potentially confusing, since it says ERROR several times: ``` $ ./scripts/start-server 2023-04-04 15:35:12,737 start-server: Running syntax and database checks System check identified no issues (15 silenced). 2023-04-04 15:35:16,211 start-server: Starting Tornado process on port 9800 zulip-tornado:zulip-tornado-port-9800: ERROR (already started) 2023-04-04 15:35:16,528 start-server: Starting Tornado process on port 9801 zulip-tornado:zulip-tornado-port-9801: ERROR (already started) 2023-04-04 15:35:16,844 start-server: Starting django server zulip-django: ERROR (already started) 2023-04-04 15:35:17,605 start-server: Starting workers zulip_deliver_scheduled_emails: ERROR (already started) zulip_deliver_scheduled_messages: ERROR (already started) process-fts-updates: ERROR (already started) 2023-04-04 15:35:18,923 start-server: Done! ``` Catch the simple common case where all of the services are already running, and output a clearer success message: ``` $ ./scripts/start-server 2023-04-04 15:39:52,367 start-server: Running syntax and database checks System check identified no issues (15 silenced). 2023-04-04 15:39:55,857 start-server: Zulip is already started; nothing to do! ```	2023-04-04 10:58:56 -07:00
Anders Kaseorg	4eda29bd86	ruff: Fix RUF005 Consider spread instead of concatenation. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2023-01-26 10:16:30 -08:00
Alex Vandiver	f5f6a3789b	restart-server: Default to running config and database checks. If there is a syntax error in `settings.py`, `restart-server` should provide a reasonable message about this. It did so prior to `af08bcdb3f`, becausde any invocation `./manage.py` without `--skip-checks` will verify `settings.py`, among several other checks. After `af08bcdb3f`, there are no `./manage.py` calls in most restarts, which `fa77be6e6c` took further. Add an explicit `./manage.py check` in the default case. upgrade-zulip-stage-2 overrides this by passing `--skip-checks`, for performance. This also means that `upgrade-zulip-from-git` itself picks up the same `--skip-checks` flag, since it inherits the same flag parsing, though that is perhaps of dubious utility.	2022-10-14 13:10:46 -07:00
Alex Vandiver	fa77be6e6c	upgrade: Only run Django system checks once, explicitly. These are expensive, and moving them to one explicit call early has considerable time savings in the critical period: ``` $ hyperfine './manage.py fill_memcached_caches' './manage.py fill_memcached_caches --skip-checks' Benchmark #1: ./manage.py fill_memcached_caches Time (mean ± σ): 5.264 s ± 0.146 s [User: 4.885 s, System: 0.344 s] Range (min … max): 5.119 s … 5.569 s 10 runs Benchmark #2: ./manage.py fill_memcached_caches --skip-checks Time (mean ± σ): 3.090 s ± 0.089 s [User: 2.853 s, System: 0.214 s] Range (min … max): 2.950 s … 3.204 s 10 runs Summary './manage.py fill_memcached_caches --skip-checks' ran 1.70 ± 0.07 times faster than './manage.py fill_memcached_caches' ```	2022-05-22 14:52:38 -07:00
Alex Vandiver	3928606886	restart-server: Treat as a start if nothing is running. Treating the restart as a start is important in reducing the critical period during upgrades -- we call restart even when we suspect the services are stopped, because puppet has a small possibility of placing them in indeterminate state. However, restart orders the workers first, then tornado/django, which prolongs the outage. Recognize when no services are currently started, and switch to acting like a start, not a restart, which places tornado/django first.	2022-05-22 14:52:38 -07:00
Alex Vandiver	7c4293a7d3	restart-server: Check if service is running before restart, vs start. In some instances (e.g. during upgrades) we run `restart-server` and not `start-server`, even though we expect the server to most likely already be stopped. `supervisorctl restart servicename` if the service is stopped produces the perhaps-alarming message: ``` restart-server: Restarting servicename servicename: ERROR (not running) servicename: started ``` This may cause operators to worry that something is broken, when it is not. Check if the service is already running, and switch from "restart" to "start" in cases where it is not. The race condition here is safe -- if the service transitions from stopped to started between the check and the `start` call, it will merely output: ``` servicename: ERROR (already started) ``` ...and continue, as that has exit status 0. If the service transitions from started to stopped between the check and the `restart` call, we are merely back in the current case, where it outputs: ``` servicename: ERROR (not running) servicename: started ``` In none of these cases does a call to "restart" fail to result in the service being stopped and then started.	2022-03-09 14:42:15 -08:00
Alex Vandiver	2066860ab6	start-server: Start auxiliary services, if they exist. Services like go-camo and smokescreen are not stopped in stop-server, since they are upgraded and restarted by puppet application. As such, they also do not appear in start-server, despite the server relying on them to be running to function properly. Ensure those services are started, by starting them in start-server, if they are configured in supervisor on the host.	2022-01-26 12:39:54 -08:00
Alex Vandiver	88c3f560ae	supervisor: Add a filter for only(-not)-running.	2022-01-26 12:39:54 -08:00
Alex Vandiver	7243c3c73d	scripts: Re-implement list_supervisor_processes using API.	2022-01-26 12:39:54 -08:00
Alex Vandiver	6218ed91c2	puppet: Use lazy-apps and uwsgi control sockets for rolling reloads. Restarting the uwsgi processes by way of supervisor opens a window during which nginx 502's all responses. uwsgi has a configuration called "chain reloading" which allows for rolling restart of the uwsgi processes, such that only one process at once in unavailable; see uwsgi documentation ([1]). The tradeoff is that this requires that the uwsgi processes load the libraries after forking, rather than before ("lazy apps"); in theory this can lead to larger memory footprints, since they are not shared. In practice, as Django defers much of the loading, this is not as much of an issue. In a very basic test of memory consumption (measured by total memory - free - caches - buffers; 6 uwsgi workers), both immediately after restarting Django, and after requesting `/` 60 times with 6 concurrent requests: \| Non-lazy \| Lazy app \| Difference ------------------+------------+------------+------------- Fresh \| 2,827,216 \| 2,870,480 \| +43,264 After 60 requests \| 3,332,284 \| 3,409,608 \| +77,324 ..................\|............\|............\|............. Difference \| +505,068 \| +539,128 \| +34,060 That is, "lazy app" loading increased the footprint pre-requests by 43MB, and after 60 requests grew the memory footprint by 539MB, as opposed to non-lazy loading, which grew it by 505MB. Using wsgi "lazy app" loading does increase the memory footprint, but not by a large percentage. The other effect is that processes may be served by either old or new code during the restart window. This may cause transient failures when new frontend code talks to old backend code. Enable chain-reloading during graceful, puppetless restarts, but only if enabled via a zulip.conf configuration flag. Fixes #2559. [1]: https://uwsgi-docs.readthedocs.io/en/latest/articles/TheArtOfGracefulReloading.html#chain-reloading-lazy-apps	2022-01-05 14:48:52 -08:00
Alex Vandiver	fb3368b482	restart-server: Factor out argparser, to allow reuse.	2021-12-31 11:17:14 -08:00
Alex Vandiver	939d2e2705	scripts: Only stop/start existing tornado processes. Stopping both `zulip-tornado` and `zulip-tornado:` causes errors on deploys with tornado sharding, as the plain `zulip-tornado` service does not exist. Pass `zulip-tornado:`, which matches both plain `zulip-tornado`, as well as the sharded `zulip-tornado:zulip-tornado-port-9800` cases.	2021-12-08 14:06:06 -08:00
Alex Vandiver	c9bb2c16cc	restart-server: Add a --skip-tornado. Tornado restarts are the most user-visible; provide a means to restart everything but them, for changes which are known to not affect Tornado.	2021-08-04 10:57:53 -07:00
Alex Vandiver	16691110a6	scripts: Only stop/restart zulip_deliver_scheduled_* processes if known. Running `supervisorctl stop` or `supervisorctl restart` on a process name which is not known is an error: ``` $ supervisorctl stop nonexistent-process nonexistent-process: ERROR (no such process) $ echo $? 1 ``` `ef6d0ec5ca` moved zulip_deliver_scheduled_* out of the `workers:` group. Since upgrades run `stop-server` before applying puppet, the list of processes at that time is from the previous version of Zulip, so may not have the new `zulip_deliver_scheduled_` names -- and the `stop-server` will hence fail. If the upgrade is not applying puppet, it will `restart-server`. At that point, the old names will still be in the configuration, so relying on the current `superisorctl status` is the best gauge of what exists to restart. In short, only ever stop/start/restart the `zulip_deliver_scheduled_` processes if `supervisorctl status` knows about them already.	2021-07-09 10:04:53 -07:00
Alex Vandiver	85a9c0982a	zulip_tools: Extract out `list_supervisor_processes`.	2021-07-09 10:04:53 -07:00
Gaurav Pandey	af08bcdb3f	management: Delete send_stats command. This command is part of a statsd infrastructure that we stopped supporting years ago. Its only purpose for some time has been to provide sample code for how the restart script might trigger a notification to a graphing system, which doesn't justify maintaining it. Fixes part of #18898.	2021-06-25 09:13:48 -07:00
Alex Vandiver	d51272cc3d	puppet: Remove zulip_deliver_scheduled_* from zulip-workers:. Staging and other hosts that are `zulip::app_frontend_base` but not `zulip::app_frontend_once` do not have a /etc/supervisor/conf.d/zulip/zulip-once.conf and as such do not have `zulip_deliver_scheduled_emails` or `zulip_deliver_scheduled_messages` and thus supervisor will fail to reload. Making the contents of `zulip-workers` contingent on if the server is _also_ a `-once` server is complicated, and would involve using Concat fragments, which severely limit readability. Instead, expel those two from `zulip-workers`; this is somewhat reasonable, since they are use an entirely different codepath from zulip_events_, using the database rather than RabbitMQ for their queuing.	2021-06-14 17:12:59 -07:00
Tim Abbott	de47feab43	scripts: Fix check for services running when upgrading. When upgrading from a pre-4.0 release, scripts/stop-server logic would check whether supervisord configuration files were present to determine what it needed to restart, but only considered paths to those files that are introduced in Zulip 4.0. Fixed #18493.	2021-05-13 18:57:19 -07:00
Robert Imschweiler	534d78232c	scripts: Add {start,stop,restart}-server support for postgresql role. During the upgrade process of a postgresql-only Zulip installation, (`puppet_classes = zulip::profile::postgresql` in `/etc/zulip/zulip.conf`) either `scripts/start-server` or `scripts/stop-server` fail because they try to handle supervisor services that are not available (e.g. Tornado) since only `/etc/supervisor/conf.d/zulip/zulip_db.conf` is present and not `/etc/supervisor/conf.d/zulip/zulip.conf`. While this wasn't previously supported, it's a pretty reasonable thing to do, and can be readily supported by just adding a few conditionals.	2021-05-07 09:41:05 -07:00
Anders Kaseorg	9d57fa9759	puppet: Use pgrep -x to avoid accidental matches. Matching the full process name (-x without -f) or full command line (-xf) is less prone to mistakes like matching a random substring of some other command line or pgrep matching itself. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2021-05-07 08:54:41 -07:00
Anders Kaseorg	405bc8dabf	requirements: Remove Thumbor. Thumbor and tc-aws have been dragging their feet on Python 3 support for years, and even the alphas and unofficial forks we’ve been running don’t seem to be maintained anymore. Depending on these projects is no longer viable for us. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2021-05-06 20:07:32 -07:00
Alex Vandiver	daabc52a78	restart-server: Reorder supervisorctl calls for less downtime. Instead of taking the "onion" approach, where all services are stopped, and then started back up again, default to a rolling restart across all processes. This draws out how long the overall "restart" takes, but minimizes the time that any of the services are down. This minimizes user-visible impact and queue buildup. In cases where speed is more important than minimal impact (for example, there is already a current outage), a --less-graceful flag is provided, which brings the services down more suddenly, and back up in a still-correct order.	2021-04-30 16:47:15 -07:00
Alex Vandiver	ec12a6128a	scripts: Add a start-server as well. In general, `./scripts/restart-server` will already work in any circumstance where the server is already stopped and needs to be started. However, it will output a couple minor warnings, and it is not readily obvious that it will work correctly. Add an alias for `restart-server` named `start-server`, for parallelism with `stop-server`, which omits the steps of `restart-server` which would stop the server first.	2021-04-21 10:24:08 -07:00
Alex Vandiver	31169526ec	scripts: Say "Zulip" rather than "Application".	2021-04-21 10:24:08 -07:00
Alex Vandiver	0de8357820	scripts: Fix path to additional Zulip supervisor files. The path which contains all of the Zulip supervisor files changed in `3ab9b31d2f` to make it easier to purge now-unwanted supervisor configuration files. However, the paths that the zulip upgrade process, and restart-server, look at were not adjusted. Fix the supervisor configuration file paths.	2021-04-21 10:24:08 -07:00
Anders Kaseorg	6e4c3e41dc	python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2021-02-12 13:11:19 -08:00
Anders Kaseorg	11741543da	python: Reformat with Black, except quotes. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2021-02-12 13:11:19 -08:00
Alex Vandiver	2a12fedcf1	tornado: Remove explicit tornado_processes setting; compute it. We can compute the intended number of processes from the sharding configuration. In doing so, also validate that all of the ports are contiguous. This removes a discrepancy between `scripts/lib/sharding.py` and other parts of the codebase about if merely having a `[tornado_sharding]` section is sufficient to enable sharding. Having behaviour which changes merely based on if an empty section exists is surprising. This does require that a (presumably empty) `9800` configuration line exist, but making that default explicit is useful. After this commit, configuring sharding can be done by adding to `zulip.conf`: ``` [tornado_sharding] 9800 = # default 9801 = other_realm ``` Followed by running `./scripts/refresh-sharding-and-restart`.	2020-09-18 15:13:40 -07:00
Alex Vandiver	efdaa58c24	supervisor: Use more specific process_name than "port-9800". Making this include "zulip-tornado" makes it clearer in supervisor logs. Without this, one only sees: ``` 2020-09-14 03:43:13,788 INFO waiting for port-9807 to stop 2020-09-14 03:43:14,466 INFO stopped: port-9807 (exit status 1) 2020-09-14 03:43:14,469 INFO spawned: 'port-9807' with pid 24289 2020-09-14 03:43:15,470 INFO success: port-9807 entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) ```	2020-09-14 22:17:51 -07:00
Alex Vandiver	dc58dec231	restart-server: Start services in opposite order from stop. `supervisorctl` starts and stops its arguments sequentially, in the order they are passed[1]. Start them in the opposite order from the order in which they were stopped -- this puts the dependencies first, and the most core services (`zulip-django`) last. While the only "dependency" here is currently thumbor, this sets us up in case others are added later. [1] https://github.com/Supervisor/supervisor/blob/master/supervisor/supervisorctl.py#L782	2020-09-14 16:27:15 -07:00
Anders Kaseorg	b4597a8ca8	python: Elide default for store_{true,false} argparse arguments. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2020-09-03 16:17:14 -07:00
Anders Kaseorg	1ded51aa9d	python: Replace list literal concatenation with * unpacking. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2020-09-02 11:15:41 -07:00
Anders Kaseorg	a5dbab8fb0	python: Remove redundant dest for argparse arguments. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2020-09-02 11:04:10 -07:00
Anders Kaseorg	5dc9b55c43	python: Manually convert more percent-formatting to f-strings. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2020-06-14 23:27:22 -07:00
Anders Kaseorg	365fe0b3d5	python: Sort imports with isort. Fixes #2665. Regenerated by tabbott with `lint --fix` after a rebase and change in parameters. Note from tabbott: In a few cases, this converts technical debt in the form of unsorted imports into different technical debt in the form of our largest files having very long, ugly import sequences at the start. I expect this change will increase pressure for us to split those files, which isn't a bad thing. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2020-06-11 16:45:32 -07:00
Anders Kaseorg	67e7a3631d	python: Convert percent formatting to Python 3.6 f-strings. Generated by pyupgrade --py36-plus. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2020-06-10 15:02:09 -07:00
Anders Kaseorg	333f7d16c9	logging: Pass more format arguments to logging. Commit `bdc365d0fe` (#14852) missed this because of https://github.com/returntocorp/semgrep/issues/831. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2020-05-26 11:42:23 -07:00
Tim Abbott	0f1bdcc46f	restart-server: Restart Tornado processes individually. After some testing, I've confirmed that this seems to behave significantly better in terms of the number of failed requests due to Tornado being the process of restarting compared with the previous version, as each individual process is only down for a short time, rather than all of them being down at once.	2020-03-27 06:23:34 -07:00
Anders Kaseorg	ea6934c26d	dependencies: Remove WebSockets system for sending messages. Zulip has had a small use of WebSockets (specifically, for the code path of sending messages, via the webapp only) since ~2013. We originally added this use of WebSockets in the hope that the latency benefits of doing so would allow us to avoid implementing a markdown local echo; they were not. Further, HTTP/2 may have eliminated the latency difference we hoped to exploit by using WebSockets in any case. While we’d originally imagined using WebSockets for other endpoints, there was never a good justification for moving more components to the WebSockets system. This WebSockets code path had a lot of downsides/complexity, including: * The messy hack involving constructing an emulated request object to hook into doing Django requests. * The `message_senders` queue processor system, which increases RAM needs and must be provisioned independently from the rest of the server). * A duplicate check_send_receive_time Nagios test specific to WebSockets. * The requirement for users to have their firewalls/NATs allow WebSocket connections, and a setting to disable them for networks where WebSockets don’t work. * Dependencies on the SockJS family of libraries, which has at times been poorly maintained, and periodically throws random JavaScript exceptions in our production environments without a deep enough traceback to effectively investigate. * A total of about 1600 lines of our code related to the feature. * Increased load on the Tornado system, especially around a Zulip server restart, and especially for large installations like zulipchat.com, resulting in extra delay before messages can be sent again. As detailed in https://github.com/zulip/zulip/pull/12862#issuecomment-536152397, it appears that removing WebSockets moderately increases the time it takes for the `send_message` API query to return from the server, but does not significantly change the time between when a message is sent and when it is received by clients. We don’t understand the reason for that change (suggesting the possibility of a measurement error), and even if it is a real change, we consider that potential small latency regression to be acceptable. If we later want WebSockets, we’ll likely want to just use Django Channels. Signed-off-by: Anders Kaseorg <anders@zulipchat.com>	2020-01-14 22:34:00 -08:00
Anders Kaseorg	8d91bebf95	restart-server: Warn if the shell’s PWD goes through an updated symlink. Signed-off-by: Anders Kaseorg <anders@zulipchat.com>	2019-09-21 12:02:15 -07:00
Harshit Bansal	50ef91bb08	scripts: Add argparse option to `restart-zerver` for `--fill-cache`. Nowm unless you specify `--fill-cache`, memcached caches will not be pre-filled after a server restart. This will be helpful when someone is in a hurry (e.g. if the server is down right now, or if he/she testing a configuration change in a newly setup server), it's best to just restart without pre-filling the cache. Fixes: #10900.	2019-01-14 15:20:01 -08:00
Anders Kaseorg	a694c3cafd	scripts/restart-server: Avoid shelling out for ln. Signed-off-by: Anders Kaseorg <andersk@mit.edu>	2018-11-28 17:26:54 -08:00
Tim Abbott	5a56925495	restart-server: Fix restarting server with multiple tornado processes. Previously, we unconditionally tried to restart the Tornado process name corresponding to the historically always-true case of a single Tornado process. This resulted in Tornado not being automatically restarted on a production deployment on servers with more than one Tornado process configured.	2018-11-27 17:20:05 -08:00
Tim Abbott	1a0e9fe2f9	restart-server: Restart tornado early. This dramatically reduces the Tornado downtime when restarting a Zulip server, which is generally the most significant source of user-facing bad experiences.	2018-10-16 15:04:07 -07:00

1 2

75 Commits