zulip

Commit Graph

Author	SHA1	Message	Date
Anders Kaseorg	f3f5dfb5aa	ruff: Fix RUF004 exit() is only available in the interpreter. ‘exit’ is pulled in for the interactive interpreter as a side effect of the site module; this can be disabled with python -S and shouldn’t be relied on. Also, use the NoReturn type where appropriate. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-12-04 22:11:24 -08:00
Alex Vandiver	ea9988cc9e	grafana: Upgrade to 9.3.0.	2022-11-30 12:41:18 -05:00
Alex Vandiver	7069e2c8c2	puppet: Align more sections of $versions.	2022-11-30 12:13:47 -05:00
Alex Vandiver	89f20140c0	wal-g: Use pre-built aarch64 binary, rather than building from source. Starting with wal-g 2.0.1, they provide `aarch64` assets[^1]. Effectively revert `d7b59c86ce`, and use the pre-built binary for `aarch64` rather than spend a bunch of space and time having to build it from source. [^1]: https://github.com/wal-g/wal-g/releases/tag/v2.0.1	2022-11-30 12:13:47 -05:00
Anders Kaseorg	e5c26eeb86	tornado: Support sharding by user ID. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-11-15 17:27:01 -08:00
Alex Vandiver	03f0cb07ff	puppet: Upgrade puppetlabs libraries.	2022-11-08 13:26:32 -08:00
Alex Vandiver	6517e4b239	puppet: Update third-party package versions.	2022-11-08 13:26:32 -08:00
Alex Vandiver	521ec5885b	puppet: Rename autossh tunnel, as it is no longer for just munin.	2022-11-01 22:24:40 -07:00
Alex Vandiver	42f84a8cc7	puppet: Use existing autossh tunnels as OpenSSH "master" sockets. A number of autossh connections are already left open for port-forwarding Munin ports; autossh starts the connections and ensures that they are automatically restarted if they are severed. However, this represents a missed opportunity. Nagios's monitoring uses a large number of SSH connections to the remote hosts to run commands on them; each of these connections requires doing a complete SSH handshake and authentication, which can have non-trivial network latency, particularly for hosts which may be located far away, in a network topology sense (up to 1s for a no-op command!). Use OpenSSH's ability to multiplex multiple connections over a single socket, to reuse the already-established connection. We leave an explicit `ControlMaster no` in the general configuration, and not `auto`, as we do not wish any of the short-lived Nagios connections to get promoted to being a control socket if the autossh is not running for some reason. We enable protocol-level keepalives, to give a better chance of the socket being kept open.	2022-11-01 22:24:40 -07:00
Alex Vandiver	e05a0dcf98	puppet: Support FQDNs in puppet zulip.conf names.	2022-11-01 22:24:40 -07:00
Alex Vandiver	df201bd132	puppet: Monitor "hosts_fullstack" hosts (e.g. CZO). These hosts were excluded from `zulipconf_nagios_hosts` in `8cff27f67d`, because it was replicating the previously hard-coded behaviour exactly. That behaviour was an accident of history, in that `4fbe201187` and before had simply not monitored hosts of this class. There is no reason to not add SSH tunnels and munin monitoring for these hosts; stop skipping them.	2022-11-01 22:24:40 -07:00
Alex Vandiver	951dc68f3a	autossh: Drop unnecessary -2 option. The -2 option is a no-op.	2022-11-01 22:24:40 -07:00
Alex Vandiver	01f38c4516	puppet: Bump Grafana version.	2022-10-12 22:00:27 -07:00
Alex Vandiver	ed19361838	puppet: Upgrade puppetlabs libraries.	2022-10-10 08:46:29 -07:00
Alex Vandiver	798ab420db	puppet: Update third-party package versions.	2022-10-10 08:46:29 -07:00
Anders Kaseorg	11a86ec328	install: Remove PostgreSQL 10 support. PostgreSQL 10 reaches its upstream end of life in November, and is not supported by Django 4.1. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-10-06 15:59:07 -07:00
Anders Kaseorg	ce9ceb7f9f	tornado: Fix Tornado CSRF check with X-Forwarded-Proto. Since Django factors request.is_secure() into its CSRF check, we need this to tell it to consider requests forwarded from nginx to Tornado as secure. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-09-23 16:01:12 -07:00
Anders Kaseorg	987ab741f9	sharding: Support Tornado sharding by regexes. One should now be able to configure a regex by appending _regex to the port number: [tornado_sharding] 9802_regex = ^[l-p].*\.zulipchat\.com$ Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-09-15 16:07:50 -07:00
Anders Kaseorg	7666ff603d	sharding: Configure Tornado sharding with nginx map. https://nginx.org/en/docs/http/ngx_http_map_module.html Since Puppet doesn’t manage the contents of nginx_sharding.conf after its initial creation, it needs to be renamed so we can give it different default contents. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-09-15 16:07:50 -07:00
Anders Kaseorg	0da0ee3c92	puppet: Remove nginx configuration for zulip.org. This is unused since commit `1806e0f45e` (#19625). Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-09-01 10:03:18 -07:00
Anders Kaseorg	5d77d50423	scripts: Help mypy resolve the psycopg2.connect overload. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-08-30 17:36:21 -07:00
Matt Keller	91e5ae84ac	uwsgi: Increase timeout before harakiri. Some legitimate requests in Zulip can take more than 20s to be processed, and we don't have a current problem where having a 20s limit here is preventing a problem.	2022-08-23 15:28:10 -07:00
Alex Vandiver	a9183d2208	grafana: Enable auto-sign-up. This avoids the need to explicitly create new users in Grafana, by simply trusting Teleport.	2022-07-19 17:52:17 -07:00
Alex Vandiver	9bd88a93e2	puppet: Tell needrestart to not default to restarting core services. The `needrestart` tool added in 22.04 is useful in terms of listing which services may need to be restarted to pick up updated libraries. However, it prompts about the current state of services needing restart for every subsequent `apt-get upgrade`, and defaulting core services to restarting requires carefully manually excluding them every time, at risk of causing an unscheduled outage. Build a list of default-off services based on the list in unattended-upgrades.	2022-07-19 17:51:18 -07:00
Alex Vandiver	7ae3708c02	teleport: Add explicit WebAuthn config, not just U2F. WebAuthn is the default, replacing U2F, in Teleport 10 and above[1]. While Teleport can derive a WebAuthn configuration from a U2F configuration[2], it's useful to be explicit. [1]: https://goteleport.com/docs/access-controls/guides/webauthn/ [2]: https://goteleport.com/docs/access-controls/guides/webauthn/#u2f	2022-07-18 11:41:00 -07:00
Alex Vandiver	9d29c46078	puppet: Upgrade Grafana, Prometheus and redis_exporter.	2022-07-15 09:18:58 -07:00
Alex Vandiver	42dc5d003e	puppet: Upgrade Smokescreen and golang.	2022-07-15 09:18:58 -07:00
Alex Vandiver	120de1dca9	zephyr: Write out unix timestamp in check, as check_cron_file expects. A follow-up fix to `8bc26aab08`.	2022-06-30 11:12:26 -07:00
Alex Vandiver	4fd51cb5ad	uwsgi: Increase request buffer size to 64k, from 8k default. The default value in uwsgi is 4k; receiving more than this amount from nginx leads to a 502 response (though, happily, the backend uwsgi does not terminate). `ab18dbfde5` originally increased it from the unstated uwsgi default of 4096, to 8192; `b1da797955` made it configurable, in order to allow requests from clients with many cookies, without causing 502's[1]. nginx defaults to a limitation of 1k, with 4 additional 8k header lines allowed[2]; any request larger than that returns a response of `400 Request Header Or Cookie Too Large`. The largest header size theoretically possible from nginx, by default, is thus 33k, though that would require packing four separate headers to exactly 8k each. Remove the gap between nginx's limit and uwsgi's, which could trigger 502s, by removing the uwsgi configurability, and setting a 64k size in uwsgi (the max allowable), which is larger than nginx's default limit. uWSGI's documentation of `buffer-size` ([3], [4]) also notes that "It is a security measure too, so adapt to your app needs instead of maxing it out." Python has no security issues with buffers of 64k, and there is no appreciable memory footprint difference to having a larger buffer available in uwsgi. [1]: https://chat.zulip.org/#narrow/stream/31-production-help/topic/works.20in.20Edge.20not.20Chrome/near/719523 [2]: https://nginx.org/en/docs/http/ngx_http_core_module.html#client_header_buffer_size [3]: https://uwsgi-docs.readthedocs.io/en/latest/ThingsToKnow.html [4]: https://uwsgi-docs.readthedocs.io/en/latest/Options.html#buffer-size	2022-06-28 16:14:24 -07:00
Anders Kaseorg	ef3510fa6d	nginx: Remove legacy X-XSS-Protection header. Support for this header was removed in Chrome 78, Safari 15.4, and Edge 17. It was never supported in Firefox. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-06-27 17:38:18 -07:00
Alex Vandiver	8577adcf2e	cron: Remove unused STATE_FILE environment variable.	2022-06-22 12:07:38 -07:00
Alex Vandiver	8bc26aab08	nagios: Switch check_user_zephyr_mirror_liveness to run via cron. This check loads Django, and as such must be run as the zulip user. Repeat the same pattern used elsewhere in nagios, of writing a state file, which is read by `check_cron_file`.	2022-06-22 12:07:38 -07:00
Alex Vandiver	41deef40cf	nagios: Switch to generic check_cron_file for queues and consumers. These share a common root; `91da4bd59b` duplicated the code, but didn't move the existing uses to the new utility.	2022-06-22 12:07:38 -07:00
Alex Vandiver	b2d0bad9af	check_cron_file: Remove unnecessary quotes.	2022-06-22 12:07:38 -07:00
Alex Vandiver	41b7ae4e44	check_cron_file: Don't crash on missing cron file. This is `5050fb19f6`, but for `check_cron_file`, which was introduced in `91da4bd59b`.	2022-06-22 12:07:38 -07:00
Alex Vandiver	8fbde9b8c5	nagios: Only run check_fts_update_log on one PostgreSQL host. The data is the same in the table in all replicas -- there is no need to alert on all of them.	2022-06-22 12:07:38 -07:00
Alex Vandiver	499284d2fd	nagios: Split postgresql into primary and replica. Replication checks should only run on primary and replicas, not standalone hosts; while `autovac_freeze` currently only runs on primary hosts, it functions identically on replicas, and is fine to run there. Make `autovac_freeze` run on all `postgresql` hosts, and make standalone hosts no longer `postgres_primary`, so they do not fail the replication tests.	2022-06-22 12:07:38 -07:00
Alex Vandiver	38e435347b	nagios: Add missing queue consumer checks.	2022-06-22 12:07:38 -07:00
Alex Vandiver	e01a4242aa	nagios: Sort queue consumer checks.	2022-06-22 12:07:38 -07:00
Alex Vandiver	2c90c7a010	nagios: Switch `check_remote_arg_string` queue checks to consumer checks. These style of checks just look for matching process names using `check_remote_arg_string`, which dates to `8edbd64bb8`. These were added because the original two (`missedmessage_emails` and `slow_queries`) did not create consumers, instead polling for events. Switch these to checking the queue consumer counts that the `check-rabbitmq-consumers` check is already writing out. Since the `missedmessage_emails` was _already_ checked via the consumer check, a duplicate is not added.	2022-06-22 12:07:38 -07:00
Alex Vandiver	f48d543d9b	nagios: Make and use a "rabbitmq-consumer-service" template service.	2022-06-22 12:07:38 -07:00
Alex Vandiver	775a084d0f	nagios: Add a catchall "other" set.	2022-06-22 12:07:38 -07:00
Alex Vandiver	83c82c8e15	nagios: Adjust load alerting by hostgroup. Even the `pageable_servers` group did not page for high load -- in part because what was "high" depends on the servers. Set slightly better limits based on server role.	2022-06-22 12:07:38 -07:00
Alex Vandiver	2a14aa5180	nagios: Add a `fullstack` hostgroup. This will be used to apply checks only to czo.	2022-06-22 12:07:38 -07:00
Alex Vandiver	b5ecfc327f	nagios: Remove unnecessary `web` hostgroup. This had identical membership to `frontends`.	2022-06-22 12:07:38 -07:00
Alex Vandiver	4be9025212	nagios: Remove redundant `postgresql` hostgroup. This is implied by `postgresql_primary`.	2022-06-22 12:07:38 -07:00
Alex Vandiver	d9d0014fb4	nagios: Rename `zmirror_main` into `zmirror` hostgroup. `zmirror` itself was `zmirror_main` + `zmirrorp` but was unused; we consistently just use the term `zmirror` for the non-personals server, so use it as the hostgroup name.	2022-06-22 12:07:38 -07:00
Alex Vandiver	70c36985b4	nagios: Remove frontends from redis group. The Redis nagios checks themselves are done against `redis` + `frontends` groups, so there is no need to misleadingly place `frontends` in the `redis` hostgroup.	2022-06-22 12:07:38 -07:00
Alex Vandiver	08127086bc	nagios: Remove misleading "staging_frontends" from standalone. No services are tested for the `staging_frontends` hostgroup, so this does not alter the checks.	2022-06-22 12:07:38 -07:00
Alex Vandiver	d804de871d	nagios: Move staging and prod hostgroups adjacent.	2022-06-22 12:07:38 -07:00
Alex Vandiver	4c17f2bccc	nagios: The frontends hostgroup now includes prod and staging frontends. This lets the config file remove some repetition.	2022-06-22 12:07:38 -07:00
Alex Vandiver	1e81775fa0	nagios: Drop unhelpful hostgroup comment.	2022-06-22 12:07:38 -07:00
Alex Vandiver	7b584401ac	nagios: Reformat hostgroups.	2022-06-22 12:07:38 -07:00
Alex Vandiver	93bcb86345	nagios: Reorder service checks.	2022-06-22 12:07:38 -07:00
Alex Vandiver	eaaa2fbff8	nagios: Use canonical "hostgroup_name" consistently.	2022-06-22 12:07:38 -07:00
Alex Vandiver	e8996b53a5	nagios: Remove unused has_swap hostgroup.	2022-06-22 12:07:38 -07:00
Alex Vandiver	33472ee9ff	nagios: Remove unused stats host set.	2022-06-22 12:07:38 -07:00
Alex Vandiver	bc4f4b4862	nagios: Make the pageable/not/flaky tri-state clearer.	2022-06-22 12:07:38 -07:00
Alex Vandiver	c74f195fba	nagios: Split AWS and non-AWS hosts, for ntp checks. The non-AWS hosts cannot use the AWS ntp server for their check.	2022-06-22 12:07:38 -07:00
Alex Vandiver	872efdee58	nagios: Fold single- and multitornado_frontends back into frontends. `5abf4dee92` made this distinction, then multitornado_frontends was never used; the singletornado_frontends alerting worked even for the multiple-Tornado instances. Remove the useless and misleading distinction.	2022-06-22 12:07:38 -07:00
Anders Kaseorg	dc6af98e52	nginx: Add Cache-Control headers for Django-hashed static files. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-06-21 17:26:23 -07:00
Alex Vandiver	0645656fd8	process_fts_updates: Nagios may lack permissions to load Django config. Even if Django and PostgreSQL are on the same host, the `nagios` user may lack permissions to read accessory configuration files needed to load the Django configuration (e.g. authentication keys). Catch those failures, and switch to loading the required settings from `/etc/zulip/zulip.conf`.	2022-06-21 12:50:13 -07:00
Anders Kaseorg	a7f9c4f958	logging: Pass more format arguments to logging. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-06-03 12:27:23 -07:00
Alex Vandiver	aa46d8d2a8	puppet: Enable strict typo checking in uwsgi.	2022-06-02 13:20:48 -07:00
Alex Vandiver	18ec3b6215	puppet: Enable background worker threads in uwsgi. Without this, uwsgi does not release the GIL before going back into `epoll_wait` to wait for the next request. This results in any background threads languishing, unserviced.[1] Practically, this results in Sentry background reporter threads timing out when attempting to post results -- but only in situations with low traffic, as in those significant time is spent in `epoll_wait`. This is seen in logs as: WARN [urllib3.connectionpool] Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1131)'))': /api/123456789/envelope/ Or: WARN [urllib3.connectionpool] Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', RemoteDisconnected('Remote end closed connection without response'))': /api/123456789/envelope/ Sentry attempts to detect this and warn, but due to startup ordering, the warning is not printed without lazy-loading. Enable threads, at a miniscule performance cost, in order to support background workers like Sentry[2]. [1] https://github.com/unbit/uwsgi/issues/1141#issuecomment-169042767 [2] https://docs.sentry.io/clients/python/advanced/#a-note-on-uwsgi	2022-06-02 13:20:48 -07:00
Alex Vandiver	919c904091	puppet: Give the uwsgi processes a shorter process name. Previously, the complete command line, which is quite long, is shown: 3963143 ? SN 0:00 /home/zulip/deployments/current/zulip-current-venv/bin/uwsgi --ini /etc/zulip/uwsgi.ini 3963144 ? SN 0:03 \_ /home/zulip/deployments/current/zulip-current-venv/bin/uwsgi --ini /etc/zulip/uwsgi.ini 3963145 ? SN 0:03 \_ /home/zulip/deployments/current/zulip-current-venv/bin/uwsgi --ini /etc/zulip/uwsgi.ini 3963146 ? SN 0:03 \_ /home/zulip/deployments/current/zulip-current-venv/bin/uwsgi --ini /etc/zulip/uwsgi.ini 3963147 ? SN 0:03 \_ /home/zulip/deployments/current/zulip-current-venv/bin/uwsgi --ini /etc/zulip/uwsgi.ini 3963148 ? SN 0:03 \_ /home/zulip/deployments/current/zulip-current-venv/bin/uwsgi --ini /etc/zulip/uwsgi.ini 3963149 ? SN 0:03 \_ /home/zulip/deployments/current/zulip-current-venv/bin/uwsgi --ini /etc/zulip/uwsgi.ini Configure uwsgi to rename and number the processes. This results in: 3907613 ? SN 0:00 zulip-django uWSGI master 3907614 ? SN 0:05 \_ zulip-django uWSGI worker 1 3907615 ? SN 0:03 \_ zulip-django uWSGI worker 2 3907616 ? SN 0:05 \_ zulip-django uWSGI worker 3 3907617 ? SN 0:05 \_ zulip-django uWSGI worker 4 3907618 ? SN 0:05 \_ zulip-django uWSGI worker 5 3907619 ? SN 0:05 \_ zulip-django uWSGI worker 6	2022-06-02 13:20:48 -07:00
Alex Vandiver	a522ad1d9a	puppet: Always create a uwsgi master control socket. This is potentially useful even with rolling restarts disabled.	2022-06-02 13:20:48 -07:00
Alex Vandiver	721a101f12	puppet: Reorganize and comment uwsgi.ini file. As the uwsgi documentation is somewhat obtuse, more comments are added here than might usually be.	2022-06-02 13:20:48 -07:00
Alex Vandiver	3741c1c034	puppet: Switch to checking time against the AWS timeserver. Since this is what chrony is sync'ing to, it lessens the chance of spurious firings of this alert. See https://aws.amazon.com/blogs/aws/keeping-time-with-amazon-time-sync-service/	2022-05-31 22:57:32 -07:00
Alex Vandiver	a201e3b25b	puppet: Upgrade wal-g to 2.0.0.	2022-05-22 14:51:18 -07:00
Alex Vandiver	c8ee53619d	puppet: Upgrade go and smokescreen.	2022-05-22 14:51:18 -07:00
Alex Vandiver	4a5e530743	puppet: Upgrade Grafana to 8.5.3, for CVE-2022-29170.	2022-05-22 14:51:18 -07:00
Alex Vandiver	baed1214f2	puppet: Only fix certbot certificates if https is enabled. This is a reprise of `c97162e485`, but for the case where certbot certs are no longer in use by way of enabling `http_only` and letting another server handle TLS termination. Fixes: #22034.	2022-05-17 15:03:44 -07:00
Alex Vandiver	62f234328d	puppet: Include the OS-enabled nginx module configurations. This allows system-level configuration to be done by `apt-get install` of nginx modules, which place their load statements in this directory. The initial import in `ed0cb0a5f8` of the stock nginx config omitted this include -- one potential explanation was in an effort to reduce the memory footprint of the server. The default nginx install enables: 50-mod-http-auth-pam.conf 50-mod-http-dav-ext.conf 50-mod-http-echo.conf 50-mod-http-geoip2.conf 50-mod-http-geoip.conf 50-mod-http-image-filter.conf 50-mod-http-subs-filter.conf 50-mod-http-upstream-fair.conf 50-mod-http-xslt-filter.conf 50-mod-mail.conf 50-mod-stream.conf While Zulip doesn't actively use any of these, they likely don't do any harm to simply be loaded -- they are loaded into every nginx by default. Having the `modules-enabled` include allows easier extension of the server, as neither of the existing wildcard includes (`/etc/nginx/conf.d/.conf` and `/etc/nginx/zulip-include/app.d/.conf`) are in the top context, and thus able to load modules.	2022-05-17 15:03:07 -07:00
Alex Vandiver	814841c9ec	puppet: Remove typo'd cron job. `54b6a83412` fixed the typo introduced in `49ad188449`, but that does not clean up existing installs which had the file with the wrong name already. Remove the file with the typo'd name, so two jobs do not race, and fix the typo in the comment.	2022-05-16 14:57:21 -07:00
Alex Vandiver	20b7a2d450	puppet: Each worker should chdir after forking. The top-level `chdir` setting only does the chdir once, at initial `uwsgi` startup time. Rolling restarts, however, however, require that `uwsgi` pick up the _new_ value of the `current` directory, and start new workers in that directory -- as currently implemented, rolling restarts cannot restart into newer versions of the code, only the same one in which they were started. Use [configurable hooks][1] to execute the `chdir` after every fork. This causes the following behaviour: ``` Thu May 12 18:56:55 2022 - chain reload starting... Thu May 12 18:56:55 2022 - chain next victim is worker 1 Gracefully killing worker 1 (pid: 1757689)... worker 1 killed successfully (pid: 1757689) Respawned uWSGI worker 1 (new pid: 1757969) Thu May 12 18:56:56 2022 - chain is still waiting for worker 1... running "chdir:/home/zulip/deployments/current" (post-fork)... Thu May 12 18:56:57 2022 - chain is still waiting for worker 1... Thu May 12 18:56:58 2022 - chain is still waiting for worker 1... Thu May 12 18:56:59 2022 - chain is still waiting for worker 1... WSGI app 0 (mountpoint='') ready in 3 seconds on interpreter 0x55dfca409170 pid: 1757969 (default app) Thu May 12 18:57:00 2022 - chain next victim is worker 2 [...] ``` ..and so forth down the line of processes. Each process is correctly started in the _current_ value of `current`, and thus picks up the correct code. [1]: https://uwsgi-docs.readthedocs.io/en/latest/Hooks.html	2022-05-12 21:54:02 -07:00
Alex Vandiver	7f6a77da31	puppet: Add a redis exporter.	2022-05-03 17:13:44 -07:00
Anders Kaseorg	e9ba9b0e0d	zulip-ec2-configure-interfaces: Remove. Our current EC2 systems don’t have an interface named ‘eth0’, and if they did, this script would do nothing but crash with ImportError because we have never installed boto.utils for Python 3. (The message of commit `2a4d851a7c` made an effort to document for future researchers why this script should not have been blindly converted to Python 3. However, commit `2dc6d09c2a` (#14278) was evidently unresearched and untested.) Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-05-03 02:25:59 -07:00
Alex Vandiver	d891b9590a	puppet: Fix non-replicated PostgreSQL 10 and 11 configuration. `6f5ae8d13d` removed the `$replication` variable from the configurations of PostgreSQL 12 and higher, but left it in the templates for PostgreSQL 10 and 11. Because `undef != ''`, deployments on PostgreSQL 10 and 11 started trying to push to S3 backups, regardless of if they were configured, leaving frequent log messages like: ``` 2022-04-30 12:45:47.805 UTC [626d24ec.1f8db0]: [107-1] LOG: archiver process (PID 2086106) exited with exit code 1 2022-04-30 12:45:49.680 UTC [626d24ee.1f8dc3]: [18-1] LOG: checkpoint complete: wrote 19 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=1.910 s, sync=0.022 s, total=1.950 s; sync files=16, longest=0.018 s, average=0.002 s; distance=49 kB, estimate=373 kB /usr/bin/timeout: failed to run command "/usr/local/bin/env-wal-g": No such file or directory 2022-04-30 12:46:17.852 UTC [626d2f99.1fd4e9]: [1-1] FATAL: archive command failed with exit code 127 2022-04-30 12:46:17.852 UTC [626d2f99.1fd4e9]: [2-1] DETAIL: The failed archive command was: /usr/bin/timeout 10m /usr/local/bin/env-wal-g wal-push pg_wal/000000010000000300000080 ``` Switch the PostgreSQL 10 and 11 configuration to check `s3_backups_bucket`, like the other versions.	2022-05-02 16:46:10 -07:00
Anders Kaseorg	646a4d19a3	puppet: Remove quotes for enumerable values. https://puppet.com/docs/puppet/7/style_guide.html#style_guide_module_design-quoting “If a string is a value from an enumerable set of options, such as present and absent, it SHOULD NOT be enclosed in quotes at all.” Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-04-29 22:06:46 -07:00
Alex Vandiver	c97162e485	puppet: Check that certbot certs are in use before fixing them. It is possible to have previously installed certbot, but switched back to using self-signed certificates -- in which case renewing them using certbot may fail. Verify that the certificate is a symlink into certbot's output directory before running `fix-standalone-certbot`.	2022-04-27 16:01:15 -07:00
Anders Kaseorg	098a514599	python: Use Python 3.8 shlex.join function. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-04-27 12:57:49 -07:00
Alex Vandiver	35db1ee435	puppet: Only include "app_service" section if there are apps. This works around gravitational/teleport#12256, but also produces config files that are slightly cleaner.	2022-04-26 16:36:13 -07:00
Anders Kaseorg	a7e6cb7705	puppet: ‘supervisorctl stop all’ before restarting Supervisor. This fixes a failure of the 3.4 upgrade test running on Ubuntu 20.04 with Supervisor 4. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-04-26 16:32:02 -07:00
Alex Vandiver	e5548ecba0	puppet: Upgrade external dependencies.	2022-04-21 13:54:14 -07:00
Alex Vandiver	1151118cc8	puppet: Upgrade Grafana to 8.4.6.	2022-04-12 16:41:45 -07:00
Alex Vandiver	572443edc6	puppet: Remove memcached SASL workaround. https://bugs.launchpad.net/ubuntu/+source/memcached/+bug/1878721 was fixed and released in Focal in 2020-06-24. We don't bother with an `ensure => absent` because leaving this in-place for existing installs does no harm.	2022-04-08 14:59:45 -07:00
Anders Kaseorg	935cb605a5	puppet: Do not ensure Chrony is running. Commit `f6d27562fa` (#21564) tried to ensure Chrony is running, which fails in containers where Chrony doesn’t have permission to update the host clock. The Debian package should still attempt to start it, and Puppet should still restart it when chrony.conf is modified. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-03-30 11:37:54 -07:00
Alex Vandiver	f6d27562fa	puppet: Configure chrony to use AWS-local NTP sources. This prevents hosts from spewing traffic to random hosts across the Internet.	2022-03-25 17:07:53 -07:00
Alex Vandiver	5e128e7cad	puppet: Extract the wal-g configuration from the backups. This will allow it to be used for monitoring, to check the state in S3 rather than just trusting the backups when they said they ran.	2022-03-25 17:05:30 -07:00
Alex Vandiver	d7b59c86ce	puppet: Build wal-g from source for aarch64. Since wal-g does not provide binaries for aarch64, build them from source. While building them from source for arm64 would better ensure that build process is tested, the build process takes 7min and 700M of temp files, which is an unacceptable cost; we thus only build on aarch64. Since the wal-g build process uses submodules, which are not in the Github export, we clone the full wal-g repository. Because the repository is relatively small, we clone it anew on each new version, rather than attempt to manage the remotes. Fixes #21070.	2022-03-22 15:02:35 -07:00
Alex Vandiver	4d4c320a07	puppet: Switch from ntp to chrony. Chrony is the recommended time server for Ubuntu since 18.04[1], and is the default on Redhat; it is more accurate, and has lower-memory usage, than ntp, which is only getting best-effort security maintenance. See: - https://wiki.ubuntu.com/BionicBeaver/ReleaseNotes#Chrony - https://chrony.tuxfamily.org/comparison.html - https://engineering.fb.com/2020/03/18/production-engineering/ntp-service/	2022-03-22 13:07:27 -07:00
Alex Vandiver	a2c8be9cd5	puppet: Increase download timeout from 5m to 10m. The default timeout for `exec` commands in Puppet is 5 minutes[1]. On slow connections, this may not be sufficient to download larger downloads, such as the ~135MB golang tarball. Increase the timeout to 10 minutes; this is a minimum download speed of is ~225kB/s. Fixes #21449. [1]: https://puppet.com/docs/puppet/5.5/types/exec.html#exec-attribute-timeout	2022-03-21 15:47:04 -07:00
Alex Vandiver	9e850b08f3	puppet: Fix the PostgreSQL paths to recovery.conf / standby.conf.	2022-03-20 16:16:04 -07:00
Alex Vandiver	1bd5723cd2	puppet: Add a prometheus monitor for tornado processes.	2022-03-20 16:12:11 -07:00
Alex Vandiver	6b91652d9a	puppet: Open the grok_exporter port. The complete grok_exporter configuration is not ready to be committed, but this at least prepares the way for it.	2022-03-20 16:12:11 -07:00
Alex Vandiver	6558655fc6	puppet: Add rabbitmq prometheus plugin, and open the firewall.	2022-03-20 16:12:11 -07:00
Alex Vandiver	bdd2f35d05	puppet: Switch czo to using zulip_ops::app_frontend_monitoring. This was clearly intended in `f61ac4a28d` but never executed.	2022-03-20 16:12:11 -07:00
Alex Vandiver	17699bea44	puppet: postgresql_backups is auto-included if s3_backups_bucket is set. Since `6496d43148`.	2022-03-20 16:12:11 -07:00
Alex Vandiver	bedc7c2986	puppet: Smokescreen is now auto-included in standalone. Since `c33562f0a8`.	2022-03-20 16:12:11 -07:00
Alex Vandiver	6489c832a3	puppet: Upgrade third-party package versions.	2022-03-17 11:44:05 -07:00
Alex Vandiver	d17006da55	puppet: Support setting an `ssl_mode` verification level.	2022-03-15 12:43:50 -07:00
Alex Vandiver	253bef27f5	puppet: Support password-based PostgreSQL replication.	2022-03-15 12:43:50 -07:00
Sahil Batra	f0606b34ad	user_groups: Add cron job for adding users to full members system group. This commit adds a cron job which runs every hour to add the users to full members system group if user is promoted to a full member. This should ensure that full member status is available no more than an hour after configuration suggests it should be.	2022-03-14 18:53:47 -07:00
Alex Vandiver	6f5ae8d13d	puppet: wal-g backups are required for replication. Previously, it was possible to configure `wal-g` backups without replication enabled; this resulted in only daily backups, not streaming backups. It was also possible to enable replication without configuring the `wal-g` backups bucket; this simply failed to work. Make `wal-g` backups always streaming, and warn loudly if replication is enabled but `wal-g` is not configured.	2022-03-11 10:09:35 -08:00
Alex Vandiver	6496d43148	puppet: Only s3_backups_bucket is required for backups. `s3_backups_key` / `s3_backups_secret_key` are optional, as the permissions could come from the EC2 instance's role.	2022-03-11 10:09:35 -08:00
Alex Vandiver	19beed2709	puppet: Default s3_region to the current ec2 region.	2022-03-11 10:09:35 -08:00
Anders Kaseorg	b3260bd610	docs: Use Debian and Ubuntu version numbers over development codenames. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-02-23 12:04:24 -08:00
Anders Kaseorg	1629d6bfb3	python: Reformat with Black 22 (stable). Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-02-18 18:03:13 -08:00
Alex Vandiver	c656d933fa	puppet: Switch from $::memorysize_mb to non-legacy $::memory.	2022-02-15 12:04:37 -08:00
Alex Vandiver	f2f4462e71	puppet: Switch from $::fqdn to non-legacy $::networking.	2022-02-15 12:04:37 -08:00
Alex Vandiver	bb4c0799cc	puppet: Switch to the canonical case for $::os['family']. The == operator in Puppet is case-insensitive for ASCII characters[1], which is potentially surprising. Switch to the canonical case that `$::os['family']` returns. [1] https://puppet.com/docs/puppet/5.5/lang_expressions.html#string-encoding-and-comparisons	2022-02-15 12:04:37 -08:00
Alex Vandiver	d4eefbbeea	puppet: Switch from $::osfamily to non-legacy $::os.	2022-02-15 12:04:37 -08:00
Alex Vandiver	a787ebe0e2	puppet: Switch from $::architecture to non-legacy $::os.	2022-02-15 12:04:37 -08:00
Alex Vandiver	d7e8733705	puppet: Use goarch for wal-g. wal-g does not currently provide pre-built binaries for arm64/aarch64 (see #21070) but if they begin to, it will likely be with the goarch names.	2022-02-15 12:04:37 -08:00
Alex Vandiver	abdbe4ca83	puppet: Use goarch for go-camo.	2022-02-15 12:04:37 -08:00
Alex Vandiver	be2f2a5bde	puppet: Use goarch for golang. Fixes: #21051.	2022-02-15 12:04:37 -08:00
Alex Vandiver	788daa953b	puppet: Factor out $::architecture case statement for golang.	2022-02-15 12:04:37 -08:00
Anders Kaseorg	f6a701090c	setup-apt-repos: Don’t install lsb_release. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-02-14 16:38:53 -08:00
Anders Kaseorg	45f4db9702	puppet: Remove unused $release_name. It would confuse a future Debian 15.10 release with Ubuntu 15.10, it relies on the legacy fact $::operatingsystemrelease, the modern fact $::os provides this information without extra logic, and it’s unused as of commit `03bffd3938`. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-02-14 16:38:53 -08:00
Alex Vandiver	291c5e87b6	puppet: Upgrade prometheus to 2.33.1.	2022-02-09 20:32:24 -08:00
Alex Vandiver	2d538c2356	puppet: Upgrade grafana to 8.3.6.	2022-02-09 20:32:24 -08:00
Alex Vandiver	f2e66c0b20	puppet: Upgrade go-camo to 2.4.0.	2022-02-09 20:32:24 -08:00
Alex Vandiver	51a516384d	puppet: Upgrade golang to 1.17.6.	2022-02-09 20:32:24 -08:00
Alex Vandiver	48263a01dd	puppet: Upgrade puppet libraries.	2022-02-09 20:32:24 -08:00
Alex Vandiver	e032b38661	puppet: Fix typo in uwsgi exporter dependency.	2022-02-08 15:17:17 -08:00
Alex Vandiver	b3900bec7e	puppet: Upgrade Grafana to 8.3.5. https://grafana.com/docs/grafana/latest/release-notes/release-notes-8-3-5/	2022-02-08 11:13:40 -08:00
Alex Vandiver	a46f6df91e	CVE-2021-43799: Write rabbitmq configuration before starting. Zulip writes a `rabbitmq.config` configuration file which locks down RabbitMQ to listen only on localhost:5672, as well as the RabbitMQ distribution port, on localhost:25672. The "distribution port" is part of Erlang's clustering configuration; while it is documented that the protocol is fundamentally insecure ([1], [2]) and can result in remote arbitrary execution of code, by default the RabbitMQ configuration on Debian and Ubuntu leaves it publicly accessible, with weak credentials. The configuration file that Zulip writes, while effective, is only written _after_ the package has been installed and the service started, which leaves the port exposed until RabbitMQ or system restart. Ensure that rabbitmq's `/etc/rabbitmq/rabbitmq.config` is written before rabbitmq is installed or starts, and that changes to that file trigger a restart of the service, such that the ports are only ever bound to localhost. This does not mitigate existing installs, since it does not force a rabbitmq restart. [1] https://www.erlang.org/doc/apps/erts/erl_dist_protocol.html [2] https://www.erlang.org/doc/reference_manual/distributed.html#distributed-erlang-system	2022-01-25 01:48:05 +00:00
Alex Vandiver	43d63bd5a1	puppet: Always set the RabbitMQ nodename to zulip@localhost. This is required in order to lock down the RabbitMQ port to only listen on localhost. If the nodename is `rabbit@hostname`, in most circumstances the hostname will resolve to an external IP, which the rabbitmq port will not be bound to. Installs which used `rabbit@hostname`, due to RabbitMQ having been installed before Zulip, would not have functioned if the host or RabbitMQ service was restarted, as the localhost restrictions in the RabbitMQ configuration would have made rabbitmqctl (and Zulip cron jobs that call it) unable to find the rabbitmq server. The previous commit ensures that configure-rabbitmq is re-run after the nodename has changed. However, rabbitmq needs to be stopped before `rabbitmq-env.conf` is changed; we use an `onlyif` on an `exec` to print the warning about the node change, and let the subsequent config change and notify of the service and configure-rabbitmq to complete the re-configuration.	2022-01-25 01:48:02 +00:00
Alex Vandiver	3bfcfeac24	puppet: Run configure-rabbitmq on nodename change. `/etc/rabbitmq/rabbitmq-env.conf` sets the nodename; anytime the nodename changes, the backing database changes, and this requires re-creating the rabbitmq users and permissions. Trigger this in puppet by running configure-rabbitmq after the file changes.	2022-01-25 01:46:51 +00:00
Alex Vandiver	694c4dfe8f	puppet: Admit we leave epmd port 4369 open on all interfaces. The Erlang `epmd` daemon listens on port 4369, and provides information (without authentication) about which Erlang processes are listening on what ports. This information is not itself a vulnerability, but may provide information for remote attackers about what local Erlang services (such as `rabbitmq-server`) are running, and where. `epmd` supports an `ERL_EPMD_ADDRESS` environment variable to limit which interfaces it binds on. While this environment variable is set in `/etc/default/rabbitmq-server`, Zulip unfortunately attempts to start `epmd` using an explicit `exec` block, which ignores those settings. Regardless, this lack of `ERL_EPMD_ADDRESS` variable only controls `epmd`'s startup upon first installation. Upon reboot, there are two ways in which `epmd` might be started, neither of which respect `ERL_EPMD_ADDRESS`: - On Focal, an `epmd` service exists and is activated, which uses systemd's configuration to choose which interfaces to bind on, and thus `ERL_EPMD_ADDRESS` is irrelevant. - On Bionic (and Focal, due to a broken dependency from `rabbitmq-server` to `epmd@` instead of `epmd`, which may lead to the explicit `epmd` service losing a race), `epmd` is started by `rabbitmq-server` when it does not detect a running instance. Unfortunately, only `/etc/init.d/rabbitmq-server` would respects `/etc/default/rabbitmq-server` -- and it defers the actual startup to using systemd, which does not pass the environment variable down. Thus, `ERL_EPMD_ADDRESS` is also irrelevant here. We unfortunately cannot limit `epmd` to only listening on localhost, due to a number of overlapping bugs and limitations: - Manually starting `epmd` with `-address 127.0.0.1` silently fails to start on hosts with IPv6 disabled, due to an Erlang bug ([1], [2]). - The dependencies of the systemd `rabbitmq-server` service can be fixed to include the `epmd` service, and systemd can be made to bind to `127.0.0.1:4369` and pass that socket to `epmd`, bypassing the above bug. However, the startup of this service is not guaranteed, because it races with other sources of `epmd` (see below). - Any process that runs `rabbitmqctl` results in `epmd` being started if one is not currently running; these instances do not respect any environment variables as to which addresses to bind on. This is also triggered by `service rabbitmq-server status`, as well as various Zulip cron jobs which inspect the rabbitmq queues. As such, it is difficult-to-impossible to ensure that some other `epmd` process will not win the race and open the port on all interfaces. Since the only known exposure from leaving port 4369 open is information that rabbitmq is running on the host, and the complexity of adjusting this to only bind on localhost is high, we remove the setting which does not address the problem, and document that the port is left open, and should be protected via system-level or network-level firewalls. [1]: https://bugs.launchpad.net/ubuntu/+source/erlang/+bug/1374109 [2]: https://github.com/erlang/otp/issues/4820	2022-01-25 01:46:51 +00:00
Alex Vandiver	2713e90eaf	puppet: Remove rabbitmq_mochiweb configuration. mochiweb was renamed to web_dispatch in RabbitMQ 3.8.0, and the plugin is not enabled. Nor does this control the management interface, which would listen on port 15672.	2022-01-25 01:46:51 +00:00
Alex Vandiver	a3adaf4aa3	puppet: Fix standalone certbot configurations. This addresses the problems mentioned in the previous commit, but for existing installations which have `authenticator = standalone` in their configurations. This reconfigures all hostnames in certbot to use the webroot authenticator, and attempts to force-renew their certificates. Force-renewal is necessary because certbot contains no way to merely update the configuration. Let's Encrypt allows for multiple extra renewals per week, so this is a reasonable cost. Because the certbot configuration is `configobj`, and not `configparser`, we have no way to easily parse to determine if webroot is in use; additionally, `certbot certificates` does not provide this information. We use `grep`, on the assumption that this will catch nearly all cases. It is possible that this will find `authenticator = standalone` certificates which are managed by Certbot, but not Zulip certificates. These certificates would also fail to renew while Zulip is running, so switching them to use the Zulip webroot would still be an improvement. Fixes #20593.	2022-01-24 12:13:44 -08:00
Anders Kaseorg	97e4e9886c	python: Replace universal_newlines with text. This is supported in Python ≥ 3.7. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-01-23 22:16:01 -08:00
Anders Kaseorg	a58a71ef43	Remove Ubuntu 18.04 support. As a consequence: • Bump minimum supported Python version to 3.7. • Move Vagrant environment to Debian 10, which has Python 3.7. • Move CI frontend tests to Debian 10. • Move production build test to Debian 10. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-01-21 17:26:14 -08:00
Alex Vandiver	3bbe5c1110	puppet: Put comments on iptables lines. In addition to documenting the rules.v4 and rules.v6 files slightly, these comments show up in `iptables -L`: ``` root@hostname:~# iptables -L INPUT Chain INPUT (policy ACCEPT) target prot opt source destination ACCEPT all -- anywhere anywhere LOGDROP all -- anywhere localhost/8 ACCEPT all -- anywhere anywhere state RELATED,ESTABLISHED ACCEPT tcp -- anywhere anywhere tcp dpt:ssh /* ssh / ACCEPT tcp -- anywhere anywhere tcp dpt:3000 / grafana / ACCEPT tcp -- anywhere anywhere tcp dpt:9100 / node_exporter */ LOGDROP all -- anywhere anywhere ```	2022-01-21 16:46:14 -08:00
Alex Vandiver	6bc5849ea8	puppet: Remove now-unused debathena apt repository.	2022-01-18 14:13:28 -08:00
Alex Vandiver	b3f07cc98d	puppet: Replace debathena zephyr package with equivalent puppet file.	2022-01-18 14:13:28 -08:00
Alex Vandiver	a6d7539571	puppet: Replace debathena krb5 package with equivalent puppet file.	2022-01-18 14:13:28 -08:00
Alex Vandiver	75224ea5de	puppet: python-dev is now purely virtual; install python2.7-dev.	2022-01-18 14:13:28 -08:00
Alex Vandiver	fc1adef28a	puppet: Fix server_name of internal staging server.	2022-01-18 12:36:56 -08:00
Alex Vandiver	7e630b81f8	puppet: Switch to using snakeoil certs for staging. This parallels `ba3b88c81b`, but for the staging host.	2022-01-18 12:36:56 -08:00
Alex Vandiver	fb4d9764fa	puppet: Bump Grafana version, for 8.3.4. security release.	2022-01-18 12:33:02 -08:00
Alex Vandiver	434bda01c7	puppet: Enable camo prometheus metrics. Doing so requires protecting /metrics from direct access when proxied through nginx. If camo is placed on a separate host, the equivalent /metrics URL may need to be protected. See https://github.com/cactus/go-camo#metrics for details on the statistics so reported. Note that 5xx responses are _expected_ from go-camo's statistics, as it returns 502 status code when the remote server responds with 500/502/503/504, or 504 when the remote host times out.	2022-01-13 14:19:18 -08:00
Alex Vandiver	0b8a6a51b8	puppet: Remove all parts of AWS kernels. Otherwise, we just uninstall the meta-package, and still restart into the installed AWS kernel.	2022-01-12 15:52:19 -08:00
Alex Vandiver	4d7e6b26df	puppet: Provide more attributes to teleport on ssh nodes.	2022-01-12 14:15:45 -08:00
Alex Vandiver	339e70671c	puppet: Switch Grafana to Grafana 8 Unified Alerting.	2022-01-11 14:27:11 -08:00
Alex Vandiver	6a7eecee9a	puppet: Increase load paging thresholds.	2022-01-11 09:38:31 -08:00
Alex Vandiver	1e80b844f4	puppet: Disable apparmor profile for msmtp. As the nagios user, we want to read the msmtp configuration from ~nagios, which apparmor's profile does not allow msmtp to do.	2022-01-11 09:38:31 -08:00
Alex Vandiver	3c95ad82c6	puppet: Upgrade to nagios4. This updates the puppeted nagios configuration file for the Nagios4 defaults.	2022-01-11 09:38:31 -08:00
Alex Vandiver	d328d3dd4d	puppet: Allow routing camo requests through an outgoing proxy. Because Camo includes logic to deny access to private subnets, routing its requests through Smokescreen is generally not necessary. However, it may be necessary if Zulip has configured a non-Smokescreen exit proxy. Default Camo to using the proxy only if it is not Smokescreen, with a new `proxy.enable_for_camo` setting to override this behaviour if need be. Note that that setting is in `zulip.conf` on the host with Camo installed -- not the Zulip frontend host, if they are different. Fixes: #20550.	2022-01-07 12:08:10 -08:00
Alex Vandiver	2c5fc1827c	puppet: Standardize what values are bools, and what true is. For `no_serve_uploads`, `http_only`, which previously specified "non-empty" to enable, this tightens what values are true. For `pgroonga` and `queue_workers_multiprocess`, this broadens the possible values from `enabled`, and `true` respectively.	2022-01-07 12:08:10 -08:00
Alex Vandiver	1e672e4d82	puppet: Remove unused $no_serve_uploads in app_frontend.	2022-01-07 12:08:10 -08:00
Alex Vandiver	6218ed91c2	puppet: Use lazy-apps and uwsgi control sockets for rolling reloads. Restarting the uwsgi processes by way of supervisor opens a window during which nginx 502's all responses. uwsgi has a configuration called "chain reloading" which allows for rolling restart of the uwsgi processes, such that only one process at once in unavailable; see uwsgi documentation ([1]). The tradeoff is that this requires that the uwsgi processes load the libraries after forking, rather than before ("lazy apps"); in theory this can lead to larger memory footprints, since they are not shared. In practice, as Django defers much of the loading, this is not as much of an issue. In a very basic test of memory consumption (measured by total memory - free - caches - buffers; 6 uwsgi workers), both immediately after restarting Django, and after requesting `/` 60 times with 6 concurrent requests: \| Non-lazy \| Lazy app \| Difference ------------------+------------+------------+------------- Fresh \| 2,827,216 \| 2,870,480 \| +43,264 After 60 requests \| 3,332,284 \| 3,409,608 \| +77,324 ..................\|............\|............\|............. Difference \| +505,068 \| +539,128 \| +34,060 That is, "lazy app" loading increased the footprint pre-requests by 43MB, and after 60 requests grew the memory footprint by 539MB, as opposed to non-lazy loading, which grew it by 505MB. Using wsgi "lazy app" loading does increase the memory footprint, but not by a large percentage. The other effect is that processes may be served by either old or new code during the restart window. This may cause transient failures when new frontend code talks to old backend code. Enable chain-reloading during graceful, puppetless restarts, but only if enabled via a zulip.conf configuration flag. Fixes #2559. [1]: https://uwsgi-docs.readthedocs.io/en/latest/articles/TheArtOfGracefulReloading.html#chain-reloading-lazy-apps	2022-01-05 14:48:52 -08:00
Alex Vandiver	4a95967a33	puppet: Gather uwsgi stats from chat.zulip.org.	2022-01-03 21:26:57 -08:00
Alex Vandiver	8a5be972d2	puppet: Add a uwsgi exporter for monitoring. This allows investigation of how many workers are busy, and to track "harikari" terminations.	2022-01-03 15:25:58 -08:00
Alex Vandiver	d6c40d24d4	puppet: Manage current smokescreen binary so it is not tidied. Fix another tidy error caused by 1e4e6a09af23; as also noted in `f9a39b6703`, these resources are necessary such that tidy does not cleanup of smokescreen, and then force a recompilation of it again.	2022-01-03 15:24:42 -08:00
Alex Vandiver	f9a39b6703	puppet: Manage extracted resources again. `1e4e6a09af` removed the resources for the unpacked directory, on the argument that they were unnecessary. However, the directory (or file, see below) that is unpacked must be managed, or it will be tidied on the next puppet apply. Add back the resource for `$dir`, but mark it `ensure => present`, to support tarballs which only unpack to a single file (e.g. wal-g).	2022-01-02 12:11:53 -08:00
Alex Vandiver	54b6a83412	puppet: Fix typo in cron job name.	2021-12-31 17:39:53 -08:00
Alex Vandiver	941800cf12	puppet: Upgrade external dependencies.	2021-12-31 11:14:40 -08:00
Alex Vandiver	6f693d10d9	puppet: Fix version of node_exporter. This was a copy/paste but introduced in `f166f9f7d6`.	2021-12-30 23:33:34 +00:00
Anders Kaseorg	82748d45d8	install-yarn: Use test -ef in case /srv is a symlink. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2021-12-30 13:42:07 -08:00
Alex Vandiver	c094867a74	puppet: Add aarch64 build hashes to external dependencies. wal-g does not ship aarch64 binaries, currently; the compilation process([1]) is somewhat complicated, so we defer the decision about how to support wal-g for aarch64 until a later date. [1]: https://github.com/wal-g/wal-g/blob/master/docs/PostgreSQL.md#installing	2021-12-29 16:35:15 -08:00
Alex Vandiver	f166f9f7d6	puppet: Centralize versions and sha256 hashes of external dependencies. This will make it easier to update versions of these dependencies.	2021-12-29 16:35:15 -08:00
Alex Vandiver	57662689a9	puppet: Provide a constant homedir for grafana user. The homedir of a user cannot be changed if any processes are running as them, so having it change over time as upgrades happen will break puppet application, as the old grafana process under supervisor will effectively lock changes to the user's homedir. Unfortunately, that means that this change will thus fail to puppet-apply unless `supervisorctl stop grafana` is run first, but there's no way around that.	2021-12-29 16:35:15 -08:00
Alex Vandiver	6e55e52694	puppet: Pull out grafana $data_dir.	2021-12-29 16:35:15 -08:00
Alex Vandiver	51d3862c7e	puppet: Move wal-g to external_dep, in /srv/zulip-wal-g-*.	2021-12-29 16:35:15 -08:00
Alex Vandiver	1e4e6a09af	puppet: Stop making resources for external binaries and directories. In the event that extracting doesn't produce the binary we expected it to, all this will do is create an _empty_ file where we expect the binary to be. This will likely muddle debugging. Since the only reason the resourfce was made in the first place was to make dependencies clear, switch to depending on the External_Dep itself, when such a dependency is needed.	2021-12-29 16:35:15 -08:00
Alex Vandiver	3c163a7d5e	puppet: Move slash out of $dir by convention.	2021-12-29 16:35:15 -08:00
Alex Vandiver	bb5a2c8138	puppet: Move prometheus to external_dep.	2021-12-29 16:35:15 -08:00
Alex Vandiver	2d6c096904	puppet: Move node_exporter to external_dep.	2021-12-29 16:35:15 -08:00
Alex Vandiver	d2a78bac7e	puppet: Adjust wal-g release version and SHA256. wal-g apparently removed the 1.1.1 release; replace it with the equivalent rc.	2021-12-29 16:35:15 -08:00
Alex Vandiver	7a9074ecfd	puppet: Use shorter local variable for supervisor conf.d dir.	2021-12-28 09:24:01 -08:00
Alex Vandiver	670fad0cc4	puppet: Drop now-unnecessary supervisor file removals.	2021-12-28 09:24:01 -08:00
Alex Vandiver	20eab264cf	puppet: Remove dependency on scripts.lib.zulip_tools. `ab130ceb35` added a dependency on scripts.lib.zulip_tools; however, check_postgresql_replication_lag is run on hosts which do not have a zulip tree installed. Inline the simple functions that were imported.	2021-12-14 14:48:53 -08:00
Alex Vandiver	71b56f7c1c	puppet: process_fts_updates connects as nagios (or provided username). It should not use the configured zulip username, but should instead pull from the login user (likely `nagios`), or an explicit alternate provided PostgreSQL username. Failure to do so results in Nagios failures because the `nagios` login does not have permissions to authenticated the `zulip` PostgreSQL user. This requires CI changes, as the install tests install as the `zulip` login username, which allowed Nagios tests to pass previously; with the custom database and username, however, they must be passed to process_fts_updates explicitly when validating the install.	2021-12-14 14:48:53 -08:00
Alex Vandiver	9d67e37166	puppet: Nagios connects as itself, in check_postgresql_replication_lag.	2021-12-14 14:48:53 -08:00
Alex Vandiver	850bc4cc81	puppet: Create directory for redis PID file. The Redis configuration, and the systemd file for it, assumes there will be a pid file written to `/var/run/redis/redis.pid`, but `/var/run/redis` is not created during installation. Create `/run/redis`; as `/var/run` is a symlink to `/run` on systemd systems, this is equivalent to `/var/run/redis`.	2021-12-13 12:42:15 -08:00
Alex Vandiver	a6c2079502	puppet: Create memcached PID file that systemd config file specifies. The systemd config file installed by the `memcached` package assumes there will be a PID written to `/run/memcached/memcached.pid`. Since we override `memcached.conf`, we have omitted the line that writes out the PID to this file. Systemd is smart enough to not _need_ the PID file to start up the service correctly, but match the configuration. We create the directory since the package does not do so. It is created as `/run/memcached` and not `/var/run/memcached` because `/var/run` is a symlink to `/run`.	2021-12-13 12:42:15 -08:00
Alex Vandiver	e4b23daad7	puppet: Upgrade to Grafana 8.3.2, for CVE-2021-43813.	2021-12-10 14:00:11 -08:00
Alex Vandiver	01e8f752a8	puppet: Use certbot package timer, not our own cron job. The certbot package installs its own systemd timer (and cron job, which disabled itself if systemd is enabled) which updates certificates. This process races with the cron job which Zulip installs -- the only difference being that Zulip respects the `certbot.auto_renew` setting, and that it passes the deploy hook. This means that occasionally nginx would not be reloaded, when the systemd timer caught the expiration first. Remove the custom cron job and `certbot-maybe-renew` script, and reconfigure certbot to always reload nginx after deploying, using certbot directory hooks. Since `certbot.auto_renew` can't have an effect, remove the setting. In turn, this removes the need for `--no-zulip-conf` to `setup-certbot`. `--deploy-hook` is similarly removed, as running deploy hooks to restart nginx is now the default; pass `--no-directory-hooks` in standalone mode to not attempt to reload nginx. The other property of `--deploy-hook`, of skipping symlinking into place, is given its own flog.	2021-12-09 13:47:33 -08:00
Alex Vandiver	053682964e	puppet: Only fetch from running hosts in Grafana ec2 discovery.	2021-12-09 08:12:03 -08:00
Alex Vandiver	291f688678	puppet: Use zulip::external_dep for grafana, template config. Templating the config ensures that the service is restarted when it is upgraded.	2021-12-08 20:58:10 -08:00
Alex Vandiver	3eae429ab4	puppet: Upgrade Grafana to 8.3.1, for CVE-2021-43798.	2021-12-08 20:58:10 -08:00
Alex Vandiver	7db146d0a9	puppet: Do not assume amd64 architecture.	2021-12-06 11:08:50 -08:00
Alex Vandiver	fb2d05f9e3	puppet: Remove unused 'builder' files. These are leftover detritus from the "builder" host, which was removed in `4c9a283542`.	2021-12-06 10:21:50 -08:00
Alex Vandiver	cb2d0ff32b	postgresql: Support replication on PostgreSQL >= 11, document. PostgreSQL 11 and below used a configuration file names `recovery.conf` to manage replicas and standbys; support for this was removed in PostgreSQL 12[1], and the configuration parameters were moved into the main `postgresql.conf`. Add `zulip.conf` settings for the primary server hostname and replication username, so that the complete `postgresql.conf` configuration on PostgreSQL 14 can continue to be managed, even when replication is enabled. For consistency, also begin writing out the `recovery.conf` for PostgreSQL 11 and below. In PostgreSQL 12 configuration and later, the `wal_level = hot_standby` setting is removed, as `hot_standby` is equivalent to `replica`, which is the default value[2]. Similarly, the `hot_standby = on` setting is also the default[3]. Documentation is added for these features, and the commentary on the "Export and Import" page referencing files under `puppet/zulip_ops/` is removed, as those files no longer have any replication-specific configuration. [1]: https://www.postgresql.org/docs/current/recovery-config.html [2]: https://www.postgresql.org/docs/12/runtime-config-wal.html#GUC-WAL-LEVEL [3]: https://www.postgresql.org/docs/12/runtime-config-replication.html#GUC-HOT-STANDBY	2021-12-03 16:32:41 -08:00
Alex Vandiver	7d3399a970	puppet: Drop configuration files for unsupported PostgreSQL versions. These are both unsupported by PostgreSQL itself, as well as by Zulip; the removal of Ubuntu Xenial and Debian Stretch support in Zulip 3.0 removed the requirement for PostgreSQL 9.6, and the previous versions date back yet farther.	2021-12-03 16:32:41 -08:00
Alex Vandiver	6436c4087d	puppet: Tidy old wal-g binaries.	2021-12-03 16:17:50 -08:00
Alex Vandiver	53cc9538f7	puppet: Factor out wal-g binary path.	2021-12-03 16:17:50 -08:00
Alex Vandiver	338483792b	puppet: Upgrade wal-g release to 1.1.1.	2021-12-03 16:17:50 -08:00
Anders Kaseorg	325b4bac7e	env-wal-g: Quote $s3_backups_bucket. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2021-12-03 14:33:53 -08:00
Alex Vandiver	f31bf3f06c	puppet: Install camo on Docker. Now that go-camo runs within supervisor, it can be run in Docker simply. Fixes #20101. Fixes zulip/docker-zulip#179.	2021-12-02 09:25:00 -08:00
Alex Vandiver	358a7fb0c6	puppet: Read camo secret at startup time, not at puppet-apply time. Writing the secret to the supervisor configuration file makes changes to the secret requires a zulip-puppet-apply to take hold. The Docker image is constructed to avoid having to run zulip-puppet-apply on startup, and indeed cannot run zulip-puppet-apply after having configured secrets, as it has replaced the zulip.conf file with a symlink, for example. This means that camo gets the static secret that was built into the image, and not the one regenerated on first startup. Read the camo secret at process startup time. Because this pattern is likely common with "12-factor" applications which can read from environment variables, write a generic tool to map secrets to environment variables before exec'ing a binary, and use that for Camo.	2021-12-02 09:25:00 -08:00
Alex Vandiver	86cf3be39f	puppet: Fix pgroonga init for custom database names and users.	2021-11-20 07:13:50 -08:00
Alex Vandiver	c514feaa22	puppet: Default go-camo to listening on localhost for standalone deploys. The default in the previous commit, inherited from camo, was to bind to 0.0.0.0:9292. In standalone deployments, camo is deployed on the same host as the nginx reverse proxy, and as such there is no need to open it up to other IPs. Make `zulip::camo` take an optional parameter, which allows overriding it in puppet, but skips a `zulip.conf` setting for it, since it is unlikely to be adjust by most users.	2021-11-19 15:58:26 -08:00
Alex Vandiver	b982222e03	camo: Replace with go-camo implementation. The upstream of the `camo` repository[1] has been unmaintained for several years, and is now archived by the owner. Additionally, it has a number of limitations: - It is installed as a sysinit service, which does not run under Docker - It does not prevent access to internal IPs, like 127.0.0.1 - It does not respect standard `HTTP_proxy` environment variables, making it unable to use Smokescreen to prevent the prior flaw - It occasionally just crashes, and thus must have a cron job to restart it. Swap camo out for the drop-in replacement go-camo[2], which has the same external API, requiring not changes to Django code, but is more maintained. Additionally, it resolves all of the above complaints. go-camo is not configured to use Smokescreen as a proxy, because its own private-IP filtering prevents using a proxy which lies within that IP space. It is also unclear if the addition of Smokescreen would provide any additional protection over the existing IP address restrictions in go-camo. go-camo has a subset of the security headers that our nginx reverse proxy sets, and which camo set; provide the missing headers with `-H` to ensure that go-camo, if exposed from behind some other non-nginx load-balancer, still provides the necessary security headers. Fixes #18351 by moving to supervisor. Fixes zulip/docker-zulip#298 also by moving to supervisor. [1] https://github.com/atmos/camo [2] https://github.com/cactus/go-camo	2021-11-19 15:58:26 -08:00
Alex Vandiver	c33562f0a8	puppet: Default to installing smokescreen on application frontends. This is an additional security hardening step, to make Zulip default to preventing SSRF attacks. The overhead of running Smokescreen is minimal, and there is no reason to force deployments to take additional steps in order to secure themselves against SSRF attacks. Deployments which already have a different external proxy configured will not gain a local Smokescreen installation, and running without Smokescreen is supported by explicitly unsetting the `host` or `port` values in `/etc/zulip/zulip.conf`.	2021-11-19 15:29:28 -08:00
Alex Vandiver	44f1ea6bae	puppet: Split smokescreen into a non-profile version. In a subsequent commit, we intend to include it from `zulip::app_frontend_base`, which is a layering violation if it only exists in the form of a profile.	2021-11-19 15:29:28 -08:00
Alex Vandiver	c2ed3c22b5	puppet: Remove unused smokescreen symlink.	2021-11-19 15:29:28 -08:00
Alex Vandiver	47e16a5d41	puppet: Tidy old smokescreen binaries.	2021-11-19 15:29:28 -08:00
Alex Vandiver	239ac8413e	puppet: Embed golang version into binary path, to rebuild on new golang. This will cause the output binary path to be sensitive to golang version, causing it to be rebuilt on new golang, and an updated supervisor config file written out, and thus supervisor also restarted.	2021-11-19 15:29:28 -08:00
Alex Vandiver	216eeba2dd	puppet: Factor out smokescreen binary path.	2021-11-19 15:29:28 -08:00
Alex Vandiver	3a7cef6582	puppet: Switch smokescreen to using zulip::external_dep, so it tidies.	2021-11-19 15:29:28 -08:00
Alex Vandiver	ea08111d60	puppet: Move /srv/smokescreen-src to /srv/zulip-smokescreen-src. As with the previous commit for `/srv/golang`, we have the custom of namespacing things under `/srv` with `zulip-` to help ensure that we play nice with anything else that happens to be on the host.	2021-11-19 15:29:28 -08:00
Anders Kaseorg	c64e1adb19	puppet: Upgrade Smokescreen v0.0.2-59-gbfca45c to v0.0.2-63-gdc40301. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2021-11-19 15:29:28 -08:00
Alex Vandiver	bb9d2df1ae	puppet: Extract an external-tarball-dependency manifest.	2021-11-19 15:29:28 -08:00
Alex Vandiver	3c8d7e2598	puppet: Tidy old golang directories. This relies on behavior which is only in Puppet 5.5.1 and above, which means it must be skipped on Ubuntu 18.04.	2021-11-19 15:29:28 -08:00
Alex Vandiver	2fc4acdf81	puppet: Move /srv/golang to /srv/zulip-golang. We have the custom of namespacing things under `/srv` with `zulip-` to help ensure that we play nice with anything else that happens to be on the host.	2021-11-19 15:29:28 -08:00
Alex Vandiver	00a4abb642	puppet: Switch dependency to the golang binary we need.	2021-11-19 15:29:28 -08:00
Alex Vandiver	2d5f813094	puppet: Stop making a /srv/golang symlink. Nothing needs this extra directory.	2021-11-19 15:29:28 -08:00
Alex Vandiver	93af6c7f06	puppet: Factor out golang variables.	2021-11-19 15:29:28 -08:00
Alex Vandiver	21be36f15f	puppet: Shorten golang version variable name.	2021-11-19 15:29:28 -08:00
Alex Vandiver	6b9e74adee	puppet: Upgrade golang from 1.16.4 to 1.17.3.	2021-11-19 15:29:28 -08:00
Alex Vandiver	514801c509	puppet: Split out golang toolchain into its own manifest.	2021-11-19 15:29:28 -08:00
Alex Vandiver	610a0b2d59	nagios: `pg_is_in_recovery()` is better to know replica/primary status. It is possible to be in recovery, and downloading WAL logs from archives, and not yet be replicating. If one only checks the streaming log status, it reports as "no replicas" which is technically accurate but not a useful summation of the state of the replica.	2021-11-17 13:38:26 -08:00
Alex Vandiver	83091cbc96	puppet: Swap the one use of the `cron` resource for an /etc/cron.d file. The `cron` resource places its contents in the user's crontab, which makes it unlike every other cron job that Zulip installs. Switch to using `/etc/cron.d` files, like all other cron jobs.	2021-11-16 16:17:32 -08:00
Alex Vandiver	90e1a0400e	puppet: Add a few more inter-resource dependencies. None of these are important; they just express semantic dependencies.	2021-11-16 16:17:32 -08:00
Alex Vandiver	49ad188449	rate_limit: Add a flag to lump all TOR exit node IPs together. TOR users are legitimate users of the system; however, that system can also be used for abuse -- specifically, by evading IP-based rate-limiting. For the purposes of IP-based rate-limiting, add a RATE_LIMIT_TOR_TOGETHER flag, defaulting to false, which lumps all requests from TOR exit nodes into the same bucket. This may allow a TOR user to deny other TOR users access to the find-my-account and new-realm endpoints, but this is a low cost for cutting off a significant potential abuse vector. If enabled, the list of TOR exit nodes is fetched from their public endpoint once per hour, via a cron job, and cached on disk. Django processes load this data from disk, and cache it in memcached. Requests are spared from the burden of checking disk on failure via a circuitbreaker, which trips of there are two failures in a row, and only begins trying again after 10 minutes.	2021-11-16 11:42:00 -08:00
Alex Vandiver	01c007ceaf	puppet: Remove an out-of-date comment. Comment was missed in `9d57fa9759`.	2021-11-09 21:52:17 -08:00
Alex Vandiver	7af2fa2e92	puppet: Use sysv status command, not supervisorctl status. Since Supervisor 4, which is installed on Ubuntu 20.04 and Debian 11, `supervisorctl status` returns exit code 3 if any of the supervisor-controlled processes are not running. Using `supervisorctl status` as the Puppet `status` command for Supervisor leads to unnecessarily trying to "start" a Supervisor process which is already started, but happens to have one or more of its managed processes stopped. This is an unnecessary no-op in production environments, but in docker-init enviroments, such as in CI, attempting to start the process a second time is an error. Switch to checking if supervisor is running by way of sysv init. This fixes the potential error in CI, as well as eliminates unnecessary "starts" of supervisor when it was already running -- a situation which made zulip-puppet-apply not idempotent: ``` root@alexmv-prod:~# supervisorctl status process-fts-updates STOPPED Nov 10 12:33 AM smokescreen RUNNING pid 1287280, uptime 0:35:32 zulip-django STOPPED Nov 10 12:33 AM zulip-tornado STOPPED Nov 10 12:33 AM [...] root@alexmv-prod:~# ~zulip/deployments/current/scripts/zulip-puppet-apply --force Notice: Compiled catalog for alexmv-prod.zulipdev.org in environment production in 2.32 seconds Notice: /Stage[main]/Zulip::Supervisor/Service[supervisor]/ensure: ensure changed 'stopped' to 'running' Notice: Applied catalog in 0.91 seconds root@alexmv-prod:~# ~zulip/deployments/current/scripts/zulip-puppet-apply --force Notice: Compiled catalog for alexmv-prod.zulipdev.org in environment production in 2.35 seconds Notice: /Stage[main]/Zulip::Supervisor/Service[supervisor]/ensure: ensure changed 'stopped' to 'running' Notice: Applied catalog in 0.92 seconds ```	2021-11-09 21:52:17 -08:00
Alex Vandiver	8a1bb43b23	puppet: Adjust for templated paths and settings, set C.UTF-8 locale.	2021-11-08 18:21:46 -08:00
Alex Vandiver	d3e9a71d42	puppet: Check in upstream PostgreSQL 14 configuration file. Note that one `<%u%%d>` has to be escaped as `<%%u%%d>`.	2021-11-08 18:21:46 -08:00
Adam Benesh	c881430f4c	puppet: Add WSGIApplicationGroup config to Apache SSO example. Zulip apparently is now affected by a bad interaction between Apache's WSGI using Python subinterpreters and C extension modules like `re2` that are not designed for it. The solution is apparently to set WSGIApplicationGroup to %{GLOBAL}, which disables Apache's use of Python subinterpreters. See https://serverfault.com/questions/514242/non-responsive-apache-mod-wsgi-after-installing-scipy/514251#514251 for background. Fixes #19924.	2021-10-08 15:07:23 -07:00
Tim Abbott	33b5fa633a	process_fts_updates: Fix docker-zulip support. In the series of migrations to this tool's configuration to support specifying an arbitrary database name (e.g. `c17f502bb0`), we broke support for running process_fts_updates on the application server, connected to a remote database server. That workflow is used by docker-zulip and presumably other settings like Amazon RDS. The fix is to import the Zulip virtualenv (if available) when running on an application server. This is better than just supporting this case, since both docker-zulip and an Amazon RDS database are setting where it would be inconvenient to run process-fts-updates directly on the database server. (In the former case, because we want to avoid having a strong version dependency on the postgres container). Details are available in this conversation: https://chat.zulip.org/#narrow/stream/49-development-help/topic/Logic.20in.20process_fts_updates.20seems.20to.20be.20broken/near/1251894 Thanks to Erik Tews for reporting and help in debugging this issue.	2021-09-27 18:17:33 -05:00
Alex Vandiver	1806e0f45e	puppet: Remove zulip.org configuration.	2021-08-26 17:21:31 -07:00
Alex Vandiver	27881babab	puppet: Increase prometheus storage, from the default 15d.	2021-08-24 23:40:43 -07:00
Alex Vandiver	faf71eea41	upgrade-postgresql: Do not remove other supervisor configs. We previously used `zulip-puppet-apply` with a custom config file, with an updated PostgreSQL version but more limited set of `puppet_classes`, to pre-create the basic settings for the new cluster before running `pg_upgradecluster`. Unfortunately, the supervisor config uses `purge => true` to remove all SUPERVISOR configuration files that are not included in the puppet configuration; this leads to it removing all other supervisor processes during the upgrade, only to add them back and start them during the second `zulip-puppet-apply`. It also leads to `process-fts-updates` not being started after the upgrade completes; this is the one supervisor config file which was not removed and re-added, and thus the one that is not re-started due to having been re-added. This was not detected in CI because CI added a `start-server` command which was not in the upgrade documentation. Set a custom facter fact that prevents the `purge` behaviour of the supervisor configuration. We want to preserve that behaviour in general, and using `zulip-puppet-apply` continues to be the best way to pre-set-up the PostgreSQL configuration -- but we wish to avoid that behaviour when we know we are applying a subset of the puppet classes. Since supervisor configs are no longer removed and re-added, this requires an explicit start-server step in the instructions after the upgrades complete. This brings the documentation into alignment with what CI is testing.	2021-08-24 19:00:58 -07:00
Alex Vandiver	e46e862f2b	puppet: Add a bare-bones zulipbot profile. This sets up the firewalls appropriate for zulipbot, but does not automate any of the configuration of zulipbot itself.	2021-08-24 16:05:58 -07:00
Alex Vandiver	5857dcd9b4	puppet: Configure ip6tables in parallel to ipv4. Previously, IPv6 firewalls were left at the default all-open. Configure IPv6 equivalently to IPv4.	2021-08-24 16:05:46 -07:00
Alex Vandiver	845509a9ec	puppet: Be explicit that existing iptables are only ipv4.	2021-08-24 16:05:46 -07:00
Anders Kaseorg	09564e95ac	mypy: Add types-psycopg2. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2021-08-09 20:32:19 -07:00
Alex Vandiver	4dd289cb9d	puppet: Enable prometheus monitoring of supervisord. To be able to read the UNIX socket, this requires running node_exporter as zulip, not as prometheus.	2021-08-03 21:47:02 -07:00
Alex Vandiver	aa940bce72	puppet: Disable hwmon collector, which does nothing on cloud hosts.	2021-08-03 21:47:02 -07:00
Alex Vandiver	23a355df0f	puppet: Move backup time earlier, from 10am to 7pm America/Los_Angeles. This is less likely to overlap with common evening deploy times.	2021-08-03 18:32:45 -05:00
Alex Vandiver	e94b6afb00	nagios: Remove broken check_email_deliverer_* checks and related code. These checks suffer from a couple notable problems: - They are only enabled on staging hosts -- where they should never be run. Since `ef6d0ec5ca`, these supervisor processes are only run on one host, and never on the staging host. - They run as the `nagios` user, which does not have appropriate permissions, and thus the checks always fail. Specifically, `nagios` does not have permissions to run `supervisorctl`, since the socket is owned by the `zulip` user, and mode 0700; and the `nagios` user does not have permission to access Zulip secrets to run `./manage.py print_email_delivery_backlog`. Rather than rewrite these checks to run on a cron as zulip, and check those file contents as the nagios user, drop these checks -- they can be rewritten at a later point, or replaced with Prometheus alerting, and currently serve only to cause always-failing Nagios checks, which normalizes alert failures. Leave the files installed if they currently exist, rather than cluttering puppet with `ensure => absent`; they do no harm if they are left installed.	2021-08-03 16:07:13 -07:00
Mateusz Mandera	57f14b247e	bots: Specify realm for nagios bots messages in check_send_receive_time.	2021-07-26 15:33:13 -07:00
Alex Vandiver	befe204be4	puppet: Run the supervisor-restart step only after it is started. In an initial install, the following is a potential rule ordering: ``` Notice: /Stage[main]/Zulip::Supervisor/File[/etc/supervisor/conf.d/zulip]/ensure: created Notice: /Stage[main]/Zulip::Supervisor/File[/etc/supervisor/supervisord.conf]/content: content changed '{md5}99dc7e8a1178ede9ae9794aaecbca436' to '{md5}7ef9771d2c476c246a3ebd95fab784cb' Notice: /Stage[main]/Zulip::Supervisor/Exec[supervisor-restart]: Triggered 'refresh' from 1 event [...] Notice: /Stage[main]/Zulip::App_frontend_base/File[/etc/supervisor/conf.d/zulip/zulip.conf]/ensure: defined content as '{md5}d98ac8a974d44efb1d1bb2ef8b9c3dee' [...] Notice: /Stage[main]/Zulip::App_frontend_once/File[/etc/supervisor/conf.d/zulip/zulip-once.conf]/ensure: defined content as '{md5}53f56ae4b95413bfd7a117e3113082dc' [...] Notice: /Stage[main]/Zulip::Process_fts_updates/File[/etc/supervisor/conf.d/zulip/zulip_db.conf]/ensure: defined content as '{md5}96092d7f27d76f48178a53b51f80b0f0' Notice: /Stage[main]/Zulip::Supervisor/Service[supervisor]/ensure: ensure changed 'stopped' to 'running' ``` The last line is misleading -- supervisor was already started by the `supervisor-restart` process on the third line. As can be shown with `zulip-puppet-apply --debug`, the last line just installs supervisor to run on startup, using `systemctl`: ``` Debug: Executing: 'supervisorctl status' Debug: Executing: '/usr/bin/systemctl unmask supervisor' Debug: Executing: '/usr/bin/systemctl start supervisor' ``` This means the list of processes started by supervisor depends entirely on which configuration files were successfully written out by puppet before the initial `supervisor-restart` ran. Since `zulip_db.conf` is written later than the rest, the initial install often fails to start the `process-fts-updates` process. In this state, an explicit `supervisorctl restart` or `supervisorctl reread && supervisorctl update` is required for the service to be found and started. Reorder the `supervisor-restart` exec to only run after the service is started. Because all supervisor configuration files have a `notify` of the service, this forces the ordering of: ``` (package) -> (config files) -> (service) -> (optional restart) ``` On first startup, this will start and them immediately restart supervisor, which is unfortunate but unavoidable -- and not terribly relevant, since the database will not have been created yet, and thus most processes will be in a restart loop for failing to connect to it.	2021-07-22 14:09:01 -07:00
Alex Vandiver	ee7c849f8a	puppet: Work around sysvinit supervisor init bug. The sysvinit script for supervisor has a long-standing bug where `/etc/init.d/supervisor restart` stops but does not then start the supervisor process. Work around this by making restart then try to start, and return if it is currently running.	2021-07-22 14:09:01 -07:00
Alex Vandiver	7e65421b1f	puppet: Ensure psycopg2 is installed before running process_fts_updates. Not having the package installed will cause startup failures in `process_fts_updates`; ensure that we've installed the package before we potentially start the service.	2021-07-14 17:24:52 -07:00
Alex Vandiver	528e5adaab	smokescreen: Default to only listening on 127.0.0.1. This prevents Smokescreen from acting as an open proxy. Fixes #19214.	2021-07-14 15:40:26 -07:00
Alex Vandiver	e6bae4f1dd	puppet: Remove zulip::nagios class. `93f62b999e` removed the last file in puppet/zulip/files/nagios_plugins/zulip_nagios_server, which means the singular rule in zulip::nagios no longer applies cleanly. Remove the `zulip::nagios` class, as it is no longer needed.	2021-07-09 17:29:41 -07:00
Anders Kaseorg	93f62b999e	nagios: Replace check_website_response with standard check_http plugin. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2021-07-09 16:47:03 -07:00
Vishnu KS	e0f5fadb79	billing: Downgrade small realms that are behind on payments. An organization with at most 5 users that is behind on payments isn't worth spending time on investigating the situation. For larger organizations, we likely want somewhat different logic that at least does not void invoices.	2021-07-02 13:19:12 -07:00
Anders Kaseorg	91bfebca7d	install: Replace wget with curl. curl uses Happy Eyeballs to avoid long timeouts on systems with broken IPv6. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2021-06-25 09:05:07 -07:00
Anders Kaseorg	3b60b25446	ci: Remove bullseye hack. base-files 11.1 marked bullseye as Debian 11 in /etc/os-release. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2021-06-24 14:35:51 -07:00
Alex Vandiver	d51272cc3d	puppet: Remove zulip_deliver_scheduled_* from zulip-workers:. Staging and other hosts that are `zulip::app_frontend_base` but not `zulip::app_frontend_once` do not have a /etc/supervisor/conf.d/zulip/zulip-once.conf and as such do not have `zulip_deliver_scheduled_emails` or `zulip_deliver_scheduled_messages` and thus supervisor will fail to reload. Making the contents of `zulip-workers` contingent on if the server is _also_ a `-once` server is complicated, and would involve using Concat fragments, which severely limit readability. Instead, expel those two from `zulip-workers`; this is somewhat reasonable, since they are use an entirely different codepath from zulip_events_, using the database rather than RabbitMQ for their queuing.	2021-06-14 17:12:59 -07:00
Alex Vandiver	6c72698df2	puppet: Move zulip_ops supervisor config into /etc/supervisor/conf.d/zulip/. This is similar cleanup to `3ab9b31d2f`, but only affects zulip_ops services; it serves to ensure that any of these services which are no longer enabled are automatically removed from supervisor. Note that this will cause a supervisor restart on all affected hosts, which will restart all supervisor services.	2021-06-14 17:12:59 -07:00
Alex Vandiver	df09607202	puppet: Switch to $zulip::common::supervisor_conf_dir variable.	2021-06-14 17:12:59 -07:00
Alex Vandiver	391f78a9c1	puppet: Move supervisor-not-in-/etc/supervisor/conf.d/ to common place.	2021-06-14 17:12:59 -07:00

... 3 4 5 6 7 ...

1646 Commits