zulip

Commit Graph

Author	SHA1	Message	Date
Tim Abbott	5e7d61464d	puppet: Include trusted-proto definition in zulip_ops configurations. This should have been part of `0935d388f0`.	2023-05-29 15:13:45 -07:00
Alex Vandiver	8d8b5935ac	puppet: Prevent unattended upgrades of erlang-base. When upgraded, the `erlang-base` package automatically stops all services which depend on the Erlang runtime; for Zulip, this is the `rabbitmq-server` service. This results in an unexpected outage of Zulip. Block unattended upgrades of the `erlang-base` package.	2023-05-16 14:02:06 -07:00
Alex Vandiver	3aba2789d3	prometheus: Add an exporter for wal-g backup properties. Since backups may now taken on arbitrary hosts, we need a blackbox monitor that _some_ backup was produced. Add a Prometheus exporter which calls `wal-g backup-list` and reports statistics about the backups. This could be extended to include `wal-g wal-verify`, but that requires a connection to the PostgreSQL server.	2023-04-26 15:41:39 -07:00
Alex Vandiver	cace8858f9	puppet: Move logrotate config into app_frontend_base. `7c023042cf` moved the logrotate configuration to being a templated file, from a static file, but missed that the static file was still referenced from `zulip_ops::app_frontend`; it only updated `zulip::profile::app_frontend`. This caused errors in applying puppet on any `zulip_ops::app_frontend` host. Prior to `7c023042cf`, the Puppet role was identical between those two classes; deduplicate the rule by moving the updated template definition into `zulip::app_frontend_base` which is common to those two classes and not used in any other classes.	2023-04-19 09:34:37 -07:00
Alex Vandiver	d0fc3f1c2e	puppet: Add prod hooks to push zulip-cloud-current and notify CZO.	2023-04-12 11:36:33 -07:00
Tim Abbott	561daee2a1	puppet: Update declared zmirror dependencies. Following zulip/python-zulip-api/pull/758/, we're no longer using python-zephyr, and don't need to build it from source. Additionally, we no longer need to build a forked Zephyr package, since ZLoadSession and ZDumpSession were merged in `e6a545e759`.	2023-04-06 09:45:06 -07:00
Alex Vandiver	6975417acf	puppet: Create zmirror supervisor subdirectory. To not change the `supervisor.conf` file, which requires a restart of supervisor (and thus all services running under it, which is extremely disruptive) we carefully leave the contents unchanged for most installs, and append a new piece to the file, only for the zmirror configuration, using `concat`.	2023-04-06 09:45:06 -07:00
Alex Vandiver	8a771c7ac0	hooks: Add a hook to send a Zulip before/after the deploy.	2023-04-05 18:51:55 -04:00
Alex Vandiver	89e366771a	prometheus: Add a postgres exporter.	2023-03-30 16:16:18 -07:00
Alex Vandiver	c2beb64a79	prometheus: Consistently import the base class and supervisor, if needed.	2023-03-30 16:16:18 -07:00
Alex Vandiver	3feb536df3	nagios: Remove swap check. Swap usage is not a high signal thing to alert on, and is likely to flap.	2023-03-27 15:10:50 -07:00
Alex Vandiver	f2a20b56bc	puppet: Enable sentry hooks for production and staging.	2023-03-17 08:10:31 -07:00
Alex Vandiver	1a65315566	puppet: Switch teleport to running under systemd, not supervisord. There is no reason that the base node access method should be run under supervisor, which exists primarily to give access to the `zulip` user to restart its managed services. This access is unnecessary for Teleport, and also causes unwanted restarts of Teleport services when the `supervisor` base configuration changes. Additionally, supervisor does not support the in-place upgrade process that Teleport uses, as it replaces its core process with a new one. Switch to installing a systemd configuration file (as generated by `teleport install systemd`) for each part of Teleport, customized to pass a `--config` path. As such, we explicitly disable the `teleport` service provided by the package. The supervisor process is shut down by dint of no longer installing the file, which purges it from the managed directory, and reloads Supervisor to pick up the removed service.	2023-03-15 17:23:42 -04:00
Alex Vandiver	044ccdb334	chat.zulip.org: Enable Sentry hook.	2023-02-14 17:20:35 -05:00
Alex Vandiver	e8123dfeea	puppet: Match the `x` bits on directories to what puppet actually does. Puppet _always_ sets the `+x` bit on directories if they have the `r` bit set for that slot[^1]: > When specifying numeric permissions for directories, Puppet sets the > search permission wherever the read permission is set. As such, for instance, `0640` is actually applied as `0750`. Fix what we "want" to match what puppet is applying, by adding the `x` bit. In none of these cases did we actually intend the directory to not be executable. [1] https://www.puppet.com/docs/puppet/5.5/types/file.html#file-attribute-mode	2023-01-26 15:06:01 -08:00
Alex Vandiver	372bba4a8e	puppet: Stop creating a /home/zulip/logs. This was last really used in `d7a3570c7e`, in 2013, when it was `/home/humbug/logs`. Repoint the one obscure piece of tooling that writes there, and remove the places that created it.	2023-01-26 15:06:01 -08:00
Alex Vandiver	d0de66b273	puppet: Remove "ensure => absent" rules which have all been applied.	2023-01-24 13:05:24 -08:00
Alex Vandiver	04cf68b45e	uploads: Serve S3 uploads directly from nginx. When file uploads are stored in S3, this means that Zulip serves as a 302 to S3. Because browsers do not cache redirects, this means that no image contents can be cached -- and upon every page load or reload, every recently-posted image must be re-fetched. This incurs extra load on the Zulip server, as well as potentially excessive bandwidth usage from S3, and on the client's connection. Switch to fetching the content from S3 in nginx, and serving the content from nginx. These have `Cache-control: private, immutable` headers set on the response, allowing browsers to cache them locally. Because nginx fetching from S3 can be slow, and requests for uploads will generally be bunched around when a message containing them are first posted, we instruct nginx to cache the contents locally. This is safe because uploaded file contents are immutable; access control is still mediated by Django. The nginx cache key is the URL without query parameters, as those parameters include a time-limited signed authentication parameter which lets nginx fetch the non-public file. This adds a number of nginx-level configuration parameters to control the caching which nginx performs, including the amount of in-memory index for he cache, the maximum storage of the cache on disk, and how long data is retained in the cache. The currently-chosen figures are reasonable for small to medium deployments. The most notable effect of this change is in allowing browsers to cache uploaded image content; however, while there will be many fewer requests, it also has an improvement on request latency. The following tests were done with a non-AWS client in SFO, a server and S3 storage in us-east-1, and with 100 requests after 10 requests of warm-up (to fill the nginx cache). The mean and standard deviation are shown. \| \| Redirect to S3 \| Caching proxy, hot \| Caching proxy, cold \| \| ----------------- \| ------------------- \| ------------------- \| ------------------- \| \| Time in Django \| 263.0 ms ± 28.3 ms \| 258.0 ms ± 12.3 ms \| 258.0 ms ± 12.3 ms \| \| Small file (842b) \| 586.1 ms ± 21.1 ms \| 266.1 ms ± 67.4 ms \| 288.6 ms ± 17.7 ms \| \| Large file (660k) \| 959.6 ms ± 137.9 ms \| 609.5 ms ± 13.0 ms \| 648.1 ms ± 43.2 ms \| The hot-cache performance is faster for both large and small files, since it saves the client the time having to make a second request to a separate host. This performance improvement remains at least 100ms even if the client is on the same coast as the server. Cold nginx caches are only slightly slower than hot caches, because VPC access to S3 endpoints is extremely fast (assuming it is in the same region as the host), and nginx can pool connections to S3 and reuse them. However, all of the 648ms taken to serve a cold-cache large file is occupied in nginx, as opposed to the only 263ms which was spent in nginx when using redirects to S3. This means that to overall spend less time responding to uploaded-file requests in nginx, clients will need to find files in their local cache, and skip making an uploaded-file request, at least 60% of the time. Modeling shows a reduction in the number of client requests by about 70% - 80%. The `Content-Disposition` header logic can now also be entirely shared with the local-file codepath, as can the `url_only` path used by mobile clients. While we could provide the direct-to-S3 temporary signed URL to mobile clients, we choose to provide the served-from-Zulip signed URL, to better control caching headers on it, and greater consistency. In doing so, we adjust the salt used for the URL; since these URLs are only valid for 60s, the effect of this salt change is minimal.	2023-01-09 18:23:58 -05:00
Anders Kaseorg	f3f5dfb5aa	ruff: Fix RUF004 exit() is only available in the interpreter. ‘exit’ is pulled in for the interactive interpreter as a side effect of the site module; this can be disabled with python -S and shouldn’t be relied on. Also, use the NoReturn type where appropriate. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-12-04 22:11:24 -08:00
Alex Vandiver	521ec5885b	puppet: Rename autossh tunnel, as it is no longer for just munin.	2022-11-01 22:24:40 -07:00
Alex Vandiver	42f84a8cc7	puppet: Use existing autossh tunnels as OpenSSH "master" sockets. A number of autossh connections are already left open for port-forwarding Munin ports; autossh starts the connections and ensures that they are automatically restarted if they are severed. However, this represents a missed opportunity. Nagios's monitoring uses a large number of SSH connections to the remote hosts to run commands on them; each of these connections requires doing a complete SSH handshake and authentication, which can have non-trivial network latency, particularly for hosts which may be located far away, in a network topology sense (up to 1s for a no-op command!). Use OpenSSH's ability to multiplex multiple connections over a single socket, to reuse the already-established connection. We leave an explicit `ControlMaster no` in the general configuration, and not `auto`, as we do not wish any of the short-lived Nagios connections to get promoted to being a control socket if the autossh is not running for some reason. We enable protocol-level keepalives, to give a better chance of the socket being kept open.	2022-11-01 22:24:40 -07:00
Alex Vandiver	e05a0dcf98	puppet: Support FQDNs in puppet zulip.conf names.	2022-11-01 22:24:40 -07:00
Alex Vandiver	951dc68f3a	autossh: Drop unnecessary -2 option. The -2 option is a no-op.	2022-11-01 22:24:40 -07:00
Anders Kaseorg	7666ff603d	sharding: Configure Tornado sharding with nginx map. https://nginx.org/en/docs/http/ngx_http_map_module.html Since Puppet doesn’t manage the contents of nginx_sharding.conf after its initial creation, it needs to be renamed so we can give it different default contents. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-09-15 16:07:50 -07:00
Anders Kaseorg	0da0ee3c92	puppet: Remove nginx configuration for zulip.org. This is unused since commit `1806e0f45e` (#19625). Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-09-01 10:03:18 -07:00
Alex Vandiver	a9183d2208	grafana: Enable auto-sign-up. This avoids the need to explicitly create new users in Grafana, by simply trusting Teleport.	2022-07-19 17:52:17 -07:00
Alex Vandiver	9bd88a93e2	puppet: Tell needrestart to not default to restarting core services. The `needrestart` tool added in 22.04 is useful in terms of listing which services may need to be restarted to pick up updated libraries. However, it prompts about the current state of services needing restart for every subsequent `apt-get upgrade`, and defaulting core services to restarting requires carefully manually excluding them every time, at risk of causing an unscheduled outage. Build a list of default-off services based on the list in unattended-upgrades.	2022-07-19 17:51:18 -07:00
Alex Vandiver	7ae3708c02	teleport: Add explicit WebAuthn config, not just U2F. WebAuthn is the default, replacing U2F, in Teleport 10 and above[1]. While Teleport can derive a WebAuthn configuration from a U2F configuration[2], it's useful to be explicit. [1]: https://goteleport.com/docs/access-controls/guides/webauthn/ [2]: https://goteleport.com/docs/access-controls/guides/webauthn/#u2f	2022-07-18 11:41:00 -07:00
Alex Vandiver	120de1dca9	zephyr: Write out unix timestamp in check, as check_cron_file expects. A follow-up fix to `8bc26aab08`.	2022-06-30 11:12:26 -07:00
Alex Vandiver	8577adcf2e	cron: Remove unused STATE_FILE environment variable.	2022-06-22 12:07:38 -07:00
Alex Vandiver	8bc26aab08	nagios: Switch check_user_zephyr_mirror_liveness to run via cron. This check loads Django, and as such must be run as the zulip user. Repeat the same pattern used elsewhere in nagios, of writing a state file, which is read by `check_cron_file`.	2022-06-22 12:07:38 -07:00
Alex Vandiver	41deef40cf	nagios: Switch to generic check_cron_file for queues and consumers. These share a common root; `91da4bd59b` duplicated the code, but didn't move the existing uses to the new utility.	2022-06-22 12:07:38 -07:00
Alex Vandiver	8fbde9b8c5	nagios: Only run check_fts_update_log on one PostgreSQL host. The data is the same in the table in all replicas -- there is no need to alert on all of them.	2022-06-22 12:07:38 -07:00
Alex Vandiver	499284d2fd	nagios: Split postgresql into primary and replica. Replication checks should only run on primary and replicas, not standalone hosts; while `autovac_freeze` currently only runs on primary hosts, it functions identically on replicas, and is fine to run there. Make `autovac_freeze` run on all `postgresql` hosts, and make standalone hosts no longer `postgres_primary`, so they do not fail the replication tests.	2022-06-22 12:07:38 -07:00
Alex Vandiver	38e435347b	nagios: Add missing queue consumer checks.	2022-06-22 12:07:38 -07:00
Alex Vandiver	e01a4242aa	nagios: Sort queue consumer checks.	2022-06-22 12:07:38 -07:00
Alex Vandiver	2c90c7a010	nagios: Switch `check_remote_arg_string` queue checks to consumer checks. These style of checks just look for matching process names using `check_remote_arg_string`, which dates to `8edbd64bb8`. These were added because the original two (`missedmessage_emails` and `slow_queries`) did not create consumers, instead polling for events. Switch these to checking the queue consumer counts that the `check-rabbitmq-consumers` check is already writing out. Since the `missedmessage_emails` was _already_ checked via the consumer check, a duplicate is not added.	2022-06-22 12:07:38 -07:00
Alex Vandiver	f48d543d9b	nagios: Make and use a "rabbitmq-consumer-service" template service.	2022-06-22 12:07:38 -07:00
Alex Vandiver	775a084d0f	nagios: Add a catchall "other" set.	2022-06-22 12:07:38 -07:00
Alex Vandiver	83c82c8e15	nagios: Adjust load alerting by hostgroup. Even the `pageable_servers` group did not page for high load -- in part because what was "high" depends on the servers. Set slightly better limits based on server role.	2022-06-22 12:07:38 -07:00
Alex Vandiver	2a14aa5180	nagios: Add a `fullstack` hostgroup. This will be used to apply checks only to czo.	2022-06-22 12:07:38 -07:00
Alex Vandiver	b5ecfc327f	nagios: Remove unnecessary `web` hostgroup. This had identical membership to `frontends`.	2022-06-22 12:07:38 -07:00
Alex Vandiver	4be9025212	nagios: Remove redundant `postgresql` hostgroup. This is implied by `postgresql_primary`.	2022-06-22 12:07:38 -07:00
Alex Vandiver	d9d0014fb4	nagios: Rename `zmirror_main` into `zmirror` hostgroup. `zmirror` itself was `zmirror_main` + `zmirrorp` but was unused; we consistently just use the term `zmirror` for the non-personals server, so use it as the hostgroup name.	2022-06-22 12:07:38 -07:00
Alex Vandiver	70c36985b4	nagios: Remove frontends from redis group. The Redis nagios checks themselves are done against `redis` + `frontends` groups, so there is no need to misleadingly place `frontends` in the `redis` hostgroup.	2022-06-22 12:07:38 -07:00
Alex Vandiver	08127086bc	nagios: Remove misleading "staging_frontends" from standalone. No services are tested for the `staging_frontends` hostgroup, so this does not alter the checks.	2022-06-22 12:07:38 -07:00
Alex Vandiver	d804de871d	nagios: Move staging and prod hostgroups adjacent.	2022-06-22 12:07:38 -07:00
Alex Vandiver	4c17f2bccc	nagios: The frontends hostgroup now includes prod and staging frontends. This lets the config file remove some repetition.	2022-06-22 12:07:38 -07:00
Alex Vandiver	1e81775fa0	nagios: Drop unhelpful hostgroup comment.	2022-06-22 12:07:38 -07:00
Alex Vandiver	7b584401ac	nagios: Reformat hostgroups.	2022-06-22 12:07:38 -07:00

1 2 3 4 5 ...

476 Commits