zulip

Commit Graph

Author	SHA1	Message	Date
Alex Vandiver	9b1bdfefcd	nagios: Use a better index on UserActivity for zephyr alerting. Limiting only by client_name and query leads to a very poorly-indexed lookup on `query` which throws out nearly all of its rows: ``` Nested Loop (cost=50885.64..60522.96 rows=821 width=8) -> Index Scan using zerver_client_name_key on zerver_client (cost=0.28..2.49 rows=1 width=4) Index Cond: ((name)::text = 'zephyr_mirror'::text) -> Bitmap Heap Scan on zerver_useractivity (cost=50885.37..60429.95 rows=9052 width=12) Recheck Cond: ((client_id = zerver_client.id) AND ((query)::text = ANY ('{get_events,/api/v1/events}'::text[]))) -> BitmapAnd (cost=50885.37..50885.37 rows=9052 width=0) -> Bitmap Index Scan on zerver_useractivity_2bfe9d72 (cost=0.00..16631.82 rows=..large.. width=0) Index Cond: (client_id = zerver_client.id) -> Bitmap Index Scan on zerver_useractivity_1b1cc7f0 (cost=0.00..34103.95 rows=..large.. width=0) Index Cond: ((query)::text = ANY ('{get_events,/api/v1/events}'::text[])) ``` A partial index on the client and query list is extremely effective here in reducing PostgreSQL's workload; however, we cannot easily write it as a migration, since it depends on the value of the ID of the `zephyr_mirror` client. Since this is only relevant for Zulip Cloud, we manually create the index: ```sql CREATE INDEX CONCURRENTLY zerver_useractivity_zehpyr_liveness ON zerver_useractivity(last_visit) WHERE client_id = 1005 AND query IN ('get_events', '/api/v1/events'); ``` We rewrite the query to do the time limit, distinct, and count in SQL, instead of Python, and make use of this index. This turns a 20-second query into two 10ms queries.	2023-11-30 16:01:55 -08:00
Alex Vandiver	c47ee4a296	zulip_ops: Configure stats to be pushed to status.zulip.com.	2023-11-16 16:21:12 -05:00
Alex Vandiver	5e49804004	puppet_ops: Include Akamai log parser on prometheus server.	2023-11-13 14:35:39 -05:00
Alex Vandiver	5591d6f65c	zulip_ops: Add configuration for Vector Akamai stats. Akamai writes access logs to S3; we use an SQS events queue, combined with Vector, to transform those into Prometheus statistics.	2023-11-13 09:53:20 -08:00
Anders Kaseorg	835ee69c80	docs: Fix grammar errors found by mwic. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2023-10-09 13:24:09 -07:00
Alex Vandiver	528d0ebcf0	puppet: Serve /etc/zulip/well-known/ in nginx as /.well-known/.	2023-10-04 15:56:42 -07:00
Alex Vandiver	5308fbdeac	puppet: Add postgresql-client depenencies to monitoring. The `unless` step errors out if /usr/bin/psql does not exist at first evaluation time -- protect that with a `test -f` check, and protect the actual `createuser` with a dependency on `postgresql-client`. To work around `Zulip::Safepackage` not actually being safe to instantiate more than once, we move the instantiation of `Package[postgresql-client]` into a class which can be safely included one or more times.	2023-09-22 11:45:00 -07:00
Alex Vandiver	f95c8b894a	nagios: Remove load monitoring. Load monitoring alerts are extremely noisy, and do not reliably indicate an issue which is affecting users.	2023-09-14 09:29:29 -07:00
Alex Vandiver	ccbd834a86	postgres_exporter: Rebase the per-index stats branch. The branch from the PR is somewhat stale, and is missing important bugfixes.	2023-09-11 17:59:54 -07:00
Alex Vandiver	0c88cfca63	postgres_exporter: Build from source for per-index stats. This builds prometheus-community/postgres_exporter#843 to track per-index statistics.	2023-09-11 11:59:39 -07:00
Alex Vandiver	fdd811bec1	postgres_exporter: Explicitly specify the zulip database. Some of the collectors (e.g. `pg_stat_user_tables`) don't appear to work with `--auto-discover-databases`, which is deprecated since version 0.13.0[^1]. Explicitly set the database name. [^1]: https://github.com/prometheus-community/postgres_exporter/releases/tag/v0.13.0	2023-09-06 09:20:57 -07:00
Alex Vandiver	f8636e7d2b	iptables: Stop logging on dropped packets. We never examine these logs, and it fills dmesg. We have flow logging at the AWS stack layer.	2023-08-30 15:29:01 -07:00
Alex Vandiver	c5cace3600	puppet: Fix includes for new name of zulip_ops::prometheus::tornado. This fixes the `include` name for the file renamed in `740a494ba4`.	2023-08-09 02:32:28 +00:00
Alex Vandiver	740a494ba4	puppet: Rename and generalize Tornado process exporter. Exporting stats about all of the various Zulip processes is useful for tracking memory leaks, etc.	2023-08-06 13:41:10 -07:00
Anders Kaseorg	b285813beb	error_notify: Remove custom email error reporting handler. Restore the default django.utils.log.AdminEmailHandler when ERROR_REPORTING is enabled. Those with more sophisticated needs can turn it off and use Sentry or a Sentry-compatible system. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2023-07-20 11:00:09 -07:00
Alex Vandiver	8743602648	puppet: Allow access to smokescreen metrics on CZO.	2023-07-19 16:20:39 -07:00
Alex Vandiver	fcf096c52e	puppet: Remove unused zulip notification contact.	2023-07-17 10:52:36 -07:00
Alex Vandiver	9799a03d79	puppet: Expose Smokescreen prometheus metrics on :9810.	2023-07-13 11:47:34 -07:00
Tim Abbott	5e7d61464d	puppet: Include trusted-proto definition in zulip_ops configurations. This should have been part of `0935d388f0`.	2023-05-29 15:13:45 -07:00
Alex Vandiver	8d8b5935ac	puppet: Prevent unattended upgrades of erlang-base. When upgraded, the `erlang-base` package automatically stops all services which depend on the Erlang runtime; for Zulip, this is the `rabbitmq-server` service. This results in an unexpected outage of Zulip. Block unattended upgrades of the `erlang-base` package.	2023-05-16 14:02:06 -07:00
Alex Vandiver	3aba2789d3	prometheus: Add an exporter for wal-g backup properties. Since backups may now taken on arbitrary hosts, we need a blackbox monitor that _some_ backup was produced. Add a Prometheus exporter which calls `wal-g backup-list` and reports statistics about the backups. This could be extended to include `wal-g wal-verify`, but that requires a connection to the PostgreSQL server.	2023-04-26 15:41:39 -07:00
Alex Vandiver	cace8858f9	puppet: Move logrotate config into app_frontend_base. `7c023042cf` moved the logrotate configuration to being a templated file, from a static file, but missed that the static file was still referenced from `zulip_ops::app_frontend`; it only updated `zulip::profile::app_frontend`. This caused errors in applying puppet on any `zulip_ops::app_frontend` host. Prior to `7c023042cf`, the Puppet role was identical between those two classes; deduplicate the rule by moving the updated template definition into `zulip::app_frontend_base` which is common to those two classes and not used in any other classes.	2023-04-19 09:34:37 -07:00
Alex Vandiver	d0fc3f1c2e	puppet: Add prod hooks to push zulip-cloud-current and notify CZO.	2023-04-12 11:36:33 -07:00
Tim Abbott	561daee2a1	puppet: Update declared zmirror dependencies. Following zulip/python-zulip-api/pull/758/, we're no longer using python-zephyr, and don't need to build it from source. Additionally, we no longer need to build a forked Zephyr package, since ZLoadSession and ZDumpSession were merged in `e6a545e759`.	2023-04-06 09:45:06 -07:00
Alex Vandiver	6975417acf	puppet: Create zmirror supervisor subdirectory. To not change the `supervisor.conf` file, which requires a restart of supervisor (and thus all services running under it, which is extremely disruptive) we carefully leave the contents unchanged for most installs, and append a new piece to the file, only for the zmirror configuration, using `concat`.	2023-04-06 09:45:06 -07:00
Alex Vandiver	8a771c7ac0	hooks: Add a hook to send a Zulip before/after the deploy.	2023-04-05 18:51:55 -04:00
Alex Vandiver	89e366771a	prometheus: Add a postgres exporter.	2023-03-30 16:16:18 -07:00
Alex Vandiver	c2beb64a79	prometheus: Consistently import the base class and supervisor, if needed.	2023-03-30 16:16:18 -07:00
Alex Vandiver	3feb536df3	nagios: Remove swap check. Swap usage is not a high signal thing to alert on, and is likely to flap.	2023-03-27 15:10:50 -07:00
Alex Vandiver	f2a20b56bc	puppet: Enable sentry hooks for production and staging.	2023-03-17 08:10:31 -07:00
Alex Vandiver	1a65315566	puppet: Switch teleport to running under systemd, not supervisord. There is no reason that the base node access method should be run under supervisor, which exists primarily to give access to the `zulip` user to restart its managed services. This access is unnecessary for Teleport, and also causes unwanted restarts of Teleport services when the `supervisor` base configuration changes. Additionally, supervisor does not support the in-place upgrade process that Teleport uses, as it replaces its core process with a new one. Switch to installing a systemd configuration file (as generated by `teleport install systemd`) for each part of Teleport, customized to pass a `--config` path. As such, we explicitly disable the `teleport` service provided by the package. The supervisor process is shut down by dint of no longer installing the file, which purges it from the managed directory, and reloads Supervisor to pick up the removed service.	2023-03-15 17:23:42 -04:00
Alex Vandiver	044ccdb334	chat.zulip.org: Enable Sentry hook.	2023-02-14 17:20:35 -05:00
Alex Vandiver	e8123dfeea	puppet: Match the `x` bits on directories to what puppet actually does. Puppet _always_ sets the `+x` bit on directories if they have the `r` bit set for that slot[^1]: > When specifying numeric permissions for directories, Puppet sets the > search permission wherever the read permission is set. As such, for instance, `0640` is actually applied as `0750`. Fix what we "want" to match what puppet is applying, by adding the `x` bit. In none of these cases did we actually intend the directory to not be executable. [1] https://www.puppet.com/docs/puppet/5.5/types/file.html#file-attribute-mode	2023-01-26 15:06:01 -08:00
Alex Vandiver	372bba4a8e	puppet: Stop creating a /home/zulip/logs. This was last really used in `d7a3570c7e`, in 2013, when it was `/home/humbug/logs`. Repoint the one obscure piece of tooling that writes there, and remove the places that created it.	2023-01-26 15:06:01 -08:00
Alex Vandiver	d0de66b273	puppet: Remove "ensure => absent" rules which have all been applied.	2023-01-24 13:05:24 -08:00
Alex Vandiver	04cf68b45e	uploads: Serve S3 uploads directly from nginx. When file uploads are stored in S3, this means that Zulip serves as a 302 to S3. Because browsers do not cache redirects, this means that no image contents can be cached -- and upon every page load or reload, every recently-posted image must be re-fetched. This incurs extra load on the Zulip server, as well as potentially excessive bandwidth usage from S3, and on the client's connection. Switch to fetching the content from S3 in nginx, and serving the content from nginx. These have `Cache-control: private, immutable` headers set on the response, allowing browsers to cache them locally. Because nginx fetching from S3 can be slow, and requests for uploads will generally be bunched around when a message containing them are first posted, we instruct nginx to cache the contents locally. This is safe because uploaded file contents are immutable; access control is still mediated by Django. The nginx cache key is the URL without query parameters, as those parameters include a time-limited signed authentication parameter which lets nginx fetch the non-public file. This adds a number of nginx-level configuration parameters to control the caching which nginx performs, including the amount of in-memory index for he cache, the maximum storage of the cache on disk, and how long data is retained in the cache. The currently-chosen figures are reasonable for small to medium deployments. The most notable effect of this change is in allowing browsers to cache uploaded image content; however, while there will be many fewer requests, it also has an improvement on request latency. The following tests were done with a non-AWS client in SFO, a server and S3 storage in us-east-1, and with 100 requests after 10 requests of warm-up (to fill the nginx cache). The mean and standard deviation are shown. \| \| Redirect to S3 \| Caching proxy, hot \| Caching proxy, cold \| \| ----------------- \| ------------------- \| ------------------- \| ------------------- \| \| Time in Django \| 263.0 ms ± 28.3 ms \| 258.0 ms ± 12.3 ms \| 258.0 ms ± 12.3 ms \| \| Small file (842b) \| 586.1 ms ± 21.1 ms \| 266.1 ms ± 67.4 ms \| 288.6 ms ± 17.7 ms \| \| Large file (660k) \| 959.6 ms ± 137.9 ms \| 609.5 ms ± 13.0 ms \| 648.1 ms ± 43.2 ms \| The hot-cache performance is faster for both large and small files, since it saves the client the time having to make a second request to a separate host. This performance improvement remains at least 100ms even if the client is on the same coast as the server. Cold nginx caches are only slightly slower than hot caches, because VPC access to S3 endpoints is extremely fast (assuming it is in the same region as the host), and nginx can pool connections to S3 and reuse them. However, all of the 648ms taken to serve a cold-cache large file is occupied in nginx, as opposed to the only 263ms which was spent in nginx when using redirects to S3. This means that to overall spend less time responding to uploaded-file requests in nginx, clients will need to find files in their local cache, and skip making an uploaded-file request, at least 60% of the time. Modeling shows a reduction in the number of client requests by about 70% - 80%. The `Content-Disposition` header logic can now also be entirely shared with the local-file codepath, as can the `url_only` path used by mobile clients. While we could provide the direct-to-S3 temporary signed URL to mobile clients, we choose to provide the served-from-Zulip signed URL, to better control caching headers on it, and greater consistency. In doing so, we adjust the salt used for the URL; since these URLs are only valid for 60s, the effect of this salt change is minimal.	2023-01-09 18:23:58 -05:00
Anders Kaseorg	f3f5dfb5aa	ruff: Fix RUF004 exit() is only available in the interpreter. ‘exit’ is pulled in for the interactive interpreter as a side effect of the site module; this can be disabled with python -S and shouldn’t be relied on. Also, use the NoReturn type where appropriate. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-12-04 22:11:24 -08:00
Alex Vandiver	521ec5885b	puppet: Rename autossh tunnel, as it is no longer for just munin.	2022-11-01 22:24:40 -07:00
Alex Vandiver	42f84a8cc7	puppet: Use existing autossh tunnels as OpenSSH "master" sockets. A number of autossh connections are already left open for port-forwarding Munin ports; autossh starts the connections and ensures that they are automatically restarted if they are severed. However, this represents a missed opportunity. Nagios's monitoring uses a large number of SSH connections to the remote hosts to run commands on them; each of these connections requires doing a complete SSH handshake and authentication, which can have non-trivial network latency, particularly for hosts which may be located far away, in a network topology sense (up to 1s for a no-op command!). Use OpenSSH's ability to multiplex multiple connections over a single socket, to reuse the already-established connection. We leave an explicit `ControlMaster no` in the general configuration, and not `auto`, as we do not wish any of the short-lived Nagios connections to get promoted to being a control socket if the autossh is not running for some reason. We enable protocol-level keepalives, to give a better chance of the socket being kept open.	2022-11-01 22:24:40 -07:00
Alex Vandiver	e05a0dcf98	puppet: Support FQDNs in puppet zulip.conf names.	2022-11-01 22:24:40 -07:00
Alex Vandiver	951dc68f3a	autossh: Drop unnecessary -2 option. The -2 option is a no-op.	2022-11-01 22:24:40 -07:00
Anders Kaseorg	7666ff603d	sharding: Configure Tornado sharding with nginx map. https://nginx.org/en/docs/http/ngx_http_map_module.html Since Puppet doesn’t manage the contents of nginx_sharding.conf after its initial creation, it needs to be renamed so we can give it different default contents. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-09-15 16:07:50 -07:00
Anders Kaseorg	0da0ee3c92	puppet: Remove nginx configuration for zulip.org. This is unused since commit `1806e0f45e` (#19625). Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-09-01 10:03:18 -07:00
Alex Vandiver	a9183d2208	grafana: Enable auto-sign-up. This avoids the need to explicitly create new users in Grafana, by simply trusting Teleport.	2022-07-19 17:52:17 -07:00
Alex Vandiver	9bd88a93e2	puppet: Tell needrestart to not default to restarting core services. The `needrestart` tool added in 22.04 is useful in terms of listing which services may need to be restarted to pick up updated libraries. However, it prompts about the current state of services needing restart for every subsequent `apt-get upgrade`, and defaulting core services to restarting requires carefully manually excluding them every time, at risk of causing an unscheduled outage. Build a list of default-off services based on the list in unattended-upgrades.	2022-07-19 17:51:18 -07:00
Alex Vandiver	7ae3708c02	teleport: Add explicit WebAuthn config, not just U2F. WebAuthn is the default, replacing U2F, in Teleport 10 and above[1]. While Teleport can derive a WebAuthn configuration from a U2F configuration[2], it's useful to be explicit. [1]: https://goteleport.com/docs/access-controls/guides/webauthn/ [2]: https://goteleport.com/docs/access-controls/guides/webauthn/#u2f	2022-07-18 11:41:00 -07:00
Alex Vandiver	120de1dca9	zephyr: Write out unix timestamp in check, as check_cron_file expects. A follow-up fix to `8bc26aab08`.	2022-06-30 11:12:26 -07:00
Alex Vandiver	8577adcf2e	cron: Remove unused STATE_FILE environment variable.	2022-06-22 12:07:38 -07:00
Alex Vandiver	8bc26aab08	nagios: Switch check_user_zephyr_mirror_liveness to run via cron. This check loads Django, and as such must be run as the zulip user. Repeat the same pattern used elsewhere in nagios, of writing a state file, which is read by `check_cron_file`.	2022-06-22 12:07:38 -07:00
Alex Vandiver	41deef40cf	nagios: Switch to generic check_cron_file for queues and consumers. These share a common root; `91da4bd59b` duplicated the code, but didn't move the existing uses to the new utility.	2022-06-22 12:07:38 -07:00

1 2 3 4 5 ...

494 Commits