Commit Graph

1591 Commits

Author SHA1 Message Date
Tim Abbott c4dfeb9c37 puppet: Increase minimum memory for multiprocess queue workers.
This should give some more room for systems that are still below 4GB
of RAM to use the lower-memory multithreaded mode, which is less
likely to have OOM kills (a very bad experience).

There should be little cost, as few systems are likely allocated with
memory in this range.

(cherry picked from commit a22f418827)
2024-02-16 12:28:16 -08:00
Tim Abbott 8ea5e2156a puppet: Update rules for number of uwsgi processes.
The defaults for how many uwsgi processes to run no longer depend on
the queue processor mode, but instead the total memory on the system.

(cherry picked from commit 62dbe2298e)
2024-02-16 12:28:16 -08:00
Alex Vandiver 495312b86a logrotate: smokescreen has its own config file.
149bea8309 added a separate config file
for smokescreen (which is necessary because it can be installed
separately) but failed ot notice that `zulip.template.erb` already had
a config line for it.  This leads to failures starting the logrotate
service:

```
logrotate[4158688]: error: zulip:1 duplicate log entry for /var/log/zulip/smokescreen.log
logrotate[4158688]: error: found error in file zulip, skipping
```

Remove the duplicate line.

(cherry picked from commit 725affcb5a)
2024-01-15 12:02:53 -08:00
Alex Vandiver e311b372cb install: Support PostgreSQL 16.
(cherry picked from commit 1ba2f39854)
2024-01-05 10:32:54 -05:00
Anders Kaseorg 26811c5049 models: Extract zerver.models.clients.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2024-01-05 10:32:54 -05:00
Anders Kaseorg 086df4a81e models: Extract zerver.models.realms.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2024-01-05 10:32:54 -05:00
Anders Kaseorg ee85ac5433 models: Extract zerver.models.users.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2024-01-05 10:32:54 -05:00
Alex Vandiver 4989221b9e nginx: Limit the methods that we proxy to Tornado.
While the Tornado server supports POST requests, those are only used
by internal endpoints.  We only support OPTIONS, GET, and DELETE
methods from clients, so filter everything else out at the nginx
level.

We set `Accepts` header on both `OPTIONS` requests and 405 responses,
and the CORS headers on `OPTIONS` requests.
2023-12-08 09:23:30 -08:00
Alex Vandiver ca57d360e6 puppet: Update dependencies. 2023-12-07 18:45:10 -08:00
Anders Kaseorg 3853fa875a python: Consistently use from…import for urllib.parse.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-12-05 13:03:07 -08:00
Anders Kaseorg 8a7916f21a python: Consistently use from…import for datetime.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-12-05 12:01:18 -08:00
Alex Vandiver 9b1bdfefcd nagios: Use a better index on UserActivity for zephyr alerting.
Limiting only by client_name and query leads to a very poorly-indexed
lookup on `query` which throws out nearly all of its rows:

```
Nested Loop  (cost=50885.64..60522.96 rows=821 width=8)
  ->  Index Scan using zerver_client_name_key on zerver_client  (cost=0.28..2.49 rows=1 width=4)
        Index Cond: ((name)::text = 'zephyr_mirror'::text)
  ->  Bitmap Heap Scan on zerver_useractivity  (cost=50885.37..60429.95 rows=9052 width=12)
        Recheck Cond: ((client_id = zerver_client.id) AND ((query)::text = ANY ('{get_events,/api/v1/events}'::text[])))
        ->  BitmapAnd  (cost=50885.37..50885.37 rows=9052 width=0)
              ->  Bitmap Index Scan on zerver_useractivity_2bfe9d72  (cost=0.00..16631.82 rows=..large.. width=0)
                    Index Cond: (client_id = zerver_client.id)
              ->  Bitmap Index Scan on zerver_useractivity_1b1cc7f0  (cost=0.00..34103.95 rows=..large.. width=0)
                    Index Cond: ((query)::text = ANY ('{get_events,/api/v1/events}'::text[]))
```

A partial index on the client and query list is extremely effective
here in reducing PostgreSQL's workload; however, we cannot easily
write it as a migration, since it depends on the value of the ID of
the `zephyr_mirror` client.

Since this is only relevant for Zulip Cloud, we manually create the
index:

```sql
CREATE INDEX CONCURRENTLY zerver_useractivity_zehpyr_liveness
    ON zerver_useractivity(last_visit)
 WHERE client_id = 1005
   AND query IN ('get_events', '/api/v1/events');
```

We rewrite the query to do the time limit, distinct, and count in SQL,
instead of Python, and make use of this index.  This turns a 20-second
query into two 10ms queries.
2023-11-30 16:01:55 -08:00
Alex Vandiver c4b619af15 puppet: Change /etc/rabbitmq to be owned by rabbitmq.
The Ubuntu and Debian package installation scripts for
`rabbitmq-server` install `/etc/rabbitmq` (and its contents) owned by
the `rabbitmq` user -- not `root` as Puppet does.  This means that
Puppet and `rabbitmq-server` unnecessarily fight over the ownership.

Create the `rabbitmq` user and group, to the same specifications that
the Debian package install scripts do, so that we can properly declare
the ownership of `/etc/rabbitmq`.
2023-11-29 21:45:35 -08:00
Alex Vandiver c47ee4a296 zulip_ops: Configure stats to be pushed to status.zulip.com. 2023-11-16 16:21:12 -05:00
Alex Vandiver 5e49804004 puppet_ops: Include Akamai log parser on prometheus server. 2023-11-13 14:35:39 -05:00
Alex Vandiver 5591d6f65c zulip_ops: Add configuration for Vector Akamai stats.
Akamai writes access logs to S3; we use an SQS events queue, combined
with Vector, to transform those into Prometheus statistics.
2023-11-13 09:53:20 -08:00
Tim Abbott b59e90d100 puppet: Fix buggy media-src Content-Security-Policy.
The colon is invalid syntax. Verified the updated policy using an
online CSP checker.
2023-11-06 14:45:05 -05:00
Alex Vandiver 803b7b4b93 puppet: Fix SHA256sum of sentry-cli binary. 2023-10-31 10:24:49 -07:00
Alex Vandiver 37b261ef0f puppet: Update dependencies. 2023-10-30 16:10:25 -07:00
Aman Agrawal f3ab45a152 uploads-internal: Mark `self` as a valid source of loading media.
Without this, browser refused to play the video. To reproduce press `open`
on an uploaded video on CZO. Chrome gives us the following error
in console:

Refused to load media from '<source>' because it violates the
following Content Security Policy directive: "default-src 'none'".
Note that 'media-src' was not explicitly set, so 'default-src' is
used as a fallback.
2023-10-12 09:57:21 -07:00
Anders Kaseorg 835ee69c80 docs: Fix grammar errors found by mwic.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-10-09 13:24:09 -07:00
Anders Kaseorg 4cb2eded68 typos: Fix typos caught by typos.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-10-09 11:55:16 -07:00
Alex Vandiver 528d0ebcf0 puppet: Serve /etc/zulip/well-known/ in nginx as /.well-known/. 2023-10-04 15:56:42 -07:00
Aman Agrawal 8ef52d55d3 markdown: Add support for inline video thumbnails. 2023-10-02 22:39:02 -07:00
Alex Vandiver 5308fbdeac puppet: Add postgresql-client depenencies to monitoring.
The `unless` step errors out if /usr/bin/psql does not exist at
first evaluation time -- protect that with a `test -f` check, and
protect the actual `createuser` with a dependency on `postgresql-client`.
To work around `Zulip::Safepackage` not actually being safe to
instantiate more than once, we move the instantiation of
`Package[postgresql-client]` into a class which can be safely
included one or more times.
2023-09-22 11:45:00 -07:00
Alex Vandiver 5ee4b642ad views: Add a /health healthcheck endpoint.
This endpoint verifies that the services that Zulip needs to function
are running, and Django can talk to them.  It is designed to be used
as a readiness probe[^1] for Zulip, either by Kubernetes, or some other
reverse-proxy load-balancer in front of Zulip.  Because of this, it
limits access to only localhost and the IP addresses of configured
reverse proxies.

Tests are limited because we cannot stop running services (which would
impact other concurrent tests) and there would be extremely limited
utility to mocking the very specific methods we're calling to raising
the exceptions that we're looking for.

[^1]: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
2023-09-20 09:53:59 -07:00
Alex Vandiver f778316b5a uwsgi: Ensure that the master process cannot load the application.
The rolling restart configuration of uwsgi attempted to re-chdir the
CWD to the new `/home/zulip/deployments/current` before `lazy-apps`
loaded the application in the forked child.  It successfully did so --
however, the "main" process was still running in the original
`/home/zulip/deployments/current`, which somehow (?) tainted the
search path of the children processes.

Set the parent uwsgi process to start in `/`, so that the old deploy
directory cannot taint the load order of later children processes.
2023-09-18 13:13:34 -07:00
Alex Vandiver a6d5d7740e uwsgi: Always enable lazy-apps.
Enabling `lazy-apps` defers loading of the uwsgi application until
after the fork, instead of happening prior to forking workers[^1].  The
nominal reason to not enable this is that it increases the memory
footprint of the server (since no memory is shared across processes),
and may slow down worker initialization, since each worker has to load
the files from disk.

However, Django defers loading the majority of the code until the
first request[^2].  As such, our current non-`lazy-apps` gains nothing
over `lazy-apps`.  For consistency, switch to using `lazy-apps` for
all deployments, rolling restart or no.

[^1]: https://uwsgi-docs.readthedocs.io/en/latest/articles/TheArtOfGracefulReloading.html#preforking-vs-lazy-apps-vs-lazy
[^2]: https://uwsgi-docs.readthedocs.io/en/latest/articles/TheArtOfGracefulReloading.html#preforking-vs-lazy-apps-vs-lazy
2023-09-18 13:13:34 -07:00
Alex Vandiver f95c8b894a nagios: Remove load monitoring.
Load monitoring alerts are extremely noisy, and do not reliably
indicate an issue which is affecting users.
2023-09-14 09:29:29 -07:00
Anders Kaseorg 2665a3ce2b python: Elide unnecessary list wrappers.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-09-13 12:41:23 -07:00
Alex Vandiver 135acfea93 nginx: Suppress proxy warnings when the proxy itself sent the request.
This is common in cases where the reverse proxy itself is making
health-check requests to the Zulip server; these requests have no
X-Forwarded-* headers, so would normally hit the error case of
"request through the proxy, but no X-Forwarded-Proto header".

Add an additional special-case for when the request's originating IP
address is resolved to be the reverse proxy itself; in these cases,
HTTP requests with no X-Forwarded-Proto are acceptable.
2023-09-12 10:10:58 -07:00
Alex Vandiver ccbd834a86 postgres_exporter: Rebase the per-index stats branch.
The branch from the PR is somewhat stale, and is missing important bugfixes.
2023-09-11 17:59:54 -07:00
Alex Vandiver 0c88cfca63 postgres_exporter: Build from source for per-index stats.
This builds prometheus-community/postgres_exporter#843 to track
per-index statistics.
2023-09-11 11:59:39 -07:00
Alex Vandiver fdd811bec1 postgres_exporter: Explicitly specify the zulip database.
Some of the collectors (e.g. `pg_stat_user_tables`) don't appear to
work with `--auto-discover-databases`, which is deprecated since
version 0.13.0[^1].

Explicitly set the database name.

[^1]: https://github.com/prometheus-community/postgres_exporter/releases/tag/v0.13.0
2023-09-06 09:20:57 -07:00
Alex Vandiver 5d3ce8b2d4 puppet: Update dependencies. 2023-09-06 09:20:06 -07:00
Alex Vandiver f8636e7d2b iptables: Stop logging on dropped packets.
We never examine these logs, and it fills dmesg.  We have flow logging at the AWS stack layer.
2023-08-30 15:29:01 -07:00
Alex Vandiver e8c8544028 nginx: Do not forward X-amz-cf-id header to S3.
All `X-amz-*` headers must be included in the signed request to S3;
since Django did not take those headers into account (it constructed a
request from scratch, while nginx's request inherits them from the
end-user's request), the proxied request fails to be signed correctly.

Strip off the `X-amz-cf-id` header added by CloudFront.  While we
would ideally strip off all `X-amz-*` headers, this requires a
third-party module[^1].

[^1]: https://github.com/openresty/headers-more-nginx-module#more_clear_input_headers
2023-08-28 12:30:14 -07:00
Anders Kaseorg c43629a222 ruff: Fix PLW1510 `subprocess.run` without explicit `check` argument.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-08-17 17:05:34 -07:00
Alex Vandiver c5cace3600 puppet: Fix includes for new name of zulip_ops::prometheus::tornado.
This fixes the `include` name for the file renamed in 740a494ba4.
2023-08-09 02:32:28 +00:00
Anders Kaseorg 0b95d83f09 ruff: Fix PERF402 Use `list` or `list.copy` to create a copy of a list.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-08-07 17:23:55 -07:00
Alex Vandiver 740a494ba4 puppet: Rename and generalize Tornado process exporter.
Exporting stats about all of the various Zulip processes is useful for
tracking memory leaks, etc.
2023-08-06 13:41:10 -07:00
Anders Kaseorg 211934a9d9 nginx: Remove gzip_disable "msie6".
We don’t support IE 6.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-07-20 13:09:53 -07:00
Anders Kaseorg b285813beb error_notify: Remove custom email error reporting handler.
Restore the default django.utils.log.AdminEmailHandler when
ERROR_REPORTING is enabled.  Those with more sophisticated needs can
turn it off and use Sentry or a Sentry-compatible system.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-07-20 11:00:09 -07:00
Alex Vandiver 8743602648 puppet: Allow access to smokescreen metrics on CZO. 2023-07-19 16:20:39 -07:00
Alex Vandiver 60ce5e1955 wal-g: Use "start_time" field, not "time" which is S3 modified-at.
The `time` field is based on the file metadata in S3, which means that
touching the file contents in S3 can move backups around in the list.

Switch to using `start_time` as the sort key, which is based on the
contents of the JSON file stored as part of the backup, so is not
affected by changes in S3 metadata.
2023-07-19 14:57:51 -07:00
Alex Vandiver 5a26237b54 wal-g: Support alternate S3 storage classes. 2023-07-19 10:55:18 -07:00
Alex Vandiver 52eacd30c5 wal-g: Set WALG_S3_PREFIX, instead of WALE_S3_PREFIX.
The `WALE_` prefix was only used for backwards compatibility.  Switch
to the canonical variable name.
2023-07-19 10:55:18 -07:00
Alex Vandiver fcf096c52e puppet: Remove unused zulip notification contact. 2023-07-17 10:52:36 -07:00
Alex Vandiver 9799a03d79 puppet: Expose Smokescreen prometheus metrics on :9810. 2023-07-13 11:47:34 -07:00
Alex Vandiver 149bea8309 puppet: Configure smokescreen for 14 days of logs, via logrotate.
supervisord's log rotation is only "every x bytes" which is not a good
enough policy for tracking auditing logs.  The default is also 10 logs
of 50MB, which is very much not enough for active instances.

Switch to tracking 14 days of daily logs.
2023-07-13 11:47:34 -07:00