Commit Graph

1516 Commits

Author SHA1 Message Date
Alex Vandiver 2f4775ba68 wal-g: Write out a logfile.
Otherwise, this output goes into `/var/spool/mail/postgres`, which is
not terribly helpful.  We do not write to `/var/log/zulip` because the
backup runs as the `postgres` user, and `/var/log/zulip` is owned by
zulip and chmod 750.
2023-04-27 12:19:43 -07:00
Alex Vandiver 3aba2789d3 prometheus: Add an exporter for wal-g backup properties.
Since backups may now taken on arbitrary hosts, we need a blackbox
monitor that _some_ backup was produced.

Add a Prometheus exporter which calls `wal-g backup-list` and reports
statistics about the backups.

This could be extended to include `wal-g wal-verify`, but that
requires a connection to the PostgreSQL server.
2023-04-26 15:41:39 -07:00
Alex Vandiver b8a6de95d2 pg_backup_and_purge: Allow adjusting the backup concurrency.
SSDs are good at parallel random reads.
2023-04-26 10:54:51 -07:00
Alex Vandiver 19a11c9556 pg_backup_and_purge: Take backups on replicas, if present.
Taking backups on the database primary adds additional disk load,
which can impact the performance of the application.

Switch to taking backups on replicas, if they exist.  Some deployments
may have multiple replicas, and taking backups on all of them is
wasteful and potentially confusing; add a flag to inhibit taking
nightly snapshots on the host.

If the deployment is a single instance of PostgreSQL, with no
replicas, it takes backups as before, modulo the extra flag to allow
skipping taking them.
2023-04-26 10:54:51 -07:00
Alex Vandiver 4b35211ca1 pg_backup_and_purge: Remove unnecessary explicit types. 2023-04-26 10:54:51 -07:00
Alex Vandiver e72e83793d pg_backup_and_purge: Just use subprocess directly.
dry_run was never passed into run(); switch to using subprocess directly.
2023-04-26 10:54:51 -07:00
Alex Vandiver cace8858f9 puppet: Move logrotate config into app_frontend_base.
7c023042cf moved the logrotate configuration to being a templated
file, from a static file, but missed that the static file was still
referenced from `zulip_ops::app_frontend`; it only updated
`zulip::profile::app_frontend`.  This caused errors in applying puppet
on any `zulip_ops::app_frontend` host.

Prior to 7c023042cf, the Puppet role was identical between those two
classes; deduplicate the rule by moving the updated template
definition into `zulip::app_frontend_base` which is common to those
two classes and not used in any other classes.
2023-04-19 09:34:37 -07:00
Alex Vandiver 775c7ca4ea hooks: Give a bit better Zulip deploy message. 2023-04-19 09:32:39 -07:00
Alex Vandiver d0fc3f1c2e puppet: Add prod hooks to push zulip-cloud-current and notify CZO. 2023-04-12 11:36:33 -07:00
Alex Vandiver 7c023042cf puppet: Rotate access log files every day, not at 500M.
Since logrotate runs in a daily cron, this practically means "daily,
but only if it's larger than 500M."  For large installs with large
traffic, this is effectively daily for 10 days; for small installs, it
is an unknown amount of time.

Switch to daily logfiles, defaulting to 14 days to match nginx; this
can be overridden using a zulip.conf setting.  This makes it easier to
ensure that access logs are only kept for a bounded period of time.
2023-04-06 14:31:16 -04:00
Tim Abbott 561daee2a1 puppet: Update declared zmirror dependencies.
Following zulip/python-zulip-api/pull/758/, we're no longer using
python-zephyr, and don't need to build it from source. Additionally,
we no longer need to build a forked Zephyr package, since ZLoadSession
and ZDumpSession were merged in
e6a545e759.
2023-04-06 09:45:06 -07:00
Alex Vandiver 6975417acf puppet: Create zmirror supervisor subdirectory.
To not change the `supervisor.conf` file, which requires a restart of
supervisor (and thus all services running under it, which is extremely
disruptive) we carefully leave the contents unchanged for most
installs, and append a new piece to the file, only for the zmirror
configuration, using `concat`.
2023-04-06 09:45:06 -07:00
Alex Vandiver c519ba40fd hooks: Add a push_git_ref post-deploy hook. 2023-04-05 18:51:55 -04:00
Alex Vandiver 8a771c7ac0 hooks: Add a hook to send a Zulip before/after the deploy. 2023-04-05 18:51:55 -04:00
Alex Vandiver 377f2d6d03 hooks: Add a common/ directory and factor out common Sentry code. 2023-04-05 18:51:55 -04:00
Alex Vandiver f4d70a2e37 hooks: Resolve version strings to commit SHAs, and pass in via the env. 2023-04-05 18:51:55 -04:00
Alex Vandiver ecfb12404a hooks: Switch to passing values through the environment. 2023-04-05 18:51:55 -04:00
Alex Vandiver 160a917ad3 hooks: Add a helper to install a single static file. 2023-04-05 18:51:55 -04:00
Alex Vandiver 0c13bacb89 sentry: Switch shell variables to lower-case. 2023-04-05 18:51:55 -04:00
Alex Vandiver 7202a98438 cron: Move fetch-tor-exit-nodes to not on the hour.
We see connection timeouts and other access issues when run exactly on
the hour, either due to load on their servers from similar cron jobs,
or from operational processes of theirs.

Move to on the :17s to avoid these access issues.
2023-04-05 12:20:30 -07:00
Alex Vandiver db0ae85d97 sentry: Remove an unnecessary sudo.
790e4854dd made the hooks run as the `zulip` user, making this sudo
unnecessary.
2023-04-03 15:04:56 -07:00
Alex Vandiver 89e366771a prometheus: Add a postgres exporter. 2023-03-30 16:16:18 -07:00
Alex Vandiver c2beb64a79 prometheus: Consistently import the base class and supervisor, if needed. 2023-03-30 16:16:18 -07:00
Alex Vandiver 3feb536df3 nagios: Remove swap check.
Swap usage is not a high signal thing to alert on, and is likely to flap.
2023-03-27 15:10:50 -07:00
Alex Vandiver 262b19346e puppet: Decrease default nginx worker_connections.
Increasing worker_connections has a memory cost, unlike the rest of
the changes in 1c76036c61d8; setting it to 1 million caused nginx to
consume several GB of memory.

Reduce the default down to 10k, and allow deploys to configure it up
if necessary.  `worker_rlimit_nofile` is left at 1M, since it has no
impact on memory consumption.
2023-03-23 15:59:23 -07:00
Alex Vandiver 0c46bbdf9f puppet: Update dependencies. 2023-03-23 09:50:30 -07:00
Anders Kaseorg 3a27b12a7d dependencies: Switch to pnpm.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-03-20 15:48:29 -07:00
Alex Vandiver f2a20b56bc puppet: Enable sentry hooks for production and staging. 2023-03-17 08:10:31 -07:00
Alex Vandiver 1a65315566 puppet: Switch teleport to running under systemd, not supervisord.
There is no reason that the base node access method should be run
under supervisor, which exists primarily to give access to the `zulip`
user to restart its managed services.  This access is unnecessary for
Teleport, and also causes unwanted restarts of Teleport services when
the `supervisor` base configuration changes.  Additionally,
supervisor does not support the in-place upgrade process that Teleport
uses, as it replaces its core process with a new one.

Switch to installing a systemd configuration file (as generated by
`teleport install systemd`) for each part of Teleport, customized to
pass a `--config` path.  As such, we explicitly disable the `teleport`
service provided by the package.

The supervisor process is shut down by dint of no longer installing
the file, which purges it from the managed directory, and reloads
Supervisor to pick up the removed service.
2023-03-15 17:23:42 -04:00
Alex Vandiver 8f8a9f6f04 sentry: Add frontend event monitoring.
Zulip already has integrations for server-side Sentry integration;
however, it has historically used the Zulip-specific `blueslip`
library for monitoring browser-side errors.  However, the latter sends
errors to email, as well optionally to an internal `#errors` stream.
While this is sufficient for low volumes of users, and useful in that
it does not rely on outside services, at higher volumes it is very
difficult to do any analysis or filtering of the errors.  Client-side
errors are exceptionally noisy, with many false positives due to
browser extensions or similar, so determining real real errors from a
stream of un-grouped emails or messages in a stream is quite
difficult.

Add a client-side Javascript sentry integration.  To provide useful
backtraces, this requires extending the pre-deploy hooks to upload the
source-maps to Sentry.  Additional keys are added to the non-public
API of `page_params` to control the DSN, realm identifier, and sample
rates.
2023-03-07 10:51:45 -08:00
Alex Vandiver fc40d74cda hooks: Remove --project from sentry when not necessary. 2023-03-07 10:51:45 -08:00
Alex Vandiver 08251ac53b hooks: Fix typo in sentry error message. 2023-03-07 10:51:45 -08:00
Alex Vandiver 26eb1d7371 puppet: Also set systemd limits. 2023-03-03 16:39:47 -08:00
Alex Vandiver 1c76036c61 puppet: Increase maximum file descriptors.
The current threshold of 40k descriptors was set in 2016, chosen to be
"at least 40x our current scale."  At present, that only provides a
50% safety margin.  Increase to 1 million to provide the same 40x
buffer as previously.

The highest value currently allowed by the kernels in
production (linux 5.3.0) is 1048576.  This is set as the hard limit.

The 1 million limit is likely far above what the system can handle for
other reasons (memory, cpu, etc).  While this removes a potential
safeguard on overload due to too many connections, due to the longpoll
architecture we would generally prefer to service more connections at
lower quality (due to CPU limitations) rather than randomly reject
additional connections.

Relevant prior commits:
 - 836f313e69
 - f2f97dd335
 - ec23996538
 - 8806ec698a
 - e4fce10f46
2023-03-03 16:39:47 -08:00
Alex Vandiver a20bb54cbb puppet: Move limits.conf to maintain more of the installation structure. 2023-03-03 16:39:47 -08:00
Tim Abbott 6b37f9a290 puppet: Run delete-old-unclaimed-attachments in archive cron file.
After reflecting a bit on the last commit, I think it's substantially
easier to understand what's happening for these two tasks to be
defined in the same file, because we want the timing to be different
to avoid potential races.
2023-03-01 11:21:42 -08:00
Mateusz Mandera 35344f7f6b puppet: Add cronjob to run delete_old_unclaimed_attachments daily. 2023-03-01 11:16:39 -08:00
Alex Vandiver e7fabb45f2 puppet: Pin with sha256sum verification. 2023-02-28 00:04:39 -05:00
Alex Vandiver 0d42abe1a8 puppet: wal-g is a tarball with a single file, not a directory.
5db55c38dc switched from `ensure => present` to the more specific
`ensure => directory` on the premise that tarballs would result in
more than one file being copied out of them.  However, we only extract
a single file from the wal-g tarball, and install it at the output
path.  The new rule attempts to replace it with an empty directory
after extraction.

Switch back to `ensure => present` for the tarball codepath.
2023-02-14 18:18:36 -05:00
Alex Vandiver 6f8ce2d00a hooks: Fix shebang line to use /usr/bin/env bash. 2023-02-14 17:28:58 -05:00
Alex Vandiver 044ccdb334 chat.zulip.org: Enable Sentry hook. 2023-02-14 17:20:35 -05:00
Alex Vandiver 3109d40b21 puppet: Add a sentry release class.
This installs the Sentry CLI, and uses it to send API events to Sentry
when a release is started and completed.
2023-02-10 15:53:10 -08:00
Alex Vandiver 5db55c38dc puppet: Add a sha256_file_to. 2023-02-10 15:53:10 -08:00
Alex Vandiver af0ba0b58f puppet: sha256_tarball_to is only ever called with one from/to. 2023-02-10 15:53:10 -08:00
Alex Vandiver 840884ec89 upgrade-zulip: Provide directories to run hooks before/after upgrade.
These hooks are run immediately around the critical section of the
upgrade.  If the upgrade fails for preparatory reasons, the pre-deploy
hook may not be run; if it fails during the upgrade, the post-deploy
hook will not be run.  Hooks are called from the CWD of the new
deploy, with arguments of the old version and the new version.  If
they exit with non-0 exit code, the deploy aborts.
2023-02-10 15:53:10 -08:00
Alex Vandiver 7ab4fdf250 memcached: Allow overriding the max-item-size.
This is necessary for organizations with extremely large numbers of
members (20k+).
2023-02-09 12:04:29 -08:00
Alex Vandiver 23894fc9a3 uploads: Set Content-Type and -Disposition from Django for local files.
Similar to the previous commit, Django was responsible for setting the
Content-Disposition based on the filename, whereas the Content-Type
was set by nginx based on the filename.  This difference is not
exploitable, as even if they somehow disagreed with Django's expected
Content-Type, nginx will only ever respond with Content-Types found in
`uploads.types` -- none of which are unsafe for user-supplied content.

However, for consistency, have Django provide both Content-Type and
Content-Disposition headers.
2023-02-07 17:12:02 +00:00
Alex Vandiver 2f6c5a883e CVE-2023-22735: Provide the Content-Disposition header from S3.
The Content-Type of user-provided uploads was provided by the browser
at initial upload time, and stored in S3; however, 04cf68b45e
switched to determining the Content-Disposition merely from the
filename.  This makes uploads vulnerable to a stored XSS, wherein a
file uploaded with a content-type of `text/html` and an extension of
`.png` would be served to browsers as `Content-Disposition: inline`,
which is unsafe.

The `Content-Security-Policy` headers in the previous commit mitigate
this, but only for browsers which support them.

Revert parts of 04cf68b45e, specifically by allowing S3 to provide
the Content-Disposition header, and using the
`ResponseContentDisposition` argument when necessary to override it to
`attachment`.  Because we expect S3 responses to vary based on this
argument, we include it in the cache key; since the query parameter
has dashes in it, we can't use use the helper `$arg_` variables, and
must parse it from the query parameters manually.

Adding the disposition may decrease the cache hit rate somewhat, but
downloads are infrequent enough that it is unlikely to have a
noticeable effect.  We take care to not adjust the cache key for
requests which do not specify the disposition.
2023-02-07 17:09:52 +00:00
Alex Vandiver 36e97f8121 CVE-2023-22735: Set a Content-Security-Policy header on proxied S3 data.
This was missed in 04cf68b45ebb5c03247a0d6453e35ffc175d55da; as this
content is fundamentally untrusted, it must be served with
`Content-Security-Policy` headers in order to be safe.  These headers
were not provided previously for S3 content because it was served from
the S3 domain.

This mitigates content served from Zulip which could be a stored XSS,
but only in browsers which support Content-Security-Policy headers;
see subsequent commit for the complete solution.
2023-02-07 17:09:52 +00:00
Alex Vandiver d41a00b83b uploads: Extra-escape internal S3 paths.
In nginx, `location` blocks operate on the _decoded_ URI[^1]:

> The matching is performed against a normalized URI, after decoding
> the text encoded in the “%XX” form

This means that if a user-uploaded file contains characters that are
not URI-safe, the browser encodes them in UTF-8 and then URI-encodes
them -- and nginx decodes them and reassembles the original character
before running the `location ~ ^/...` match.  This means that the `$2`
_is not URI-encoded_ and _may contain non-ASCII characters.

When `proxy_pass` is passed a value containing one or more variables,
it does no encoding on that expanded value, assuming that the bytes
are exactly as they should be passed to the upstream.  This means that
directly calling `proxy_pass https://$1/$2` would result in sending
high-bit characters to the S3 upstream, which would rightly balk.

However, a longstanding bug in nginx's `set` directive[^2] means that
the following line:

```nginx
set $download_url https://$1/$2;
```

...results in nginx accidentally URI-encoding $1 and $2 when they are
inserted, resulting in a `$download_url` which is suitable to pass to
`proxy_pass`.  This bug is only present with numeric capture
variables, not named captures; this is particularly relevant because
numeric captures are easily overridden by additional regexes
elsewhere, as subsequent commits will add.

Fixing this is complicated; nginx does not supply any way to escape
values[^3], besides a third-party module[^4] which is an undue
complication to begin using.  The only variable which nginx exposes
which is _not_ un-escaped already is `$request_uri`, which contains
the very original URL sent by the browser -- and thus can't respect
any work done in Django to generate the `X-Accel-Redirect` (e.g., for
`/user_uploads/temporary/` URLs).  We also cannot pass these URLs to
nginx via query-parameters, since `$arg_foo` values are not
URI-decoded by nginx, there is no function to do so[^3], and the
values must be URI-encoded because they themselves are URLs with query
parameters.

Extra-URI-encode the path that we pass to the `X-Accel-Redirect`
location, for S3 redirects.  We rely on the `location` block
un-escaping that layer, leaving `$s3_hostname` and `$s3_path` as they
were intended in Django.

This works around the nginx bug, with no behaviour change.

[^1]: http://nginx.org/en/docs/http/ngx_http_core_module.html#location
[^2]: https://trac.nginx.org/nginx/ticket/348
[^3]: https://trac.nginx.org/nginx/ticket/52
[^4]: https://github.com/openresty/set-misc-nginx-module#set_escape_uri
2023-02-07 17:09:52 +00:00