Commit Graph

1603 Commits

Author SHA1 Message Date
Alex Vandiver c5cace3600 puppet: Fix includes for new name of zulip_ops::prometheus::tornado.
This fixes the `include` name for the file renamed in 740a494ba4.
2023-08-09 02:32:28 +00:00
Anders Kaseorg 0b95d83f09 ruff: Fix PERF402 Use `list` or `list.copy` to create a copy of a list.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-08-07 17:23:55 -07:00
Alex Vandiver 740a494ba4 puppet: Rename and generalize Tornado process exporter.
Exporting stats about all of the various Zulip processes is useful for
tracking memory leaks, etc.
2023-08-06 13:41:10 -07:00
Anders Kaseorg 211934a9d9 nginx: Remove gzip_disable "msie6".
We don’t support IE 6.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-07-20 13:09:53 -07:00
Anders Kaseorg b285813beb error_notify: Remove custom email error reporting handler.
Restore the default django.utils.log.AdminEmailHandler when
ERROR_REPORTING is enabled.  Those with more sophisticated needs can
turn it off and use Sentry or a Sentry-compatible system.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-07-20 11:00:09 -07:00
Alex Vandiver 8743602648 puppet: Allow access to smokescreen metrics on CZO. 2023-07-19 16:20:39 -07:00
Alex Vandiver 60ce5e1955 wal-g: Use "start_time" field, not "time" which is S3 modified-at.
The `time` field is based on the file metadata in S3, which means that
touching the file contents in S3 can move backups around in the list.

Switch to using `start_time` as the sort key, which is based on the
contents of the JSON file stored as part of the backup, so is not
affected by changes in S3 metadata.
2023-07-19 14:57:51 -07:00
Alex Vandiver 5a26237b54 wal-g: Support alternate S3 storage classes. 2023-07-19 10:55:18 -07:00
Alex Vandiver 52eacd30c5 wal-g: Set WALG_S3_PREFIX, instead of WALE_S3_PREFIX.
The `WALE_` prefix was only used for backwards compatibility.  Switch
to the canonical variable name.
2023-07-19 10:55:18 -07:00
Alex Vandiver fcf096c52e puppet: Remove unused zulip notification contact. 2023-07-17 10:52:36 -07:00
Alex Vandiver 9799a03d79 puppet: Expose Smokescreen prometheus metrics on :9810. 2023-07-13 11:47:34 -07:00
Alex Vandiver 149bea8309 puppet: Configure smokescreen for 14 days of logs, via logrotate.
supervisord's log rotation is only "every x bytes" which is not a good
enough policy for tracking auditing logs.  The default is also 10 logs
of 50MB, which is very much not enough for active instances.

Switch to tracking 14 days of daily logs.
2023-07-13 11:47:34 -07:00
Alex Vandiver 0c44db5325 puppet: Update dependencies. 2023-07-13 08:08:11 -07:00
Alex Vandiver 8a77cca341 middleware: Detect reverse proxy misconfigurations.
Combine nginx and Django middlware to stop putting misleading warnings
about `CSRF_TRUSTED_ORIGINS` when the issue is untrusted proxies.
This attempts to, in the error logs, diagnose and suggest next steps
to fix common proxy misconfigurations.

See also #24599 and zulip/docker-zulip#403.
2023-07-02 16:20:21 -07:00
Alex Vandiver 671b708c4b puppet: Remove loadbalancer configurations when they are unset. 2023-07-02 16:20:21 -07:00
Alex Vandiver c8ec3dfcf6 pgroonga: Run upgrade SQL when pgroonga package is updated.
Updating the pgroonga package is not sufficient to upgrade the
extension in PostgreSQL -- an `ALTER EXTENSION pgroonga UPDATE` must
explicitly be run[^1].  Failure to do so can lead to unexpected behavior,
including crashes of PostgreSQL.

Expand on the existing `pgroonga_setup.sql.applied` file, to track
which version of the PostgreSQL extension has been configured.  If the
file exists but is empty, we run `ALTER EXTENSION pgroonga UPDATE`
regardless -- if it is a no-op, it still succeeds with a `NOTICE`:

```
zulip=# ALTER EXTENSION pgroonga UPDATE;
NOTICE:  version "3.0.8" of extension "pgroonga" is already installed
ALTER EXTENSION
```

The simple `ALTER EXTENSION` is sufficient for the
backwards-compatible case[^1] -- which, for our usage, is every
upgrade since 0.9 -> 1.0.  Since version 1.0 was released in 2015,
before pgroonga support was added to Zulip in 2016, we can assume for
the moment that all pgroonga upgrades are backwards-compatible, and
not bother regenerating indexes.

Fixes: #25989.

[^1]: https://pgroonga.github.io/upgrade/
2023-06-23 14:40:27 -07:00
Alex Vandiver dc2726c814 pgroonga: Remove now-unnecessary 'GRANT USAGE' statement.
This was only necessary for PGroonga 1.x, and the `pgroonga` schema
will most likely be removed at some point inthe future, which will
make this statement error out.

Drop the unnecessary statement.
2023-06-23 14:40:27 -07:00
Alex Vandiver 7ef05316d5 puppet: Support IPv6 nameservers.
The syntax in `/etc/resolv.conf` does not include any brackets:
```
nameserver 2001:db8::a3
```

However, the format of the nginx `resolver` directive[^1] requires that
IPv6 addresses be enclosed in brackets.

Adjust the `resolver_ip` puppet function to surround any IPv6
addresses extracted from `/etc/resolv.conf` with square brackets, and
any addresses from `application_server.resolver` to gain brackets if
necessary.

Fixes: #26013.

[^1]: http://nginx.org/en/docs/http/ngx_http_core_module.html#resolver
2023-06-23 11:32:17 -07:00
Alex Vandiver edfc911649 hooks: Tell Sentry the explicit commit range.
This is necessary if one has different deployments (with different
commit ranges) using the same projects.

See https://docs.sentry.io/product/releases/associate-commits/#using-the-cli
for the API of the `sentry-cli` tool.
2023-06-19 13:43:56 -07:00
Alex Vandiver bd217ad31b puppet: Read resolver from /etc/resolv.conf.
04cf68b45e make nginx responsible for downloading (and caching)
files from S3.  As noted in that commit, nginx implements its own
non-blocking DNS resolver, since the base syscall is blocking, so
requires an explicit nameserver configuration.  That commit used
127.0.0.53, which is provided by systemd-resolved, as the resolver.

However, that service may not always be enabled and running, and may
in fact not even be installed (e.g. on Docker).  Switch to parsing
`/etc/resolv.conf` and using the first-provided nameserver.  In many
deployments, this will still be `127.0.0.53`, but for others it will
provide a working DNS server which is external to the host.

In the event that a server is misconfigured and has no resolvers in
`/etc/resolv.conf`, it will error out:
```console
Error: Evaluation Error: Error while evaluating a Function Call, No nameservers found in /etc/resolv.conf!  Configure one by setting application_server.nameserver in /etc/zulip/zulip.conf (file: /home/zulip/deployments/current/puppet/zulip/manifests/app_frontend_base.pp, line: 76, column: 70) on node example.zulipdev.org
```
2023-06-12 20:18:28 +00:00
Alex Vandiver 575f51ed08 puppet: Remove quotes from enumerable values.
See 646a4d19a3.
2023-06-09 14:39:38 -04:00
Tim Abbott 5e7d61464d puppet: Include trusted-proto definition in zulip_ops configurations.
This should have been part of
0935d388f0.
2023-05-29 15:13:45 -07:00
Anders Kaseorg 9797de52a0 ruff: Fix RUF010 Use conversion in f-string.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-05-26 22:09:18 -07:00
Alex Vandiver 0935d388f0 nginx: Set X-Forwarded-Proto based on trust from requesting source.
Django has a `SECURE_PROXY_SSL_HEADER` setting[^1] which controls if
it examines a header, usually provided by upstream proxies, to allow
it to treat requests as "secure" even if the proximal HTTP connection
was not encrypted.  This header is usually the `X-Forwarded-Proto`
header, and the Django configuration has large warnings about ensuring
that this setting is not enabled unless `X-Forwarded-Proto` is
explicitly controlled by the proxy, and cannot be supplied by the
end-user.

In the absence of this setting, Django checks the `wsgi.url_scheme`
property of the WSGI environment[^2].

Zulip did not control the value of the `X-Forwarded-Proto` header,
because it did not set the `SECURE_PROXY_SSL_HEADER` setting (though
see below).  However, uwsgi has undocumented code which silently
overrides the `wsgi.url_scheme` property based on the
`HTTP_X_FORWARDED_PROTO` property[^3] (and hence the
`X-Forwarded-Proto` header), thus doing the same as enabling the
Django `SECURE_PROXY_SSL_HEADER` setting, but in a way that cannot be
disabled.  It also sets `wsgi.url_scheme` to `https` if the
`X-Forwarded-SSL` header is set to `on` or `1`[^4], providing an
alternate route to deceive to Django.

These combine to make Zulip always trust `X-Forwarded-Proto` or
``X-Forwarded-SSL` headers from external sources, and thus able to
trick Django into thinking a request is "secure" when it is not.
However, Zulip is not accessible via unencrypted channels, since it
redirects all `http` requests to `https` at the nginx level; this
mitigates the vulnerability.

Regardless, we harden Zulip against this vulnerability provided by the
undocumented uwsgi feature, by stripping off `X-Forwarded-SSL` headers
before they reach uwsgi, and setting `X-Forwarded-Proto` only if the
request was received directly from a trusted proxy.

Tornado, because it does not use uwsgi, is an entirely separate
codepath.  It uses the `proxy_set_header` values from
`puppet/zulip/files/nginx/zulip-include-common/proxy`, which set
`X-Forwarded-Proto` to the scheme that nginx received the request
over.  As such, `SECURE_PROXY_SSL_HEADER` was set in Tornado, and only
Tornado; since the header was always set in nginx, this was safe.
However, it was also _incorrect_ in cases where nginx did not do SSL
termination, but an upstream proxy did -- it would mark those requests
as insecure when they were actually secure.  We adjust the
`proxy_set_header X-Forwarded-Proto` used to talk to Tornado to
respect the proxy if it is trusted, or the local scheme if not.

[^1]: https://docs.djangoproject.com/en/4.2/ref/settings/#secure-proxy-ssl-header
[^2]: https://wsgi.readthedocs.io/en/latest/definitions.html#envvar-wsgi.url_scheme
[^3]: 73efb013e9/core/protocol.c (L558-L561)
[^4]: 73efb013e9/core/protocol.c (L531-L534)
2023-05-22 16:50:29 -07:00
Alex Vandiver a95b796a91 supervisor: Drop minfds back down from 1000000 to 40000.
1c76036c61 raised the number of `minfds` in Supervisor from 40k to
1M.  If Supervisor cannot guarantee that number of available file
descriptors, it will fail to start; `/etc/security/limits.conf` was
hence adjusted upwards as well.  However, on some virtualized
environments, including Proxmox LXC, setting
`/etc/security/limits.conf` may not be enough to raise the
system-level limits.  This causes `supervisord` with the larger
`minfds` to fail to start.

The limit of 1000000 was chosen to be arbitrarily high, assuming it
came without cost; it is not expected to ever be reached on any
deployment.  262b19346e already lowered one aspect of that
changeset, upon determining it did come with a cost.  Potentially
breaking virtualized deployments during upgrade is another cost of
that change.

Lower the `minfds` it back down to 40k, partially reverting
1c76036c61, but allow adjusting it upwards for extremely large
deployments.  We do not expect any except the largest deployments to
ever hit the 40k limit, and a frictionless deployment for the
vanishingly small number of huge deployments is not worth the
potential upgrade hiccups for the much more frequent smaller
deployments.
2023-05-18 13:04:33 -07:00
Alex Vandiver 8d8b5935ac puppet: Prevent unattended upgrades of erlang-base.
When upgraded, the `erlang-base` package automatically stops all
services which depend on the Erlang runtime; for Zulip, this is the
`rabbitmq-server` service.  This results in an unexpected outage of
Zulip.

Block unattended upgrades of the `erlang-base` package.
2023-05-16 14:02:06 -07:00
Anders Kaseorg 16aa7c0923 puppet: Migrate Ruby functions from legacy Puppet 3.x API.
https://www.puppet.com/docs/puppet/7/functions_refactor_legacy.html

This removes a bug in the 3.x API that was converting nil to the empty
string, so some templates need to be adjusted.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-05-12 18:17:53 -07:00
Anders Kaseorg cf8ae46291 puppet: Fix shell escaping in Ruby functions.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-05-12 18:17:53 -07:00
Anders Kaseorg 614ab533dc puppet: Reformat Ruby functions with rufo.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-05-12 18:17:53 -07:00
Alex Vandiver f4683de742 puppet: Switch the `rolling_restart` setting to use the bool values.
2c5fc1827c standardized which values are "true"; use them.
2023-05-11 15:54:15 -07:00
Alex Vandiver 530980cf31 zulip_tools: Add a get_config_bool to match Puppet logic.
Unfortunately, the existing use of this logic in `process_fts_updates`
cannot switch to using this code, as that code cannot import
zulip_tools.
2023-05-11 15:54:15 -07:00
Alex Vandiver 4d02ac6fb1 puppet: Bring back uwsgi_rolling_restart config.
a522ad1d9a mistakenly deleted this variable assignment, which made
the `zulip.conf` configuration setting not work -- uwsgi's `lazy_apps`
were not enabled, which are required for rolling restart.
2023-05-11 15:54:15 -07:00
Alex Vandiver da2c1ad839 puppet: Fix checksum of sentry-cli binary. 2023-05-11 13:39:54 -07:00
Alex Vandiver 1019a74c6f puppet: Update dependencies. 2023-05-11 10:51:37 -07:00
Alex Vandiver f11350f789 puppet: Add PostgreSQL 15 support.
Instead of copying over a mostly-unchanged `postgresql.conf`, we
transition to deploying a `conf.d/zulip.conf` which contains the
only material changes we made to the file, which were previously
appended to the end.

While shipping separate while `postgresql.conf` files for each
supported version is useful if there is large variety in supported
options between versions, there is not no such variation at current,
and the burden of overriding the entire default configuration is that
it must be keep up to date wit the package's version.
2023-05-10 14:06:02 -07:00
Alex Vandiver a9f51a0c02 static: Add Timing-Allow-Origin: * to allow sentry data timing.
This is required for the browser to provide detailed timing
information about resource fetches from other domains[^1].

[^1]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Timing-Allow-Origin
2023-05-09 13:16:28 -07:00
Alex Vandiver e5ae55637e install: Remove PostgreSQL 11 support.
Django 4.2 removes this support, so Zulip has not installed with
PostgreSQL 11 since 2c20028aa4.
2023-05-05 13:35:32 -07:00
Alex Vandiver 2f4775ba68 wal-g: Write out a logfile.
Otherwise, this output goes into `/var/spool/mail/postgres`, which is
not terribly helpful.  We do not write to `/var/log/zulip` because the
backup runs as the `postgres` user, and `/var/log/zulip` is owned by
zulip and chmod 750.
2023-04-27 12:19:43 -07:00
Alex Vandiver 3aba2789d3 prometheus: Add an exporter for wal-g backup properties.
Since backups may now taken on arbitrary hosts, we need a blackbox
monitor that _some_ backup was produced.

Add a Prometheus exporter which calls `wal-g backup-list` and reports
statistics about the backups.

This could be extended to include `wal-g wal-verify`, but that
requires a connection to the PostgreSQL server.
2023-04-26 15:41:39 -07:00
Alex Vandiver b8a6de95d2 pg_backup_and_purge: Allow adjusting the backup concurrency.
SSDs are good at parallel random reads.
2023-04-26 10:54:51 -07:00
Alex Vandiver 19a11c9556 pg_backup_and_purge: Take backups on replicas, if present.
Taking backups on the database primary adds additional disk load,
which can impact the performance of the application.

Switch to taking backups on replicas, if they exist.  Some deployments
may have multiple replicas, and taking backups on all of them is
wasteful and potentially confusing; add a flag to inhibit taking
nightly snapshots on the host.

If the deployment is a single instance of PostgreSQL, with no
replicas, it takes backups as before, modulo the extra flag to allow
skipping taking them.
2023-04-26 10:54:51 -07:00
Alex Vandiver 4b35211ca1 pg_backup_and_purge: Remove unnecessary explicit types. 2023-04-26 10:54:51 -07:00
Alex Vandiver e72e83793d pg_backup_and_purge: Just use subprocess directly.
dry_run was never passed into run(); switch to using subprocess directly.
2023-04-26 10:54:51 -07:00
Alex Vandiver cace8858f9 puppet: Move logrotate config into app_frontend_base.
7c023042cf moved the logrotate configuration to being a templated
file, from a static file, but missed that the static file was still
referenced from `zulip_ops::app_frontend`; it only updated
`zulip::profile::app_frontend`.  This caused errors in applying puppet
on any `zulip_ops::app_frontend` host.

Prior to 7c023042cf, the Puppet role was identical between those two
classes; deduplicate the rule by moving the updated template
definition into `zulip::app_frontend_base` which is common to those
two classes and not used in any other classes.
2023-04-19 09:34:37 -07:00
Alex Vandiver 775c7ca4ea hooks: Give a bit better Zulip deploy message. 2023-04-19 09:32:39 -07:00
Alex Vandiver d0fc3f1c2e puppet: Add prod hooks to push zulip-cloud-current and notify CZO. 2023-04-12 11:36:33 -07:00
Alex Vandiver 7c023042cf puppet: Rotate access log files every day, not at 500M.
Since logrotate runs in a daily cron, this practically means "daily,
but only if it's larger than 500M."  For large installs with large
traffic, this is effectively daily for 10 days; for small installs, it
is an unknown amount of time.

Switch to daily logfiles, defaulting to 14 days to match nginx; this
can be overridden using a zulip.conf setting.  This makes it easier to
ensure that access logs are only kept for a bounded period of time.
2023-04-06 14:31:16 -04:00
Tim Abbott 561daee2a1 puppet: Update declared zmirror dependencies.
Following zulip/python-zulip-api/pull/758/, we're no longer using
python-zephyr, and don't need to build it from source. Additionally,
we no longer need to build a forked Zephyr package, since ZLoadSession
and ZDumpSession were merged in
e6a545e759.
2023-04-06 09:45:06 -07:00
Alex Vandiver 6975417acf puppet: Create zmirror supervisor subdirectory.
To not change the `supervisor.conf` file, which requires a restart of
supervisor (and thus all services running under it, which is extremely
disruptive) we carefully leave the contents unchanged for most
installs, and append a new piece to the file, only for the zmirror
configuration, using `concat`.
2023-04-06 09:45:06 -07:00
Alex Vandiver c519ba40fd hooks: Add a push_git_ref post-deploy hook. 2023-04-05 18:51:55 -04:00
Alex Vandiver 8a771c7ac0 hooks: Add a hook to send a Zulip before/after the deploy. 2023-04-05 18:51:55 -04:00
Alex Vandiver 377f2d6d03 hooks: Add a common/ directory and factor out common Sentry code. 2023-04-05 18:51:55 -04:00
Alex Vandiver f4d70a2e37 hooks: Resolve version strings to commit SHAs, and pass in via the env. 2023-04-05 18:51:55 -04:00
Alex Vandiver ecfb12404a hooks: Switch to passing values through the environment. 2023-04-05 18:51:55 -04:00
Alex Vandiver 160a917ad3 hooks: Add a helper to install a single static file. 2023-04-05 18:51:55 -04:00
Alex Vandiver 0c13bacb89 sentry: Switch shell variables to lower-case. 2023-04-05 18:51:55 -04:00
Alex Vandiver 7202a98438 cron: Move fetch-tor-exit-nodes to not on the hour.
We see connection timeouts and other access issues when run exactly on
the hour, either due to load on their servers from similar cron jobs,
or from operational processes of theirs.

Move to on the :17s to avoid these access issues.
2023-04-05 12:20:30 -07:00
Alex Vandiver db0ae85d97 sentry: Remove an unnecessary sudo.
790e4854dd made the hooks run as the `zulip` user, making this sudo
unnecessary.
2023-04-03 15:04:56 -07:00
Alex Vandiver 89e366771a prometheus: Add a postgres exporter. 2023-03-30 16:16:18 -07:00
Alex Vandiver c2beb64a79 prometheus: Consistently import the base class and supervisor, if needed. 2023-03-30 16:16:18 -07:00
Alex Vandiver 3feb536df3 nagios: Remove swap check.
Swap usage is not a high signal thing to alert on, and is likely to flap.
2023-03-27 15:10:50 -07:00
Alex Vandiver 262b19346e puppet: Decrease default nginx worker_connections.
Increasing worker_connections has a memory cost, unlike the rest of
the changes in 1c76036c61d8; setting it to 1 million caused nginx to
consume several GB of memory.

Reduce the default down to 10k, and allow deploys to configure it up
if necessary.  `worker_rlimit_nofile` is left at 1M, since it has no
impact on memory consumption.
2023-03-23 15:59:23 -07:00
Alex Vandiver 0c46bbdf9f puppet: Update dependencies. 2023-03-23 09:50:30 -07:00
Anders Kaseorg 3a27b12a7d dependencies: Switch to pnpm.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-03-20 15:48:29 -07:00
Alex Vandiver f2a20b56bc puppet: Enable sentry hooks for production and staging. 2023-03-17 08:10:31 -07:00
Alex Vandiver 1a65315566 puppet: Switch teleport to running under systemd, not supervisord.
There is no reason that the base node access method should be run
under supervisor, which exists primarily to give access to the `zulip`
user to restart its managed services.  This access is unnecessary for
Teleport, and also causes unwanted restarts of Teleport services when
the `supervisor` base configuration changes.  Additionally,
supervisor does not support the in-place upgrade process that Teleport
uses, as it replaces its core process with a new one.

Switch to installing a systemd configuration file (as generated by
`teleport install systemd`) for each part of Teleport, customized to
pass a `--config` path.  As such, we explicitly disable the `teleport`
service provided by the package.

The supervisor process is shut down by dint of no longer installing
the file, which purges it from the managed directory, and reloads
Supervisor to pick up the removed service.
2023-03-15 17:23:42 -04:00
Alex Vandiver 8f8a9f6f04 sentry: Add frontend event monitoring.
Zulip already has integrations for server-side Sentry integration;
however, it has historically used the Zulip-specific `blueslip`
library for monitoring browser-side errors.  However, the latter sends
errors to email, as well optionally to an internal `#errors` stream.
While this is sufficient for low volumes of users, and useful in that
it does not rely on outside services, at higher volumes it is very
difficult to do any analysis or filtering of the errors.  Client-side
errors are exceptionally noisy, with many false positives due to
browser extensions or similar, so determining real real errors from a
stream of un-grouped emails or messages in a stream is quite
difficult.

Add a client-side Javascript sentry integration.  To provide useful
backtraces, this requires extending the pre-deploy hooks to upload the
source-maps to Sentry.  Additional keys are added to the non-public
API of `page_params` to control the DSN, realm identifier, and sample
rates.
2023-03-07 10:51:45 -08:00
Alex Vandiver fc40d74cda hooks: Remove --project from sentry when not necessary. 2023-03-07 10:51:45 -08:00
Alex Vandiver 08251ac53b hooks: Fix typo in sentry error message. 2023-03-07 10:51:45 -08:00
Alex Vandiver 26eb1d7371 puppet: Also set systemd limits. 2023-03-03 16:39:47 -08:00
Alex Vandiver 1c76036c61 puppet: Increase maximum file descriptors.
The current threshold of 40k descriptors was set in 2016, chosen to be
"at least 40x our current scale."  At present, that only provides a
50% safety margin.  Increase to 1 million to provide the same 40x
buffer as previously.

The highest value currently allowed by the kernels in
production (linux 5.3.0) is 1048576.  This is set as the hard limit.

The 1 million limit is likely far above what the system can handle for
other reasons (memory, cpu, etc).  While this removes a potential
safeguard on overload due to too many connections, due to the longpoll
architecture we would generally prefer to service more connections at
lower quality (due to CPU limitations) rather than randomly reject
additional connections.

Relevant prior commits:
 - 836f313e69
 - f2f97dd335
 - ec23996538
 - 8806ec698a
 - e4fce10f46
2023-03-03 16:39:47 -08:00
Alex Vandiver a20bb54cbb puppet: Move limits.conf to maintain more of the installation structure. 2023-03-03 16:39:47 -08:00
Tim Abbott 6b37f9a290 puppet: Run delete-old-unclaimed-attachments in archive cron file.
After reflecting a bit on the last commit, I think it's substantially
easier to understand what's happening for these two tasks to be
defined in the same file, because we want the timing to be different
to avoid potential races.
2023-03-01 11:21:42 -08:00
Mateusz Mandera 35344f7f6b puppet: Add cronjob to run delete_old_unclaimed_attachments daily. 2023-03-01 11:16:39 -08:00
Alex Vandiver e7fabb45f2 puppet: Pin with sha256sum verification. 2023-02-28 00:04:39 -05:00
Alex Vandiver 0d42abe1a8 puppet: wal-g is a tarball with a single file, not a directory.
5db55c38dc switched from `ensure => present` to the more specific
`ensure => directory` on the premise that tarballs would result in
more than one file being copied out of them.  However, we only extract
a single file from the wal-g tarball, and install it at the output
path.  The new rule attempts to replace it with an empty directory
after extraction.

Switch back to `ensure => present` for the tarball codepath.
2023-02-14 18:18:36 -05:00
Alex Vandiver 6f8ce2d00a hooks: Fix shebang line to use /usr/bin/env bash. 2023-02-14 17:28:58 -05:00
Alex Vandiver 044ccdb334 chat.zulip.org: Enable Sentry hook. 2023-02-14 17:20:35 -05:00
Alex Vandiver 3109d40b21 puppet: Add a sentry release class.
This installs the Sentry CLI, and uses it to send API events to Sentry
when a release is started and completed.
2023-02-10 15:53:10 -08:00
Alex Vandiver 5db55c38dc puppet: Add a sha256_file_to. 2023-02-10 15:53:10 -08:00
Alex Vandiver af0ba0b58f puppet: sha256_tarball_to is only ever called with one from/to. 2023-02-10 15:53:10 -08:00
Alex Vandiver 840884ec89 upgrade-zulip: Provide directories to run hooks before/after upgrade.
These hooks are run immediately around the critical section of the
upgrade.  If the upgrade fails for preparatory reasons, the pre-deploy
hook may not be run; if it fails during the upgrade, the post-deploy
hook will not be run.  Hooks are called from the CWD of the new
deploy, with arguments of the old version and the new version.  If
they exit with non-0 exit code, the deploy aborts.
2023-02-10 15:53:10 -08:00
Alex Vandiver 7ab4fdf250 memcached: Allow overriding the max-item-size.
This is necessary for organizations with extremely large numbers of
members (20k+).
2023-02-09 12:04:29 -08:00
Alex Vandiver 23894fc9a3 uploads: Set Content-Type and -Disposition from Django for local files.
Similar to the previous commit, Django was responsible for setting the
Content-Disposition based on the filename, whereas the Content-Type
was set by nginx based on the filename.  This difference is not
exploitable, as even if they somehow disagreed with Django's expected
Content-Type, nginx will only ever respond with Content-Types found in
`uploads.types` -- none of which are unsafe for user-supplied content.

However, for consistency, have Django provide both Content-Type and
Content-Disposition headers.
2023-02-07 17:12:02 +00:00
Alex Vandiver 2f6c5a883e CVE-2023-22735: Provide the Content-Disposition header from S3.
The Content-Type of user-provided uploads was provided by the browser
at initial upload time, and stored in S3; however, 04cf68b45e
switched to determining the Content-Disposition merely from the
filename.  This makes uploads vulnerable to a stored XSS, wherein a
file uploaded with a content-type of `text/html` and an extension of
`.png` would be served to browsers as `Content-Disposition: inline`,
which is unsafe.

The `Content-Security-Policy` headers in the previous commit mitigate
this, but only for browsers which support them.

Revert parts of 04cf68b45e, specifically by allowing S3 to provide
the Content-Disposition header, and using the
`ResponseContentDisposition` argument when necessary to override it to
`attachment`.  Because we expect S3 responses to vary based on this
argument, we include it in the cache key; since the query parameter
has dashes in it, we can't use use the helper `$arg_` variables, and
must parse it from the query parameters manually.

Adding the disposition may decrease the cache hit rate somewhat, but
downloads are infrequent enough that it is unlikely to have a
noticeable effect.  We take care to not adjust the cache key for
requests which do not specify the disposition.
2023-02-07 17:09:52 +00:00
Alex Vandiver 36e97f8121 CVE-2023-22735: Set a Content-Security-Policy header on proxied S3 data.
This was missed in 04cf68b45ebb5c03247a0d6453e35ffc175d55da; as this
content is fundamentally untrusted, it must be served with
`Content-Security-Policy` headers in order to be safe.  These headers
were not provided previously for S3 content because it was served from
the S3 domain.

This mitigates content served from Zulip which could be a stored XSS,
but only in browsers which support Content-Security-Policy headers;
see subsequent commit for the complete solution.
2023-02-07 17:09:52 +00:00
Alex Vandiver d41a00b83b uploads: Extra-escape internal S3 paths.
In nginx, `location` blocks operate on the _decoded_ URI[^1]:

> The matching is performed against a normalized URI, after decoding
> the text encoded in the “%XX” form

This means that if a user-uploaded file contains characters that are
not URI-safe, the browser encodes them in UTF-8 and then URI-encodes
them -- and nginx decodes them and reassembles the original character
before running the `location ~ ^/...` match.  This means that the `$2`
_is not URI-encoded_ and _may contain non-ASCII characters.

When `proxy_pass` is passed a value containing one or more variables,
it does no encoding on that expanded value, assuming that the bytes
are exactly as they should be passed to the upstream.  This means that
directly calling `proxy_pass https://$1/$2` would result in sending
high-bit characters to the S3 upstream, which would rightly balk.

However, a longstanding bug in nginx's `set` directive[^2] means that
the following line:

```nginx
set $download_url https://$1/$2;
```

...results in nginx accidentally URI-encoding $1 and $2 when they are
inserted, resulting in a `$download_url` which is suitable to pass to
`proxy_pass`.  This bug is only present with numeric capture
variables, not named captures; this is particularly relevant because
numeric captures are easily overridden by additional regexes
elsewhere, as subsequent commits will add.

Fixing this is complicated; nginx does not supply any way to escape
values[^3], besides a third-party module[^4] which is an undue
complication to begin using.  The only variable which nginx exposes
which is _not_ un-escaped already is `$request_uri`, which contains
the very original URL sent by the browser -- and thus can't respect
any work done in Django to generate the `X-Accel-Redirect` (e.g., for
`/user_uploads/temporary/` URLs).  We also cannot pass these URLs to
nginx via query-parameters, since `$arg_foo` values are not
URI-decoded by nginx, there is no function to do so[^3], and the
values must be URI-encoded because they themselves are URLs with query
parameters.

Extra-URI-encode the path that we pass to the `X-Accel-Redirect`
location, for S3 redirects.  We rely on the `location` block
un-escaping that layer, leaving `$s3_hostname` and `$s3_path` as they
were intended in Django.

This works around the nginx bug, with no behaviour change.

[^1]: http://nginx.org/en/docs/http/ngx_http_core_module.html#location
[^2]: https://trac.nginx.org/nginx/ticket/348
[^3]: https://trac.nginx.org/nginx/ticket/52
[^4]: https://github.com/openresty/set-misc-nginx-module#set_escape_uri
2023-02-07 17:09:52 +00:00
Alex Vandiver a955f52904 uploads: Stop putting API headers on local-file upload responses.
These only need the usual response headers, not the
Access-Control-Origin headers that API endpoints need.
2023-02-07 17:09:52 +00:00
Anders Kaseorg df001db1a9 black: Reformat with Black 23.
Black 23 enforces some slightly more specific rules about empty line
counts and redundant parenthesis removal, but the result is still
compatible with Black 22.

(This does not actually upgrade our Python environment to Black 23
yet.)

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-02-02 10:40:13 -08:00
Alex Vandiver 68f4071873 puppet: Allow choice of timesync tool. 2023-01-31 14:20:41 -08:00
Tran Sang 3bea65b39c puppet: Set /etc/mailname based on postfix.mailname configuration.
The `postfix.mailname` setting in `/etc/zulip.conf` was previously
only used for incoming mail, to identify in Postfix configuration
which messages were "local."

Also set `/etc/mailname`, which is used by Postfix to set how it
identifies to other hosts when sending outgoing email.

Co-authored-by: Alex Vandiver <alexmv@zulip.com>
2023-01-27 15:08:22 -05:00
Alex Vandiver e8123dfeea puppet: Match the `x` bits on directories to what puppet actually does.
Puppet _always_ sets the `+x` bit on directories if they have the `r`
bit set for that slot[^1]:

> When specifying numeric permissions for directories, Puppet sets the
> search permission wherever the read permission is set.

As such, for instance, `0640` is actually applied as `0750`.

Fix what we "want" to match what puppet is applying, by adding the `x`
bit.  In none of these cases did we actually intend the directory to
not be executable.

[1] https://www.puppet.com/docs/puppet/5.5/types/file.html#file-attribute-mode
2023-01-26 15:06:01 -08:00
Alex Vandiver 372bba4a8e puppet: Stop creating a /home/zulip/logs.
This was last really used in d7a3570c7e, in 2013, when it was
`/home/humbug/logs`.

Repoint the one obscure piece of tooling that writes there, and remove
the places that created it.
2023-01-26 15:06:01 -08:00
Alex Vandiver 7f2514b316 puppet: Collapse identical blocks. 2023-01-26 15:06:01 -08:00
Alex Vandiver 09bb0e6fd0 puppet: Upgrade Grafana. 2023-01-26 10:24:24 -08:00
Alex Vandiver d0de66b273 puppet: Remove "ensure => absent" rules which have all been applied. 2023-01-24 13:05:24 -08:00
Alex Vandiver 50e9df448d puppet: Do not start the "puppet" service.
Zulip runs puppet manually, using the command-line tool; it does not
make use of the `puppet` service which, by default, attempts to
contact a host named `puppet` every two minutes to get a manifest to
apply.  These attempts can generate log spam and user confusion.

Disable and stop the `puppet` service via puppet.
2023-01-23 13:02:09 -08:00
Anders Kaseorg 7a7513f6e0 ruff: Fix SIM201 Use `… != …` instead of `not … == …`.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-01-23 11:18:36 -08:00
Anders Kaseorg b0e569f07c ruff: Fix SIM102 nested `if` statements.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-01-23 11:18:36 -08:00
Alex Vandiver 04cf68b45e uploads: Serve S3 uploads directly from nginx.
When file uploads are stored in S3, this means that Zulip serves as a
302 to S3.  Because browsers do not cache redirects, this means that
no image contents can be cached -- and upon every page load or reload,
every recently-posted image must be re-fetched.  This incurs extra
load on the Zulip server, as well as potentially excessive bandwidth
usage from S3, and on the client's connection.

Switch to fetching the content from S3 in nginx, and serving the
content from nginx.  These have `Cache-control: private, immutable`
headers set on the response, allowing browsers to cache them locally.

Because nginx fetching from S3 can be slow, and requests for uploads
will generally be bunched around when a message containing them are
first posted, we instruct nginx to cache the contents locally.  This
is safe because uploaded file contents are immutable; access control
is still mediated by Django.  The nginx cache key is the URL without
query parameters, as those parameters include a time-limited signed
authentication parameter which lets nginx fetch the non-public file.

This adds a number of nginx-level configuration parameters to control
the caching which nginx performs, including the amount of in-memory
index for he cache, the maximum storage of the cache on disk, and how
long data is retained in the cache.  The currently-chosen figures are
reasonable for small to medium deployments.

The most notable effect of this change is in allowing browsers to
cache uploaded image content; however, while there will be many fewer
requests, it also has an improvement on request latency.  The
following tests were done with a non-AWS client in SFO, a server and
S3 storage in us-east-1, and with 100 requests after 10 requests of
warm-up (to fill the nginx cache).  The mean and standard deviation
are shown.

|                   | Redirect to S3      | Caching proxy, hot  | Caching proxy, cold |
| ----------------- | ------------------- | ------------------- | ------------------- |
| Time in Django    | 263.0 ms ±  28.3 ms | 258.0 ms ±  12.3 ms | 258.0 ms ±  12.3 ms |
| Small file (842b) | 586.1 ms ±  21.1 ms | 266.1 ms ±  67.4 ms | 288.6 ms ±  17.7 ms |
| Large file (660k) | 959.6 ms ± 137.9 ms | 609.5 ms ±  13.0 ms | 648.1 ms ±  43.2 ms |

The hot-cache performance is faster for both large and small files,
since it saves the client the time having to make a second request to
a separate host.  This performance improvement remains at least 100ms
even if the client is on the same coast as the server.

Cold nginx caches are only slightly slower than hot caches, because
VPC access to S3 endpoints is extremely fast (assuming it is in the
same region as the host), and nginx can pool connections to S3 and
reuse them.

However, all of the 648ms taken to serve a cold-cache large file is
occupied in nginx, as opposed to the only 263ms which was spent in
nginx when using redirects to S3.  This means that to overall spend
less time responding to uploaded-file requests in nginx, clients will
need to find files in their local cache, and skip making an
uploaded-file request, at least 60% of the time.  Modeling shows a
reduction in the number of client requests by about 70% - 80%.

The `Content-Disposition` header logic can now also be entirely shared
with the local-file codepath, as can the `url_only` path used by
mobile clients.  While we could provide the direct-to-S3 temporary
signed URL to mobile clients, we choose to provide the
served-from-Zulip signed URL, to better control caching headers on it,
and greater consistency.  In doing so, we adjust the salt used for the
URL; since these URLs are only valid for 60s, the effect of this salt
change is minimal.
2023-01-09 18:23:58 -05:00