zulip

Commit Graph

Author	SHA1	Message	Date
Alex Vandiver	3aba2789d3	prometheus: Add an exporter for wal-g backup properties. Since backups may now taken on arbitrary hosts, we need a blackbox monitor that _some_ backup was produced. Add a Prometheus exporter which calls `wal-g backup-list` and reports statistics about the backups. This could be extended to include `wal-g wal-verify`, but that requires a connection to the PostgreSQL server.	2023-04-26 15:41:39 -07:00
Alex Vandiver	b8a6de95d2	pg_backup_and_purge: Allow adjusting the backup concurrency. SSDs are good at parallel random reads.	2023-04-26 10:54:51 -07:00
Alex Vandiver	19a11c9556	pg_backup_and_purge: Take backups on replicas, if present. Taking backups on the database primary adds additional disk load, which can impact the performance of the application. Switch to taking backups on replicas, if they exist. Some deployments may have multiple replicas, and taking backups on all of them is wasteful and potentially confusing; add a flag to inhibit taking nightly snapshots on the host. If the deployment is a single instance of PostgreSQL, with no replicas, it takes backups as before, modulo the extra flag to allow skipping taking them.	2023-04-26 10:54:51 -07:00
Alex Vandiver	4b35211ca1	pg_backup_and_purge: Remove unnecessary explicit types.	2023-04-26 10:54:51 -07:00
Alex Vandiver	e72e83793d	pg_backup_and_purge: Just use subprocess directly. dry_run was never passed into run(); switch to using subprocess directly.	2023-04-26 10:54:51 -07:00
Alex Vandiver	cace8858f9	puppet: Move logrotate config into app_frontend_base. `7c023042cf` moved the logrotate configuration to being a templated file, from a static file, but missed that the static file was still referenced from `zulip_ops::app_frontend`; it only updated `zulip::profile::app_frontend`. This caused errors in applying puppet on any `zulip_ops::app_frontend` host. Prior to `7c023042cf`, the Puppet role was identical between those two classes; deduplicate the rule by moving the updated template definition into `zulip::app_frontend_base` which is common to those two classes and not used in any other classes.	2023-04-19 09:34:37 -07:00
Alex Vandiver	775c7ca4ea	hooks: Give a bit better Zulip deploy message.	2023-04-19 09:32:39 -07:00
Alex Vandiver	d0fc3f1c2e	puppet: Add prod hooks to push zulip-cloud-current and notify CZO.	2023-04-12 11:36:33 -07:00
Alex Vandiver	7c023042cf	puppet: Rotate access log files every day, not at 500M. Since logrotate runs in a daily cron, this practically means "daily, but only if it's larger than 500M." For large installs with large traffic, this is effectively daily for 10 days; for small installs, it is an unknown amount of time. Switch to daily logfiles, defaulting to 14 days to match nginx; this can be overridden using a zulip.conf setting. This makes it easier to ensure that access logs are only kept for a bounded period of time.	2023-04-06 14:31:16 -04:00
Tim Abbott	561daee2a1	puppet: Update declared zmirror dependencies. Following zulip/python-zulip-api/pull/758/, we're no longer using python-zephyr, and don't need to build it from source. Additionally, we no longer need to build a forked Zephyr package, since ZLoadSession and ZDumpSession were merged in `e6a545e759`.	2023-04-06 09:45:06 -07:00
Alex Vandiver	6975417acf	puppet: Create zmirror supervisor subdirectory. To not change the `supervisor.conf` file, which requires a restart of supervisor (and thus all services running under it, which is extremely disruptive) we carefully leave the contents unchanged for most installs, and append a new piece to the file, only for the zmirror configuration, using `concat`.	2023-04-06 09:45:06 -07:00
Alex Vandiver	c519ba40fd	hooks: Add a push_git_ref post-deploy hook.	2023-04-05 18:51:55 -04:00
Alex Vandiver	8a771c7ac0	hooks: Add a hook to send a Zulip before/after the deploy.	2023-04-05 18:51:55 -04:00
Alex Vandiver	377f2d6d03	hooks: Add a common/ directory and factor out common Sentry code.	2023-04-05 18:51:55 -04:00
Alex Vandiver	f4d70a2e37	hooks: Resolve version strings to commit SHAs, and pass in via the env.	2023-04-05 18:51:55 -04:00
Alex Vandiver	ecfb12404a	hooks: Switch to passing values through the environment.	2023-04-05 18:51:55 -04:00
Alex Vandiver	160a917ad3	hooks: Add a helper to install a single static file.	2023-04-05 18:51:55 -04:00
Alex Vandiver	0c13bacb89	sentry: Switch shell variables to lower-case.	2023-04-05 18:51:55 -04:00
Alex Vandiver	7202a98438	cron: Move fetch-tor-exit-nodes to not on the hour. We see connection timeouts and other access issues when run exactly on the hour, either due to load on their servers from similar cron jobs, or from operational processes of theirs. Move to on the :17s to avoid these access issues.	2023-04-05 12:20:30 -07:00
Alex Vandiver	db0ae85d97	sentry: Remove an unnecessary sudo. `790e4854dd` made the hooks run as the `zulip` user, making this sudo unnecessary.	2023-04-03 15:04:56 -07:00
Alex Vandiver	89e366771a	prometheus: Add a postgres exporter.	2023-03-30 16:16:18 -07:00
Alex Vandiver	c2beb64a79	prometheus: Consistently import the base class and supervisor, if needed.	2023-03-30 16:16:18 -07:00
Alex Vandiver	3feb536df3	nagios: Remove swap check. Swap usage is not a high signal thing to alert on, and is likely to flap.	2023-03-27 15:10:50 -07:00
Alex Vandiver	262b19346e	puppet: Decrease default nginx worker_connections. Increasing worker_connections has a memory cost, unlike the rest of the changes in 1c76036c61d8; setting it to 1 million caused nginx to consume several GB of memory. Reduce the default down to 10k, and allow deploys to configure it up if necessary. `worker_rlimit_nofile` is left at 1M, since it has no impact on memory consumption.	2023-03-23 15:59:23 -07:00
Alex Vandiver	0c46bbdf9f	puppet: Update dependencies.	2023-03-23 09:50:30 -07:00
Anders Kaseorg	3a27b12a7d	dependencies: Switch to pnpm. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2023-03-20 15:48:29 -07:00
Alex Vandiver	f2a20b56bc	puppet: Enable sentry hooks for production and staging.	2023-03-17 08:10:31 -07:00
Alex Vandiver	1a65315566	puppet: Switch teleport to running under systemd, not supervisord. There is no reason that the base node access method should be run under supervisor, which exists primarily to give access to the `zulip` user to restart its managed services. This access is unnecessary for Teleport, and also causes unwanted restarts of Teleport services when the `supervisor` base configuration changes. Additionally, supervisor does not support the in-place upgrade process that Teleport uses, as it replaces its core process with a new one. Switch to installing a systemd configuration file (as generated by `teleport install systemd`) for each part of Teleport, customized to pass a `--config` path. As such, we explicitly disable the `teleport` service provided by the package. The supervisor process is shut down by dint of no longer installing the file, which purges it from the managed directory, and reloads Supervisor to pick up the removed service.	2023-03-15 17:23:42 -04:00
Alex Vandiver	8f8a9f6f04	sentry: Add frontend event monitoring. Zulip already has integrations for server-side Sentry integration; however, it has historically used the Zulip-specific `blueslip` library for monitoring browser-side errors. However, the latter sends errors to email, as well optionally to an internal `#errors` stream. While this is sufficient for low volumes of users, and useful in that it does not rely on outside services, at higher volumes it is very difficult to do any analysis or filtering of the errors. Client-side errors are exceptionally noisy, with many false positives due to browser extensions or similar, so determining real real errors from a stream of un-grouped emails or messages in a stream is quite difficult. Add a client-side Javascript sentry integration. To provide useful backtraces, this requires extending the pre-deploy hooks to upload the source-maps to Sentry. Additional keys are added to the non-public API of `page_params` to control the DSN, realm identifier, and sample rates.	2023-03-07 10:51:45 -08:00
Alex Vandiver	fc40d74cda	hooks: Remove --project from sentry when not necessary.	2023-03-07 10:51:45 -08:00
Alex Vandiver	08251ac53b	hooks: Fix typo in sentry error message.	2023-03-07 10:51:45 -08:00
Alex Vandiver	26eb1d7371	puppet: Also set systemd limits.	2023-03-03 16:39:47 -08:00
Alex Vandiver	1c76036c61	puppet: Increase maximum file descriptors. The current threshold of 40k descriptors was set in 2016, chosen to be "at least 40x our current scale." At present, that only provides a 50% safety margin. Increase to 1 million to provide the same 40x buffer as previously. The highest value currently allowed by the kernels in production (linux 5.3.0) is 1048576. This is set as the hard limit. The 1 million limit is likely far above what the system can handle for other reasons (memory, cpu, etc). While this removes a potential safeguard on overload due to too many connections, due to the longpoll architecture we would generally prefer to service more connections at lower quality (due to CPU limitations) rather than randomly reject additional connections. Relevant prior commits: - `836f313e69` - `f2f97dd335` - `ec23996538` - `8806ec698a` - `e4fce10f46`	2023-03-03 16:39:47 -08:00
Alex Vandiver	a20bb54cbb	puppet: Move limits.conf to maintain more of the installation structure.	2023-03-03 16:39:47 -08:00
Tim Abbott	6b37f9a290	puppet: Run delete-old-unclaimed-attachments in archive cron file. After reflecting a bit on the last commit, I think it's substantially easier to understand what's happening for these two tasks to be defined in the same file, because we want the timing to be different to avoid potential races.	2023-03-01 11:21:42 -08:00
Mateusz Mandera	35344f7f6b	puppet: Add cronjob to run delete_old_unclaimed_attachments daily.	2023-03-01 11:16:39 -08:00
Alex Vandiver	e7fabb45f2	puppet: Pin with sha256sum verification.	2023-02-28 00:04:39 -05:00
Alex Vandiver	0d42abe1a8	puppet: wal-g is a tarball with a single file, not a directory. `5db55c38dc` switched from `ensure => present` to the more specific `ensure => directory` on the premise that tarballs would result in more than one file being copied out of them. However, we only extract a single file from the wal-g tarball, and install it at the output path. The new rule attempts to replace it with an empty directory after extraction. Switch back to `ensure => present` for the tarball codepath.	2023-02-14 18:18:36 -05:00
Alex Vandiver	6f8ce2d00a	hooks: Fix shebang line to use /usr/bin/env bash.	2023-02-14 17:28:58 -05:00
Alex Vandiver	044ccdb334	chat.zulip.org: Enable Sentry hook.	2023-02-14 17:20:35 -05:00
Alex Vandiver	3109d40b21	puppet: Add a sentry release class. This installs the Sentry CLI, and uses it to send API events to Sentry when a release is started and completed.	2023-02-10 15:53:10 -08:00
Alex Vandiver	5db55c38dc	puppet: Add a sha256_file_to.	2023-02-10 15:53:10 -08:00
Alex Vandiver	af0ba0b58f	puppet: sha256_tarball_to is only ever called with one from/to.	2023-02-10 15:53:10 -08:00
Alex Vandiver	840884ec89	upgrade-zulip: Provide directories to run hooks before/after upgrade. These hooks are run immediately around the critical section of the upgrade. If the upgrade fails for preparatory reasons, the pre-deploy hook may not be run; if it fails during the upgrade, the post-deploy hook will not be run. Hooks are called from the CWD of the new deploy, with arguments of the old version and the new version. If they exit with non-0 exit code, the deploy aborts.	2023-02-10 15:53:10 -08:00
Alex Vandiver	7ab4fdf250	memcached: Allow overriding the max-item-size. This is necessary for organizations with extremely large numbers of members (20k+).	2023-02-09 12:04:29 -08:00
Alex Vandiver	23894fc9a3	uploads: Set Content-Type and -Disposition from Django for local files. Similar to the previous commit, Django was responsible for setting the Content-Disposition based on the filename, whereas the Content-Type was set by nginx based on the filename. This difference is not exploitable, as even if they somehow disagreed with Django's expected Content-Type, nginx will only ever respond with Content-Types found in `uploads.types` -- none of which are unsafe for user-supplied content. However, for consistency, have Django provide both Content-Type and Content-Disposition headers.	2023-02-07 17:12:02 +00:00
Alex Vandiver	2f6c5a883e	CVE-2023-22735: Provide the Content-Disposition header from S3. The Content-Type of user-provided uploads was provided by the browser at initial upload time, and stored in S3; however, `04cf68b45e` switched to determining the Content-Disposition merely from the filename. This makes uploads vulnerable to a stored XSS, wherein a file uploaded with a content-type of `text/html` and an extension of `.png` would be served to browsers as `Content-Disposition: inline`, which is unsafe. The `Content-Security-Policy` headers in the previous commit mitigate this, but only for browsers which support them. Revert parts of `04cf68b45e`, specifically by allowing S3 to provide the Content-Disposition header, and using the `ResponseContentDisposition` argument when necessary to override it to `attachment`. Because we expect S3 responses to vary based on this argument, we include it in the cache key; since the query parameter has dashes in it, we can't use use the helper `$arg_` variables, and must parse it from the query parameters manually. Adding the disposition may decrease the cache hit rate somewhat, but downloads are infrequent enough that it is unlikely to have a noticeable effect. We take care to not adjust the cache key for requests which do not specify the disposition.	2023-02-07 17:09:52 +00:00
Alex Vandiver	36e97f8121	CVE-2023-22735: Set a Content-Security-Policy header on proxied S3 data. This was missed in 04cf68b45ebb5c03247a0d6453e35ffc175d55da; as this content is fundamentally untrusted, it must be served with `Content-Security-Policy` headers in order to be safe. These headers were not provided previously for S3 content because it was served from the S3 domain. This mitigates content served from Zulip which could be a stored XSS, but only in browsers which support Content-Security-Policy headers; see subsequent commit for the complete solution.	2023-02-07 17:09:52 +00:00
Alex Vandiver	d41a00b83b	uploads: Extra-escape internal S3 paths. In nginx, `location` blocks operate on the _decoded_ URI[^1]: > The matching is performed against a normalized URI, after decoding > the text encoded in the “%XX” form This means that if a user-uploaded file contains characters that are not URI-safe, the browser encodes them in UTF-8 and then URI-encodes them -- and nginx decodes them and reassembles the original character before running the `location ~ ^/...` match. This means that the `$2` _is not URI-encoded_ and _may contain non-ASCII characters. When `proxy_pass` is passed a value containing one or more variables, it does no encoding on that expanded value, assuming that the bytes are exactly as they should be passed to the upstream. This means that directly calling `proxy_pass https://$1/$2` would result in sending high-bit characters to the S3 upstream, which would rightly balk. However, a longstanding bug in nginx's `set` directive[^2] means that the following line: ```nginx set $download_url https://$1/$2; ``` ...results in nginx accidentally URI-encoding $1 and $2 when they are inserted, resulting in a `$download_url` which is suitable to pass to `proxy_pass`. This bug is only present with numeric capture variables, not named captures; this is particularly relevant because numeric captures are easily overridden by additional regexes elsewhere, as subsequent commits will add. Fixing this is complicated; nginx does not supply any way to escape values[^3], besides a third-party module[^4] which is an undue complication to begin using. The only variable which nginx exposes which is _not_ un-escaped already is `$request_uri`, which contains the very original URL sent by the browser -- and thus can't respect any work done in Django to generate the `X-Accel-Redirect` (e.g., for `/user_uploads/temporary/` URLs). We also cannot pass these URLs to nginx via query-parameters, since `$arg_foo` values are not URI-decoded by nginx, there is no function to do so[^3], and the values must be URI-encoded because they themselves are URLs with query parameters. Extra-URI-encode the path that we pass to the `X-Accel-Redirect` location, for S3 redirects. We rely on the `location` block un-escaping that layer, leaving `$s3_hostname` and `$s3_path` as they were intended in Django. This works around the nginx bug, with no behaviour change. [^1]: http://nginx.org/en/docs/http/ngx_http_core_module.html#location [^2]: https://trac.nginx.org/nginx/ticket/348 [^3]: https://trac.nginx.org/nginx/ticket/52 [^4]: https://github.com/openresty/set-misc-nginx-module#set_escape_uri	2023-02-07 17:09:52 +00:00
Alex Vandiver	a955f52904	uploads: Stop putting API headers on local-file upload responses. These only need the usual response headers, not the Access-Control-Origin headers that API endpoints need.	2023-02-07 17:09:52 +00:00

1 2 3 4 5 ...

1515 Commits