zulip

Commit Graph

Author	SHA1	Message	Date
Alex Vandiver	3feb536df3	nagios: Remove swap check. Swap usage is not a high signal thing to alert on, and is likely to flap.	2023-03-27 15:10:50 -07:00
Alex Vandiver	262b19346e	puppet: Decrease default nginx worker_connections. Increasing worker_connections has a memory cost, unlike the rest of the changes in 1c76036c61d8; setting it to 1 million caused nginx to consume several GB of memory. Reduce the default down to 10k, and allow deploys to configure it up if necessary. `worker_rlimit_nofile` is left at 1M, since it has no impact on memory consumption.	2023-03-23 15:59:23 -07:00
Alex Vandiver	0c46bbdf9f	puppet: Update dependencies.	2023-03-23 09:50:30 -07:00
Anders Kaseorg	3a27b12a7d	dependencies: Switch to pnpm. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2023-03-20 15:48:29 -07:00
Alex Vandiver	f2a20b56bc	puppet: Enable sentry hooks for production and staging.	2023-03-17 08:10:31 -07:00
Alex Vandiver	1a65315566	puppet: Switch teleport to running under systemd, not supervisord. There is no reason that the base node access method should be run under supervisor, which exists primarily to give access to the `zulip` user to restart its managed services. This access is unnecessary for Teleport, and also causes unwanted restarts of Teleport services when the `supervisor` base configuration changes. Additionally, supervisor does not support the in-place upgrade process that Teleport uses, as it replaces its core process with a new one. Switch to installing a systemd configuration file (as generated by `teleport install systemd`) for each part of Teleport, customized to pass a `--config` path. As such, we explicitly disable the `teleport` service provided by the package. The supervisor process is shut down by dint of no longer installing the file, which purges it from the managed directory, and reloads Supervisor to pick up the removed service.	2023-03-15 17:23:42 -04:00
Alex Vandiver	8f8a9f6f04	sentry: Add frontend event monitoring. Zulip already has integrations for server-side Sentry integration; however, it has historically used the Zulip-specific `blueslip` library for monitoring browser-side errors. However, the latter sends errors to email, as well optionally to an internal `#errors` stream. While this is sufficient for low volumes of users, and useful in that it does not rely on outside services, at higher volumes it is very difficult to do any analysis or filtering of the errors. Client-side errors are exceptionally noisy, with many false positives due to browser extensions or similar, so determining real real errors from a stream of un-grouped emails or messages in a stream is quite difficult. Add a client-side Javascript sentry integration. To provide useful backtraces, this requires extending the pre-deploy hooks to upload the source-maps to Sentry. Additional keys are added to the non-public API of `page_params` to control the DSN, realm identifier, and sample rates.	2023-03-07 10:51:45 -08:00
Alex Vandiver	fc40d74cda	hooks: Remove --project from sentry when not necessary.	2023-03-07 10:51:45 -08:00
Alex Vandiver	08251ac53b	hooks: Fix typo in sentry error message.	2023-03-07 10:51:45 -08:00
Alex Vandiver	26eb1d7371	puppet: Also set systemd limits.	2023-03-03 16:39:47 -08:00
Alex Vandiver	1c76036c61	puppet: Increase maximum file descriptors. The current threshold of 40k descriptors was set in 2016, chosen to be "at least 40x our current scale." At present, that only provides a 50% safety margin. Increase to 1 million to provide the same 40x buffer as previously. The highest value currently allowed by the kernels in production (linux 5.3.0) is 1048576. This is set as the hard limit. The 1 million limit is likely far above what the system can handle for other reasons (memory, cpu, etc). While this removes a potential safeguard on overload due to too many connections, due to the longpoll architecture we would generally prefer to service more connections at lower quality (due to CPU limitations) rather than randomly reject additional connections. Relevant prior commits: - `836f313e69` - `f2f97dd335` - `ec23996538` - `8806ec698a` - `e4fce10f46`	2023-03-03 16:39:47 -08:00
Alex Vandiver	a20bb54cbb	puppet: Move limits.conf to maintain more of the installation structure.	2023-03-03 16:39:47 -08:00
Tim Abbott	6b37f9a290	puppet: Run delete-old-unclaimed-attachments in archive cron file. After reflecting a bit on the last commit, I think it's substantially easier to understand what's happening for these two tasks to be defined in the same file, because we want the timing to be different to avoid potential races.	2023-03-01 11:21:42 -08:00
Mateusz Mandera	35344f7f6b	puppet: Add cronjob to run delete_old_unclaimed_attachments daily.	2023-03-01 11:16:39 -08:00
Alex Vandiver	e7fabb45f2	puppet: Pin with sha256sum verification.	2023-02-28 00:04:39 -05:00
Alex Vandiver	0d42abe1a8	puppet: wal-g is a tarball with a single file, not a directory. `5db55c38dc` switched from `ensure => present` to the more specific `ensure => directory` on the premise that tarballs would result in more than one file being copied out of them. However, we only extract a single file from the wal-g tarball, and install it at the output path. The new rule attempts to replace it with an empty directory after extraction. Switch back to `ensure => present` for the tarball codepath.	2023-02-14 18:18:36 -05:00
Alex Vandiver	6f8ce2d00a	hooks: Fix shebang line to use /usr/bin/env bash.	2023-02-14 17:28:58 -05:00
Alex Vandiver	044ccdb334	chat.zulip.org: Enable Sentry hook.	2023-02-14 17:20:35 -05:00
Alex Vandiver	3109d40b21	puppet: Add a sentry release class. This installs the Sentry CLI, and uses it to send API events to Sentry when a release is started and completed.	2023-02-10 15:53:10 -08:00
Alex Vandiver	5db55c38dc	puppet: Add a sha256_file_to.	2023-02-10 15:53:10 -08:00
Alex Vandiver	af0ba0b58f	puppet: sha256_tarball_to is only ever called with one from/to.	2023-02-10 15:53:10 -08:00
Alex Vandiver	840884ec89	upgrade-zulip: Provide directories to run hooks before/after upgrade. These hooks are run immediately around the critical section of the upgrade. If the upgrade fails for preparatory reasons, the pre-deploy hook may not be run; if it fails during the upgrade, the post-deploy hook will not be run. Hooks are called from the CWD of the new deploy, with arguments of the old version and the new version. If they exit with non-0 exit code, the deploy aborts.	2023-02-10 15:53:10 -08:00
Alex Vandiver	7ab4fdf250	memcached: Allow overriding the max-item-size. This is necessary for organizations with extremely large numbers of members (20k+).	2023-02-09 12:04:29 -08:00
Alex Vandiver	23894fc9a3	uploads: Set Content-Type and -Disposition from Django for local files. Similar to the previous commit, Django was responsible for setting the Content-Disposition based on the filename, whereas the Content-Type was set by nginx based on the filename. This difference is not exploitable, as even if they somehow disagreed with Django's expected Content-Type, nginx will only ever respond with Content-Types found in `uploads.types` -- none of which are unsafe for user-supplied content. However, for consistency, have Django provide both Content-Type and Content-Disposition headers.	2023-02-07 17:12:02 +00:00
Alex Vandiver	2f6c5a883e	CVE-2023-22735: Provide the Content-Disposition header from S3. The Content-Type of user-provided uploads was provided by the browser at initial upload time, and stored in S3; however, `04cf68b45e` switched to determining the Content-Disposition merely from the filename. This makes uploads vulnerable to a stored XSS, wherein a file uploaded with a content-type of `text/html` and an extension of `.png` would be served to browsers as `Content-Disposition: inline`, which is unsafe. The `Content-Security-Policy` headers in the previous commit mitigate this, but only for browsers which support them. Revert parts of `04cf68b45e`, specifically by allowing S3 to provide the Content-Disposition header, and using the `ResponseContentDisposition` argument when necessary to override it to `attachment`. Because we expect S3 responses to vary based on this argument, we include it in the cache key; since the query parameter has dashes in it, we can't use use the helper `$arg_` variables, and must parse it from the query parameters manually. Adding the disposition may decrease the cache hit rate somewhat, but downloads are infrequent enough that it is unlikely to have a noticeable effect. We take care to not adjust the cache key for requests which do not specify the disposition.	2023-02-07 17:09:52 +00:00
Alex Vandiver	36e97f8121	CVE-2023-22735: Set a Content-Security-Policy header on proxied S3 data. This was missed in 04cf68b45ebb5c03247a0d6453e35ffc175d55da; as this content is fundamentally untrusted, it must be served with `Content-Security-Policy` headers in order to be safe. These headers were not provided previously for S3 content because it was served from the S3 domain. This mitigates content served from Zulip which could be a stored XSS, but only in browsers which support Content-Security-Policy headers; see subsequent commit for the complete solution.	2023-02-07 17:09:52 +00:00
Alex Vandiver	d41a00b83b	uploads: Extra-escape internal S3 paths. In nginx, `location` blocks operate on the _decoded_ URI[^1]: > The matching is performed against a normalized URI, after decoding > the text encoded in the “%XX” form This means that if a user-uploaded file contains characters that are not URI-safe, the browser encodes them in UTF-8 and then URI-encodes them -- and nginx decodes them and reassembles the original character before running the `location ~ ^/...` match. This means that the `$2` _is not URI-encoded_ and _may contain non-ASCII characters. When `proxy_pass` is passed a value containing one or more variables, it does no encoding on that expanded value, assuming that the bytes are exactly as they should be passed to the upstream. This means that directly calling `proxy_pass https://$1/$2` would result in sending high-bit characters to the S3 upstream, which would rightly balk. However, a longstanding bug in nginx's `set` directive[^2] means that the following line: ```nginx set $download_url https://$1/$2; ``` ...results in nginx accidentally URI-encoding $1 and $2 when they are inserted, resulting in a `$download_url` which is suitable to pass to `proxy_pass`. This bug is only present with numeric capture variables, not named captures; this is particularly relevant because numeric captures are easily overridden by additional regexes elsewhere, as subsequent commits will add. Fixing this is complicated; nginx does not supply any way to escape values[^3], besides a third-party module[^4] which is an undue complication to begin using. The only variable which nginx exposes which is _not_ un-escaped already is `$request_uri`, which contains the very original URL sent by the browser -- and thus can't respect any work done in Django to generate the `X-Accel-Redirect` (e.g., for `/user_uploads/temporary/` URLs). We also cannot pass these URLs to nginx via query-parameters, since `$arg_foo` values are not URI-decoded by nginx, there is no function to do so[^3], and the values must be URI-encoded because they themselves are URLs with query parameters. Extra-URI-encode the path that we pass to the `X-Accel-Redirect` location, for S3 redirects. We rely on the `location` block un-escaping that layer, leaving `$s3_hostname` and `$s3_path` as they were intended in Django. This works around the nginx bug, with no behaviour change. [^1]: http://nginx.org/en/docs/http/ngx_http_core_module.html#location [^2]: https://trac.nginx.org/nginx/ticket/348 [^3]: https://trac.nginx.org/nginx/ticket/52 [^4]: https://github.com/openresty/set-misc-nginx-module#set_escape_uri	2023-02-07 17:09:52 +00:00
Alex Vandiver	a955f52904	uploads: Stop putting API headers on local-file upload responses. These only need the usual response headers, not the Access-Control-Origin headers that API endpoints need.	2023-02-07 17:09:52 +00:00
Anders Kaseorg	df001db1a9	black: Reformat with Black 23. Black 23 enforces some slightly more specific rules about empty line counts and redundant parenthesis removal, but the result is still compatible with Black 22. (This does not actually upgrade our Python environment to Black 23 yet.) Signed-off-by: Anders Kaseorg <anders@zulip.com>	2023-02-02 10:40:13 -08:00
Alex Vandiver	68f4071873	puppet: Allow choice of timesync tool.	2023-01-31 14:20:41 -08:00
Tran Sang	3bea65b39c	puppet: Set /etc/mailname based on postfix.mailname configuration. The `postfix.mailname` setting in `/etc/zulip.conf` was previously only used for incoming mail, to identify in Postfix configuration which messages were "local." Also set `/etc/mailname`, which is used by Postfix to set how it identifies to other hosts when sending outgoing email. Co-authored-by: Alex Vandiver <alexmv@zulip.com>	2023-01-27 15:08:22 -05:00
Alex Vandiver	e8123dfeea	puppet: Match the `x` bits on directories to what puppet actually does. Puppet _always_ sets the `+x` bit on directories if they have the `r` bit set for that slot[^1]: > When specifying numeric permissions for directories, Puppet sets the > search permission wherever the read permission is set. As such, for instance, `0640` is actually applied as `0750`. Fix what we "want" to match what puppet is applying, by adding the `x` bit. In none of these cases did we actually intend the directory to not be executable. [1] https://www.puppet.com/docs/puppet/5.5/types/file.html#file-attribute-mode	2023-01-26 15:06:01 -08:00
Alex Vandiver	372bba4a8e	puppet: Stop creating a /home/zulip/logs. This was last really used in `d7a3570c7e`, in 2013, when it was `/home/humbug/logs`. Repoint the one obscure piece of tooling that writes there, and remove the places that created it.	2023-01-26 15:06:01 -08:00
Alex Vandiver	7f2514b316	puppet: Collapse identical blocks.	2023-01-26 15:06:01 -08:00
Alex Vandiver	09bb0e6fd0	puppet: Upgrade Grafana.	2023-01-26 10:24:24 -08:00
Alex Vandiver	d0de66b273	puppet: Remove "ensure => absent" rules which have all been applied.	2023-01-24 13:05:24 -08:00
Alex Vandiver	50e9df448d	puppet: Do not start the "puppet" service. Zulip runs puppet manually, using the command-line tool; it does not make use of the `puppet` service which, by default, attempts to contact a host named `puppet` every two minutes to get a manifest to apply. These attempts can generate log spam and user confusion. Disable and stop the `puppet` service via puppet.	2023-01-23 13:02:09 -08:00
Anders Kaseorg	7a7513f6e0	ruff: Fix SIM201 Use `… != …` instead of `not … == …`. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2023-01-23 11:18:36 -08:00
Anders Kaseorg	b0e569f07c	ruff: Fix SIM102 nested `if` statements. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2023-01-23 11:18:36 -08:00
Alex Vandiver	04cf68b45e	uploads: Serve S3 uploads directly from nginx. When file uploads are stored in S3, this means that Zulip serves as a 302 to S3. Because browsers do not cache redirects, this means that no image contents can be cached -- and upon every page load or reload, every recently-posted image must be re-fetched. This incurs extra load on the Zulip server, as well as potentially excessive bandwidth usage from S3, and on the client's connection. Switch to fetching the content from S3 in nginx, and serving the content from nginx. These have `Cache-control: private, immutable` headers set on the response, allowing browsers to cache them locally. Because nginx fetching from S3 can be slow, and requests for uploads will generally be bunched around when a message containing them are first posted, we instruct nginx to cache the contents locally. This is safe because uploaded file contents are immutable; access control is still mediated by Django. The nginx cache key is the URL without query parameters, as those parameters include a time-limited signed authentication parameter which lets nginx fetch the non-public file. This adds a number of nginx-level configuration parameters to control the caching which nginx performs, including the amount of in-memory index for he cache, the maximum storage of the cache on disk, and how long data is retained in the cache. The currently-chosen figures are reasonable for small to medium deployments. The most notable effect of this change is in allowing browsers to cache uploaded image content; however, while there will be many fewer requests, it also has an improvement on request latency. The following tests were done with a non-AWS client in SFO, a server and S3 storage in us-east-1, and with 100 requests after 10 requests of warm-up (to fill the nginx cache). The mean and standard deviation are shown. \| \| Redirect to S3 \| Caching proxy, hot \| Caching proxy, cold \| \| ----------------- \| ------------------- \| ------------------- \| ------------------- \| \| Time in Django \| 263.0 ms ± 28.3 ms \| 258.0 ms ± 12.3 ms \| 258.0 ms ± 12.3 ms \| \| Small file (842b) \| 586.1 ms ± 21.1 ms \| 266.1 ms ± 67.4 ms \| 288.6 ms ± 17.7 ms \| \| Large file (660k) \| 959.6 ms ± 137.9 ms \| 609.5 ms ± 13.0 ms \| 648.1 ms ± 43.2 ms \| The hot-cache performance is faster for both large and small files, since it saves the client the time having to make a second request to a separate host. This performance improvement remains at least 100ms even if the client is on the same coast as the server. Cold nginx caches are only slightly slower than hot caches, because VPC access to S3 endpoints is extremely fast (assuming it is in the same region as the host), and nginx can pool connections to S3 and reuse them. However, all of the 648ms taken to serve a cold-cache large file is occupied in nginx, as opposed to the only 263ms which was spent in nginx when using redirects to S3. This means that to overall spend less time responding to uploaded-file requests in nginx, clients will need to find files in their local cache, and skip making an uploaded-file request, at least 60% of the time. Modeling shows a reduction in the number of client requests by about 70% - 80%. The `Content-Disposition` header logic can now also be entirely shared with the local-file codepath, as can the `url_only` path used by mobile clients. While we could provide the direct-to-S3 temporary signed URL to mobile clients, we choose to provide the served-from-Zulip signed URL, to better control caching headers on it, and greater consistency. In doing so, we adjust the salt used for the URL; since these URLs are only valid for 60s, the effect of this salt change is minimal.	2023-01-09 18:23:58 -05:00
Alex Vandiver	ed6d62a9e7	avatars: Serve /user_avatars/ through Django, which offloads to nginx. Moving `/user_avatars/` to being served partially through Django removes the need for the `no_serve_uploads` nginx reconfiguring when switching between S3 and local backends. This is important because a subsequent commit will move S3 attachments to being served through nginx, which would make `no_serve_uploads` entirely nonsensical of a name. Serve the files through Django, with an offload for the actual image response to an internal nginx route. In development, serve the files directly in Django. We do _not_ mark the contents as immutable for caching purposes, since the path for avatar images is hashed only by their user-id and a salt, and as such are reused when a user's avatar is updated.	2023-01-09 18:23:58 -05:00
Alex Vandiver	24f95a3788	uploads: Move internal upload serving path to under /internal/.	2023-01-09 18:23:58 -05:00
Alex Vandiver	b20ecabf8f	tornado: Move internal tornado redirect to under /internal/.	2023-01-09 18:23:58 -05:00
Alex Vandiver	cc9b028312	uploads: Set X-Accel-Redirect manually, without using django-sendfile2. The `django-sendfile2` module unfortunately only supports a single `SENDFILE` root path -- an invariant which subsequent commits need to break. Especially as Zulip only runs with a single webserver, and thus sendfile backend, the functionality is simple to inline. It is worth noting that the following headers from the initial Django response are _preserved_, if present, and sent unmodified to the client; all other headers are overridden by those supplied by the internal redirect[^1]: - Content-Type - Content-Disposition - Accept-Ranges - Set-Cookie - Cache-Control - Expires As such, we explicitly unset the Content-type header to allow nginx to set it from the static file, but set Content-Disposition and Cache-Control as we want them to be. [^1]: https://www.nginx.com/resources/wiki/start/topics/examples/xsendfile/	2023-01-09 18:23:58 -05:00
Alex Vandiver	497abc2e48	nginx: Move uploads handling into app_frontend_base. As uploads are a feature of the application, not of a generic nginx deployment, move them into the `zulip::app_frontend_base` class. This is purely for organizational clarity -- we do not support deployments with has `zulip::nginx` but not `zulip::app_frontend_base`.	2023-01-09 18:23:58 -05:00
Anders Kaseorg	f7e97b1180	ruff: Fix PLW0602 Using global but no assignment is done. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2023-01-04 16:25:07 -08:00
Alex Vandiver	8ba51f90e6	puppet: Go's website is officially go.dev, not golang.org.	2023-01-04 14:33:37 -08:00
Anders Kaseorg	f3f5dfb5aa	ruff: Fix RUF004 exit() is only available in the interpreter. ‘exit’ is pulled in for the interactive interpreter as a side effect of the site module; this can be disabled with python -S and shouldn’t be relied on. Also, use the NoReturn type where appropriate. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-12-04 22:11:24 -08:00
Alex Vandiver	ea9988cc9e	grafana: Upgrade to 9.3.0.	2022-11-30 12:41:18 -05:00
Alex Vandiver	7069e2c8c2	puppet: Align more sections of $versions.	2022-11-30 12:13:47 -05:00

1 2 3 4 5 ...

1493 Commits