Commit Graph

1527 Commits

Author SHA1 Message Date
Anders Kaseorg 16aa7c0923 puppet: Migrate Ruby functions from legacy Puppet 3.x API.
https://www.puppet.com/docs/puppet/7/functions_refactor_legacy.html

This removes a bug in the 3.x API that was converting nil to the empty
string, so some templates need to be adjusted.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-05-12 18:17:53 -07:00
Anders Kaseorg cf8ae46291 puppet: Fix shell escaping in Ruby functions.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-05-12 18:17:53 -07:00
Anders Kaseorg 614ab533dc puppet: Reformat Ruby functions with rufo.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-05-12 18:17:53 -07:00
Alex Vandiver f4683de742 puppet: Switch the `rolling_restart` setting to use the bool values.
2c5fc1827c standardized which values are "true"; use them.
2023-05-11 15:54:15 -07:00
Alex Vandiver 530980cf31 zulip_tools: Add a get_config_bool to match Puppet logic.
Unfortunately, the existing use of this logic in `process_fts_updates`
cannot switch to using this code, as that code cannot import
zulip_tools.
2023-05-11 15:54:15 -07:00
Alex Vandiver 4d02ac6fb1 puppet: Bring back uwsgi_rolling_restart config.
a522ad1d9a mistakenly deleted this variable assignment, which made
the `zulip.conf` configuration setting not work -- uwsgi's `lazy_apps`
were not enabled, which are required for rolling restart.
2023-05-11 15:54:15 -07:00
Alex Vandiver da2c1ad839 puppet: Fix checksum of sentry-cli binary. 2023-05-11 13:39:54 -07:00
Alex Vandiver 1019a74c6f puppet: Update dependencies. 2023-05-11 10:51:37 -07:00
Alex Vandiver f11350f789 puppet: Add PostgreSQL 15 support.
Instead of copying over a mostly-unchanged `postgresql.conf`, we
transition to deploying a `conf.d/zulip.conf` which contains the
only material changes we made to the file, which were previously
appended to the end.

While shipping separate while `postgresql.conf` files for each
supported version is useful if there is large variety in supported
options between versions, there is not no such variation at current,
and the burden of overriding the entire default configuration is that
it must be keep up to date wit the package's version.
2023-05-10 14:06:02 -07:00
Alex Vandiver a9f51a0c02 static: Add Timing-Allow-Origin: * to allow sentry data timing.
This is required for the browser to provide detailed timing
information about resource fetches from other domains[^1].

[^1]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Timing-Allow-Origin
2023-05-09 13:16:28 -07:00
Alex Vandiver e5ae55637e install: Remove PostgreSQL 11 support.
Django 4.2 removes this support, so Zulip has not installed with
PostgreSQL 11 since 2c20028aa4.
2023-05-05 13:35:32 -07:00
Alex Vandiver 2f4775ba68 wal-g: Write out a logfile.
Otherwise, this output goes into `/var/spool/mail/postgres`, which is
not terribly helpful.  We do not write to `/var/log/zulip` because the
backup runs as the `postgres` user, and `/var/log/zulip` is owned by
zulip and chmod 750.
2023-04-27 12:19:43 -07:00
Alex Vandiver 3aba2789d3 prometheus: Add an exporter for wal-g backup properties.
Since backups may now taken on arbitrary hosts, we need a blackbox
monitor that _some_ backup was produced.

Add a Prometheus exporter which calls `wal-g backup-list` and reports
statistics about the backups.

This could be extended to include `wal-g wal-verify`, but that
requires a connection to the PostgreSQL server.
2023-04-26 15:41:39 -07:00
Alex Vandiver b8a6de95d2 pg_backup_and_purge: Allow adjusting the backup concurrency.
SSDs are good at parallel random reads.
2023-04-26 10:54:51 -07:00
Alex Vandiver 19a11c9556 pg_backup_and_purge: Take backups on replicas, if present.
Taking backups on the database primary adds additional disk load,
which can impact the performance of the application.

Switch to taking backups on replicas, if they exist.  Some deployments
may have multiple replicas, and taking backups on all of them is
wasteful and potentially confusing; add a flag to inhibit taking
nightly snapshots on the host.

If the deployment is a single instance of PostgreSQL, with no
replicas, it takes backups as before, modulo the extra flag to allow
skipping taking them.
2023-04-26 10:54:51 -07:00
Alex Vandiver 4b35211ca1 pg_backup_and_purge: Remove unnecessary explicit types. 2023-04-26 10:54:51 -07:00
Alex Vandiver e72e83793d pg_backup_and_purge: Just use subprocess directly.
dry_run was never passed into run(); switch to using subprocess directly.
2023-04-26 10:54:51 -07:00
Alex Vandiver cace8858f9 puppet: Move logrotate config into app_frontend_base.
7c023042cf moved the logrotate configuration to being a templated
file, from a static file, but missed that the static file was still
referenced from `zulip_ops::app_frontend`; it only updated
`zulip::profile::app_frontend`.  This caused errors in applying puppet
on any `zulip_ops::app_frontend` host.

Prior to 7c023042cf, the Puppet role was identical between those two
classes; deduplicate the rule by moving the updated template
definition into `zulip::app_frontend_base` which is common to those
two classes and not used in any other classes.
2023-04-19 09:34:37 -07:00
Alex Vandiver 775c7ca4ea hooks: Give a bit better Zulip deploy message. 2023-04-19 09:32:39 -07:00
Alex Vandiver d0fc3f1c2e puppet: Add prod hooks to push zulip-cloud-current and notify CZO. 2023-04-12 11:36:33 -07:00
Alex Vandiver 7c023042cf puppet: Rotate access log files every day, not at 500M.
Since logrotate runs in a daily cron, this practically means "daily,
but only if it's larger than 500M."  For large installs with large
traffic, this is effectively daily for 10 days; for small installs, it
is an unknown amount of time.

Switch to daily logfiles, defaulting to 14 days to match nginx; this
can be overridden using a zulip.conf setting.  This makes it easier to
ensure that access logs are only kept for a bounded period of time.
2023-04-06 14:31:16 -04:00
Tim Abbott 561daee2a1 puppet: Update declared zmirror dependencies.
Following zulip/python-zulip-api/pull/758/, we're no longer using
python-zephyr, and don't need to build it from source. Additionally,
we no longer need to build a forked Zephyr package, since ZLoadSession
and ZDumpSession were merged in
e6a545e759.
2023-04-06 09:45:06 -07:00
Alex Vandiver 6975417acf puppet: Create zmirror supervisor subdirectory.
To not change the `supervisor.conf` file, which requires a restart of
supervisor (and thus all services running under it, which is extremely
disruptive) we carefully leave the contents unchanged for most
installs, and append a new piece to the file, only for the zmirror
configuration, using `concat`.
2023-04-06 09:45:06 -07:00
Alex Vandiver c519ba40fd hooks: Add a push_git_ref post-deploy hook. 2023-04-05 18:51:55 -04:00
Alex Vandiver 8a771c7ac0 hooks: Add a hook to send a Zulip before/after the deploy. 2023-04-05 18:51:55 -04:00
Alex Vandiver 377f2d6d03 hooks: Add a common/ directory and factor out common Sentry code. 2023-04-05 18:51:55 -04:00
Alex Vandiver f4d70a2e37 hooks: Resolve version strings to commit SHAs, and pass in via the env. 2023-04-05 18:51:55 -04:00
Alex Vandiver ecfb12404a hooks: Switch to passing values through the environment. 2023-04-05 18:51:55 -04:00
Alex Vandiver 160a917ad3 hooks: Add a helper to install a single static file. 2023-04-05 18:51:55 -04:00
Alex Vandiver 0c13bacb89 sentry: Switch shell variables to lower-case. 2023-04-05 18:51:55 -04:00
Alex Vandiver 7202a98438 cron: Move fetch-tor-exit-nodes to not on the hour.
We see connection timeouts and other access issues when run exactly on
the hour, either due to load on their servers from similar cron jobs,
or from operational processes of theirs.

Move to on the :17s to avoid these access issues.
2023-04-05 12:20:30 -07:00
Alex Vandiver db0ae85d97 sentry: Remove an unnecessary sudo.
790e4854dd made the hooks run as the `zulip` user, making this sudo
unnecessary.
2023-04-03 15:04:56 -07:00
Alex Vandiver 89e366771a prometheus: Add a postgres exporter. 2023-03-30 16:16:18 -07:00
Alex Vandiver c2beb64a79 prometheus: Consistently import the base class and supervisor, if needed. 2023-03-30 16:16:18 -07:00
Alex Vandiver 3feb536df3 nagios: Remove swap check.
Swap usage is not a high signal thing to alert on, and is likely to flap.
2023-03-27 15:10:50 -07:00
Alex Vandiver 262b19346e puppet: Decrease default nginx worker_connections.
Increasing worker_connections has a memory cost, unlike the rest of
the changes in 1c76036c61d8; setting it to 1 million caused nginx to
consume several GB of memory.

Reduce the default down to 10k, and allow deploys to configure it up
if necessary.  `worker_rlimit_nofile` is left at 1M, since it has no
impact on memory consumption.
2023-03-23 15:59:23 -07:00
Alex Vandiver 0c46bbdf9f puppet: Update dependencies. 2023-03-23 09:50:30 -07:00
Anders Kaseorg 3a27b12a7d dependencies: Switch to pnpm.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-03-20 15:48:29 -07:00
Alex Vandiver f2a20b56bc puppet: Enable sentry hooks for production and staging. 2023-03-17 08:10:31 -07:00
Alex Vandiver 1a65315566 puppet: Switch teleport to running under systemd, not supervisord.
There is no reason that the base node access method should be run
under supervisor, which exists primarily to give access to the `zulip`
user to restart its managed services.  This access is unnecessary for
Teleport, and also causes unwanted restarts of Teleport services when
the `supervisor` base configuration changes.  Additionally,
supervisor does not support the in-place upgrade process that Teleport
uses, as it replaces its core process with a new one.

Switch to installing a systemd configuration file (as generated by
`teleport install systemd`) for each part of Teleport, customized to
pass a `--config` path.  As such, we explicitly disable the `teleport`
service provided by the package.

The supervisor process is shut down by dint of no longer installing
the file, which purges it from the managed directory, and reloads
Supervisor to pick up the removed service.
2023-03-15 17:23:42 -04:00
Alex Vandiver 8f8a9f6f04 sentry: Add frontend event monitoring.
Zulip already has integrations for server-side Sentry integration;
however, it has historically used the Zulip-specific `blueslip`
library for monitoring browser-side errors.  However, the latter sends
errors to email, as well optionally to an internal `#errors` stream.
While this is sufficient for low volumes of users, and useful in that
it does not rely on outside services, at higher volumes it is very
difficult to do any analysis or filtering of the errors.  Client-side
errors are exceptionally noisy, with many false positives due to
browser extensions or similar, so determining real real errors from a
stream of un-grouped emails or messages in a stream is quite
difficult.

Add a client-side Javascript sentry integration.  To provide useful
backtraces, this requires extending the pre-deploy hooks to upload the
source-maps to Sentry.  Additional keys are added to the non-public
API of `page_params` to control the DSN, realm identifier, and sample
rates.
2023-03-07 10:51:45 -08:00
Alex Vandiver fc40d74cda hooks: Remove --project from sentry when not necessary. 2023-03-07 10:51:45 -08:00
Alex Vandiver 08251ac53b hooks: Fix typo in sentry error message. 2023-03-07 10:51:45 -08:00
Alex Vandiver 26eb1d7371 puppet: Also set systemd limits. 2023-03-03 16:39:47 -08:00
Alex Vandiver 1c76036c61 puppet: Increase maximum file descriptors.
The current threshold of 40k descriptors was set in 2016, chosen to be
"at least 40x our current scale."  At present, that only provides a
50% safety margin.  Increase to 1 million to provide the same 40x
buffer as previously.

The highest value currently allowed by the kernels in
production (linux 5.3.0) is 1048576.  This is set as the hard limit.

The 1 million limit is likely far above what the system can handle for
other reasons (memory, cpu, etc).  While this removes a potential
safeguard on overload due to too many connections, due to the longpoll
architecture we would generally prefer to service more connections at
lower quality (due to CPU limitations) rather than randomly reject
additional connections.

Relevant prior commits:
 - 836f313e69
 - f2f97dd335
 - ec23996538
 - 8806ec698a
 - e4fce10f46
2023-03-03 16:39:47 -08:00
Alex Vandiver a20bb54cbb puppet: Move limits.conf to maintain more of the installation structure. 2023-03-03 16:39:47 -08:00
Tim Abbott 6b37f9a290 puppet: Run delete-old-unclaimed-attachments in archive cron file.
After reflecting a bit on the last commit, I think it's substantially
easier to understand what's happening for these two tasks to be
defined in the same file, because we want the timing to be different
to avoid potential races.
2023-03-01 11:21:42 -08:00
Mateusz Mandera 35344f7f6b puppet: Add cronjob to run delete_old_unclaimed_attachments daily. 2023-03-01 11:16:39 -08:00
Alex Vandiver e7fabb45f2 puppet: Pin with sha256sum verification. 2023-02-28 00:04:39 -05:00
Alex Vandiver 0d42abe1a8 puppet: wal-g is a tarball with a single file, not a directory.
5db55c38dc switched from `ensure => present` to the more specific
`ensure => directory` on the premise that tarballs would result in
more than one file being copied out of them.  However, we only extract
a single file from the wal-g tarball, and install it at the output
path.  The new rule attempts to replace it with an empty directory
after extraction.

Switch back to `ensure => present` for the tarball codepath.
2023-02-14 18:18:36 -05:00