Commit Graph

1465 Commits

Author SHA1 Message Date
Alex Vandiver 8bc26aab08 nagios: Switch check_user_zephyr_mirror_liveness to run via cron.
This check loads Django, and as such must be run as the zulip user.
Repeat the same pattern used elsewhere in nagios, of writing a state
file, which is read by `check_cron_file`.
2022-06-22 12:07:38 -07:00
Alex Vandiver 41deef40cf nagios: Switch to generic check_cron_file for queues and consumers.
These share a common root; 91da4bd59b duplicated the code, but
didn't move the existing uses to the new utility.
2022-06-22 12:07:38 -07:00
Alex Vandiver b2d0bad9af check_cron_file: Remove unnecessary quotes. 2022-06-22 12:07:38 -07:00
Alex Vandiver 41b7ae4e44 check_cron_file: Don't crash on missing cron file.
This is 5050fb19f6, but for `check_cron_file`, which was introduced
in 91da4bd59b.
2022-06-22 12:07:38 -07:00
Alex Vandiver 8fbde9b8c5 nagios: Only run check_fts_update_log on one PostgreSQL host.
The data is the same in the table in all replicas -- there is no need
to alert on all of them.
2022-06-22 12:07:38 -07:00
Alex Vandiver 499284d2fd nagios: Split postgresql into primary and replica.
Replication checks should only run on primary and replicas, not
standalone hosts; while `autovac_freeze` currently only runs on
primary hosts, it functions identically on replicas, and is fine to
run there.

Make `autovac_freeze` run on all `postgresql` hosts, and make
standalone hosts no longer `postgres_primary`, so they do not fail the
replication tests.
2022-06-22 12:07:38 -07:00
Alex Vandiver 38e435347b nagios: Add missing queue consumer checks. 2022-06-22 12:07:38 -07:00
Alex Vandiver e01a4242aa nagios: Sort queue consumer checks. 2022-06-22 12:07:38 -07:00
Alex Vandiver 2c90c7a010 nagios: Switch `check_remote_arg_string` queue checks to consumer checks.
These style of checks just look for matching process names using
`check_remote_arg_string`, which dates to 8edbd64bb8.  These were
added because the original two (`missedmessage_emails` and
`slow_queries`) did not create consumers, instead polling for events.

Switch these to checking the queue consumer counts that the
`check-rabbitmq-consumers` check is already writing out.  Since the
`missedmessage_emails` was _already_ checked via the consumer check, a
duplicate is not added.
2022-06-22 12:07:38 -07:00
Alex Vandiver f48d543d9b nagios: Make and use a "rabbitmq-consumer-service" template service. 2022-06-22 12:07:38 -07:00
Alex Vandiver 775a084d0f nagios: Add a catchall "other" set. 2022-06-22 12:07:38 -07:00
Alex Vandiver 83c82c8e15 nagios: Adjust load alerting by hostgroup.
Even the `pageable_servers` group did not page for high load -- in
part because what was "high" depends on the servers.  Set slightly
better limits based on server role.
2022-06-22 12:07:38 -07:00
Alex Vandiver 2a14aa5180 nagios: Add a `fullstack` hostgroup.
This will be used to apply checks only to czo.
2022-06-22 12:07:38 -07:00
Alex Vandiver b5ecfc327f nagios: Remove unnecessary `web` hostgroup.
This had identical membership to `frontends`.
2022-06-22 12:07:38 -07:00
Alex Vandiver 4be9025212 nagios: Remove redundant `postgresql` hostgroup.
This is implied by `postgresql_primary`.
2022-06-22 12:07:38 -07:00
Alex Vandiver d9d0014fb4 nagios: Rename `zmirror_main` into `zmirror` hostgroup.
`zmirror` itself was `zmirror_main` + `zmirrorp` but was unused; we
consistently just use the term `zmirror` for the non-personals server,
so use it as the hostgroup name.
2022-06-22 12:07:38 -07:00
Alex Vandiver 70c36985b4 nagios: Remove frontends from redis group.
The Redis nagios checks themselves are done against `redis` +
`frontends` groups, so there is no need to misleadingly place
`frontends` in the `redis` hostgroup.
2022-06-22 12:07:38 -07:00
Alex Vandiver 08127086bc nagios: Remove misleading "staging_frontends" from standalone.
No services are tested for the `staging_frontends` hostgroup, so this
does not alter the checks.
2022-06-22 12:07:38 -07:00
Alex Vandiver d804de871d nagios: Move staging and prod hostgroups adjacent. 2022-06-22 12:07:38 -07:00
Alex Vandiver 4c17f2bccc nagios: The frontends hostgroup now includes prod and staging frontends.
This lets the config file remove some repetition.
2022-06-22 12:07:38 -07:00
Alex Vandiver 1e81775fa0 nagios: Drop unhelpful hostgroup comment. 2022-06-22 12:07:38 -07:00
Alex Vandiver 7b584401ac nagios: Reformat hostgroups. 2022-06-22 12:07:38 -07:00
Alex Vandiver 93bcb86345 nagios: Reorder service checks. 2022-06-22 12:07:38 -07:00
Alex Vandiver eaaa2fbff8 nagios: Use canonical "hostgroup_name" consistently. 2022-06-22 12:07:38 -07:00
Alex Vandiver e8996b53a5 nagios: Remove unused has_swap hostgroup. 2022-06-22 12:07:38 -07:00
Alex Vandiver 33472ee9ff nagios: Remove unused stats host set. 2022-06-22 12:07:38 -07:00
Alex Vandiver bc4f4b4862 nagios: Make the pageable/not/flaky tri-state clearer. 2022-06-22 12:07:38 -07:00
Alex Vandiver c74f195fba nagios: Split AWS and non-AWS hosts, for ntp checks.
The non-AWS hosts cannot use the AWS ntp server for their check.
2022-06-22 12:07:38 -07:00
Alex Vandiver 872efdee58 nagios: Fold single- and multitornado_frontends back into frontends.
5abf4dee92 made this distinction, then multitornado_frontends was
never used; the singletornado_frontends alerting worked even for the
multiple-Tornado instances.

Remove the useless and misleading distinction.
2022-06-22 12:07:38 -07:00
Anders Kaseorg dc6af98e52 nginx: Add Cache-Control headers for Django-hashed static files.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-06-21 17:26:23 -07:00
Alex Vandiver 0645656fd8 process_fts_updates: Nagios may lack permissions to load Django config.
Even if Django and PostgreSQL are on the same host, the `nagios` user
may lack permissions to read accessory configuration files needed to
load the Django configuration (e.g. authentication keys).

Catch those failures, and switch to loading the required settings from
`/etc/zulip/zulip.conf`.
2022-06-21 12:50:13 -07:00
Anders Kaseorg a7f9c4f958 logging: Pass more format arguments to logging.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-06-03 12:27:23 -07:00
Alex Vandiver aa46d8d2a8 puppet: Enable strict typo checking in uwsgi. 2022-06-02 13:20:48 -07:00
Alex Vandiver 18ec3b6215 puppet: Enable background worker threads in uwsgi.
Without this, uwsgi does not release the GIL before going back into
`epoll_wait` to wait for the next request.  This results in any
background threads languishing, unserviced.[1]

Practically, this results in Sentry background reporter threads timing
out when attempting to post results -- but only in situations with low
traffic, as in those significant time is spent in `epoll_wait`.  This
is seen in logs as:

    WARN [urllib3.connectionpool] Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1131)'))': /api/123456789/envelope/

Or:

    WARN [urllib3.connectionpool] Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', RemoteDisconnected('Remote end closed connection without response'))': /api/123456789/envelope/

Sentry attempts to detect this and warn, but due to startup ordering,
the warning is not printed without lazy-loading.

Enable threads, at a miniscule performance cost, in order to support
background workers like Sentry[2].

[1] https://github.com/unbit/uwsgi/issues/1141#issuecomment-169042767
[2] https://docs.sentry.io/clients/python/advanced/#a-note-on-uwsgi
2022-06-02 13:20:48 -07:00
Alex Vandiver 919c904091 puppet: Give the uwsgi processes a shorter process name.
Previously, the complete command line, which is quite long, is shown:

    3963143 ?        SN     0:00 /home/zulip/deployments/current/zulip-current-venv/bin/uwsgi --ini /etc/zulip/uwsgi.ini
    3963144 ?        SN     0:03  \_ /home/zulip/deployments/current/zulip-current-venv/bin/uwsgi --ini /etc/zulip/uwsgi.ini
    3963145 ?        SN     0:03  \_ /home/zulip/deployments/current/zulip-current-venv/bin/uwsgi --ini /etc/zulip/uwsgi.ini
    3963146 ?        SN     0:03  \_ /home/zulip/deployments/current/zulip-current-venv/bin/uwsgi --ini /etc/zulip/uwsgi.ini
    3963147 ?        SN     0:03  \_ /home/zulip/deployments/current/zulip-current-venv/bin/uwsgi --ini /etc/zulip/uwsgi.ini
    3963148 ?        SN     0:03  \_ /home/zulip/deployments/current/zulip-current-venv/bin/uwsgi --ini /etc/zulip/uwsgi.ini
    3963149 ?        SN     0:03  \_ /home/zulip/deployments/current/zulip-current-venv/bin/uwsgi --ini /etc/zulip/uwsgi.ini

Configure uwsgi to rename and number the processes.  This results in:

    3907613 ?        SN     0:00 zulip-django uWSGI master
    3907614 ?        SN     0:05  \_ zulip-django uWSGI worker 1
    3907615 ?        SN     0:03  \_ zulip-django uWSGI worker 2
    3907616 ?        SN     0:05  \_ zulip-django uWSGI worker 3
    3907617 ?        SN     0:05  \_ zulip-django uWSGI worker 4
    3907618 ?        SN     0:05  \_ zulip-django uWSGI worker 5
    3907619 ?        SN     0:05  \_ zulip-django uWSGI worker 6
2022-06-02 13:20:48 -07:00
Alex Vandiver a522ad1d9a puppet: Always create a uwsgi master control socket.
This is potentially useful even with rolling restarts disabled.
2022-06-02 13:20:48 -07:00
Alex Vandiver 721a101f12 puppet: Reorganize and comment uwsgi.ini file.
As the uwsgi documentation is somewhat obtuse, more comments are added
here than might usually be.
2022-06-02 13:20:48 -07:00
Alex Vandiver 3741c1c034 puppet: Switch to checking time against the AWS timeserver.
Since this is what chrony is sync'ing to, it lessens the chance of
spurious firings of this alert.

See https://aws.amazon.com/blogs/aws/keeping-time-with-amazon-time-sync-service/
2022-05-31 22:57:32 -07:00
Alex Vandiver a201e3b25b puppet: Upgrade wal-g to 2.0.0. 2022-05-22 14:51:18 -07:00
Alex Vandiver c8ee53619d puppet: Upgrade go and smokescreen. 2022-05-22 14:51:18 -07:00
Alex Vandiver 4a5e530743 puppet: Upgrade Grafana to 8.5.3, for CVE-2022-29170. 2022-05-22 14:51:18 -07:00
Alex Vandiver baed1214f2 puppet: Only fix certbot certificates if https is enabled.
This is a reprise of c97162e485, but for the case where certbot
certs are no longer in use by way of enabling `http_only` and letting
another server handle TLS termination.

Fixes: #22034.
2022-05-17 15:03:44 -07:00
Alex Vandiver 62f234328d puppet: Include the OS-enabled nginx module configurations.
This allows system-level configuration to be done by `apt-get install`
of nginx modules, which place their load statements in this directory.

The initial import in ed0cb0a5f8 of the stock nginx config omitted
this include -- one potential explanation was in an effort to reduce
the memory footprint of the server.

The default nginx install enables:

    50-mod-http-auth-pam.conf
    50-mod-http-dav-ext.conf
    50-mod-http-echo.conf
    50-mod-http-geoip2.conf
    50-mod-http-geoip.conf
    50-mod-http-image-filter.conf
    50-mod-http-subs-filter.conf
    50-mod-http-upstream-fair.conf
    50-mod-http-xslt-filter.conf
    50-mod-mail.conf
    50-mod-stream.conf

While Zulip doesn't actively use any of these, they likely don't do
any harm to simply be loaded -- they are loaded into every nginx by
default.

Having the `modules-enabled` include allows easier extension of the
server, as neither of the existing wildcard
includes (`/etc/nginx/conf.d/*.conf` and
`/etc/nginx/zulip-include/app.d/*.conf`) are in the top context, and
thus able to load modules.
2022-05-17 15:03:07 -07:00
Alex Vandiver 814841c9ec puppet: Remove typo'd cron job.
54b6a83412 fixed the typo introduced in 49ad188449, but that does
not clean up existing installs which had the file with the wrong name
already.

Remove the file with the typo'd name, so two jobs do not race, and fix
the typo in the comment.
2022-05-16 14:57:21 -07:00
Alex Vandiver 20b7a2d450 puppet: Each worker should chdir after forking.
The top-level `chdir` setting only does the chdir once, at initial
`uwsgi` startup time.  Rolling restarts, however, however, require
that `uwsgi` pick up the _new_ value of the `current` directory, and
start new workers in that directory -- as currently implemented,
rolling restarts cannot restart into newer versions of the code, only
the same one in which they were started.

Use [configurable hooks][1] to execute the `chdir` after every fork.
This causes the following behaviour:

```
Thu May 12 18:56:55 2022 - chain reload starting...
Thu May 12 18:56:55 2022 - chain next victim is worker 1
Gracefully killing worker 1 (pid: 1757689)...
worker 1 killed successfully (pid: 1757689)
Respawned uWSGI worker 1 (new pid: 1757969)
Thu May 12 18:56:56 2022 - chain is still waiting for worker 1...
running "chdir:/home/zulip/deployments/current" (post-fork)...
Thu May 12 18:56:57 2022 - chain is still waiting for worker 1...
Thu May 12 18:56:58 2022 - chain is still waiting for worker 1...
Thu May 12 18:56:59 2022 - chain is still waiting for worker 1...
WSGI app 0 (mountpoint='') ready in 3 seconds on interpreter 0x55dfca409170 pid: 1757969 (default app)
Thu May 12 18:57:00 2022 - chain next victim is worker 2
[...]
```

..and so forth down the line of processes.  Each process is correctly
started in the _current_ value of `current`, and thus picks up the
correct code.

[1]: https://uwsgi-docs.readthedocs.io/en/latest/Hooks.html
2022-05-12 21:54:02 -07:00
Alex Vandiver 7f6a77da31 puppet: Add a redis exporter. 2022-05-03 17:13:44 -07:00
Anders Kaseorg e9ba9b0e0d zulip-ec2-configure-interfaces: Remove.
Our current EC2 systems don’t have an interface named ‘eth0’, and if
they did, this script would do nothing but crash with ImportError
because we have never installed boto.utils for Python 3.

(The message of commit 2a4d851a7c made
an effort to document for future researchers why this script should
not have been blindly converted to Python 3.  However, commit
2dc6d09c2a (#14278) was evidently
unresearched and untested.)

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-05-03 02:25:59 -07:00
Alex Vandiver d891b9590a puppet: Fix non-replicated PostgreSQL 10 and 11 configuration.
6f5ae8d13d removed the `$replication` variable from the
configurations of PostgreSQL 12 and higher, but left it in the
templates for PostgreSQL 10 and 11.  Because `undef != ''`,
deployments on PostgreSQL 10 and 11 started trying to push to S3
backups, regardless of if they were configured, leaving frequent log
messages like:

```
2022-04-30 12:45:47.805 UTC [626d24ec.1f8db0]: [107-1] LOG: archiver process (PID 2086106) exited with exit code 1
2022-04-30 12:45:49.680 UTC [626d24ee.1f8dc3]: [18-1] LOG: checkpoint complete: wrote 19 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=1.910 s, sync=0.022 s, total=1.950 s; sync files=16, longest=0.018 s, average=0.002 s; distance=49 kB, estimate=373 kB
/usr/bin/timeout: failed to run command "/usr/local/bin/env-wal-g": No such file or directory
2022-04-30 12:46:17.852 UTC [626d2f99.1fd4e9]: [1-1] FATAL: archive command failed with exit code 127
2022-04-30 12:46:17.852 UTC [626d2f99.1fd4e9]: [2-1] DETAIL: The failed archive command was: /usr/bin/timeout 10m /usr/local/bin/env-wal-g wal-push pg_wal/000000010000000300000080
```

Switch the PostgreSQL 10 and 11 configuration to check
`s3_backups_bucket`, like the other versions.
2022-05-02 16:46:10 -07:00
Anders Kaseorg 646a4d19a3 puppet: Remove quotes for enumerable values.
https://puppet.com/docs/puppet/7/style_guide.html#style_guide_module_design-quoting
“If a string is a value from an enumerable set of options, such as
present and absent, it SHOULD NOT be enclosed in quotes at all.”

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-04-29 22:06:46 -07:00
Alex Vandiver c97162e485 puppet: Check that certbot certs are in use before fixing them.
It is possible to have previously installed certbot, but switched back
to using self-signed certificates -- in which case renewing them using
certbot may fail.

Verify that the certificate is a symlink into certbot's output
directory before running `fix-standalone-certbot`.
2022-04-27 16:01:15 -07:00
Anders Kaseorg 098a514599 python: Use Python 3.8 shlex.join function.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-04-27 12:57:49 -07:00
Alex Vandiver 35db1ee435 puppet: Only include "app_service" section if there are apps.
This works around gravitational/teleport#12256, but also produces config
files that are slightly cleaner.
2022-04-26 16:36:13 -07:00
Anders Kaseorg a7e6cb7705 puppet: ‘supervisorctl stop all’ before restarting Supervisor.
This fixes a failure of the 3.4 upgrade test running on Ubuntu 20.04
with Supervisor 4.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-04-26 16:32:02 -07:00
Alex Vandiver e5548ecba0 puppet: Upgrade external dependencies. 2022-04-21 13:54:14 -07:00
Alex Vandiver 1151118cc8 puppet: Upgrade Grafana to 8.4.6. 2022-04-12 16:41:45 -07:00
Alex Vandiver 572443edc6 puppet: Remove memcached SASL workaround.
https://bugs.launchpad.net/ubuntu/+source/memcached/+bug/1878721 was
fixed and released in Focal in 2020-06-24.

We don't bother with an `ensure => absent` because leaving this
in-place for existing installs does no harm.
2022-04-08 14:59:45 -07:00
Anders Kaseorg 935cb605a5 puppet: Do not ensure Chrony is running.
Commit f6d27562fa (#21564) tried to
ensure Chrony is running, which fails in containers where Chrony
doesn’t have permission to update the host clock.

The Debian package should still attempt to start it, and Puppet should
still restart it when chrony.conf is modified.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-03-30 11:37:54 -07:00
Alex Vandiver f6d27562fa puppet: Configure chrony to use AWS-local NTP sources.
This prevents hosts from spewing traffic to random hosts across the
Internet.
2022-03-25 17:07:53 -07:00
Alex Vandiver 5e128e7cad puppet: Extract the wal-g configuration from the backups.
This will allow it to be used for monitoring, to check the state in S3
rather than just trusting the backups when they said they ran.
2022-03-25 17:05:30 -07:00
Alex Vandiver d7b59c86ce puppet: Build wal-g from source for aarch64.
Since wal-g does not provide binaries for aarch64, build them from
source.  While building them from source for arm64 would better ensure
that build process is tested, the build process takes 7min and 700M of
temp files, which is an unacceptable cost; we thus only build on
aarch64.

Since the wal-g build process uses submodules, which are not in the
Github export, we clone the full wal-g repository.  Because the
repository is relatively small, we clone it anew on each new version,
rather than attempt to manage the remotes.

Fixes #21070.
2022-03-22 15:02:35 -07:00
Alex Vandiver 4d4c320a07 puppet: Switch from ntp to chrony.
Chrony is the recommended time server for Ubuntu since 18.04[1], and
is the default on Redhat; it is more accurate, and has lower-memory
usage, than ntp, which is only getting best-effort security
maintenance.

See:
- https://wiki.ubuntu.com/BionicBeaver/ReleaseNotes#Chrony
- https://chrony.tuxfamily.org/comparison.html
- https://engineering.fb.com/2020/03/18/production-engineering/ntp-service/
2022-03-22 13:07:27 -07:00
Alex Vandiver a2c8be9cd5 puppet: Increase download timeout from 5m to 10m.
The default timeout for `exec` commands in Puppet is 5 minutes[1].  On
slow connections, this may not be sufficient to download larger
downloads, such as the ~135MB golang tarball.

Increase the timeout to 10 minutes; this is a minimum download speed
of is ~225kB/s.

Fixes #21449.

[1]: https://puppet.com/docs/puppet/5.5/types/exec.html#exec-attribute-timeout
2022-03-21 15:47:04 -07:00
Alex Vandiver 9e850b08f3 puppet: Fix the PostgreSQL paths to recovery.conf / standby.conf. 2022-03-20 16:16:04 -07:00
Alex Vandiver 1bd5723cd2 puppet: Add a prometheus monitor for tornado processes. 2022-03-20 16:12:11 -07:00
Alex Vandiver 6b91652d9a puppet: Open the grok_exporter port.
The complete grok_exporter configuration is not ready to be committed,
but this at least prepares the way for it.
2022-03-20 16:12:11 -07:00
Alex Vandiver 6558655fc6 puppet: Add rabbitmq prometheus plugin, and open the firewall. 2022-03-20 16:12:11 -07:00
Alex Vandiver bdd2f35d05 puppet: Switch czo to using zulip_ops::app_frontend_monitoring.
This was clearly intended in f61ac4a28d
but never executed.
2022-03-20 16:12:11 -07:00
Alex Vandiver 17699bea44 puppet: postgresql_backups is auto-included if s3_backups_bucket is set.
Since 6496d43148.
2022-03-20 16:12:11 -07:00
Alex Vandiver bedc7c2986 puppet: Smokescreen is now auto-included in standalone.
Since c33562f0a8.
2022-03-20 16:12:11 -07:00
Alex Vandiver 6489c832a3 puppet: Upgrade third-party package versions. 2022-03-17 11:44:05 -07:00
Alex Vandiver d17006da55 puppet: Support setting an `ssl_mode` verification level. 2022-03-15 12:43:50 -07:00
Alex Vandiver 253bef27f5 puppet: Support password-based PostgreSQL replication. 2022-03-15 12:43:50 -07:00
Sahil Batra f0606b34ad user_groups: Add cron job for adding users to full members system group.
This commit adds a cron job which runs every hour to add the users to
full members system group if user is promoted to a full member.

This should ensure that full member status is available no more than
an hour after configuration suggests it should be.
2022-03-14 18:53:47 -07:00
Alex Vandiver 6f5ae8d13d puppet: wal-g backups are required for replication.
Previously, it was possible to configure `wal-g` backups without
replication enabled; this resulted in only daily backups, not
streaming backups.  It was also possible to enable replication without
configuring the `wal-g` backups bucket; this simply failed to work.

Make `wal-g` backups always streaming, and warn loudly if replication
is enabled but `wal-g` is not configured.
2022-03-11 10:09:35 -08:00
Alex Vandiver 6496d43148 puppet: Only s3_backups_bucket is required for backups.
`s3_backups_key` / `s3_backups_secret_key` are optional, as the
permissions could come from the EC2 instance's role.
2022-03-11 10:09:35 -08:00
Alex Vandiver 19beed2709 puppet: Default s3_region to the current ec2 region. 2022-03-11 10:09:35 -08:00
Anders Kaseorg b3260bd610 docs: Use Debian and Ubuntu version numbers over development codenames.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-02-23 12:04:24 -08:00
Anders Kaseorg 1629d6bfb3 python: Reformat with Black 22 (stable).
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-02-18 18:03:13 -08:00
Alex Vandiver c656d933fa puppet: Switch from $::memorysize_mb to non-legacy $::memory. 2022-02-15 12:04:37 -08:00
Alex Vandiver f2f4462e71 puppet: Switch from $::fqdn to non-legacy $::networking. 2022-02-15 12:04:37 -08:00
Alex Vandiver bb4c0799cc puppet: Switch to the canonical case for $::os['family'].
The == operator in Puppet is case-insensitive for ASCII
characters[1], which is potentially surprising.  Switch to the
canonical case that `$::os['family']` returns.

[1] https://puppet.com/docs/puppet/5.5/lang_expressions.html#string-encoding-and-comparisons
2022-02-15 12:04:37 -08:00
Alex Vandiver d4eefbbeea puppet: Switch from $::osfamily to non-legacy $::os. 2022-02-15 12:04:37 -08:00
Alex Vandiver a787ebe0e2 puppet: Switch from $::architecture to non-legacy $::os. 2022-02-15 12:04:37 -08:00
Alex Vandiver d7e8733705 puppet: Use goarch for wal-g.
wal-g does not currently provide pre-built binaries for
arm64/aarch64 (see #21070) but if they begin to, it will likely be
with the goarch names.
2022-02-15 12:04:37 -08:00
Alex Vandiver abdbe4ca83 puppet: Use goarch for go-camo. 2022-02-15 12:04:37 -08:00
Alex Vandiver be2f2a5bde puppet: Use goarch for golang.
Fixes: #21051.
2022-02-15 12:04:37 -08:00
Alex Vandiver 788daa953b puppet: Factor out $::architecture case statement for golang. 2022-02-15 12:04:37 -08:00
Anders Kaseorg f6a701090c setup-apt-repos: Don’t install lsb_release.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-02-14 16:38:53 -08:00
Anders Kaseorg 45f4db9702 puppet: Remove unused $release_name.
It would confuse a future Debian 15.10 release with Ubuntu 15.10, it
relies on the legacy fact $::operatingsystemrelease, the modern fact
$::os provides this information without extra logic, and it’s unused
as of commit 03bffd3938.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-02-14 16:38:53 -08:00
Alex Vandiver 291c5e87b6 puppet: Upgrade prometheus to 2.33.1. 2022-02-09 20:32:24 -08:00
Alex Vandiver 2d538c2356 puppet: Upgrade grafana to 8.3.6. 2022-02-09 20:32:24 -08:00
Alex Vandiver f2e66c0b20 puppet: Upgrade go-camo to 2.4.0. 2022-02-09 20:32:24 -08:00
Alex Vandiver 51a516384d puppet: Upgrade golang to 1.17.6. 2022-02-09 20:32:24 -08:00
Alex Vandiver 48263a01dd puppet: Upgrade puppet libraries. 2022-02-09 20:32:24 -08:00
Alex Vandiver e032b38661 puppet: Fix typo in uwsgi exporter dependency. 2022-02-08 15:17:17 -08:00
Alex Vandiver b3900bec7e puppet: Upgrade Grafana to 8.3.5.
https://grafana.com/docs/grafana/latest/release-notes/release-notes-8-3-5/
2022-02-08 11:13:40 -08:00
Alex Vandiver a46f6df91e CVE-2021-43799: Write rabbitmq configuration before starting.
Zulip writes a `rabbitmq.config` configuration file which locks down
RabbitMQ to listen only on localhost:5672, as well as the RabbitMQ
distribution port, on localhost:25672.

The "distribution port" is part of Erlang's clustering configuration;
while it is documented that the protocol is fundamentally
insecure ([1], [2]) and can result in remote arbitrary execution of
code, by default the RabbitMQ configuration on Debian and Ubuntu
leaves it publicly accessible, with weak credentials.

The configuration file that Zulip writes, while effective, is only
written _after_ the package has been installed and the service
started, which leaves the port exposed until RabbitMQ or system
restart.

Ensure that rabbitmq's `/etc/rabbitmq/rabbitmq.config` is written
before rabbitmq is installed or starts, and that changes to that file
trigger a restart of the service, such that the ports are only ever
bound to localhost.  This does not mitigate existing installs, since
it does not force a rabbitmq restart.

[1] https://www.erlang.org/doc/apps/erts/erl_dist_protocol.html
[2] https://www.erlang.org/doc/reference_manual/distributed.html#distributed-erlang-system
2022-01-25 01:48:05 +00:00
Alex Vandiver 43d63bd5a1 puppet: Always set the RabbitMQ nodename to zulip@localhost.
This is required in order to lock down the RabbitMQ port to only
listen on localhost.  If the nodename is `rabbit@hostname`, in most
circumstances the hostname will resolve to an external IP, which the
rabbitmq port will not be bound to.

Installs which used `rabbit@hostname`, due to RabbitMQ having been
installed before Zulip, would not have functioned if the host or
RabbitMQ service was restarted, as the localhost restrictions in the
RabbitMQ configuration would have made rabbitmqctl (and Zulip cron
jobs that call it) unable to find the rabbitmq server.

The previous commit ensures that configure-rabbitmq is re-run after
the nodename has changed.  However, rabbitmq needs to be stopped
before `rabbitmq-env.conf` is changed; we use an `onlyif` on an `exec`
to print the warning about the node change, and let the subsequent
config change and notify of the service and configure-rabbitmq to
complete the re-configuration.
2022-01-25 01:48:02 +00:00
Alex Vandiver 3bfcfeac24 puppet: Run configure-rabbitmq on nodename change.
`/etc/rabbitmq/rabbitmq-env.conf` sets the nodename; anytime the
nodename changes, the backing database changes, and this requires
re-creating the rabbitmq users and permissions.

Trigger this in puppet by running configure-rabbitmq after the file
changes.
2022-01-25 01:46:51 +00:00
Alex Vandiver 694c4dfe8f puppet: Admit we leave epmd port 4369 open on all interfaces.
The Erlang `epmd` daemon listens on port 4369, and provides
information (without authentication) about which Erlang processes are
listening on what ports.  This information is not itself a
vulnerability, but may provide information for remote attackers about
what local Erlang services (such as `rabbitmq-server`) are running,
and where.

`epmd` supports an `ERL_EPMD_ADDRESS` environment variable to limit
which interfaces it binds on.  While this environment variable is set
in `/etc/default/rabbitmq-server`, Zulip unfortunately attempts to
start `epmd` using an explicit `exec` block, which ignores those
settings.

Regardless, this lack of `ERL_EPMD_ADDRESS` variable only controls
`epmd`'s startup upon first installation.  Upon reboot, there are two
ways in which `epmd` might be started, neither of which respect
`ERL_EPMD_ADDRESS`:

 - On Focal, an `epmd` service exists and is activated, which uses
   systemd's configuration to choose which interfaces to bind on, and
   thus `ERL_EPMD_ADDRESS` is irrelevant.

 - On Bionic (and Focal, due to a broken dependency from
   `rabbitmq-server` to `epmd@` instead of `epmd`, which may lead to
   the explicit `epmd` service losing a race), `epmd` is started by
   `rabbitmq-server` when it does not detect a running instance.
   Unfortunately, only `/etc/init.d/rabbitmq-server` would respects
   `/etc/default/rabbitmq-server` -- and it defers the actual startup
   to using systemd, which does not pass the environment variable
   down.  Thus, `ERL_EPMD_ADDRESS` is also irrelevant here.

We unfortunately cannot limit `epmd` to only listening on localhost,
due to a number of overlapping bugs and limitations:

 - Manually starting `epmd` with `-address 127.0.0.1` silently fails
   to start on hosts with IPv6 disabled, due to an Erlang bug ([1],
   [2]).

 - The dependencies of the systemd `rabbitmq-server` service can be
   fixed to include the `epmd` service, and systemd can be made to
   bind to `127.0.0.1:4369` and pass that socket to `epmd`, bypassing
   the above bug.  However, the startup of this service is not
   guaranteed, because it races with other sources of `epmd` (see
   below).

 - Any process that runs `rabbitmqctl` results in `epmd` being started
   if one is not currently running; these instances do not respect any
   environment variables as to which addresses to bind on.  This is
   also triggered by `service rabbitmq-server status`, as well as
   various Zulip cron jobs which inspect the rabbitmq queues.  As
   such, it is difficult-to-impossible to ensure that some other
   `epmd` process will not win the race and open the port on all
   interfaces.

Since the only known exposure from leaving port 4369 open is
information that rabbitmq is running on the host, and the complexity
of adjusting this to only bind on localhost is high, we remove the
setting which does not address the problem, and document that the port
is left open, and should be protected via system-level or
network-level firewalls.

[1]: https://bugs.launchpad.net/ubuntu/+source/erlang/+bug/1374109
[2]: https://github.com/erlang/otp/issues/4820
2022-01-25 01:46:51 +00:00