Commit Graph

1345 Commits

Author SHA1 Message Date
Alex Vandiver 2c5fc1827c puppet: Standardize what values are bools, and what true is.
For `no_serve_uploads`, `http_only`, which previously specified
"non-empty" to enable, this tightens what values are true.  For
`pgroonga` and `queue_workers_multiprocess`, this broadens the
possible values from `enabled`, and `true` respectively.
2022-01-07 12:08:10 -08:00
Alex Vandiver 1e672e4d82 puppet: Remove unused $no_serve_uploads in app_frontend. 2022-01-07 12:08:10 -08:00
Alex Vandiver 6218ed91c2 puppet: Use lazy-apps and uwsgi control sockets for rolling reloads.
Restarting the uwsgi processes by way of supervisor opens a window
during which nginx 502's all responses.  uwsgi has a configuration
called "chain reloading" which allows for rolling restart of the uwsgi
processes, such that only one process at once in unavailable; see
uwsgi documentation ([1]).

The tradeoff is that this requires that the uwsgi processes load the
libraries after forking, rather than before ("lazy apps"); in theory
this can lead to larger memory footprints, since they are not shared.
In practice, as Django defers much of the loading, this is not as much
of an issue.  In a very basic test of memory consumption (measured by
total memory - free - caches - buffers; 6 uwsgi workers), both
immediately after restarting Django, and after requesting `/` 60 times
with 6 concurrent requests:

                      |  Non-lazy  |  Lazy app  | Difference
    ------------------+------------+------------+-------------
    Fresh             |  2,827,216 |  2,870,480 |   +43,264
    After 60 requests |  3,332,284 |  3,409,608 |   +77,324
    ..................|............|............|.............
    Difference        |   +505,068 |   +539,128 |   +34,060

That is, "lazy app" loading increased the footprint pre-requests by
43MB, and after 60 requests grew the memory footprint by 539MB, as
opposed to non-lazy loading, which grew it by 505MB.  Using wsgi "lazy
app" loading does increase the memory footprint, but not by a large
percentage.

The other effect is that processes may be served by either old or new
code during the restart window.  This may cause transient failures
when new frontend code talks to old backend code.

Enable chain-reloading during graceful, puppetless restarts, but only
if enabled via a zulip.conf configuration flag.

Fixes #2559.

[1]: https://uwsgi-docs.readthedocs.io/en/latest/articles/TheArtOfGracefulReloading.html#chain-reloading-lazy-apps
2022-01-05 14:48:52 -08:00
Alex Vandiver 4a95967a33 puppet: Gather uwsgi stats from chat.zulip.org. 2022-01-03 21:26:57 -08:00
Alex Vandiver 8a5be972d2 puppet: Add a uwsgi exporter for monitoring.
This allows investigation of how many workers are busy, and to track
"harikari" terminations.
2022-01-03 15:25:58 -08:00
Alex Vandiver d6c40d24d4 puppet: Manage current smokescreen binary so it is not tidied.
Fix another tidy error caused by 1e4e6a09af23; as also noted in
f9a39b6703, these resources are necessary such that tidy does not
cleanup of smokescreen, and then force a recompilation of it again.
2022-01-03 15:24:42 -08:00
Alex Vandiver f9a39b6703 puppet: Manage extracted resources again.
1e4e6a09af removed the resources for the unpacked directory, on the
argument that they were unnecessary.  However, the directory (or file,
see below) that is unpacked must be managed, or it will be tidied on
the next puppet apply.

Add back the resource for `$dir`, but mark it `ensure => present`, to
support tarballs which only unpack to a single file (e.g. wal-g).
2022-01-02 12:11:53 -08:00
Alex Vandiver 54b6a83412 puppet: Fix typo in cron job name. 2021-12-31 17:39:53 -08:00
Alex Vandiver 941800cf12 puppet: Upgrade external dependencies. 2021-12-31 11:14:40 -08:00
Alex Vandiver 6f693d10d9 puppet: Fix version of node_exporter.
This was a copy/paste but introduced in f166f9f7d6.
2021-12-30 23:33:34 +00:00
Anders Kaseorg 82748d45d8 install-yarn: Use test -ef in case /srv is a symlink.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-12-30 13:42:07 -08:00
Alex Vandiver c094867a74 puppet: Add aarch64 build hashes to external dependencies.
wal-g does not ship aarch64 binaries, currently; the compilation
process([1]) is somewhat complicated, so we defer the decision about
how to support wal-g for aarch64 until a later date.

[1]: https://github.com/wal-g/wal-g/blob/master/docs/PostgreSQL.md#installing
2021-12-29 16:35:15 -08:00
Alex Vandiver f166f9f7d6 puppet: Centralize versions and sha256 hashes of external dependencies.
This will make it easier to update versions of these dependencies.
2021-12-29 16:35:15 -08:00
Alex Vandiver 57662689a9 puppet: Provide a constant homedir for grafana user.
The homedir of a user cannot be changed if any processes are running
as them, so having it change over time as upgrades happen will break
puppet application, as the old grafana process under supervisor will
effectively lock changes to the user's homedir.

Unfortunately, that means that this change will thus fail to
puppet-apply unless `supervisorctl stop grafana` is run first, but
there's no way around that.
2021-12-29 16:35:15 -08:00
Alex Vandiver 6e55e52694 puppet: Pull out grafana $data_dir. 2021-12-29 16:35:15 -08:00
Alex Vandiver 51d3862c7e puppet: Move wal-g to external_dep, in /srv/zulip-wal-g-*. 2021-12-29 16:35:15 -08:00
Alex Vandiver 1e4e6a09af puppet: Stop making resources for external binaries and directories.
In the event that extracting doesn't produce the binary we expected it
to, all this will do is create an _empty_ file where we expect the
binary to be.  This will likely muddle debugging.

Since the only reason the resourfce was made in the first place was to
make dependencies clear, switch to depending on the External_Dep
itself, when such a dependency is needed.
2021-12-29 16:35:15 -08:00
Alex Vandiver 3c163a7d5e puppet: Move slash out of $dir by convention. 2021-12-29 16:35:15 -08:00
Alex Vandiver bb5a2c8138 puppet: Move prometheus to external_dep. 2021-12-29 16:35:15 -08:00
Alex Vandiver 2d6c096904 puppet: Move node_exporter to external_dep. 2021-12-29 16:35:15 -08:00
Alex Vandiver d2a78bac7e puppet: Adjust wal-g release version and SHA256.
wal-g apparently removed the 1.1.1 release; replace it with the
equivalent rc.
2021-12-29 16:35:15 -08:00
Alex Vandiver 7a9074ecfd puppet: Use shorter local variable for supervisor conf.d dir. 2021-12-28 09:24:01 -08:00
Alex Vandiver 670fad0cc4 puppet: Drop now-unnecessary supervisor file removals. 2021-12-28 09:24:01 -08:00
Alex Vandiver 20eab264cf puppet: Remove dependency on scripts.lib.zulip_tools.
ab130ceb35 added a dependency on scripts.lib.zulip_tools; however,
check_postgresql_replication_lag is run on hosts which do not have a
zulip tree installed.

Inline the simple functions that were imported.
2021-12-14 14:48:53 -08:00
Alex Vandiver 71b56f7c1c puppet: process_fts_updates connects as nagios (or provided username).
It should not use the configured zulip username, but should instead
pull from the login user (likely `nagios`), or an explicit alternate
provided PostgreSQL username.  Failure to do so results in Nagios
failures because the `nagios` login does not have permissions to
authenticated the `zulip` PostgreSQL user.

This requires CI changes, as the install tests install as the `zulip`
login username, which allowed Nagios tests to pass previously; with
the custom database and username, however, they must be passed to
process_fts_updates explicitly when validating the install.
2021-12-14 14:48:53 -08:00
Alex Vandiver 9d67e37166 puppet: Nagios connects as itself, in check_postgresql_replication_lag. 2021-12-14 14:48:53 -08:00
Alex Vandiver 850bc4cc81 puppet: Create directory for redis PID file.
The Redis configuration, and the systemd file for it, assumes there
will be a pid file written to `/var/run/redis/redis.pid`, but
`/var/run/redis` is not created during installation.

Create `/run/redis`; as `/var/run` is a symlink to `/run` on systemd
systems, this is equivalent to `/var/run/redis`.
2021-12-13 12:42:15 -08:00
Alex Vandiver a6c2079502 puppet: Create memcached PID file that systemd config file specifies.
The systemd config file installed by the `memcached` package assumes
there will be a PID written to `/run/memcached/memcached.pid`.  Since we
override `memcached.conf`, we have omitted the line that writes out the
PID to this file.

Systemd is smart enough to not _need_ the PID file to start up the
service correctly, but match the configuration.  We create the
directory since the package does not do so.  It is created as
`/run/memcached` and not `/var/run/memcached` because `/var/run` is a
symlink to `/run`.
2021-12-13 12:42:15 -08:00
Alex Vandiver e4b23daad7 puppet: Upgrade to Grafana 8.3.2, for CVE-2021-43813. 2021-12-10 14:00:11 -08:00
Alex Vandiver 01e8f752a8 puppet: Use certbot package timer, not our own cron job.
The certbot package installs its own systemd timer (and cron job,
which disabled itself if systemd is enabled) which updates
certificates.  This process races with the cron job which Zulip
installs -- the only difference being that Zulip respects the
`certbot.auto_renew` setting, and that it passes the deploy hook.
This means that occasionally nginx would not be reloaded, when the
systemd timer caught the expiration first.

Remove the custom cron job and `certbot-maybe-renew` script, and
reconfigure certbot to always reload nginx after deploying, using
certbot directory hooks.

Since `certbot.auto_renew` can't have an effect, remove the setting.
In turn, this removes the need for `--no-zulip-conf` to
`setup-certbot`.  `--deploy-hook` is similarly removed, as running
deploy hooks to restart nginx is now the default; pass
`--no-directory-hooks` in standalone mode to not attempt to reload
nginx.  The other property of `--deploy-hook`, of skipping symlinking
into place, is given its own flog.
2021-12-09 13:47:33 -08:00
Alex Vandiver 053682964e puppet: Only fetch from running hosts in Grafana ec2 discovery. 2021-12-09 08:12:03 -08:00
Alex Vandiver 291f688678 puppet: Use zulip::external_dep for grafana, template config.
Templating the config ensures that the service is restarted when it is
upgraded.
2021-12-08 20:58:10 -08:00
Alex Vandiver 3eae429ab4 puppet: Upgrade Grafana to 8.3.1, for CVE-2021-43798. 2021-12-08 20:58:10 -08:00
Alex Vandiver 7db146d0a9 puppet: Do not assume amd64 architecture. 2021-12-06 11:08:50 -08:00
Alex Vandiver fb2d05f9e3 puppet: Remove unused 'builder' files.
These are leftover detritus from the "builder" host, which was removed
in 4c9a283542.
2021-12-06 10:21:50 -08:00
Alex Vandiver cb2d0ff32b postgresql: Support replication on PostgreSQL >= 11, document.
PostgreSQL 11 and below used a configuration file names
`recovery.conf` to manage replicas and standbys; support for this was
removed in PostgreSQL 12[1], and the configuration parameters were
moved into the main `postgresql.conf`.

Add `zulip.conf` settings for the primary server hostname and
replication username, so that the complete `postgresql.conf`
configuration on PostgreSQL 14 can continue to be managed, even when
replication is enabled.  For consistency, also begin writing out the
`recovery.conf` for PostgreSQL 11 and below.

In PostgreSQL 12 configuration and later, the `wal_level =
hot_standby` setting is removed, as `hot_standby` is equivalent to
`replica`, which is the default value[2].  Similarly, the
`hot_standby = on` setting is also the default[3].

Documentation is added for these features, and the commentary on the
"Export and Import" page referencing files under `puppet/zulip_ops/`
is removed, as those files no longer have any replication-specific
configuration.

[1]: https://www.postgresql.org/docs/current/recovery-config.html
[2]: https://www.postgresql.org/docs/12/runtime-config-wal.html#GUC-WAL-LEVEL
[3]: https://www.postgresql.org/docs/12/runtime-config-replication.html#GUC-HOT-STANDBY
2021-12-03 16:32:41 -08:00
Alex Vandiver 7d3399a970 puppet: Drop configuration files for unsupported PostgreSQL versions.
These are both unsupported by PostgreSQL itself, as well as by Zulip;
the removal of Ubuntu Xenial and Debian Stretch support in Zulip 3.0
removed the requirement for PostgreSQL 9.6, and the previous versions
date back yet farther.
2021-12-03 16:32:41 -08:00
Alex Vandiver 6436c4087d puppet: Tidy old wal-g binaries. 2021-12-03 16:17:50 -08:00
Alex Vandiver 53cc9538f7 puppet: Factor out wal-g binary path. 2021-12-03 16:17:50 -08:00
Alex Vandiver 338483792b puppet: Upgrade wal-g release to 1.1.1. 2021-12-03 16:17:50 -08:00
Anders Kaseorg 325b4bac7e env-wal-g: Quote $s3_backups_bucket.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-12-03 14:33:53 -08:00
Alex Vandiver f31bf3f06c puppet: Install camo on Docker.
Now that go-camo runs within supervisor, it can be run in Docker
simply.

Fixes #20101.
Fixes zulip/docker-zulip#179.
2021-12-02 09:25:00 -08:00
Alex Vandiver 358a7fb0c6 puppet: Read camo secret at startup time, not at puppet-apply time.
Writing the secret to the supervisor configuration file makes changes
to the secret requires a zulip-puppet-apply to take hold.  The Docker
image is constructed to avoid having to run zulip-puppet-apply on
startup, and indeed cannot run zulip-puppet-apply after having
configured secrets, as it has replaced the zulip.conf file with a
symlink, for example.  This means that camo gets the static secret
that was built into the image, and not the one regenerated on first
startup.

Read the camo secret at process startup time.  Because this pattern is
likely common with "12-factor" applications which can read from
environment variables, write a generic tool to map secrets to
environment variables before exec'ing a binary, and use that for Camo.
2021-12-02 09:25:00 -08:00
Alex Vandiver 86cf3be39f puppet: Fix pgroonga init for custom database names and users. 2021-11-20 07:13:50 -08:00
Alex Vandiver c514feaa22 puppet: Default go-camo to listening on localhost for standalone deploys.
The default in the previous commit, inherited from camo, was to bind
to 0.0.0.0:9292.  In standalone deployments, camo is deployed on the
same host as the nginx reverse proxy, and as such there is no need to
open it up to other IPs.

Make `zulip::camo` take an optional parameter, which allows overriding
it in puppet, but skips a `zulip.conf` setting for it, since it is
unlikely to be adjust by most users.
2021-11-19 15:58:26 -08:00
Alex Vandiver b982222e03 camo: Replace with go-camo implementation.
The upstream of the `camo` repository[1] has been unmaintained for
several years, and is now archived by the owner.  Additionally, it has
a number of limitations:
 - It is installed as a sysinit service, which does not run under
   Docker
 - It does not prevent access to internal IPs, like 127.0.0.1
 - It does not respect standard `HTTP_proxy` environment variables,
   making it unable to use Smokescreen to prevent the prior flaw
 - It occasionally just crashes, and thus must have a cron job to
   restart it.

Swap camo out for the drop-in replacement go-camo[2], which has the
same external API, requiring not changes to Django code, but is more
maintained.  Additionally, it resolves all of the above complaints.

go-camo is not configured to use Smokescreen as a proxy, because its
own private-IP filtering prevents using a proxy which lies within that
IP space.  It is also unclear if the addition of Smokescreen would
provide any additional protection over the existing IP address
restrictions in go-camo.

go-camo has a subset of the security headers that our nginx reverse
proxy sets, and which camo set; provide the missing headers with `-H`
to ensure that go-camo, if exposed from behind some other non-nginx
load-balancer, still provides the necessary security headers.

Fixes #18351 by moving to supervisor.
Fixes zulip/docker-zulip#298 also by moving to supervisor.

[1] https://github.com/atmos/camo
[2] https://github.com/cactus/go-camo
2021-11-19 15:58:26 -08:00
Alex Vandiver c33562f0a8 puppet: Default to installing smokescreen on application frontends.
This is an additional security hardening step, to make Zulip default
to preventing SSRF attacks.  The overhead of running Smokescreen is
minimal, and there is no reason to force deployments to take
additional steps in order to secure themselves against SSRF attacks.

Deployments which already have a different external proxy configured
will not gain a local Smokescreen installation, and running without
Smokescreen is supported by explicitly unsetting the `host` or `port`
values in `/etc/zulip/zulip.conf`.
2021-11-19 15:29:28 -08:00
Alex Vandiver 44f1ea6bae puppet: Split smokescreen into a non-profile version.
In a subsequent commit, we intend to include it from
`zulip::app_frontend_base`, which is a layering violation if it only
exists in the form of a profile.
2021-11-19 15:29:28 -08:00
Alex Vandiver c2ed3c22b5 puppet: Remove unused smokescreen symlink. 2021-11-19 15:29:28 -08:00
Alex Vandiver 47e16a5d41 puppet: Tidy old smokescreen binaries. 2021-11-19 15:29:28 -08:00
Alex Vandiver 239ac8413e puppet: Embed golang version into binary path, to rebuild on new golang.
This will cause the output binary path to be sensitive to golang
version, causing it to be rebuilt on new golang, and an updated
supervisor config file written out, and thus supervisor also
restarted.
2021-11-19 15:29:28 -08:00
Alex Vandiver 216eeba2dd puppet: Factor out smokescreen binary path. 2021-11-19 15:29:28 -08:00
Alex Vandiver 3a7cef6582 puppet: Switch smokescreen to using zulip::external_dep, so it tidies. 2021-11-19 15:29:28 -08:00
Alex Vandiver ea08111d60 puppet: Move /srv/smokescreen-src to /srv/zulip-smokescreen-src.
As with the previous commit for `/srv/golang`, we have the custom of
namespacing things under `/srv` with `zulip-` to help ensure that we
play nice with anything else that happens to be on the host.
2021-11-19 15:29:28 -08:00
Anders Kaseorg c64e1adb19 puppet: Upgrade Smokescreen v0.0.2-59-gbfca45c to v0.0.2-63-gdc40301.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-11-19 15:29:28 -08:00
Alex Vandiver bb9d2df1ae puppet: Extract an external-tarball-dependency manifest. 2021-11-19 15:29:28 -08:00
Alex Vandiver 3c8d7e2598 puppet: Tidy old golang directories.
This relies on behavior which is only in Puppet 5.5.1 and above, which
means it must be skipped on Ubuntu 18.04.
2021-11-19 15:29:28 -08:00
Alex Vandiver 2fc4acdf81 puppet: Move /srv/golang to /srv/zulip-golang.
We have the custom of namespacing things under `/srv` with `zulip-`
to help ensure that we play nice with anything else that happens
to be on the host.
2021-11-19 15:29:28 -08:00
Alex Vandiver 00a4abb642 puppet: Switch dependency to the golang binary we need. 2021-11-19 15:29:28 -08:00
Alex Vandiver 2d5f813094 puppet: Stop making a /srv/golang symlink.
Nothing needs this extra directory.
2021-11-19 15:29:28 -08:00
Alex Vandiver 93af6c7f06 puppet: Factor out golang variables. 2021-11-19 15:29:28 -08:00
Alex Vandiver 21be36f15f puppet: Shorten golang version variable name. 2021-11-19 15:29:28 -08:00
Alex Vandiver 6b9e74adee puppet: Upgrade golang from 1.16.4 to 1.17.3. 2021-11-19 15:29:28 -08:00
Alex Vandiver 514801c509 puppet: Split out golang toolchain into its own manifest. 2021-11-19 15:29:28 -08:00
Alex Vandiver 610a0b2d59 nagios: `pg_is_in_recovery()` is better to know replica/primary status.
It is possible to be in recovery, and downloading WAL logs from
archives, and not yet be replicating.  If one only checks the
streaming log status, it reports as "no replicas" which is technically
accurate but not a useful summation of the state of the replica.
2021-11-17 13:38:26 -08:00
Alex Vandiver 83091cbc96 puppet: Swap the one use of the `cron` resource for an /etc/cron.d file.
The `cron` resource places its contents in the user's crontab, which
makes it unlike every other cron job that Zulip installs.

Switch to using `/etc/cron.d` files, like all other cron jobs.
2021-11-16 16:17:32 -08:00
Alex Vandiver 90e1a0400e puppet: Add a few more inter-resource dependencies.
None of these are important; they just express semantic dependencies.
2021-11-16 16:17:32 -08:00
Alex Vandiver 49ad188449 rate_limit: Add a flag to lump all TOR exit node IPs together.
TOR users are legitimate users of the system; however, that system can
also be used for abuse -- specifically, by evading IP-based
rate-limiting.

For the purposes of IP-based rate-limiting, add a
RATE_LIMIT_TOR_TOGETHER flag, defaulting to false, which lumps all
requests from TOR exit nodes into the same bucket.  This may allow a
TOR user to deny other TOR users access to the find-my-account and
new-realm endpoints, but this is a low cost for cutting off a
significant potential abuse vector.

If enabled, the list of TOR exit nodes is fetched from their public
endpoint once per hour, via a cron job, and cached on disk.  Django
processes load this data from disk, and cache it in memcached.
Requests are spared from the burden of checking disk on failure via a
circuitbreaker, which trips of there are two failures in a row, and
only begins trying again after 10 minutes.
2021-11-16 11:42:00 -08:00
Alex Vandiver 01c007ceaf puppet: Remove an out-of-date comment.
Comment was missed in 9d57fa9759.
2021-11-09 21:52:17 -08:00
Alex Vandiver 7af2fa2e92 puppet: Use sysv status command, not supervisorctl status.
Since Supervisor 4, which is installed on Ubuntu 20.04 and Debian 11,
`supervisorctl status` returns exit code 3 if any of the
supervisor-controlled processes are not running.

Using `supervisorctl status` as the Puppet `status` command for
Supervisor leads to unnecessarily trying to "start" a Supervisor
process which is already started, but happens to have one or more of
its managed processes stopped.  This is an unnecessary no-op in
production environments, but in docker-init enviroments, such as in
CI, attempting to start the process a second time is an error.

Switch to checking if supervisor is running by way of sysv init.  This
fixes the potential error in CI, as well as eliminates unnecessary
"starts" of supervisor when it was already running -- a situation
which made zulip-puppet-apply not idempotent:

```
root@alexmv-prod:~# supervisorctl status
process-fts-updates                                             STOPPED   Nov 10 12:33 AM
smokescreen                                                     RUNNING   pid 1287280, uptime 0:35:32
zulip-django                                                    STOPPED   Nov 10 12:33 AM
zulip-tornado                                                   STOPPED   Nov 10 12:33 AM
[...]

root@alexmv-prod:~# ~zulip/deployments/current/scripts/zulip-puppet-apply --force
Notice: Compiled catalog for alexmv-prod.zulipdev.org in environment production in 2.32 seconds
Notice: /Stage[main]/Zulip::Supervisor/Service[supervisor]/ensure: ensure changed 'stopped' to 'running'
Notice: Applied catalog in 0.91 seconds

root@alexmv-prod:~# ~zulip/deployments/current/scripts/zulip-puppet-apply --force
Notice: Compiled catalog for alexmv-prod.zulipdev.org in environment production in 2.35 seconds
Notice: /Stage[main]/Zulip::Supervisor/Service[supervisor]/ensure: ensure changed 'stopped' to 'running'
Notice: Applied catalog in 0.92 seconds
```
2021-11-09 21:52:17 -08:00
Alex Vandiver 8a1bb43b23 puppet: Adjust for templated paths and settings, set C.UTF-8 locale. 2021-11-08 18:21:46 -08:00
Alex Vandiver d3e9a71d42 puppet: Check in upstream PostgreSQL 14 configuration file.
Note that one `<%u%%d>` has to be escaped as `<%%u%%d>`.
2021-11-08 18:21:46 -08:00
Adam Benesh c881430f4c puppet: Add WSGIApplicationGroup config to Apache SSO example.
Zulip apparently is now affected by a bad interaction between Apache's
WSGI using Python subinterpreters and C extension modules like `re2`
that are not designed for it.

The solution is apparently to set WSGIApplicationGroup to %{GLOBAL},
which disables Apache's use of Python subinterpreters.

See https://serverfault.com/questions/514242/non-responsive-apache-mod-wsgi-after-installing-scipy/514251#514251 for background.

Fixes #19924.
2021-10-08 15:07:23 -07:00
Tim Abbott 33b5fa633a process_fts_updates: Fix docker-zulip support.
In the series of migrations to this tool's configuration to support
specifying an arbitrary database name
(e.g. c17f502bb0), we broke support for
running process_fts_updates on the application server, connected to a
remote database server. That workflow is used by docker-zulip and
presumably other settings like Amazon RDS.

The fix is to import the Zulip virtualenv (if available) when running
on an application server.  This is better than just supporting this
case, since both docker-zulip and an Amazon RDS database are setting
where it would be inconvenient to run process-fts-updates directly on
the database server. (In the former case, because we want to avoid
having a strong version dependency on the postgres container).

Details are available in this conversation:
https://chat.zulip.org/#narrow/stream/49-development-help/topic/Logic.20in.20process_fts_updates.20seems.20to.20be.20broken/near/1251894

Thanks to Erik Tews for reporting and help in debugging this issue.
2021-09-27 18:17:33 -05:00
Alex Vandiver 1806e0f45e puppet: Remove zulip.org configuration. 2021-08-26 17:21:31 -07:00
Alex Vandiver 27881babab puppet: Increase prometheus storage, from the default 15d. 2021-08-24 23:40:43 -07:00
Alex Vandiver faf71eea41 upgrade-postgresql: Do not remove other supervisor configs.
We previously used `zulip-puppet-apply` with a custom config file,
with an updated PostgreSQL version but more limited set of
`puppet_classes`, to pre-create the basic settings for the new cluster
before running `pg_upgradecluster`.

Unfortunately, the supervisor config uses `purge => true` to remove
all SUPERVISOR configuration files that are not included in the puppet
configuration; this leads to it removing all other supervisor
processes during the upgrade, only to add them back and start them
during the second `zulip-puppet-apply`.

It also leads to `process-fts-updates` not being started after the
upgrade completes; this is the one supervisor config file which was
not removed and re-added, and thus the one that is not re-started due
to having been re-added.  This was not detected in CI because CI added
a `start-server` command which was not in the upgrade documentation.

Set a custom facter fact that prevents the `purge` behaviour of the
supervisor configuration.  We want to preserve that behaviour in
general, and using `zulip-puppet-apply` continues to be the best way
to pre-set-up the PostgreSQL configuration -- but we wish to avoid
that behaviour when we know we are applying a subset of the puppet
classes.

Since supervisor configs are no longer removed and re-added, this
requires an explicit start-server step in the instructions after the
upgrades complete.  This brings the documentation into alignment with
what CI is testing.
2021-08-24 19:00:58 -07:00
Alex Vandiver e46e862f2b puppet: Add a bare-bones zulipbot profile.
This sets up the firewalls appropriate for zulipbot, but does not
automate any of the configuration of zulipbot itself.
2021-08-24 16:05:58 -07:00
Alex Vandiver 5857dcd9b4 puppet: Configure ip6tables in parallel to ipv4.
Previously, IPv6 firewalls were left at the default all-open.

Configure IPv6 equivalently to IPv4.
2021-08-24 16:05:46 -07:00
Alex Vandiver 845509a9ec puppet: Be explicit that existing iptables are only ipv4. 2021-08-24 16:05:46 -07:00
Anders Kaseorg 09564e95ac mypy: Add types-psycopg2.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-08-09 20:32:19 -07:00
Alex Vandiver 4dd289cb9d puppet: Enable prometheus monitoring of supervisord.
To be able to read the UNIX socket, this requires running
node_exporter as zulip, not as prometheus.
2021-08-03 21:47:02 -07:00
Alex Vandiver aa940bce72 puppet: Disable hwmon collector, which does nothing on cloud hosts. 2021-08-03 21:47:02 -07:00
Alex Vandiver 23a355df0f puppet: Move backup time earlier, from 10am to 7pm America/Los_Angeles.
This is less likely to overlap with common evening deploy times.
2021-08-03 18:32:45 -05:00
Alex Vandiver e94b6afb00 nagios: Remove broken check_email_deliverer_* checks and related code.
These checks suffer from a couple notable problems:
 - They are only enabled on staging hosts -- where they should never
   be run.  Since ef6d0ec5ca, these supervisor processes are only
   run on one host, and never on the staging host.
 - They run as the `nagios` user, which does not have appropriate
   permissions, and thus the checks always fail.  Specifically,
   `nagios` does not have permissions to run `supervisorctl`, since
   the socket is owned by the `zulip` user, and mode 0700; and the
   `nagios` user does not have permission to access Zulip secrets to
   run `./manage.py print_email_delivery_backlog`.

Rather than rewrite these checks to run on a cron as zulip, and check
those file contents as the nagios user, drop these checks -- they can
be rewritten at a later point, or replaced with Prometheus alerting,
and currently serve only to cause always-failing Nagios checks, which
normalizes alert failures.

Leave the files installed if they currently exist, rather than
cluttering puppet with `ensure => absent`; they do no harm if they are
left installed.
2021-08-03 16:07:13 -07:00
Mateusz Mandera 57f14b247e bots: Specify realm for nagios bots messages in check_send_receive_time. 2021-07-26 15:33:13 -07:00
Alex Vandiver befe204be4 puppet: Run the supervisor-restart step only after it is started.
In an initial install, the following is a potential rule ordering:
```
Notice: /Stage[main]/Zulip::Supervisor/File[/etc/supervisor/conf.d/zulip]/ensure: created
Notice: /Stage[main]/Zulip::Supervisor/File[/etc/supervisor/supervisord.conf]/content: content changed '{md5}99dc7e8a1178ede9ae9794aaecbca436' to '{md5}7ef9771d2c476c246a3ebd95fab784cb'
Notice: /Stage[main]/Zulip::Supervisor/Exec[supervisor-restart]: Triggered 'refresh' from 1 event
[...]
Notice: /Stage[main]/Zulip::App_frontend_base/File[/etc/supervisor/conf.d/zulip/zulip.conf]/ensure: defined content as '{md5}d98ac8a974d44efb1d1bb2ef8b9c3dee'
[...]
Notice: /Stage[main]/Zulip::App_frontend_once/File[/etc/supervisor/conf.d/zulip/zulip-once.conf]/ensure: defined content as '{md5}53f56ae4b95413bfd7a117e3113082dc'
[...]
Notice: /Stage[main]/Zulip::Process_fts_updates/File[/etc/supervisor/conf.d/zulip/zulip_db.conf]/ensure: defined content as '{md5}96092d7f27d76f48178a53b51f80b0f0'
Notice: /Stage[main]/Zulip::Supervisor/Service[supervisor]/ensure: ensure changed 'stopped' to 'running'
```

The last line is misleading -- supervisor was already started by the
`supervisor-restart` process on the third line.  As can be shown with
`zulip-puppet-apply --debug`, the last line just installs supervisor
to run on startup, using `systemctl`:
```
Debug: Executing: 'supervisorctl status'
Debug: Executing: '/usr/bin/systemctl unmask supervisor'
Debug: Executing: '/usr/bin/systemctl start supervisor'
```

This means the list of processes started by supervisor depends
entirely on which configuration files were successfully written out by
puppet before the initial `supervisor-restart` ran.  Since
`zulip_db.conf` is written later than the rest, the initial install
often fails to start the `process-fts-updates` process.  In this
state, an explicit `supervisorctl restart` or `supervisorctl reread &&
supervisorctl update` is required for the service to be found and
started.

Reorder the `supervisor-restart` exec to only run after the service is
started.  Because all supervisor configuration files have a `notify`
of the service, this forces the ordering of:

```
(package) -> (config files) -> (service) -> (optional restart)
```

On first startup, this will start and them immediately restart
supervisor, which is unfortunate but unavoidable -- and not terribly
relevant, since the database will not have been created yet, and thus
most processes will be in a restart loop for failing to connect to it.
2021-07-22 14:09:01 -07:00
Alex Vandiver ee7c849f8a puppet: Work around sysvinit supervisor init bug.
The sysvinit script for supervisor has a long-standing bug where
`/etc/init.d/supervisor restart` stops but does not then start the
supervisor process.

Work around this by making restart then try to start, and return if it
is currently running.
2021-07-22 14:09:01 -07:00
Alex Vandiver 7e65421b1f puppet: Ensure psycopg2 is installed before running process_fts_updates.
Not having the package installed will cause startup failures in
`process_fts_updates`; ensure that we've installed the package before
we potentially start the service.
2021-07-14 17:24:52 -07:00
Alex Vandiver 528e5adaab smokescreen: Default to only listening on 127.0.0.1.
This prevents Smokescreen from acting as an open proxy.

Fixes #19214.
2021-07-14 15:40:26 -07:00
Alex Vandiver e6bae4f1dd puppet: Remove zulip::nagios class.
93f62b999e removed the last file in
puppet/zulip/files/nagios_plugins/zulip_nagios_server, which means the
singular rule in zulip::nagios no longer applies cleanly.

Remove the `zulip::nagios` class, as it is no longer needed.
2021-07-09 17:29:41 -07:00
Anders Kaseorg 93f62b999e nagios: Replace check_website_response with standard check_http plugin.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-07-09 16:47:03 -07:00
Vishnu KS e0f5fadb79 billing: Downgrade small realms that are behind on payments.
An organization with at most 5 users that is behind on payments isn't
worth spending time on investigating the situation.

For larger organizations, we likely want somewhat different logic that
at least does not void invoices.
2021-07-02 13:19:12 -07:00
Anders Kaseorg 91bfebca7d install: Replace wget with curl.
curl uses Happy Eyeballs to avoid long timeouts on systems with broken
IPv6.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-06-25 09:05:07 -07:00
Anders Kaseorg 3b60b25446 ci: Remove bullseye hack.
base-files 11.1 marked bullseye as Debian 11 in /etc/os-release.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-06-24 14:35:51 -07:00
Alex Vandiver d51272cc3d puppet: Remove zulip_deliver_scheduled_* from zulip-workers:*.
Staging and other hosts that are `zulip::app_frontend_base` but not
`zulip::app_frontend_once` do not have a
/etc/supervisor/conf.d/zulip/zulip-once.conf and as such do not have
`zulip_deliver_scheduled_emails` or `zulip_deliver_scheduled_messages`
and thus supervisor will fail to reload.

Making the contents of `zulip-workers` contingent on if the server is
_also_ a `-once` server is complicated, and would involve using Concat
fragments, which severely limit readability.

Instead, expel those two from `zulip-workers`; this is somewhat
reasonable, since they are use an entirely different codepath from
zulip_events_*, using the database rather than RabbitMQ for their
queuing.
2021-06-14 17:12:59 -07:00
Alex Vandiver 6c72698df2 puppet: Move zulip_ops supervisor config into /etc/supervisor/conf.d/zulip/.
This is similar cleanup to 3ab9b31d2f, but only affects zulip_ops
services; it serves to ensure that any of these services which are no
longer enabled are automatically removed from supervisor.

Note that this will cause a supervisor restart on all affected hosts,
which will restart all supervisor services.
2021-06-14 17:12:59 -07:00
Alex Vandiver df09607202 puppet: Switch to $zulip::common::supervisor_conf_dir variable. 2021-06-14 17:12:59 -07:00
Alex Vandiver 391f78a9c1 puppet: Move supervisor-not-in-/etc/supervisor/conf.d/ to common place. 2021-06-14 17:12:59 -07:00
Alex Vandiver dd90083ed7 puppet: Provide FQDN of self as URI, so the certificate validates.
Failure to do this results in:
```
psql: error: failed to connect to `host=localhost user=zulip database=zulip`: failed to write startup message (x509: certificate is valid for [redacted], not localhost)
```
2021-06-14 00:14:48 -07:00