wal-g was used in `puppet/zulip` by env-wal-g, but only installed in
`puppet/zulip_ops`.
Merge all of the dependencies of doing backups using wal-g (wal-g
installation, the pg_backup_and_purge job, the nagios plugin that
verifies it happens) into a common base class in `puppet/zulip`, since
it is generally useful.
No plugins are installed inside the /usr/local/munin/lib this creates
in munin-node, nor are they symlinked into /etc/munin/plugins, so
non-default plugins are added by this.
The one complexity is that hosts_fullstack are treated differently, as
they are not currently found in the manual `hosts` list, and as such
do not get munin monitoring.
In Bionic, nagios-plugins-basic is a transitional package which
depends on monitoring-plugins-basic. In Focal, it is a virtual
package, which means that every time puppet runs, it tries to
re-install the nagios-plugins-basic package.
Switch all instances to referring to `$zulip::common::nagios_plugins`,
and repoint that to monitoring-plugins-basic.
Support for Xenial and Stretch was removed (5154ddafca, 0f4b1076ad,
8944e0ad53, 79acd5ae40, 1219a2e854), but not all codepaths were
updated to remove their conditionals on it.
Remove all code predicated on Xenial or Stretch. debathena support
was migrated to Bionic, since that appears to be the current state of
existing debathena servers.
Since 9e8f1aacb3, zulip_ops machines
might have two Package declarations for `certbot`, which doesn't work
in puppet.
The fix is, as usual, to use our `zulip::safepackage` wrapper instead.
The style guide for Zulip is to always use "primary" and "replica"
when describing database replication. Adjust a few comments under
`puppet/` that do not adhere to this.
Unfortunately, some references still remain to the insensitive and
inaccurate "master" / "slave" terminology. However, these are only in
files which we are attempting to preserve as close to the upstream
versions they are derived from (e.g. postgresql.conf,
postfix/master.cf).
All differences between the primary and replica roles having been
merged, fold the `postgres_common`, `postgres_master`, and
`postgres_slave` roles into just `postgres_appdb`.
These values differed between the primary and secondary database
hosts, for unclear reasons. The differences date back to their
introduction in 387f63deaa. As the comment in the replica
confguration notes, settings of `vm.dirty_ratio = 10` and
`vm.dirty_background_ratio = 5` matched the kernel defaults for
"newer" kernels; however, kernel 2.6.30 bumped those to 20 and 10,
respectively[1], as a fix for underlying logic now being more correct.
Remove these overrides; they should at very least be consistent across
roles, and the previous values look to be an attempt to tune for a
very much older version of the Linux kernel, which was using an
different, buggier, algorithm under the hood.
[1] 1b5e62b42b
This file controls streaming replication, and recovery using wal-g on
the secondary. The `primary_conninfo` data needs to change on short
notice when database failover happens, in a way that is not suitable
for being controlled by puppet.
PostgreSQL 12, in fact, removes the use of the `recovery.conf` file[1];
the `primary_conninfo` and `restore_command` information goes into the
main `postgresql.conf` file, and the standby status is controlled by
the presence of absence of an empty `standby.signal` file.
Remove the puppet control of the `recovery.conf` file.
[1] https://pgstef.github.io/2018/11/26/postgresql12_preview_recovery_conf_disappears.html
Since the nagios authentication is stored _in the database_, it is
unnecessary to run if the database is simply a replica of the
production database. The only case in which this statement would have
an effect is if the postgres node contains a _different_ (or empty)
database, which `setup_disks` now effectively prevents.
Remove the unnecessary step.
481613a344 updated the `setup_disks` script to no longer reference
`mdadm`, since we no longer set up RAID on servers.
Update the puppet that would call it to remove the `mdadm` dependency,
and run only if the state is not what it produces -- namely, a symlink
for `/var/lib/postgresql`, which must point to an existent
`/srv/postgresql` directory.
As the previous commit, this is currently only used in tuning, but is
a property of the whole postgres configuration; move it there, as just
the directory, not the file.
Use this directory consistently in the erb templates. Since we
produce a `pg_hba.conf`, it makes sense that we point to the path that we
know that we explicitly wrote to, for instance.
1f565a9f41 removed the `package` lines which install
`python-dateutil`, but not the line in `puppet_ops` that reference it;
as such, Puppet manifests in puppet_ops fail to compile.
Remove the stale reference to `python-dateutil`, which is unnecessary
since the code is python3, not python2.
This is vestigial.
It requires manually altering the `htdigest` file (not stored in this
repo) to change the digest realm from `wiki` to `monitoring`, and will
re-prompt users for their passwords if the browsers currently store
them.
Using the `host` virtual package confused Puppet into reporting it was
doing work every time one did a puppet run, resulting in unnecessarily
spammy output.
The construction `su postgres -c -- bash -c 'psql …'` didn’t behave the
way it reads, and only worked by accident:
1. `-c --` sets the command to `--`.
2. `bash` sets the first argument to `bash`.
3. `-c 'psql …'` replaces the command with `psql …`.
Thus, `su` ended up executing `<shell> -c 'psql …' bash`, where
`<shell>` is the `postgres` user’s login shell, usually also `bash`,
which then executed 'psql …' and ignored the extra `bash`.
Unconfuse this construction.
Note from tabbott: The old code didn't even work by accident, it was
just broken. The right fix is to move the quoting around properly.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
It hasn't been working for years, but more importantly, it spams up
root's mail queue so that one can't find important things in there
(e.g. the fact that the long-term-idle cron job was failing).
Apparently, `puppet-lint` on Ubuntu trusty throws warnings for certain
quoting patterns that are OK in modern `puppet-lint`. I believe the
old Zulip code was actually correct (i.e. the old `puppet-lint`
implementation was the problem), but it seems worth changing anyway to
suppress the warnings.
We also exclude more of puppet-apt from linting, since it's
third-party code.
We fix these by adding ignore statements in a bunch of files
where this error popped up. We target only specific lines using
the ignore statements and not the entire files.
We can't really do this in the zulip manifests (since it's sorta a
sysadmin policy decision), but these scripts can cause significant
load when Nagios logs into a server (because many of them take 50ms or
more of work to run). So we just get rid of them.