Since backups may now taken on arbitrary hosts, we need a blackbox
monitor that _some_ backup was produced.
Add a Prometheus exporter which calls `wal-g backup-list` and reports
statistics about the backups.
This could be extended to include `wal-g wal-verify`, but that
requires a connection to the PostgreSQL server.
7c023042cf moved the logrotate configuration to being a templated
file, from a static file, but missed that the static file was still
referenced from `zulip_ops::app_frontend`; it only updated
`zulip::profile::app_frontend`. This caused errors in applying puppet
on any `zulip_ops::app_frontend` host.
Prior to 7c023042cf, the Puppet role was identical between those two
classes; deduplicate the rule by moving the updated template
definition into `zulip::app_frontend_base` which is common to those
two classes and not used in any other classes.
Following zulip/python-zulip-api/pull/758/, we're no longer using
python-zephyr, and don't need to build it from source. Additionally,
we no longer need to build a forked Zephyr package, since ZLoadSession
and ZDumpSession were merged in
e6a545e759.
To not change the `supervisor.conf` file, which requires a restart of
supervisor (and thus all services running under it, which is extremely
disruptive) we carefully leave the contents unchanged for most
installs, and append a new piece to the file, only for the zmirror
configuration, using `concat`.
There is no reason that the base node access method should be run
under supervisor, which exists primarily to give access to the `zulip`
user to restart its managed services. This access is unnecessary for
Teleport, and also causes unwanted restarts of Teleport services when
the `supervisor` base configuration changes. Additionally,
supervisor does not support the in-place upgrade process that Teleport
uses, as it replaces its core process with a new one.
Switch to installing a systemd configuration file (as generated by
`teleport install systemd`) for each part of Teleport, customized to
pass a `--config` path. As such, we explicitly disable the `teleport`
service provided by the package.
The supervisor process is shut down by dint of no longer installing
the file, which purges it from the managed directory, and reloads
Supervisor to pick up the removed service.
Puppet _always_ sets the `+x` bit on directories if they have the `r`
bit set for that slot[^1]:
> When specifying numeric permissions for directories, Puppet sets the
> search permission wherever the read permission is set.
As such, for instance, `0640` is actually applied as `0750`.
Fix what we "want" to match what puppet is applying, by adding the `x`
bit. In none of these cases did we actually intend the directory to
not be executable.
[1] https://www.puppet.com/docs/puppet/5.5/types/file.html#file-attribute-mode
A number of autossh connections are already left open for
port-forwarding Munin ports; autossh starts the connections and
ensures that they are automatically restarted if they are severed.
However, this represents a missed opportunity. Nagios's monitoring
uses a large number of SSH connections to the remote hosts to run
commands on them; each of these connections requires doing a complete
SSH handshake and authentication, which can have non-trivial network
latency, particularly for hosts which may be located far away, in a
network topology sense (up to 1s for a no-op command!).
Use OpenSSH's ability to multiplex multiple connections over a single
socket, to reuse the already-established connection. We leave an
explicit `ControlMaster no` in the general configuration, and not
`auto`, as we do not wish any of the short-lived Nagios connections to
get promoted to being a control socket if the autossh is not running
for some reason.
We enable protocol-level keepalives, to give a better chance of the
socket being kept open.
The `needrestart` tool added in 22.04 is useful in terms of listing
which services may need to be restarted to pick up updated libraries.
However, it prompts about the current state of services needing
restart for *every* subsequent `apt-get upgrade`, and defaulting core
services to restarting requires carefully manually excluding them
every time, at risk of causing an unscheduled outage.
Build a list of default-off services based on the list in
unattended-upgrades.
This check loads Django, and as such must be run as the zulip user.
Repeat the same pattern used elsewhere in nagios, of writing a state
file, which is read by `check_cron_file`.
Our current EC2 systems don’t have an interface named ‘eth0’, and if
they did, this script would do nothing but crash with ImportError
because we have never installed boto.utils for Python 3.
(The message of commit 2a4d851a7c made
an effort to document for future researchers why this script should
not have been blindly converted to Python 3. However, commit
2dc6d09c2a (#14278) was evidently
unresearched and untested.)
Signed-off-by: Anders Kaseorg <anders@zulip.com>
The homedir of a user cannot be changed if any processes are running
as them, so having it change over time as upgrades happen will break
puppet application, as the old grafana process under supervisor will
effectively lock changes to the user's homedir.
Unfortunately, that means that this change will thus fail to
puppet-apply unless `supervisorctl stop grafana` is run first, but
there's no way around that.
In the event that extracting doesn't produce the binary we expected it
to, all this will do is create an _empty_ file where we expect the
binary to be. This will likely muddle debugging.
Since the only reason the resourfce was made in the first place was to
make dependencies clear, switch to depending on the External_Dep
itself, when such a dependency is needed.