Commit Graph

362 Commits

Author SHA1 Message Date
Anders Kaseorg 93f62b999e nagios: Replace check_website_response with standard check_http plugin.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-07-09 16:47:03 -07:00
Vishnu KS e0f5fadb79 billing: Downgrade small realms that are behind on payments.
An organization with at most 5 users that is behind on payments isn't
worth spending time on investigating the situation.

For larger organizations, we likely want somewhat different logic that
at least does not void invoices.
2021-07-02 13:19:12 -07:00
Alex Vandiver 6c72698df2 puppet: Move zulip_ops supervisor config into /etc/supervisor/conf.d/zulip/.
This is similar cleanup to 3ab9b31d2f, but only affects zulip_ops
services; it serves to ensure that any of these services which are no
longer enabled are automatically removed from supervisor.

Note that this will cause a supervisor restart on all affected hosts,
which will restart all supervisor services.
2021-06-14 17:12:59 -07:00
Alex Vandiver dd90083ed7 puppet: Provide FQDN of self as URI, so the certificate validates.
Failure to do this results in:
```
psql: error: failed to connect to `host=localhost user=zulip database=zulip`: failed to write startup message (x509: certificate is valid for [redacted], not localhost)
```
2021-06-14 00:14:48 -07:00
Alex Vandiver c90ff80084 puppet: Bump grafana version to 8.0.1.
Most notably, this fixes an annoying bug with CloudWatch metrics being
repeated in graphs.
2021-06-10 15:49:08 -07:00
Alex Vandiver d905eb6131 puppet: Add a database teleport server.
Host-based md5 auth for 127.0.0.1 must be removed from `pg_hba.conf`,
otherwise password authentication is preferred over certificate-based
authentication for localhost.
2021-06-08 22:21:21 -07:00
Alex Vandiver 100a899d5d puppet: Add grafana server. 2021-06-08 22:21:00 -07:00
Alex Vandiver 459f37f041 puppet: Add prometheus server. 2021-06-08 22:21:00 -07:00
Alex Vandiver 19fb58e845 puppet: Add prometheus node exporter. 2021-06-08 22:21:00 -07:00
Alex Vandiver a2b1009ed5 puppet: Turn on "authentication" which defaults to user with all rights.
Nagios refuses to allow any modifications with use_authentication off;
re-enabled "authentication" but set a default user, which (by way of
the `*` permissions in 359f37389a) is allowed to take all actions.
2021-06-08 15:19:28 -07:00
Alex Vandiver 61b6fc865c puppet: Add a label to teleport applications, to allow RBAC.
Roles can only grant or deny access based on labels; set one based on
the application name.
2021-06-08 15:19:04 -07:00
Alex Vandiver 4aff5b1d22 puppet: Allow access to `/` in nagios.
This was a regression in 51b985b40d.
2021-06-07 22:40:58 -07:00
Alex Vandiver 54768c2210 puppet: Remove now-unused basic auth support files.
51b985b40d made these unnecessary.
2021-06-07 16:17:45 -07:00
Alex Vandiver 359f37389a puppet: Remove in-nagios auth restrictions.
51b985b40d made nagios only accessible from localhost, or as proxied
via teleport.  Remove the HTTP-level auth requirements.
2021-06-07 16:17:45 -07:00
Alex Vandiver 2352fac6b5 puppet: Fix indentation. 2021-06-02 18:38:38 -07:00
Alex Vandiver 51b985b40d puppet: Move nagios to behind teleport.
This makes the server only accessible via localhost, by way of the
Teleport application service.
2021-06-02 18:38:38 -07:00
Alex Vandiver 4f51d32676 puppet: Add a teleport application server.
This requires switching to a reverse tunnel for the auth connection,
with the side effect that the `zulip_ops::teleport::node` manifest can
be applied on servers anywhere in the Internet; they do not need to
have any publicly-available open ports.
2021-06-02 18:38:38 -07:00
Alex Vandiver c59421682f puppet: Add a teleport node on every host.
Teleport nodes[1] are the equivalent to SSH servers.  In addition to
this config, joining the teleport cluster will require presenting a
one-time "join token" from the proxy server[2], which may either be
short-lived or static.

[1] https://goteleport.com/docs/architecture/nodes/
[2] https://goteleport.com/docs/admin-guide/#adding-nodes-to-the-cluster
2021-06-02 18:38:38 -07:00
Alex Vandiver 1cdf14d195 puppet: Add a teleport server.
See https://goteleport.com/docs/architecture/overview/ for the general
architecture of a Teleport cluster.  This commit adds a Teleport auth[1]
and proxy[2] server.  The auth server serves as a CA for granting
time-bounded access to users and authenticating nodes on the cluster;
the proxy provides access and a management UI.

[1] https://goteleport.com/docs/architecture/authentication/
[2] https://goteleport.com/docs/architecture/proxy/
2021-06-02 18:38:38 -07:00
Alex Vandiver 3ebd627c50 puppet: Fix "import" -> "include" in chat_zulip_org. 2021-06-02 11:02:34 -07:00
Alex Vandiver 2130fc0645 puppet: Add an explicit class for czo. 2021-06-01 22:18:50 -07:00
Alex Vandiver c9141785fd puppet: Use concat fragments to place port allows next to services.
This means that services will only open their ports if they are
actually run, without having to clutter rules.v4 with a log of `if`
statements.

This does not go as far as using `puppetlabs/firewall`[1] because that
would represent an additional DSL to learn; raw IPtables sections can
easily be inserted into the generated iptables file via
`concat::fragment` (either inline, or as a separate file), but config
can be centralized next to the appropriate service.

[1] https://forge.puppet.com/modules/puppetlabs/firewall
2021-05-27 21:14:48 -07:00
Alex Vandiver 4f79b53825 puppet: Factor out firewall config. 2021-05-27 21:14:48 -07:00
Alex Vandiver f3eea72c2a setup: Merge multiple setup-apt-repo scripts into one.
This moves the `.asc` files into subdirectories, and writes out the
according `.list` files into them.  It moves from templates to
written-out `.list` files for clarity and ease of
implementation (Debian and Ubuntu need different templates for
`zulip`), and as a way of making explicit which releases are supported
for each list.  For the special-case of the PGroonga signing key, we
source an additional file within the directory.

This simplifies the process for adding another class of `.list` file.
2021-05-26 14:42:29 -07:00
Alex Vandiver 4f017614c5 nagios: Replace check_fts_update_log with a process_fts_updates flag.
This avoids having to duplicate the connection logic from
process_fts_updates.

Co-authored-by: Adam Birds <adam.birds@adbwebdesigns.co.uk>
2021-05-25 13:56:05 -07:00
Alex Vandiver 116e41f1da puppet: Move files out and back when mounting /srv.
Specifically, this affects /srv/zulip-aws-tools.
2021-05-23 13:29:23 -07:00
Alex Vandiver ea98549e88 puppet: Always install linux-image-virtual, for ksplice support. 2021-05-23 13:29:23 -07:00
Alex Vandiver 0b1dd27841 puppet: AWS mounts its extra disks with inconsistent names.
It is now /dev/nvme1n1, not /dev/nvme0n1; but it always has a
consistent major/minor node.  Source the file that defines these.
2021-05-23 13:29:23 -07:00
Alex Vandiver 033a96aa5d puppet: Fix check_ssl_certificate check to check named host, not self. 2021-05-17 18:38:30 -07:00
Alex Vandiver feb7870db7 puppet: Adjust thresholds on autovac_freeze.
These thresholds are in relationship to the
`autovacuum_freeze_max_age`, *not* the XID wraparound, which happens
at 2^31-1.  As such, it is *perfectly normal* that they hit 100%, and
then autovacuum kicks in and brings it back down.  The unusual
condition is that PostgreSQL pushes past the point where an autovacuum
would be triggered -- therein lies the XID wraparound danger.

With the `autovacuum_freeze_max_age` set to 2000000000 in
`postgresql.conf`, XID wraparound happens at 107.3%.  Set the warning
and error thresholds to below this, but above 100% so this does not
trigger constantly.
2021-05-11 17:11:47 -07:00
Anders Kaseorg 544bbd5398 docs: Fix capitalization mistakes.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-05-10 09:57:26 -07:00
Anders Kaseorg 9d57fa9759 puppet: Use pgrep -x to avoid accidental matches.
Matching the full process name (-x without -f) or full command
line (-xf) is less prone to mistakes like matching a random substring
of some other command line or pgrep matching itself.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-05-07 08:54:41 -07:00
Alex Vandiver 6ee74b3433 puppet: Check health of APT repository. 2021-03-23 19:27:42 -07:00
Alex Vandiver c01345d20c puppet: Add nagios check for long-lived certs that do not auto-renew. 2021-03-23 19:27:27 -07:00
Alex Vandiver 9ea86c861b puppet: Add a nagios alert configuration for smokescreen.
This verifies that the proxy is working by accessing a
highly-available website through it.  Since failure of this equates to
failures of Sentry notifications and Android mobile push
notifications, this is a paging service.
2021-03-18 10:11:15 -07:00
Anders Kaseorg 129ea6dd11 nginx: Consistently listen on IPv6 and with HTTP/2.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-03-17 17:46:32 -07:00
Alex Vandiver 06c07109e4 puppet: Add missing semicolons left off in ba3b88c81b. 2021-03-12 15:48:53 -08:00
Alex Vandiver ba3b88c81b puppet: Explicitly use the snakeoil certificates for nginx.
In production, the `wildcard-zulipchat.com.combined-chain.crt` file is
just a symlink to the snakeoil certificates; but we do not puppet that
symlink, which makes new hosts fail to start cleanly.  Instead, point
explicitly to the snakeoil certificate, and explain why.
2021-03-12 13:31:54 -08:00
Alex Vandiver 306bf930f5 puppet: Add a warning if ksplice is enabled but has no key set. 2021-03-10 17:57:20 -08:00
Alex Vandiver a215c83c2d puppet: Switch to more explicit variable rather than reuse a nagios one.
Redis is not nagios, and this only leads to confusion as to why there
is a nagios domain setting on frontend servers; it also leaves the
`redis0` part of the name buried in the template.

Switch to an explicit variable for the redis hostname.
2021-03-10 11:44:54 -08:00
Alex Vandiver a5b29398fc puppet: Only install ksplice uptrack if there is an access key. 2021-03-10 11:44:11 -08:00
Alex Vandiver d938dd9d4a puppet: Document smokescreen installation, and move to puppet/zulip/.
This is more broadly useful than for just Kandra; provide
documentation and means to install Smokescreen for stand-alone
servers, and motivate its use somewhat more.
2021-03-02 17:16:38 -08:00
Alex Vandiver 2f5eae5c68 puppet: Minor formatting. 2021-02-28 17:03:29 -08:00
Alex Vandiver a759d26a32 puppet: Make ksplice config not world-readable, use 'adm' group.
This matches the configuration that ksplice itself creates the file
and directory with.
2021-02-28 17:03:29 -08:00
Tim Abbott 957c16aa77 nagios: Tweak prod load monitoring parameters.
Ultimately this monitoring isn't that helpful, but we're mainly
interested in when it spikes to very high numbers.
2021-02-26 08:39:52 -08:00
Alex Vandiver 32149c6a1c puppet: Add ksplice uptrack for kernel hotpatches. 2021-02-25 18:05:47 -08:00
Alex Vandiver 173d2dec3d puppet: Check in defensive restart-camo cron job.
This was found on lb1; add it to the camo install on smokescreen.
2021-02-24 16:42:21 -08:00
Alex Vandiver 0b736ef4cf puppet: Remove puppet_ops configuration for separate loadbalancer host. 2021-02-22 16:05:13 -08:00
Alex Vandiver e30b524896 iptables: Limit smokescreen port 4750, add camo port.
Limit incoming connections to port 4750 to only the smokescreen host,
and also allow access to the Camo server on that host, on port 9292.
2021-02-17 13:52:38 -08:00
Alex Vandiver a88af1b5a2 camo: Install on smokescreen host. 2021-02-16 08:12:31 -08:00