zulip

Commit Graph

Author	SHA1	Message	Date
Alex Vandiver	3aba2789d3	prometheus: Add an exporter for wal-g backup properties. Since backups may now taken on arbitrary hosts, we need a blackbox monitor that _some_ backup was produced. Add a Prometheus exporter which calls `wal-g backup-list` and reports statistics about the backups. This could be extended to include `wal-g wal-verify`, but that requires a connection to the PostgreSQL server.	2023-04-26 15:41:39 -07:00
Alex Vandiver	89e366771a	prometheus: Add a postgres exporter.	2023-03-30 16:16:18 -07:00
Alex Vandiver	1a65315566	puppet: Switch teleport to running under systemd, not supervisord. There is no reason that the base node access method should be run under supervisor, which exists primarily to give access to the `zulip` user to restart its managed services. This access is unnecessary for Teleport, and also causes unwanted restarts of Teleport services when the `supervisor` base configuration changes. Additionally, supervisor does not support the in-place upgrade process that Teleport uses, as it replaces its core process with a new one. Switch to installing a systemd configuration file (as generated by `teleport install systemd`) for each part of Teleport, customized to pass a `--config` path. As such, we explicitly disable the `teleport` service provided by the package. The supervisor process is shut down by dint of no longer installing the file, which purges it from the managed directory, and reloads Supervisor to pick up the removed service.	2023-03-15 17:23:42 -04:00
Alex Vandiver	521ec5885b	puppet: Rename autossh tunnel, as it is no longer for just munin.	2022-11-01 22:24:40 -07:00
Alex Vandiver	42f84a8cc7	puppet: Use existing autossh tunnels as OpenSSH "master" sockets. A number of autossh connections are already left open for port-forwarding Munin ports; autossh starts the connections and ensures that they are automatically restarted if they are severed. However, this represents a missed opportunity. Nagios's monitoring uses a large number of SSH connections to the remote hosts to run commands on them; each of these connections requires doing a complete SSH handshake and authentication, which can have non-trivial network latency, particularly for hosts which may be located far away, in a network topology sense (up to 1s for a no-op command!). Use OpenSSH's ability to multiplex multiple connections over a single socket, to reuse the already-established connection. We leave an explicit `ControlMaster no` in the general configuration, and not `auto`, as we do not wish any of the short-lived Nagios connections to get promoted to being a control socket if the autossh is not running for some reason. We enable protocol-level keepalives, to give a better chance of the socket being kept open.	2022-11-01 22:24:40 -07:00
Alex Vandiver	e05a0dcf98	puppet: Support FQDNs in puppet zulip.conf names.	2022-11-01 22:24:40 -07:00
Alex Vandiver	951dc68f3a	autossh: Drop unnecessary -2 option. The -2 option is a no-op.	2022-11-01 22:24:40 -07:00
Alex Vandiver	499284d2fd	nagios: Split postgresql into primary and replica. Replication checks should only run on primary and replicas, not standalone hosts; while `autovac_freeze` currently only runs on primary hosts, it functions identically on replicas, and is fine to run there. Make `autovac_freeze` run on all `postgresql` hosts, and make standalone hosts no longer `postgres_primary`, so they do not fail the replication tests.	2022-06-22 12:07:38 -07:00
Alex Vandiver	775a084d0f	nagios: Add a catchall "other" set.	2022-06-22 12:07:38 -07:00
Alex Vandiver	2a14aa5180	nagios: Add a `fullstack` hostgroup. This will be used to apply checks only to czo.	2022-06-22 12:07:38 -07:00
Alex Vandiver	b5ecfc327f	nagios: Remove unnecessary `web` hostgroup. This had identical membership to `frontends`.	2022-06-22 12:07:38 -07:00
Alex Vandiver	4be9025212	nagios: Remove redundant `postgresql` hostgroup. This is implied by `postgresql_primary`.	2022-06-22 12:07:38 -07:00
Alex Vandiver	d9d0014fb4	nagios: Rename `zmirror_main` into `zmirror` hostgroup. `zmirror` itself was `zmirror_main` + `zmirrorp` but was unused; we consistently just use the term `zmirror` for the non-personals server, so use it as the hostgroup name.	2022-06-22 12:07:38 -07:00
Alex Vandiver	08127086bc	nagios: Remove misleading "staging_frontends" from standalone. No services are tested for the `staging_frontends` hostgroup, so this does not alter the checks.	2022-06-22 12:07:38 -07:00
Alex Vandiver	d804de871d	nagios: Move staging and prod hostgroups adjacent.	2022-06-22 12:07:38 -07:00
Alex Vandiver	4c17f2bccc	nagios: The frontends hostgroup now includes prod and staging frontends. This lets the config file remove some repetition.	2022-06-22 12:07:38 -07:00
Alex Vandiver	93bcb86345	nagios: Reorder service checks.	2022-06-22 12:07:38 -07:00
Alex Vandiver	33472ee9ff	nagios: Remove unused stats host set.	2022-06-22 12:07:38 -07:00
Alex Vandiver	bc4f4b4862	nagios: Make the pageable/not/flaky tri-state clearer.	2022-06-22 12:07:38 -07:00
Alex Vandiver	c74f195fba	nagios: Split AWS and non-AWS hosts, for ntp checks. The non-AWS hosts cannot use the AWS ntp server for their check.	2022-06-22 12:07:38 -07:00
Alex Vandiver	872efdee58	nagios: Fold single- and multitornado_frontends back into frontends. `5abf4dee92` made this distinction, then multitornado_frontends was never used; the singletornado_frontends alerting worked even for the multiple-Tornado instances. Remove the useless and misleading distinction.	2022-06-22 12:07:38 -07:00
Alex Vandiver	7f6a77da31	puppet: Add a redis exporter.	2022-05-03 17:13:44 -07:00
Alex Vandiver	1bd5723cd2	puppet: Add a prometheus monitor for tornado processes.	2022-03-20 16:12:11 -07:00
Anders Kaseorg	b3260bd610	docs: Use Debian and Ubuntu version numbers over development codenames. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-02-23 12:04:24 -08:00
Alex Vandiver	3c95ad82c6	puppet: Upgrade to nagios4. This updates the puppeted nagios configuration file for the Nagios4 defaults.	2022-01-11 09:38:31 -08:00
Alex Vandiver	8a5be972d2	puppet: Add a uwsgi exporter for monitoring. This allows investigation of how many workers are busy, and to track "harikari" terminations.	2022-01-03 15:25:58 -08:00
Alex Vandiver	bb5a2c8138	puppet: Move prometheus to external_dep.	2021-12-29 16:35:15 -08:00
Alex Vandiver	2d6c096904	puppet: Move node_exporter to external_dep.	2021-12-29 16:35:15 -08:00
Alex Vandiver	291f688678	puppet: Use zulip::external_dep for grafana, template config. Templating the config ensures that the service is restarted when it is upgraded.	2021-12-08 20:58:10 -08:00
Anders Kaseorg	93f62b999e	nagios: Replace check_website_response with standard check_http plugin. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2021-07-09 16:47:03 -07:00
Alex Vandiver	dd90083ed7	puppet: Provide FQDN of self as URI, so the certificate validates. Failure to do this results in: ``` psql: error: failed to connect to `host=localhost user=zulip database=zulip`: failed to write startup message (x509: certificate is valid for [redacted], not localhost) ```	2021-06-14 00:14:48 -07:00
Alex Vandiver	d905eb6131	puppet: Add a database teleport server. Host-based md5 auth for 127.0.0.1 must be removed from `pg_hba.conf`, otherwise password authentication is preferred over certificate-based authentication for localhost.	2021-06-08 22:21:21 -07:00
Alex Vandiver	a2b1009ed5	puppet: Turn on "authentication" which defaults to user with all rights. Nagios refuses to allow any modifications with use_authentication off; re-enabled "authentication" but set a default user, which (by way of the `*` permissions in `359f37389a`) is allowed to take all actions.	2021-06-08 15:19:28 -07:00
Alex Vandiver	61b6fc865c	puppet: Add a label to teleport applications, to allow RBAC. Roles can only grant or deny access based on labels; set one based on the application name.	2021-06-08 15:19:04 -07:00
Alex Vandiver	4aff5b1d22	puppet: Allow access to `/` in nagios. This was a regression in `51b985b40d`.	2021-06-07 22:40:58 -07:00
Alex Vandiver	359f37389a	puppet: Remove in-nagios auth restrictions. `51b985b40d` made nagios only accessible from localhost, or as proxied via teleport. Remove the HTTP-level auth requirements.	2021-06-07 16:17:45 -07:00
Alex Vandiver	51b985b40d	puppet: Move nagios to behind teleport. This makes the server only accessible via localhost, by way of the Teleport application service.	2021-06-02 18:38:38 -07:00
Alex Vandiver	c9141785fd	puppet: Use concat fragments to place port allows next to services. This means that services will only open their ports if they are actually run, without having to clutter rules.v4 with a log of `if` statements. This does not go as far as using `puppetlabs/firewall`[1] because that would represent an additional DSL to learn; raw IPtables sections can easily be inserted into the generated iptables file via `concat::fragment` (either inline, or as a separate file), but config can be centralized next to the appropriate service. [1] https://forge.puppet.com/modules/puppetlabs/firewall	2021-05-27 21:14:48 -07:00
Alex Vandiver	9ea86c861b	puppet: Add a nagios alert configuration for smokescreen. This verifies that the proxy is working by accessing a highly-available website through it. Since failure of this equates to failures of Sentry notifications and Android mobile push notifications, this is a paging service.	2021-03-18 10:11:15 -07:00
Alex Vandiver	a215c83c2d	puppet: Switch to more explicit variable rather than reuse a nagios one. Redis is not nagios, and this only leads to confusion as to why there is a nagios domain setting on frontend servers; it also leaves the `redis0` part of the name buried in the template. Switch to an explicit variable for the redis hostname.	2021-03-10 11:44:54 -08:00
Alex Vandiver	d938dd9d4a	puppet: Document smokescreen installation, and move to puppet/zulip/. This is more broadly useful than for just Kandra; provide documentation and means to install Smokescreen for stand-alone servers, and motivate its use somewhat more.	2021-03-02 17:16:38 -08:00
Alex Vandiver	32149c6a1c	puppet: Add ksplice uptrack for kernel hotpatches.	2021-02-25 18:05:47 -08:00
Alex Vandiver	0b736ef4cf	puppet: Remove puppet_ops configuration for separate loadbalancer host.	2021-02-22 16:05:13 -08:00
Alex Vandiver	e30b524896	iptables: Limit smokescreen port 4750, add camo port. Limit incoming connections to port 4750 to only the smokescreen host, and also allow access to the Camo server on that host, on port 9292.	2021-02-17 13:52:38 -08:00
Alex Vandiver	29f60bad20	smokescreen: Put the version into the supervisorctl command. This makes it reload correctly if the version is changed.	2021-02-16 08:12:31 -08:00
Alex Vandiver	45f6c79c4a	puppet: Rename postgres_ variables to postgresql_.	2020-10-28 11:51:52 -07:00
Alex Vandiver	e124324050	puppet: Rename postgres_appdb in nagios to postgresql.	2020-10-28 11:51:52 -07:00
Alex Vandiver	78b92a51cc	puppet: Allow access to smokescreen port via iptables.	2020-10-15 15:18:35 -07:00
Alex Vandiver	0d5356969e	puppet: Reformat ipv4 iptables rules comments.	2020-10-15 15:18:35 -07:00
Alex Vandiver	24383a5082	puppet: Rename hosts_domain so hosts_prefix can be grepped for.	2020-07-10 00:14:09 -07:00

1 2

77 Commits