zulip

Commit Graph

Author	SHA1	Message	Date
Alex Vandiver	f6d27562fa	puppet: Configure chrony to use AWS-local NTP sources. This prevents hosts from spewing traffic to random hosts across the Internet.	2022-03-25 17:07:53 -07:00
Alex Vandiver	1bd5723cd2	puppet: Add a prometheus monitor for tornado processes.	2022-03-20 16:12:11 -07:00
Anders Kaseorg	f6a701090c	setup-apt-repos: Don’t install lsb_release. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-02-14 16:38:53 -08:00
Alex Vandiver	b3f07cc98d	puppet: Replace debathena zephyr package with equivalent puppet file.	2022-01-18 14:13:28 -08:00
Alex Vandiver	a6d7539571	puppet: Replace debathena krb5 package with equivalent puppet file.	2022-01-18 14:13:28 -08:00
Alex Vandiver	fc1adef28a	puppet: Fix server_name of internal staging server.	2022-01-18 12:36:56 -08:00
Alex Vandiver	7e630b81f8	puppet: Switch to using snakeoil certs for staging. This parallels `ba3b88c81b`, but for the staging host.	2022-01-18 12:36:56 -08:00
Alex Vandiver	4d7e6b26df	puppet: Provide more attributes to teleport on ssh nodes.	2022-01-12 14:15:45 -08:00
Alex Vandiver	339e70671c	puppet: Switch Grafana to Grafana 8 Unified Alerting.	2022-01-11 14:27:11 -08:00
Alex Vandiver	6a7eecee9a	puppet: Increase load paging thresholds.	2022-01-11 09:38:31 -08:00
Alex Vandiver	3c95ad82c6	puppet: Upgrade to nagios4. This updates the puppeted nagios configuration file for the Nagios4 defaults.	2022-01-11 09:38:31 -08:00
Alex Vandiver	bb5a2c8138	puppet: Move prometheus to external_dep.	2021-12-29 16:35:15 -08:00
Alex Vandiver	2d6c096904	puppet: Move node_exporter to external_dep.	2021-12-29 16:35:15 -08:00
Alex Vandiver	053682964e	puppet: Only fetch from running hosts in Grafana ec2 discovery.	2021-12-09 08:12:03 -08:00
Alex Vandiver	291f688678	puppet: Use zulip::external_dep for grafana, template config. Templating the config ensures that the service is restarted when it is upgraded.	2021-12-08 20:58:10 -08:00
Alex Vandiver	fb2d05f9e3	puppet: Remove unused 'builder' files. These are leftover detritus from the "builder" host, which was removed in `4c9a283542`.	2021-12-06 10:21:50 -08:00
Alex Vandiver	b982222e03	camo: Replace with go-camo implementation. The upstream of the `camo` repository[1] has been unmaintained for several years, and is now archived by the owner. Additionally, it has a number of limitations: - It is installed as a sysinit service, which does not run under Docker - It does not prevent access to internal IPs, like 127.0.0.1 - It does not respect standard `HTTP_proxy` environment variables, making it unable to use Smokescreen to prevent the prior flaw - It occasionally just crashes, and thus must have a cron job to restart it. Swap camo out for the drop-in replacement go-camo[2], which has the same external API, requiring not changes to Django code, but is more maintained. Additionally, it resolves all of the above complaints. go-camo is not configured to use Smokescreen as a proxy, because its own private-IP filtering prevents using a proxy which lies within that IP space. It is also unclear if the addition of Smokescreen would provide any additional protection over the existing IP address restrictions in go-camo. go-camo has a subset of the security headers that our nginx reverse proxy sets, and which camo set; provide the missing headers with `-H` to ensure that go-camo, if exposed from behind some other non-nginx load-balancer, still provides the necessary security headers. Fixes #18351 by moving to supervisor. Fixes zulip/docker-zulip#298 also by moving to supervisor. [1] https://github.com/atmos/camo [2] https://github.com/cactus/go-camo	2021-11-19 15:58:26 -08:00
Alex Vandiver	27881babab	puppet: Increase prometheus storage, from the default 15d.	2021-08-24 23:40:43 -07:00
Alex Vandiver	5857dcd9b4	puppet: Configure ip6tables in parallel to ipv4. Previously, IPv6 firewalls were left at the default all-open. Configure IPv6 equivalently to IPv4.	2021-08-24 16:05:46 -07:00
Alex Vandiver	845509a9ec	puppet: Be explicit that existing iptables are only ipv4.	2021-08-24 16:05:46 -07:00
Alex Vandiver	4dd289cb9d	puppet: Enable prometheus monitoring of supervisord. To be able to read the UNIX socket, this requires running node_exporter as zulip, not as prometheus.	2021-08-03 21:47:02 -07:00
Alex Vandiver	aa940bce72	puppet: Disable hwmon collector, which does nothing on cloud hosts.	2021-08-03 21:47:02 -07:00
Alex Vandiver	e94b6afb00	nagios: Remove broken check_email_deliverer_* checks and related code. These checks suffer from a couple notable problems: - They are only enabled on staging hosts -- where they should never be run. Since `ef6d0ec5ca`, these supervisor processes are only run on one host, and never on the staging host. - They run as the `nagios` user, which does not have appropriate permissions, and thus the checks always fail. Specifically, `nagios` does not have permissions to run `supervisorctl`, since the socket is owned by the `zulip` user, and mode 0700; and the `nagios` user does not have permission to access Zulip secrets to run `./manage.py print_email_delivery_backlog`. Rather than rewrite these checks to run on a cron as zulip, and check those file contents as the nagios user, drop these checks -- they can be rewritten at a later point, or replaced with Prometheus alerting, and currently serve only to cause always-failing Nagios checks, which normalizes alert failures. Leave the files installed if they currently exist, rather than cluttering puppet with `ensure => absent`; they do no harm if they are left installed.	2021-08-03 16:07:13 -07:00
Anders Kaseorg	93f62b999e	nagios: Replace check_website_response with standard check_http plugin. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2021-07-09 16:47:03 -07:00
Alex Vandiver	d905eb6131	puppet: Add a database teleport server. Host-based md5 auth for 127.0.0.1 must be removed from `pg_hba.conf`, otherwise password authentication is preferred over certificate-based authentication for localhost.	2021-06-08 22:21:21 -07:00
Alex Vandiver	100a899d5d	puppet: Add grafana server.	2021-06-08 22:21:00 -07:00
Alex Vandiver	459f37f041	puppet: Add prometheus server.	2021-06-08 22:21:00 -07:00
Alex Vandiver	19fb58e845	puppet: Add prometheus node exporter.	2021-06-08 22:21:00 -07:00
Alex Vandiver	51b985b40d	puppet: Move nagios to behind teleport. This makes the server only accessible via localhost, by way of the Teleport application service.	2021-06-02 18:38:38 -07:00
Alex Vandiver	4f51d32676	puppet: Add a teleport application server. This requires switching to a reverse tunnel for the auth connection, with the side effect that the `zulip_ops::teleport::node` manifest can be applied on servers anywhere in the Internet; they do not need to have any publicly-available open ports.	2021-06-02 18:38:38 -07:00
Alex Vandiver	c59421682f	puppet: Add a teleport node on every host. Teleport nodes[1] are the equivalent to SSH servers. In addition to this config, joining the teleport cluster will require presenting a one-time "join token" from the proxy server[2], which may either be short-lived or static. [1] https://goteleport.com/docs/architecture/nodes/ [2] https://goteleport.com/docs/admin-guide/#adding-nodes-to-the-cluster	2021-06-02 18:38:38 -07:00
Alex Vandiver	1cdf14d195	puppet: Add a teleport server. See https://goteleport.com/docs/architecture/overview/ for the general architecture of a Teleport cluster. This commit adds a Teleport auth[1] and proxy[2] server. The auth server serves as a CA for granting time-bounded access to users and authenticating nodes on the cluster; the proxy provides access and a management UI. [1] https://goteleport.com/docs/architecture/authentication/ [2] https://goteleport.com/docs/architecture/proxy/	2021-06-02 18:38:38 -07:00
Alex Vandiver	c9141785fd	puppet: Use concat fragments to place port allows next to services. This means that services will only open their ports if they are actually run, without having to clutter rules.v4 with a log of `if` statements. This does not go as far as using `puppetlabs/firewall`[1] because that would represent an additional DSL to learn; raw IPtables sections can easily be inserted into the generated iptables file via `concat::fragment` (either inline, or as a separate file), but config can be centralized next to the appropriate service. [1] https://forge.puppet.com/modules/puppetlabs/firewall	2021-05-27 21:14:48 -07:00
Alex Vandiver	4f017614c5	nagios: Replace check_fts_update_log with a process_fts_updates flag. This avoids having to duplicate the connection logic from process_fts_updates. Co-authored-by: Adam Birds <adam.birds@adbwebdesigns.co.uk>	2021-05-25 13:56:05 -07:00
Alex Vandiver	116e41f1da	puppet: Move files out and back when mounting /srv. Specifically, this affects /srv/zulip-aws-tools.	2021-05-23 13:29:23 -07:00
Alex Vandiver	0b1dd27841	puppet: AWS mounts its extra disks with inconsistent names. It is now /dev/nvme1n1, not /dev/nvme0n1; but it always has a consistent major/minor node. Source the file that defines these.	2021-05-23 13:29:23 -07:00
Alex Vandiver	033a96aa5d	puppet: Fix check_ssl_certificate check to check named host, not self.	2021-05-17 18:38:30 -07:00
Alex Vandiver	feb7870db7	puppet: Adjust thresholds on autovac_freeze. These thresholds are in relationship to the `autovacuum_freeze_max_age`, not the XID wraparound, which happens at 2^31-1. As such, it is perfectly normal that they hit 100%, and then autovacuum kicks in and brings it back down. The unusual condition is that PostgreSQL pushes past the point where an autovacuum would be triggered -- therein lies the XID wraparound danger. With the `autovacuum_freeze_max_age` set to 2000000000 in `postgresql.conf`, XID wraparound happens at 107.3%. Set the warning and error thresholds to below this, but above 100% so this does not trigger constantly.	2021-05-11 17:11:47 -07:00
Anders Kaseorg	9d57fa9759	puppet: Use pgrep -x to avoid accidental matches. Matching the full process name (-x without -f) or full command line (-xf) is less prone to mistakes like matching a random substring of some other command line or pgrep matching itself. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2021-05-07 08:54:41 -07:00
Alex Vandiver	6ee74b3433	puppet: Check health of APT repository.	2021-03-23 19:27:42 -07:00
Alex Vandiver	c01345d20c	puppet: Add nagios check for long-lived certs that do not auto-renew.	2021-03-23 19:27:27 -07:00
Alex Vandiver	9ea86c861b	puppet: Add a nagios alert configuration for smokescreen. This verifies that the proxy is working by accessing a highly-available website through it. Since failure of this equates to failures of Sentry notifications and Android mobile push notifications, this is a paging service.	2021-03-18 10:11:15 -07:00
Anders Kaseorg	129ea6dd11	nginx: Consistently listen on IPv6 and with HTTP/2. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2021-03-17 17:46:32 -07:00
Alex Vandiver	06c07109e4	puppet: Add missing semicolons left off in `ba3b88c81b`.	2021-03-12 15:48:53 -08:00
Alex Vandiver	ba3b88c81b	puppet: Explicitly use the snakeoil certificates for nginx. In production, the `wildcard-zulipchat.com.combined-chain.crt` file is just a symlink to the snakeoil certificates; but we do not puppet that symlink, which makes new hosts fail to start cleanly. Instead, point explicitly to the snakeoil certificate, and explain why.	2021-03-12 13:31:54 -08:00
Tim Abbott	957c16aa77	nagios: Tweak prod load monitoring parameters. Ultimately this monitoring isn't that helpful, but we're mainly interested in when it spikes to very high numbers.	2021-02-26 08:39:52 -08:00
Alex Vandiver	173d2dec3d	puppet: Check in defensive restart-camo cron job. This was found on lb1; add it to the camo install on smokescreen.	2021-02-24 16:42:21 -08:00
Alex Vandiver	0b736ef4cf	puppet: Remove puppet_ops configuration for separate loadbalancer host.	2021-02-22 16:05:13 -08:00
Alex Vandiver	29f60bad20	smokescreen: Put the version into the supervisorctl command. This makes it reload correctly if the version is changed.	2021-02-16 08:12:31 -08:00
Anders Kaseorg	6e4c3e41dc	python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2021-02-12 13:11:19 -08:00

1 2 3 4 5 ...

255 Commits