zulip

Commit Graph

Author	SHA1	Message	Date
Alex Vandiver	2c90c7a010	nagios: Switch `check_remote_arg_string` queue checks to consumer checks. These style of checks just look for matching process names using `check_remote_arg_string`, which dates to `8edbd64bb8`. These were added because the original two (`missedmessage_emails` and `slow_queries`) did not create consumers, instead polling for events. Switch these to checking the queue consumer counts that the `check-rabbitmq-consumers` check is already writing out. Since the `missedmessage_emails` was _already_ checked via the consumer check, a duplicate is not added.	2022-06-22 12:07:38 -07:00
Alex Vandiver	f48d543d9b	nagios: Make and use a "rabbitmq-consumer-service" template service.	2022-06-22 12:07:38 -07:00
Alex Vandiver	775a084d0f	nagios: Add a catchall "other" set.	2022-06-22 12:07:38 -07:00
Alex Vandiver	83c82c8e15	nagios: Adjust load alerting by hostgroup. Even the `pageable_servers` group did not page for high load -- in part because what was "high" depends on the servers. Set slightly better limits based on server role.	2022-06-22 12:07:38 -07:00
Alex Vandiver	2a14aa5180	nagios: Add a `fullstack` hostgroup. This will be used to apply checks only to czo.	2022-06-22 12:07:38 -07:00
Alex Vandiver	b5ecfc327f	nagios: Remove unnecessary `web` hostgroup. This had identical membership to `frontends`.	2022-06-22 12:07:38 -07:00
Alex Vandiver	4be9025212	nagios: Remove redundant `postgresql` hostgroup. This is implied by `postgresql_primary`.	2022-06-22 12:07:38 -07:00
Alex Vandiver	d9d0014fb4	nagios: Rename `zmirror_main` into `zmirror` hostgroup. `zmirror` itself was `zmirror_main` + `zmirrorp` but was unused; we consistently just use the term `zmirror` for the non-personals server, so use it as the hostgroup name.	2022-06-22 12:07:38 -07:00
Alex Vandiver	70c36985b4	nagios: Remove frontends from redis group. The Redis nagios checks themselves are done against `redis` + `frontends` groups, so there is no need to misleadingly place `frontends` in the `redis` hostgroup.	2022-06-22 12:07:38 -07:00
Alex Vandiver	08127086bc	nagios: Remove misleading "staging_frontends" from standalone. No services are tested for the `staging_frontends` hostgroup, so this does not alter the checks.	2022-06-22 12:07:38 -07:00
Alex Vandiver	d804de871d	nagios: Move staging and prod hostgroups adjacent.	2022-06-22 12:07:38 -07:00
Alex Vandiver	4c17f2bccc	nagios: The frontends hostgroup now includes prod and staging frontends. This lets the config file remove some repetition.	2022-06-22 12:07:38 -07:00
Alex Vandiver	1e81775fa0	nagios: Drop unhelpful hostgroup comment.	2022-06-22 12:07:38 -07:00
Alex Vandiver	7b584401ac	nagios: Reformat hostgroups.	2022-06-22 12:07:38 -07:00
Alex Vandiver	93bcb86345	nagios: Reorder service checks.	2022-06-22 12:07:38 -07:00
Alex Vandiver	eaaa2fbff8	nagios: Use canonical "hostgroup_name" consistently.	2022-06-22 12:07:38 -07:00
Alex Vandiver	e8996b53a5	nagios: Remove unused has_swap hostgroup.	2022-06-22 12:07:38 -07:00
Alex Vandiver	33472ee9ff	nagios: Remove unused stats host set.	2022-06-22 12:07:38 -07:00
Alex Vandiver	bc4f4b4862	nagios: Make the pageable/not/flaky tri-state clearer.	2022-06-22 12:07:38 -07:00
Alex Vandiver	c74f195fba	nagios: Split AWS and non-AWS hosts, for ntp checks. The non-AWS hosts cannot use the AWS ntp server for their check.	2022-06-22 12:07:38 -07:00
Alex Vandiver	872efdee58	nagios: Fold single- and multitornado_frontends back into frontends. `5abf4dee92` made this distinction, then multitornado_frontends was never used; the singletornado_frontends alerting worked even for the multiple-Tornado instances. Remove the useless and misleading distinction.	2022-06-22 12:07:38 -07:00
Alex Vandiver	3741c1c034	puppet: Switch to checking time against the AWS timeserver. Since this is what chrony is sync'ing to, it lessens the chance of spurious firings of this alert. See https://aws.amazon.com/blogs/aws/keeping-time-with-amazon-time-sync-service/	2022-05-31 22:57:32 -07:00
Alex Vandiver	7f6a77da31	puppet: Add a redis exporter.	2022-05-03 17:13:44 -07:00
Anders Kaseorg	e9ba9b0e0d	zulip-ec2-configure-interfaces: Remove. Our current EC2 systems don’t have an interface named ‘eth0’, and if they did, this script would do nothing but crash with ImportError because we have never installed boto.utils for Python 3. (The message of commit `2a4d851a7c` made an effort to document for future researchers why this script should not have been blindly converted to Python 3. However, commit `2dc6d09c2a` (#14278) was evidently unresearched and untested.) Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-05-03 02:25:59 -07:00
Anders Kaseorg	646a4d19a3	puppet: Remove quotes for enumerable values. https://puppet.com/docs/puppet/7/style_guide.html#style_guide_module_design-quoting “If a string is a value from an enumerable set of options, such as present and absent, it SHOULD NOT be enclosed in quotes at all.” Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-04-29 22:06:46 -07:00
Alex Vandiver	35db1ee435	puppet: Only include "app_service" section if there are apps. This works around gravitational/teleport#12256, but also produces config files that are slightly cleaner.	2022-04-26 16:36:13 -07:00
Alex Vandiver	f6d27562fa	puppet: Configure chrony to use AWS-local NTP sources. This prevents hosts from spewing traffic to random hosts across the Internet.	2022-03-25 17:07:53 -07:00
Alex Vandiver	1bd5723cd2	puppet: Add a prometheus monitor for tornado processes.	2022-03-20 16:12:11 -07:00
Alex Vandiver	6b91652d9a	puppet: Open the grok_exporter port. The complete grok_exporter configuration is not ready to be committed, but this at least prepares the way for it.	2022-03-20 16:12:11 -07:00
Alex Vandiver	6558655fc6	puppet: Add rabbitmq prometheus plugin, and open the firewall.	2022-03-20 16:12:11 -07:00
Alex Vandiver	bdd2f35d05	puppet: Switch czo to using zulip_ops::app_frontend_monitoring. This was clearly intended in `f61ac4a28d` but never executed.	2022-03-20 16:12:11 -07:00
Alex Vandiver	17699bea44	puppet: postgresql_backups is auto-included if s3_backups_bucket is set. Since `6496d43148`.	2022-03-20 16:12:11 -07:00
Alex Vandiver	bedc7c2986	puppet: Smokescreen is now auto-included in standalone. Since `c33562f0a8`.	2022-03-20 16:12:11 -07:00
Anders Kaseorg	b3260bd610	docs: Use Debian and Ubuntu version numbers over development codenames. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-02-23 12:04:24 -08:00
Alex Vandiver	788daa953b	puppet: Factor out $::architecture case statement for golang.	2022-02-15 12:04:37 -08:00
Anders Kaseorg	f6a701090c	setup-apt-repos: Don’t install lsb_release. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-02-14 16:38:53 -08:00
Alex Vandiver	e032b38661	puppet: Fix typo in uwsgi exporter dependency.	2022-02-08 15:17:17 -08:00
Alex Vandiver	3bbe5c1110	puppet: Put comments on iptables lines. In addition to documenting the rules.v4 and rules.v6 files slightly, these comments show up in `iptables -L`: ``` root@hostname:~# iptables -L INPUT Chain INPUT (policy ACCEPT) target prot opt source destination ACCEPT all -- anywhere anywhere LOGDROP all -- anywhere localhost/8 ACCEPT all -- anywhere anywhere state RELATED,ESTABLISHED ACCEPT tcp -- anywhere anywhere tcp dpt:ssh /* ssh / ACCEPT tcp -- anywhere anywhere tcp dpt:3000 / grafana / ACCEPT tcp -- anywhere anywhere tcp dpt:9100 / node_exporter */ LOGDROP all -- anywhere anywhere ```	2022-01-21 16:46:14 -08:00
Alex Vandiver	6bc5849ea8	puppet: Remove now-unused debathena apt repository.	2022-01-18 14:13:28 -08:00
Alex Vandiver	b3f07cc98d	puppet: Replace debathena zephyr package with equivalent puppet file.	2022-01-18 14:13:28 -08:00
Alex Vandiver	a6d7539571	puppet: Replace debathena krb5 package with equivalent puppet file.	2022-01-18 14:13:28 -08:00
Alex Vandiver	75224ea5de	puppet: python-dev is now purely virtual; install python2.7-dev.	2022-01-18 14:13:28 -08:00
Alex Vandiver	fc1adef28a	puppet: Fix server_name of internal staging server.	2022-01-18 12:36:56 -08:00
Alex Vandiver	7e630b81f8	puppet: Switch to using snakeoil certs for staging. This parallels `ba3b88c81b`, but for the staging host.	2022-01-18 12:36:56 -08:00
Alex Vandiver	0b8a6a51b8	puppet: Remove all parts of AWS kernels. Otherwise, we just uninstall the meta-package, and still restart into the installed AWS kernel.	2022-01-12 15:52:19 -08:00
Alex Vandiver	4d7e6b26df	puppet: Provide more attributes to teleport on ssh nodes.	2022-01-12 14:15:45 -08:00
Alex Vandiver	339e70671c	puppet: Switch Grafana to Grafana 8 Unified Alerting.	2022-01-11 14:27:11 -08:00
Alex Vandiver	6a7eecee9a	puppet: Increase load paging thresholds.	2022-01-11 09:38:31 -08:00
Alex Vandiver	1e80b844f4	puppet: Disable apparmor profile for msmtp. As the nagios user, we want to read the msmtp configuration from ~nagios, which apparmor's profile does not allow msmtp to do.	2022-01-11 09:38:31 -08:00
Alex Vandiver	3c95ad82c6	puppet: Upgrade to nagios4. This updates the puppeted nagios configuration file for the Nagios4 defaults.	2022-01-11 09:38:31 -08:00
Alex Vandiver	4a95967a33	puppet: Gather uwsgi stats from chat.zulip.org.	2022-01-03 21:26:57 -08:00
Alex Vandiver	8a5be972d2	puppet: Add a uwsgi exporter for monitoring. This allows investigation of how many workers are busy, and to track "harikari" terminations.	2022-01-03 15:25:58 -08:00
Anders Kaseorg	82748d45d8	install-yarn: Use test -ef in case /srv is a symlink. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2021-12-30 13:42:07 -08:00
Alex Vandiver	c094867a74	puppet: Add aarch64 build hashes to external dependencies. wal-g does not ship aarch64 binaries, currently; the compilation process([1]) is somewhat complicated, so we defer the decision about how to support wal-g for aarch64 until a later date. [1]: https://github.com/wal-g/wal-g/blob/master/docs/PostgreSQL.md#installing	2021-12-29 16:35:15 -08:00
Alex Vandiver	f166f9f7d6	puppet: Centralize versions and sha256 hashes of external dependencies. This will make it easier to update versions of these dependencies.	2021-12-29 16:35:15 -08:00
Alex Vandiver	57662689a9	puppet: Provide a constant homedir for grafana user. The homedir of a user cannot be changed if any processes are running as them, so having it change over time as upgrades happen will break puppet application, as the old grafana process under supervisor will effectively lock changes to the user's homedir. Unfortunately, that means that this change will thus fail to puppet-apply unless `supervisorctl stop grafana` is run first, but there's no way around that.	2021-12-29 16:35:15 -08:00
Alex Vandiver	6e55e52694	puppet: Pull out grafana $data_dir.	2021-12-29 16:35:15 -08:00
Alex Vandiver	1e4e6a09af	puppet: Stop making resources for external binaries and directories. In the event that extracting doesn't produce the binary we expected it to, all this will do is create an _empty_ file where we expect the binary to be. This will likely muddle debugging. Since the only reason the resourfce was made in the first place was to make dependencies clear, switch to depending on the External_Dep itself, when such a dependency is needed.	2021-12-29 16:35:15 -08:00
Alex Vandiver	3c163a7d5e	puppet: Move slash out of $dir by convention.	2021-12-29 16:35:15 -08:00
Alex Vandiver	bb5a2c8138	puppet: Move prometheus to external_dep.	2021-12-29 16:35:15 -08:00
Alex Vandiver	2d6c096904	puppet: Move node_exporter to external_dep.	2021-12-29 16:35:15 -08:00
Alex Vandiver	e4b23daad7	puppet: Upgrade to Grafana 8.3.2, for CVE-2021-43813.	2021-12-10 14:00:11 -08:00
Alex Vandiver	053682964e	puppet: Only fetch from running hosts in Grafana ec2 discovery.	2021-12-09 08:12:03 -08:00
Alex Vandiver	291f688678	puppet: Use zulip::external_dep for grafana, template config. Templating the config ensures that the service is restarted when it is upgraded.	2021-12-08 20:58:10 -08:00
Alex Vandiver	3eae429ab4	puppet: Upgrade Grafana to 8.3.1, for CVE-2021-43798.	2021-12-08 20:58:10 -08:00
Alex Vandiver	7db146d0a9	puppet: Do not assume amd64 architecture.	2021-12-06 11:08:50 -08:00
Alex Vandiver	fb2d05f9e3	puppet: Remove unused 'builder' files. These are leftover detritus from the "builder" host, which was removed in `4c9a283542`.	2021-12-06 10:21:50 -08:00
Alex Vandiver	c514feaa22	puppet: Default go-camo to listening on localhost for standalone deploys. The default in the previous commit, inherited from camo, was to bind to 0.0.0.0:9292. In standalone deployments, camo is deployed on the same host as the nginx reverse proxy, and as such there is no need to open it up to other IPs. Make `zulip::camo` take an optional parameter, which allows overriding it in puppet, but skips a `zulip.conf` setting for it, since it is unlikely to be adjust by most users.	2021-11-19 15:58:26 -08:00
Alex Vandiver	b982222e03	camo: Replace with go-camo implementation. The upstream of the `camo` repository[1] has been unmaintained for several years, and is now archived by the owner. Additionally, it has a number of limitations: - It is installed as a sysinit service, which does not run under Docker - It does not prevent access to internal IPs, like 127.0.0.1 - It does not respect standard `HTTP_proxy` environment variables, making it unable to use Smokescreen to prevent the prior flaw - It occasionally just crashes, and thus must have a cron job to restart it. Swap camo out for the drop-in replacement go-camo[2], which has the same external API, requiring not changes to Django code, but is more maintained. Additionally, it resolves all of the above complaints. go-camo is not configured to use Smokescreen as a proxy, because its own private-IP filtering prevents using a proxy which lies within that IP space. It is also unclear if the addition of Smokescreen would provide any additional protection over the existing IP address restrictions in go-camo. go-camo has a subset of the security headers that our nginx reverse proxy sets, and which camo set; provide the missing headers with `-H` to ensure that go-camo, if exposed from behind some other non-nginx load-balancer, still provides the necessary security headers. Fixes #18351 by moving to supervisor. Fixes zulip/docker-zulip#298 also by moving to supervisor. [1] https://github.com/atmos/camo [2] https://github.com/cactus/go-camo	2021-11-19 15:58:26 -08:00
Alex Vandiver	1806e0f45e	puppet: Remove zulip.org configuration.	2021-08-26 17:21:31 -07:00
Alex Vandiver	27881babab	puppet: Increase prometheus storage, from the default 15d.	2021-08-24 23:40:43 -07:00
Alex Vandiver	e46e862f2b	puppet: Add a bare-bones zulipbot profile. This sets up the firewalls appropriate for zulipbot, but does not automate any of the configuration of zulipbot itself.	2021-08-24 16:05:58 -07:00
Alex Vandiver	5857dcd9b4	puppet: Configure ip6tables in parallel to ipv4. Previously, IPv6 firewalls were left at the default all-open. Configure IPv6 equivalently to IPv4.	2021-08-24 16:05:46 -07:00
Alex Vandiver	845509a9ec	puppet: Be explicit that existing iptables are only ipv4.	2021-08-24 16:05:46 -07:00
Alex Vandiver	4dd289cb9d	puppet: Enable prometheus monitoring of supervisord. To be able to read the UNIX socket, this requires running node_exporter as zulip, not as prometheus.	2021-08-03 21:47:02 -07:00
Alex Vandiver	aa940bce72	puppet: Disable hwmon collector, which does nothing on cloud hosts.	2021-08-03 21:47:02 -07:00
Alex Vandiver	e94b6afb00	nagios: Remove broken check_email_deliverer_* checks and related code. These checks suffer from a couple notable problems: - They are only enabled on staging hosts -- where they should never be run. Since `ef6d0ec5ca`, these supervisor processes are only run on one host, and never on the staging host. - They run as the `nagios` user, which does not have appropriate permissions, and thus the checks always fail. Specifically, `nagios` does not have permissions to run `supervisorctl`, since the socket is owned by the `zulip` user, and mode 0700; and the `nagios` user does not have permission to access Zulip secrets to run `./manage.py print_email_delivery_backlog`. Rather than rewrite these checks to run on a cron as zulip, and check those file contents as the nagios user, drop these checks -- they can be rewritten at a later point, or replaced with Prometheus alerting, and currently serve only to cause always-failing Nagios checks, which normalizes alert failures. Leave the files installed if they currently exist, rather than cluttering puppet with `ensure => absent`; they do no harm if they are left installed.	2021-08-03 16:07:13 -07:00
Alex Vandiver	e6bae4f1dd	puppet: Remove zulip::nagios class. `93f62b999e` removed the last file in puppet/zulip/files/nagios_plugins/zulip_nagios_server, which means the singular rule in zulip::nagios no longer applies cleanly. Remove the `zulip::nagios` class, as it is no longer needed.	2021-07-09 17:29:41 -07:00
Anders Kaseorg	93f62b999e	nagios: Replace check_website_response with standard check_http plugin. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2021-07-09 16:47:03 -07:00
Vishnu KS	e0f5fadb79	billing: Downgrade small realms that are behind on payments. An organization with at most 5 users that is behind on payments isn't worth spending time on investigating the situation. For larger organizations, we likely want somewhat different logic that at least does not void invoices.	2021-07-02 13:19:12 -07:00
Alex Vandiver	6c72698df2	puppet: Move zulip_ops supervisor config into /etc/supervisor/conf.d/zulip/. This is similar cleanup to `3ab9b31d2f`, but only affects zulip_ops services; it serves to ensure that any of these services which are no longer enabled are automatically removed from supervisor. Note that this will cause a supervisor restart on all affected hosts, which will restart all supervisor services.	2021-06-14 17:12:59 -07:00
Alex Vandiver	dd90083ed7	puppet: Provide FQDN of self as URI, so the certificate validates. Failure to do this results in: ``` psql: error: failed to connect to `host=localhost user=zulip database=zulip`: failed to write startup message (x509: certificate is valid for [redacted], not localhost) ```	2021-06-14 00:14:48 -07:00
Alex Vandiver	c90ff80084	puppet: Bump grafana version to 8.0.1. Most notably, this fixes an annoying bug with CloudWatch metrics being repeated in graphs.	2021-06-10 15:49:08 -07:00
Alex Vandiver	d905eb6131	puppet: Add a database teleport server. Host-based md5 auth for 127.0.0.1 must be removed from `pg_hba.conf`, otherwise password authentication is preferred over certificate-based authentication for localhost.	2021-06-08 22:21:21 -07:00
Alex Vandiver	100a899d5d	puppet: Add grafana server.	2021-06-08 22:21:00 -07:00
Alex Vandiver	459f37f041	puppet: Add prometheus server.	2021-06-08 22:21:00 -07:00
Alex Vandiver	19fb58e845	puppet: Add prometheus node exporter.	2021-06-08 22:21:00 -07:00
Alex Vandiver	a2b1009ed5	puppet: Turn on "authentication" which defaults to user with all rights. Nagios refuses to allow any modifications with use_authentication off; re-enabled "authentication" but set a default user, which (by way of the `*` permissions in `359f37389a`) is allowed to take all actions.	2021-06-08 15:19:28 -07:00
Alex Vandiver	61b6fc865c	puppet: Add a label to teleport applications, to allow RBAC. Roles can only grant or deny access based on labels; set one based on the application name.	2021-06-08 15:19:04 -07:00
Alex Vandiver	4aff5b1d22	puppet: Allow access to `/` in nagios. This was a regression in `51b985b40d`.	2021-06-07 22:40:58 -07:00
Alex Vandiver	54768c2210	puppet: Remove now-unused basic auth support files. `51b985b40d` made these unnecessary.	2021-06-07 16:17:45 -07:00
Alex Vandiver	359f37389a	puppet: Remove in-nagios auth restrictions. `51b985b40d` made nagios only accessible from localhost, or as proxied via teleport. Remove the HTTP-level auth requirements.	2021-06-07 16:17:45 -07:00
Alex Vandiver	2352fac6b5	puppet: Fix indentation.	2021-06-02 18:38:38 -07:00
Alex Vandiver	51b985b40d	puppet: Move nagios to behind teleport. This makes the server only accessible via localhost, by way of the Teleport application service.	2021-06-02 18:38:38 -07:00
Alex Vandiver	4f51d32676	puppet: Add a teleport application server. This requires switching to a reverse tunnel for the auth connection, with the side effect that the `zulip_ops::teleport::node` manifest can be applied on servers anywhere in the Internet; they do not need to have any publicly-available open ports.	2021-06-02 18:38:38 -07:00
Alex Vandiver	c59421682f	puppet: Add a teleport node on every host. Teleport nodes[1] are the equivalent to SSH servers. In addition to this config, joining the teleport cluster will require presenting a one-time "join token" from the proxy server[2], which may either be short-lived or static. [1] https://goteleport.com/docs/architecture/nodes/ [2] https://goteleport.com/docs/admin-guide/#adding-nodes-to-the-cluster	2021-06-02 18:38:38 -07:00
Alex Vandiver	1cdf14d195	puppet: Add a teleport server. See https://goteleport.com/docs/architecture/overview/ for the general architecture of a Teleport cluster. This commit adds a Teleport auth[1] and proxy[2] server. The auth server serves as a CA for granting time-bounded access to users and authenticating nodes on the cluster; the proxy provides access and a management UI. [1] https://goteleport.com/docs/architecture/authentication/ [2] https://goteleport.com/docs/architecture/proxy/	2021-06-02 18:38:38 -07:00
Alex Vandiver	3ebd627c50	puppet: Fix "import" -> "include" in chat_zulip_org.	2021-06-02 11:02:34 -07:00
Alex Vandiver	2130fc0645	puppet: Add an explicit class for czo.	2021-06-01 22:18:50 -07:00
Alex Vandiver	c9141785fd	puppet: Use concat fragments to place port allows next to services. This means that services will only open their ports if they are actually run, without having to clutter rules.v4 with a log of `if` statements. This does not go as far as using `puppetlabs/firewall`[1] because that would represent an additional DSL to learn; raw IPtables sections can easily be inserted into the generated iptables file via `concat::fragment` (either inline, or as a separate file), but config can be centralized next to the appropriate service. [1] https://forge.puppet.com/modules/puppetlabs/firewall	2021-05-27 21:14:48 -07:00

1 2 3 4 5 ...

490 Commits