zulip

Commit Graph

Author	SHA1	Message	Date
Alex Vandiver	6ee74b3433	puppet: Check health of APT repository.	2021-03-23 19:27:42 -07:00
Alex Vandiver	c01345d20c	puppet: Add nagios check for long-lived certs that do not auto-renew.	2021-03-23 19:27:27 -07:00
Alex Vandiver	9ea86c861b	puppet: Add a nagios alert configuration for smokescreen. This verifies that the proxy is working by accessing a highly-available website through it. Since failure of this equates to failures of Sentry notifications and Android mobile push notifications, this is a paging service.	2021-03-18 10:11:15 -07:00
Alex Vandiver	eaa99359b1	puppet: Rename to check_postgresql_replication_lag.	2020-10-28 11:51:52 -07:00
Alex Vandiver	53e59a0a13	puppet: Rename check_postgres_backup to check_postgresql_backup.	2020-10-28 11:51:52 -07:00
Alex Vandiver	e124324050	puppet: Rename postgres_appdb in nagios to postgresql.	2020-10-28 11:51:52 -07:00
Alex Vandiver	48e06c25ba	puppet: Switch nagios SSH checks to id_ed25519 key. The ssh-rsa algorithm was deprecated[1] in OpenSSH 8.2 (2020-02-14) and will be removed in a future release. [1] https://www.openssh.com/txt/release-8.4	2020-10-22 16:42:30 -07:00
Alex Vandiver	31d80a77d4	puppet: Update nagios check_postgres_replication_lag to be on DB hosts `7d4a370a57` attempted to move the replication check to on the PostgreSQL hosts. While it updated the _check_ to assume it was running and talking to a local PostgreSQL instance, the configuration and installation for the check were not updated. As such, the check ran on the nagios host for each DB host, and produced no output. Start distributing the check to all apopdb hosts, and configure nagios to use the SSH tunnel to get there.	2020-07-14 16:27:18 -07:00
Alex Vandiver	6c27f07c1d	puppet: Move PostgreSQL backups to their own class. wal-g was used in `puppet/zulip` by env-wal-g, but only installed in `puppet/zulip_ops`. Merge all of the dependencies of doing backups using wal-g (wal-g installation, the pg_backup_and_purge job, the nagios plugin that verifies it happens) into a common base class in `puppet/zulip`, since it is generally useful.	2020-07-14 00:40:25 -07:00
Alex Vandiver	a4e7c7a27e	nagios: Remove check_memcached. check_memcached does not support memcached authentication even in its latest release (it’s in a TODO item comment, and that’s it), and was never particularly useful.	2020-07-10 00:12:48 -07:00
Alex Vandiver	a21a086f5c	puppet: nagios-plugins-basic is replaced by monitoring-plugins-basic. In Bionic, nagios-plugins-basic is a transitional package which depends on monitoring-plugins-basic. In Focal, it is a virtual package, which means that every time puppet runs, it tries to re-install the nagios-plugins-basic package. Switch all instances to referring to `$zulip::common::nagios_plugins`, and repoint that to monitoring-plugins-basic.	2020-06-29 14:58:01 -07:00
Tim Abbott	e1ce53ac46	puppet: Update nagios checks for disk to exclude kernel filesystems. The fact that we have to explicitly list these is almost certainly a bug in check_disk, but at least this works.	2020-04-16 17:49:29 -07:00
Tim Abbott	cfbb617f5c	puppet: Update nagios configuration for checking local disk.	2020-04-16 17:48:36 -07:00
Anders Kaseorg	ea6934c26d	dependencies: Remove WebSockets system for sending messages. Zulip has had a small use of WebSockets (specifically, for the code path of sending messages, via the webapp only) since ~2013. We originally added this use of WebSockets in the hope that the latency benefits of doing so would allow us to avoid implementing a markdown local echo; they were not. Further, HTTP/2 may have eliminated the latency difference we hoped to exploit by using WebSockets in any case. While we’d originally imagined using WebSockets for other endpoints, there was never a good justification for moving more components to the WebSockets system. This WebSockets code path had a lot of downsides/complexity, including: * The messy hack involving constructing an emulated request object to hook into doing Django requests. * The `message_senders` queue processor system, which increases RAM needs and must be provisioned independently from the rest of the server). * A duplicate check_send_receive_time Nagios test specific to WebSockets. * The requirement for users to have their firewalls/NATs allow WebSocket connections, and a setting to disable them for networks where WebSockets don’t work. * Dependencies on the SockJS family of libraries, which has at times been poorly maintained, and periodically throws random JavaScript exceptions in our production environments without a deep enough traceback to effectively investigate. * A total of about 1600 lines of our code related to the feature. * Increased load on the Tornado system, especially around a Zulip server restart, and especially for large installations like zulipchat.com, resulting in extra delay before messages can be sent again. As detailed in https://github.com/zulip/zulip/pull/12862#issuecomment-536152397, it appears that removing WebSockets moderately increases the time it takes for the `send_message` API query to return from the server, but does not significantly change the time between when a message is sent and when it is received by clients. We don’t understand the reason for that change (suggesting the possibility of a measurement error), and even if it is a real change, we consider that potential small latency regression to be acceptable. If we later want WebSockets, we’ll likely want to just use Django Channels. Signed-off-by: Anders Kaseorg <anders@zulipchat.com>	2020-01-14 22:34:00 -08:00
Tim Abbott	b41c2d93d1	puppet: Exclude squashfs filesystems from nagios disk checks. These generally aren't being written to.	2019-06-16 16:22:23 -07:00
Tim Abbott	24b6106c9c	puppet: Dsiable checking for evictions in memcached nagios. Zulip's caching model for message history is such that it is normal and healthy for there to eventually be a nontrivial volume of evictions.	2018-03-06 13:34:02 -08:00
Rishi Gupta	1d581a9c6e	nagios: Add nagios check for analytics state. This should help us detect issues where the analytics cron jobs aren't running properly. The cron/nagios part of the implementation done by tabbott.	2018-02-09 16:36:05 -08:00
Tim Abbott	96c3014da0	nagios: Automate configuration of outgoing email with msmtp. Now we no longer need to check in a bunch of hostnames in order to configure Nagios.	2017-10-05 20:29:47 -07:00
Tim Abbott	162eaf8917	nagios: Modify check for swap to allow no swap. If a machine is configured with no swap intentationally, that shouldn't be a Nagios problem. This alert is intended to flag machines which are swapping.	2017-10-05 20:07:44 -07:00
Tim Abbott	5193936bc3	nagios: Add Memcached and Redis monitoring. These are standard Nagios plugins that might be sometimes helpful.	2017-10-05 20:06:16 -07:00
Tim Abbott	5a80c029a2	nagios: Update path to sync_public_streams to match new config.	2017-10-05 13:34:27 -07:00
Reid Barton	ccb4c5c26f	bots: Move zephyr-related files to api/integrations/zephyr/.	2017-05-26 15:07:02 -07:00
Tim Abbott	fa8045a484	puppet: Add websockets Nagios test to configuration. Since browser clients send messages via websockets and not the API, this is an important element in making sure mission-critical Zulip functionality is working.	2017-02-08 11:13:19 -08:00
Tim Abbott	65774e1c4f	zulip_ops: use check_postgres package from apt.	2017-01-06 21:18:55 -08:00
Tim Abbott	73178e5e5a	puppet: Run check_send_receive_time via a cron job. This allows the actual nagios work involved with check_send_receive_time nagios checks to be done by an unprivileged "nagios" user rather than the "zulip" user.	2016-10-26 00:26:52 -07:00
Tim Abbott	4f58fef54b	zulip_ops: Use nagios user for all Nagios checks. There's no reason these Nagios checks needs to run as the semi-priviliged Zulip user.	2016-10-26 00:17:26 -07:00
Tim Abbott	080dd8c987	nagios: Ignore kthreads in check_procs tests. Modern Linux can have a lot of kernel threads not doing anything. Since this isn't interesting from a monitoring perpsective, we ignore these.	2016-10-26 00:10:40 -07:00
Tim Abbott	36e336edc3	puppet: Rename zulip_internal to zulip_ops. The old "zulip_internal" name was from back when Zulip, Inc. had two distributions of Zulip, the enterprise distribution in puppet/zulip/ and the "internal" SAAS distribution in puppet/zulip_internal. I think the name is a bit confusing in the new fully open-source Zulip work, so we're replacing it with "zulip_ops". I don't think the new name is perfect, but it's better. In the following commits, we'll delete a bunch of pieces of Zulip, Inc.'s infrastructure that don't exist anymore and thus are no longer useful (e.g. the old Trac configuration), with the goal of cleaning the repository of as much unnecessary content as possible.	2016-10-16 19:23:27 -07:00

28 Commits