Commit Graph

29 Commits

Author SHA1 Message Date
Tim Abbott 1f08f4e70f Rename nagios bot to zulip.com domain.
(imported from commit 9a2fba54295b4c473e030d3ff6ededbc3e2455af)
2013-07-25 17:16:53 -04:00
Tim Abbott 23beabb80c [manual] Rename manage.py subscribe_new_users to process_signups.
The old name was very confusing, and this fits the convention of "the
processor for the signups" queue a la "process_user_activity".

This requires doing a

supervisorctl stop humbug-workers:humbug-events-subscribe-new-users
puppet apply

to deploy the supervisord configuration changes and properly restart
the signups queue.

(imported from commit 0ee2dad837142afa64025446e22956709771a192)
2013-07-17 17:50:19 -04:00
Zev Benjamin 81c05e02c2 nagios: Check for the expected number of autossh processes on munin.humbughq.com
(imported from commit 77d35b2aaacf303f6118d7794f481e393868da59)
2013-07-17 14:34:00 -04:00
Zev Benjamin 14e58ff6e4 Monitor postgres1
The fact that we weren't already was an oversight on my part.

(imported from commit 2082ae79ac2884f26e98b430bcb08c15938a26c0)
2013-07-17 14:34:00 -04:00
Zev Benjamin b4a208445b Run check_postgres.pl against the correct database
We were previously running it against the 'postgres' database, which
meant we weren't actually checking the non-clusterwide statistics.

(imported from commit a6be529b16d5f1927463e49a7f7f4cf0b5299213)
2013-07-17 14:34:00 -04:00
Luke Faraone bb0a7c8fc3 [manual] Switch various configuration files to refer to .zulip.net.
We only want to change cases where we're talking about the hostname; HTTP
requests should still go to staging.humbughq.com for now.

Before this commit is deployed the hostname of staging.humbughq.com should
be changed to staging.zulip.net on the VM.

(the same for prod)

(imported from commit 7412530773f720ac227f40061c9ddb1a851e19bb)
2013-07-15 16:49:55 -04:00
Leo Franchi 113180b7b7 nagios: Don't page about load/disk/ levels on non-critical servers.
Add a pageable_servers and not_pageable_servers hostgroup, and only page for
app/postgres/zmirror.

(imported from commit 15c286324e942bd38e2a600a3b9091044f117e28)
2013-06-05 10:20:56 -04:00
Leo Franchi 25b915fa6a Enable rabbitmq consumser checks on app
(imported from commit e3df8bc849dc0e1ae2e7782c0c9be5c08d4818c2)
2013-05-20 23:29:54 -04:00
Leo Franchi 3d4e239247 Check rabbitmq consumers for all important queues
(imported from commit 1279d33e3e1c36ee8da01859875d24b54e14e2e6)
2013-05-17 01:02:35 -04:00
Tim Abbott d0540efa6a nagios check_disk: check inode disk usage too.
(imported from commit e920c4a11c2797904f0ca397ebdcd8b0a9fef8cf)
2013-05-09 10:35:47 -04:00
Leo Franchi 52f6c720d9 Add new stats server to logging
(imported from commit b3647ab039c902d09a92082c3e98b5b066e6a5c8)
2013-04-29 16:44:41 -04:00
Leo Franchi b3a3054f64 Slightly raise thresholds for load on nagios
(imported from commit 2dbc06c8ba204c10f6d6b590bc4858e07692540b)
2013-04-22 10:22:35 -04:00
Leo Franchi 350cf79ba0 Add a nagios check for a notify_tornado consumer
(imported from commit 050536bb4ac7384d5b98d5cf6cb7430b2b00dbd5)
2013-04-17 09:24:28 -04:00
Tim Abbott 5b1b2257bd nagios: Commit Luke's testing contact.
(imported from commit d88951f42ad7753777b8e0ab2d47b9bb61ff3f76)
2013-04-16 12:02:42 -04:00
Tim Abbott bb3b63206a nagios: Comment out the postgres time checks (they're too noisy).
(imported from commit c9569cdbd2909ea7fb8c8c14a681201ee033c62b)
2013-04-16 12:02:42 -04:00
Tim Abbott b73ac39a25 nagios: Run check_send_receive_time check against both staging and prod.
(imported from commit 749c5f04fba4832debe8a4e702914fa47d1fbeaa)
2013-04-16 12:02:42 -04:00
Tim Abbott 73886a95fd nagios: Update app.humbughq.com to use its primary hostname.
(imported from commit 39d291e06b0fa223ae4bb76022b26464b969a505)
2013-04-16 12:02:42 -04:00
Jessica McKellar c784457d36 nagios: update feedback bot check to reflect API directory reorg.
(imported from commit 01389b0f3f8bf68249cf91b4986e44763fb9a4a0)
2013-04-10 17:40:48 -04:00
Jessica McKellar fe7fedd252 nagios: add check for send_invite_emails process.
(imported from commit b30e55241249a02ee61fac2d3f7abecc4d8318bd)
2013-04-10 16:58:17 -04:00
Luke Faraone d89f5670bb Add nagios check to verify mailchimp is running on staging/app.
(imported from commit 2aa79cc6252aadaa0a212b5c60eff9c5c55b7781)
2013-04-05 14:44:18 -07:00
Leo Franchi 2a334a6328 Tighten rabbitmq thresholds and page_admins
(imported from commit 373014bf75346286b55b0ea7d370b21de49ffa33)
2013-03-22 15:55:49 -04:00
Tim Abbott 72d7adce93 nagios: Lower default check intervals and default counts.
The defaults are quite large for a small site like ours where on
server down means an outage (e.g. only check every 5 minutes and then
require 4 failures before we alert the admins).

(imported from commit 3b2f436bbb716262f4ee939434749be535ffd6d3)
2013-02-20 16:47:55 -05:00
Tim Abbott f547bdce9e nagios: Add swap check.
(imported from commit 37ffdb8dfc117e728acc6c3fe4bae671c66ce4c9)
2013-02-20 11:10:45 -05:00
Tim Abbott be834815aa nagios: Rename paging_admins to page_admins.
I think the name is a little clearer.

(imported from commit cd707b76339cb85365f007701c6313aa6d65b4a3)
2013-02-19 15:40:18 -05:00
Tim Abbott 02ff5bc38d Nagios: Change new services to paging mode.
(imported from commit 4406485179224287f4b7dfbaaa8ed4f97e6debbc)
2013-02-19 15:40:18 -05:00
Leo Franchi 9bb699f917 Add a nagios plugin for checking rabbitmq queue sizes
(imported from commit 32bd03bcfe4c4a4221ace17f83adb175f591c8ea)
2013-02-19 15:22:55 -05:00
Tim Abbott 63827c2301 Make the Nagios integration configurable, available, and documented.
(imported from commit 1208fc08ed366a892763c3b29b9aeafa90b29981)
2013-02-14 17:50:00 -05:00
Leo Franchi 0a0c4bb9a0 [manual] Use rabbitmq for asynchronous presence updating
Note: When deploying, restarting the process-user-activity-commandline script is needed

(imported from commit 63ee795c9c7a7db4a40170cff5636dc1dd0b46a8)
2013-02-11 18:05:57 -05:00
Zev Benjamin da95bb2988 puppet: Move all puppetized config files to the humbug module and reference them with puppet URLs
(imported from commit f0f325bbad381b87c12c6f7888f4dd5d6989f09f)
2013-02-08 16:06:34 -05:00