Tim Abbott
1f08f4e70f
Rename nagios bot to zulip.com domain.
...
(imported from commit 9a2fba54295b4c473e030d3ff6ededbc3e2455af)
2013-07-25 17:16:53 -04:00
Tim Abbott
23beabb80c
[manual] Rename manage.py subscribe_new_users to process_signups.
...
The old name was very confusing, and this fits the convention of "the
processor for the signups" queue a la "process_user_activity".
This requires doing a
supervisorctl stop humbug-workers:humbug-events-subscribe-new-users
puppet apply
to deploy the supervisord configuration changes and properly restart
the signups queue.
(imported from commit 0ee2dad837142afa64025446e22956709771a192)
2013-07-17 17:50:19 -04:00
Zev Benjamin
81c05e02c2
nagios: Check for the expected number of autossh processes on munin.humbughq.com
...
(imported from commit 77d35b2aaacf303f6118d7794f481e393868da59)
2013-07-17 14:34:00 -04:00
Zev Benjamin
14e58ff6e4
Monitor postgres1
...
The fact that we weren't already was an oversight on my part.
(imported from commit 2082ae79ac2884f26e98b430bcb08c15938a26c0)
2013-07-17 14:34:00 -04:00
Zev Benjamin
b4a208445b
Run check_postgres.pl against the correct database
...
We were previously running it against the 'postgres' database, which
meant we weren't actually checking the non-clusterwide statistics.
(imported from commit a6be529b16d5f1927463e49a7f7f4cf0b5299213)
2013-07-17 14:34:00 -04:00
Luke Faraone
bb0a7c8fc3
[manual] Switch various configuration files to refer to .zulip.net.
...
We only want to change cases where we're talking about the hostname; HTTP
requests should still go to staging.humbughq.com for now.
Before this commit is deployed the hostname of staging.humbughq.com should
be changed to staging.zulip.net on the VM.
(the same for prod)
(imported from commit 7412530773f720ac227f40061c9ddb1a851e19bb)
2013-07-15 16:49:55 -04:00
Leo Franchi
113180b7b7
nagios: Don't page about load/disk/ levels on non-critical servers.
...
Add a pageable_servers and not_pageable_servers hostgroup, and only page for
app/postgres/zmirror.
(imported from commit 15c286324e942bd38e2a600a3b9091044f117e28)
2013-06-05 10:20:56 -04:00
Leo Franchi
25b915fa6a
Enable rabbitmq consumser checks on app
...
(imported from commit e3df8bc849dc0e1ae2e7782c0c9be5c08d4818c2)
2013-05-20 23:29:54 -04:00
Leo Franchi
3d4e239247
Check rabbitmq consumers for all important queues
...
(imported from commit 1279d33e3e1c36ee8da01859875d24b54e14e2e6)
2013-05-17 01:02:35 -04:00
Tim Abbott
d0540efa6a
nagios check_disk: check inode disk usage too.
...
(imported from commit e920c4a11c2797904f0ca397ebdcd8b0a9fef8cf)
2013-05-09 10:35:47 -04:00
Leo Franchi
52f6c720d9
Add new stats server to logging
...
(imported from commit b3647ab039c902d09a92082c3e98b5b066e6a5c8)
2013-04-29 16:44:41 -04:00
Leo Franchi
b3a3054f64
Slightly raise thresholds for load on nagios
...
(imported from commit 2dbc06c8ba204c10f6d6b590bc4858e07692540b)
2013-04-22 10:22:35 -04:00
Leo Franchi
350cf79ba0
Add a nagios check for a notify_tornado consumer
...
(imported from commit 050536bb4ac7384d5b98d5cf6cb7430b2b00dbd5)
2013-04-17 09:24:28 -04:00
Tim Abbott
5b1b2257bd
nagios: Commit Luke's testing contact.
...
(imported from commit d88951f42ad7753777b8e0ab2d47b9bb61ff3f76)
2013-04-16 12:02:42 -04:00
Tim Abbott
bb3b63206a
nagios: Comment out the postgres time checks (they're too noisy).
...
(imported from commit c9569cdbd2909ea7fb8c8c14a681201ee033c62b)
2013-04-16 12:02:42 -04:00
Tim Abbott
b73ac39a25
nagios: Run check_send_receive_time check against both staging and prod.
...
(imported from commit 749c5f04fba4832debe8a4e702914fa47d1fbeaa)
2013-04-16 12:02:42 -04:00
Tim Abbott
73886a95fd
nagios: Update app.humbughq.com to use its primary hostname.
...
(imported from commit 39d291e06b0fa223ae4bb76022b26464b969a505)
2013-04-16 12:02:42 -04:00
Jessica McKellar
c784457d36
nagios: update feedback bot check to reflect API directory reorg.
...
(imported from commit 01389b0f3f8bf68249cf91b4986e44763fb9a4a0)
2013-04-10 17:40:48 -04:00
Jessica McKellar
fe7fedd252
nagios: add check for send_invite_emails process.
...
(imported from commit b30e55241249a02ee61fac2d3f7abecc4d8318bd)
2013-04-10 16:58:17 -04:00
Luke Faraone
d89f5670bb
Add nagios check to verify mailchimp is running on staging/app.
...
(imported from commit 2aa79cc6252aadaa0a212b5c60eff9c5c55b7781)
2013-04-05 14:44:18 -07:00
Leo Franchi
2a334a6328
Tighten rabbitmq thresholds and page_admins
...
(imported from commit 373014bf75346286b55b0ea7d370b21de49ffa33)
2013-03-22 15:55:49 -04:00
Tim Abbott
72d7adce93
nagios: Lower default check intervals and default counts.
...
The defaults are quite large for a small site like ours where on
server down means an outage (e.g. only check every 5 minutes and then
require 4 failures before we alert the admins).
(imported from commit 3b2f436bbb716262f4ee939434749be535ffd6d3)
2013-02-20 16:47:55 -05:00
Tim Abbott
f547bdce9e
nagios: Add swap check.
...
(imported from commit 37ffdb8dfc117e728acc6c3fe4bae671c66ce4c9)
2013-02-20 11:10:45 -05:00
Tim Abbott
be834815aa
nagios: Rename paging_admins to page_admins.
...
I think the name is a little clearer.
(imported from commit cd707b76339cb85365f007701c6313aa6d65b4a3)
2013-02-19 15:40:18 -05:00
Tim Abbott
02ff5bc38d
Nagios: Change new services to paging mode.
...
(imported from commit 4406485179224287f4b7dfbaaa8ed4f97e6debbc)
2013-02-19 15:40:18 -05:00
Leo Franchi
9bb699f917
Add a nagios plugin for checking rabbitmq queue sizes
...
(imported from commit 32bd03bcfe4c4a4221ace17f83adb175f591c8ea)
2013-02-19 15:22:55 -05:00
Tim Abbott
63827c2301
Make the Nagios integration configurable, available, and documented.
...
(imported from commit 1208fc08ed366a892763c3b29b9aeafa90b29981)
2013-02-14 17:50:00 -05:00
Leo Franchi
0a0c4bb9a0
[manual] Use rabbitmq for asynchronous presence updating
...
Note: When deploying, restarting the process-user-activity-commandline script is needed
(imported from commit 63ee795c9c7a7db4a40170cff5636dc1dd0b46a8)
2013-02-11 18:05:57 -05:00
Zev Benjamin
da95bb2988
puppet: Move all puppetized config files to the humbug module and reference them with puppet URLs
...
(imported from commit f0f325bbad381b87c12c6f7888f4dd5d6989f09f)
2013-02-08 16:06:34 -05:00