This was causing about 10% of the time, personals being forwarded by
tabbott/extra@ATHENA.MIT.EDU to get this error:
zwrite: Field too long for buffer while sending notice to
tabbott/extra@ATHENA.MIT.EDU
We don't fully understand the cause of the problem, but this fixes it.
(imported from commit 22c39a1061cde9ad6973ef07aee7227623a3d47d)
Sometimes the very first message we send isn't received by
python-zephyr; this could be because of our growing a 17-message queue
of zephyrs to service, so let's not do that.
(imported from commit 281bf1807442b6335b05c803b1a47e0a162bef4e)
This ensures that Humbug receives the verbatim text of the Nagios
body, rather than a slightly modified version. Tested by running
manually.
(imported from commit 7e0eea0b230fd8c5552860acfb7372099598f036)
The refactoring that we did a couple weeks ago to make the zephyr
mirror script restart itself automatically (by splitting it into
zephyr_mirror.py and zephyr_mirror_backend.py) had a poor interaction
with our code for killing old zephyr_mirror processes (to prevent
double-mirroring). If you manually ran two copies of the outer
mirroring script (zephyr_mirror.py), then the second one would on
startup kill the first one's zephyr_mirror_backend.py children (for
being duplicate zephyr_mirror_backend.py processes that would result
in double-mirroring). However, importantly, it did not kill the first
mirroring script's zephyr_mirror.py parent process, so the first
mirroring script would then proceed to startup up new children. The
process then repeats, with the two scripts' roles reversed.
This issue didn't affect the sharded mirroring case, where I had been
doing the testing of the kill code with that refactoring, because we
don't have a version of the outer zephyr_mirror.py loop for that
situation (a consequence of it being hard to restart the threads
properly with the run_parallel API that we're using to spawn all the
children).
(imported from commit d4886ac77312a6b0ebd0d612f6fb084970bf23a2)
This is probably a lot more annoying to use, in that e.g. there are
separate log files for the two directions of the mirror, but we
haven't used these logs for much, so whatever.
(imported from commit d3bc407d90099214d242526c01cd3d3cd9d9d9bd)
Previously all that we logged was the sender, which turns out to be a
constant for humbug=>zephyr mirroring.
(imported from commit 527a3ac4b9b815a2b4d6b63c3ad92d9d5ad99a6e)
feedback-bot and zephyr_mirror will need to be updated and restarted
when this is deployed to prod.
(imported from commit fe2b524424c174bcb1b717a851a5d3815fda3f69)
Features include:
* Not forking into two processes (shells out to zwrite to send
instead). This makes life easier since we're not doing concurrent
programming.
* Eliminated a lot of hard-to-read or unnecessary debugging output.
* Adding explanatory test suggesting the likely problem for some
common sets of received messages.
* Much less code duplication.
* Support for testing a sharded zephyr_mirror script (--sharded).
* Use of the logging module to print timestamps -- makes debugging
some issues a lot.
* Only one sleep, and for only 10 seconds, between sending the
outgoing messages and checking that they were received.
* Support for running two copies of this script at the same time, so that
running it manually doesn't screw up Nagios.
* Passed running 100 tests run in a row.
(imported from commit a3ec02ac1d1a04972e469ca30fec1790c4fb53bc)
We should be canonicalizing stream names to class names in
update_subscriptions_from_humbug, before we even decide which classes
to subscribe to; otherwise deduplication and tracking of which classes
we're already subscribed to won't work.
(imported from commit a751b6fca1022390a087516a0730ff77f13d7edf)
This causes e.g. call_on_each_message to switch to dont_block mode
after the first error.
(imported from commit b6a5a10970c987faf8017f0ddae4e0b64a513c6f)
Also fixes bugs where the retry code wouldn't work correctly if
verbose wasn't set, prints out which API query had the error, and is
more consistent about printing something to end the "..." sequences.
(imported from commit 15c1e0e4a14c5c5559b43bafe4ec268451ee04f5)
Previously, we were sending "Skipping message we got from Humbug!"
for messages we wouldn't have forwarded anyway.
(imported from commit 36df85a61336ac00e3d7913d5a417d6b42764350)
In this case, if we're configured to not forward personals, there's no
point in logging a decision not to forward one.
(imported from commit 62c37591c6a70afb6235de626b0c6a3502cbcb27)
It seems that check-mirroring was reporting a lot of spurious failures
due to it sometimes taking more than 3 seconds for the setup phase of
check-mirroring's receive path to run. So fix this:
(1) Wait a bit more than 3 seconds for the receiver to subscribe to
messages
(2) Subscribe to Humbug messages before forking (we can't do this with
zephyr.init() because python-zephyr gets totally messed up if you use
it from multiple processes with a shared initialization)
(3) Get rid of the old time.sleep(0.x) values that were intended to
make messages arrive in order -- since we're now checking that
messages correctly arrived using set(), they aren't needed.
(4) Use a single request to subscribe to both zephyr classes we need
to subscribe to (saves 1 RTT).
(imported from commit d96aef05405ce43e9a4a549de189da9a2e393875)
- The default 'default' is None
- The default 'dest' is the option name, with - replaced with _
- The default 'action' is 'store'
Not coincidentally, these defaults are correct for most of our existing code.
(imported from commit 9c6078bd778324e08e1ca214a97281a7bdce08c2)
At least the zephyr.humbughq.com case is one our users might find
useful. I imagine we'll end up hiding --site once we integrate the
zephyr and main site tornado servers.
(imported from commit a3a28492cf3f32b78eab6bbd48ba66be2f38a5f2)
Sometimes messages are delayed passed the allowed 10 seconds. We may
eventually decide that that amount of latency is unacceptable, but the
latency is not what this script is supposed to be checking.
(imported from commit d83a6a83d60e9eac13b3b87fb31de7f9881acabf)