The code comment explains this issue in some detail, but essentially
in Kubernetes and Docker Swarm systems, the container overlayer
network has a relatively short TCP idle lifetime (about 15 minutes),
which can lead to it killing the connection between Tornado and
RabbitMQ.
We fix this by setting a TCP keepalive on that connection shorter than
15 minutes.
Fixes#10776.
We were seeing errors when pubishing typical events in the form of
`Dict[str, Any]` as the expected type to be a `Union`. So we instead
change the only non-dictionary call, to pass a dict instead of `str`.
Details in comment. Together with a few previous commits, this should
completely eliminate sending error mail to admins when the RabbitMQ
server is simply restarted and comes back up normally.
Because the base class's __init__ calls `_connect`, when we set the
value after that call has already returned, our new value only takes
effect if the first connection fails and we have to reconnect.
Make it take effect from the beginning.
This parameter isn't used anywhere. A good thing, because if it were,
the code would immediately raise an exception -- `self._on_open_cbs`
hasn't been initialized yet when we first call `_connect`, from the
base class's `__init__`.
So, just cut it. If we later need something like this, it's easy to
add a working version then.
Empirically, the retry in `_on_connection_closed` didn't actually work
-- if a reconnect failed, that was it, and the exception handler
didn't get run. A traceback would get logged, but all its frames were
in Tornado or Pika, not our own code; presumably something magic and
async was happening to the exception.
Moreover, though we would make one attempt to reconnect if we had a
connection that got closed, we didn't have any form of retry if the
original attempt at connecting failed in the first place.
Happily, upstream offers a perfectly reasonable bit of API that avoids
both of these problems: the on-open-error callback. So use that.
This method was new in Tornado 4.0. It saves us from having to get
the time ourselves and do the arithmetic -- which not only makes the
code a bit shorter, but also easier to get right. Tornado docs (see
http://www.tornadoweb.org/en/stable/ioloop.html) say we should have
been getting the time from `ioloop.time()` rather than hardcoding
`time.time()`, because the loop could e.g. be running on the
`time.monotonic()` clock.
Adding it afterward is inherently racy, and upstream's API is quite
reasonable for avoiding that -- just like we can pass an on-open
callback up front, we can do the same with the on-close callback.
This is a more thorough version of 4adf2d5c2 from back in 2013-04.
The default value of this parameter is already False upstream.
(It was already False in pika version 0.9.6, which we were
supposedly using when we introduced this in 4baeaaa52; not sure
what the story was there.)
When the RabbitMQ server disappears, we log errors like these:
```
Traceback (most recent call last):
File "./zerver/lib/queue.py", line 114, in json_publish
self.publish(queue_name, ujson.dumps(body))
File "./zerver/lib/queue.py", line 108, in publish
self.ensure_queue(queue_name, do_publish)
File "./zerver/lib/queue.py", line 88, in ensure_queue
if not self.connection.is_open:
AttributeError: 'NoneType' object has no attribute 'is_open'
During handling of the above exception, another exception occurred:
[... traceback of connection failure inside the retried self.publish()]
```
That's a type error -- a programming error, not an exceptional
condition from outside the program. Fix the programming error.
Also move the retry out of the `except:` block, so that if it also
fails we don't get the exceptions stacked on each other. This is a
new feature of Python 3 which is sometimes indispensable for
debugging, and which surfaced this nit in the logs (on Python 2 we'd
never see the AttributeError part), but in some cases it can cause a
lot of spew if care isn't taken.
This makes tests of queue processors more realistic,
by adding a parameter to `queue_json_publish` that
calls a queue's consumer function if accessed in a test.
Fixes part of #6542.
Previously these were hardcoded in zproject/settings.py to be accessed
on localhost.
[Modified by Tim Abbott to adjust comments and fix configure-rabbitmq]
The register_json_consumer() function now expects its callback
function to accept a single argument, which is the payload, as
none of the callbacks cared about channel, method, and properties.
This change breaks down as follows:
* A couple test stubs and subclasses were simplified.
* All the consume() and consume_wrapper() functions in
queue_processors.py were simplified.
* Two callbacks via runtornado.py were simplified. One
of the callbacks was socket.respond_send_message, which
had an additional caller, i.e. not register_json_consumer()
calling back to it, and the caller was simplified not
to pass None for the three removed arguments.
(imported from commit 792316e20be619458dd5036745233f37e6ffcf43)