Commit Graph

87 Commits

Author SHA1 Message Date
Anders Kaseorg b0ce4f1bce docs: Fix many spelling mistakes.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-02-07 18:51:06 -08:00
Alex Vandiver faeffa2466 queue_processors: Set a bounded prefetch size on rabbitmq queues.
RabbitMQ clients have a setting called prefetch[1], which controls how
many un-acknowledged events the server forwards to the local queue in
the client.  The default is 0; this means that when clients first
connect, the server must send them every message in the queue.

This itself may cause unbounded memory usage in the client, but also
has other detrimental effects.  While the client is attempting to
process the head of the queue, it may be unable to read from the TCP
socket at the rate that the server is sending to it -- filling the TCP
buffers, and causing the server's writes to block.  If the server
blocks for more than 30 seconds, it times out the send, and closes the
connection with:

```
closing AMQP connection <0.30902.126> (127.0.0.1:53870 -> 127.0.0.1:5672):
{writer,send_failed,{error,timeout}}
```

This is https://github.com/pika/pika/issues/753#issuecomment-318119222.

Set a prefetch limit of 100 messages, or the batch size, to better
handle queues which start with large numbers of outstanding events.

Setting prefetch=1 causes significant performance degradation in the
no-op queue worker, to 30% of the prefetch=0 performance.  Setting
prefetch=100 achieves 90% of the prefetch=0 performance, and higher
values offer only minor gains above that.  For batch workers, their
performance is not notably degraded by prefetch equal to their batch
size, and they cannot function on smaller prefetches than their batch
size.

We also set a 100-count prefetch on Tornado workers, as they are
potentially susceptible to the same effect.

[1] https://www.rabbitmq.com/confirms.html#channel-qos-prefetch
2021-11-16 11:48:50 -08:00
Alex Vandiver 7c3507feef queue: Allow passing down a prefetch count to pika. 2021-11-16 11:48:50 -08:00
PIG208 aa9d73c9f6 typing: Improve typing with assertions.
This fixes some mypy errors discovered with django-stubs.
2021-08-20 05:54:19 -07:00
Anders Kaseorg 04feadd917 mypy: Add pika-stubs.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-08-02 22:31:46 -07:00
Anders Kaseorg 9f8ba913fd queue: Fix _on_connection_open_error type to accept reason: str.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-08-02 22:31:46 -07:00
Anders Kaseorg f7e2426fc5 queue: Fix ensure_queue type to accept a callback returning any object.
channel.basic_consume actually returns str.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-08-02 22:31:46 -07:00
Anders Kaseorg 5e355abe2e queue: Add missing imports.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-08-02 22:31:46 -07:00
Anders Kaseorg 87799177b5 queue: Fix channel type for TornadoQueueClient.
The BlockingChannel annotations in TornadoQueueClient were flat-out
wrong.  BlockingChannel and Channel have no common base classes.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-08-02 22:31:46 -07:00
Anders Kaseorg 5751479932 queue: Switch TornadoQueueClient to the new base QueueClient.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-08-02 22:31:46 -07:00
Anders Kaseorg bd6a2b149c queue: Split common part of SimpleQueueClient into new base class.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-08-02 22:31:46 -07:00
Anders Kaseorg 6e4c3e41dc python: Normalize quotes with Black.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-02-12 13:11:19 -08:00
Anders Kaseorg 11741543da python: Reformat with Black, except quotes.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-02-12 13:11:19 -08:00
Anders Kaseorg b7a94be152 python: Catch BaseException when we need to clean something up.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-10-11 16:16:16 -07:00
Alex Vandiver c2132a4f9c queue: Drop register_json_consumer / json_drain_queue interface.
Now that all callsites use the same interface, drop the now-unused
ones, and their tests.
2020-10-11 14:19:42 -07:00
Alex Vandiver 179c387409 tornado: Switch to start_json_consumer interface. 2020-10-11 14:19:42 -07:00
Alex Vandiver f9358d5330 queue: Switch batch interface to use the channel.consume iterator.
This low-level interface allows consuming from a queue with timeouts.
This can be used to either consume in batches (with an upper timeout),
or one-at-a-time.  This is notably more performant than calling
`.get()` repeatedly (what json_drain_queue does under the hood), which
is "*highly discouraged* as it is *very inefficient*"[1].

Before this change:
```
$ ./manage.py queue_rate --count 10000 --batch
Purging queue...
Enqueue rate: 11158 / sec
Dequeue rate: 3075 / sec
```

After:
```
$ ./manage.py queue_rate --count 10000 --batch
Purging queue...
Enqueue rate: 11511 / sec
Dequeue rate: 19938 / sec
```

[1] https://www.rabbitmq.com/consumers.html#fetching
2020-10-11 14:19:40 -07:00
Alex Vandiver 2547bdbf4a queue: Rename consume_wrapper to a better name. 2020-10-09 20:40:51 -07:00
Alex Vandiver d5a6b0f99a queue: Rename queue_size, and update for all local queues.
Despite its name, the `queue_size` method does not return the number
of items in the queue; it returns the number of items that the local
consumer has delivered but unprocessed.  These are often, but not
always, the same.

RabbitMQ's queues maintain the queue of unacknowledged messages; when
a consumer connects, it sends to the consumer some number of messages
to handle, known as the "prefetch."  This is a performance
optimization, to ensure the consumer code does not need to wait for a
network round-trip before having new data to consume.

The default prefetch is 0, which means that RabbitMQ immediately dumps
all outstanding messages to the consumer, which slowly processes and
acknowledges them.  If a second consumer were to connect to the same
queue, they would receive no messages to process, as the first
consumer has already been allocated them.  If the first consumer
disconnects or crashes, all prior events sent to it are then made
available for other consumers on the queue.

The consumer does not know the total size of the queue -- merely how
many messages it has been handed.

No change is made to the prefetch here; however, future changes may
wish to limit the prefetch, either for memory-saving, or to allow
multiple consumers to work the same queue.

Rename the method to make clear that it only contains information
about the local queue in the consumer, not the full RabbitMQ queue.
Also include the waiting message count, which is used by the
`consume()` iterator for similar purpose to the pending events list.
2020-10-09 20:40:39 -07:00
Alex Vandiver a1ce1aca3b queue: Update comment to be more accurate about import errors. 2020-10-09 20:40:32 -07:00
Alex Vandiver baf882a133 queue: Only ACK drain_queue once it has completed work on the list.
Currently, drain_queue and json_drain_queue ack every message as it is
pulled off of the queue, until the queue is empty.  This means that if
the consumer crashes between pulling a batch of messages off the
queue, and actually processing them, those messages will be
permanently lost.  Sending an ACK on every message also results in a
significant amount lot of traffic to rabbitmq, with notable
performance implications.

Send a singular ACK after the processing has completed, by making
`drain_queue` into a contextmanager.  Additionally, use the `multiple`
flag to ACK all of the messages at once -- or explicitly NACK the
messages if processing failed.  Sending a NACK will re-queue them at
the front of the queue.

Performance of a no-op dequeue before this change:
```
$ ./manage.py queue_rate --count 50000 --batch
Purging queue...
Enqueue rate: 10847 / sec
Dequeue rate: 2479 / sec
```
Performance of a no-op dequeue after this change (a 25% increase):
```
$ ./manage.py queue_rate --count 50000 --batch
Purging queue...
Enqueue rate: 10752 / sec
Dequeue rate: 3079 / sec
```
2020-10-06 17:26:14 -07:00
Alex Vandiver 2b6989a40f queue: Remove a no-longer-correct comment.
This comment stopped being true in 5686821150, and very much stopped
being relevant in dd40649e04 when the middleware entirely stopped
publishing to a queue.
2020-08-14 11:30:13 -07:00
Anders Kaseorg 61d0417e75 python: Replace ujson with orjson.
Fixes #6507.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-08-11 10:55:12 -07:00
Anders Kaseorg 23b815bb50 queue: Fix types to reflect that Pika channels receive bytes.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-08-07 11:12:32 -07:00
Anders Kaseorg 489d73f63a queue: Fix strict_optional errors.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-07-06 11:25:48 -07:00
Anders Kaseorg 1ed2d9b4a0 logging: Use logging.exception and exc_info for unexpected exceptions.
logging.exception() and logging.debug(exc_info=True),
etc. automatically include a traceback.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-06-14 23:27:22 -07:00
Anders Kaseorg 4b6d2cf25f logging: Pass more format arguments to logging.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-06-14 23:27:22 -07:00
Anders Kaseorg 365fe0b3d5 python: Sort imports with isort.
Fixes #2665.

Regenerated by tabbott with `lint --fix` after a rebase and change in
parameters.

Note from tabbott: In a few cases, this converts technical debt in the
form of unsorted imports into different technical debt in the form of
our largest files having very long, ugly import sequences at the
start.  I expect this change will increase pressure for us to split
those files, which isn't a bad thing.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-06-11 16:45:32 -07:00
Anders Kaseorg 67e7a3631d python: Convert percent formatting to Python 3.6 f-strings.
Generated by pyupgrade --py36-plus.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-06-10 15:02:09 -07:00
Anders Kaseorg 19cc22e5ab queue: Fix types to reflect that Pika channels transmit bytes.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-06-07 11:09:24 -07:00
Anders Kaseorg bdc365d0fe logging: Pass format arguments to logging.
https://docs.python.org/3/howto/logging.html#optimization

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-05-02 10:18:02 -07:00
Anders Kaseorg fead14951c python: Convert assignment type annotations to Python 3.6 style.
This commit was split by tabbott; this piece covers the vast majority
of files in Zulip, but excludes scripts/, tools/, and puppet/ to help
ensure we at least show the right error messages for Xenial systems.

We can likely further refine the remaining pieces with some testing.

Generated by com2ann, with whitespace fixes and various manual fixes
for runtime issues:

-    invoiced_through: Optional[LicenseLedger] = models.ForeignKey(
+    invoiced_through: Optional["LicenseLedger"] = models.ForeignKey(

-_apns_client: Optional[APNsClient] = None
+_apns_client: Optional["APNsClient"] = None

-    notifications_stream: Optional[Stream] = models.ForeignKey('Stream', related_name='+', null=True, blank=True, on_delete=CASCADE)
-    signup_notifications_stream: Optional[Stream] = models.ForeignKey('Stream', related_name='+', null=True, blank=True, on_delete=CASCADE)
+    notifications_stream: Optional["Stream"] = models.ForeignKey('Stream', related_name='+', null=True, blank=True, on_delete=CASCADE)
+    signup_notifications_stream: Optional["Stream"] = models.ForeignKey('Stream', related_name='+', null=True, blank=True, on_delete=CASCADE)

-    author: Optional[UserProfile] = models.ForeignKey('UserProfile', blank=True, null=True, on_delete=CASCADE)
+    author: Optional["UserProfile"] = models.ForeignKey('UserProfile', blank=True, null=True, on_delete=CASCADE)

-    bot_owner: Optional[UserProfile] = models.ForeignKey('self', null=True, on_delete=models.SET_NULL)
+    bot_owner: Optional["UserProfile"] = models.ForeignKey('self', null=True, on_delete=models.SET_NULL)

-    default_sending_stream: Optional[Stream] = models.ForeignKey('zerver.Stream', null=True, related_name='+', on_delete=CASCADE)
-    default_events_register_stream: Optional[Stream] = models.ForeignKey('zerver.Stream', null=True, related_name='+', on_delete=CASCADE)
+    default_sending_stream: Optional["Stream"] = models.ForeignKey('zerver.Stream', null=True, related_name='+', on_delete=CASCADE)
+    default_events_register_stream: Optional["Stream"] = models.ForeignKey('zerver.Stream', null=True, related_name='+', on_delete=CASCADE)

-descriptors_by_handler_id: Dict[int, ClientDescriptor] = {}
+descriptors_by_handler_id: Dict[int, "ClientDescriptor"] = {}

-worker_classes: Dict[str, Type[QueueProcessingWorker]] = {}
-queues: Dict[str, Dict[str, Type[QueueProcessingWorker]]] = {}
+worker_classes: Dict[str, Type["QueueProcessingWorker"]] = {}
+queues: Dict[str, Dict[str, Type["QueueProcessingWorker"]]] = {}

-AUTH_LDAP_REVERSE_EMAIL_SEARCH: Optional[LDAPSearch] = None
+AUTH_LDAP_REVERSE_EMAIL_SEARCH: Optional["LDAPSearch"] = None

Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
2020-04-22 11:02:32 -07:00
Mateusz Mandera 5252b081bd queue_processors: Gather statistics on queue worker operations. 2020-04-01 16:44:06 -07:00
Anders Kaseorg a681ca6cf5 queue: Update error callback signatures for Pika 1.1.
The expected signatures for these callbacks seem to have changed
somewhere in https://github.com/pika/pika/pull/1002.

Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
2019-11-20 17:23:48 -08:00
Andrew Szeto b312001fd9 rabbitmq: Set a short TCP keepalive idle time on BlockingConnection.
The code comment explains this issue in some detail, but essentially
in Kubernetes and Docker Swarm systems, the container overlayer
network has a relatively short TCP idle lifetime (about 15 minutes),
which can lead to it killing the connection between Tornado and
RabbitMQ.

We fix this by setting a TCP keepalive on that connection shorter than
15 minutes.

Fixes #10776.
2019-10-30 16:15:44 -07:00
Rafid Aslam 447f74ae63 Upgrade pika to 1.1.*.
Upgrade pika to 1.1.* and make some changes accordingly
to comply with the new version.

Fixes #12899.
2019-10-29 17:01:12 -07:00
neiljp (Neil Pilgrim) ba7a0934e3 requirements: Upgrade mypy to 0.711.
This comes with it a big performance improvement; mypy is now only
barely our slowest linter even if it wasn't previously running.

Fixes: #12058
2019-07-22 17:12:50 -07:00
Wyatt Hoodes 5686821150 middleware: Change write_log_line to publish as a dict.
We were seeing errors when pubishing typical events in the form of
`Dict[str, Any]` as the expected type to be a `Union`.  So we instead
change the only non-dictionary call, to pass a dict instead of `str`.
2019-07-22 17:06:41 -07:00
Vishnu Ks 0d0007742f requirements: Upgrade pika from 0.12.0 to 0.13.0.
The important changes to pika for us are based on this PR of ours:
https://github.com/pika/pika/pull/1129

Fixes #11394.
2019-01-31 10:04:07 -08:00
Tim Abbott c94deff920 mypy: Remove some now-unnecessary type: ignores. 2018-03-28 10:39:05 -07:00
Greg Price 73559e5320 queue: Suppress error mail from brief rabbitmq downtimes.
Details in comment.  Together with a few previous commits, this should
completely eliminate sending error mail to admins when the RabbitMQ
server is simply restarted and comes back up normally.
2018-03-21 18:03:05 -07:00
Greg Price 3b3154527f queue: Don't blow up when a connection closes quickly. 2018-03-20 16:49:05 -07:00
Greg Price 9dcc436766 queue: Fix __init__ logic so heartbeat choice works fully.
Because the base class's __init__ calls `_connect`, when we set the
value after that call has already returned, our new value only takes
effect if the first connection fails and we have to reconnect.
Make it take effect from the beginning.
2018-03-20 16:49:05 -07:00
Greg Price 5edc26a0df queue: Cut disused, broken parameter to `_connect`.
This parameter isn't used anywhere.  A good thing, because if it were,
the code would immediately raise an exception -- `self._on_open_cbs`
hasn't been initialized yet when we first call `_connect`, from the
base class's `__init__`.

So, just cut it.  If we later need something like this, it's easy to
add a working version then.
2018-03-20 16:49:05 -07:00
Greg Price 4926228071 rabbitmq: Do a better job of retrying failed connections.
Empirically, the retry in `_on_connection_closed` didn't actually work
-- if a reconnect failed, that was it, and the exception handler
didn't get run.  A traceback would get logged, but all its frames were
in Tornado or Pika, not our own code; presumably something magic and
async was happening to the exception.

Moreover, though we would make one attempt to reconnect if we had a
connection that got closed, we didn't have any form of retry if the
original attempt at connecting failed in the first place.

Happily, upstream offers a perfectly reasonable bit of API that avoids
both of these problems: the on-open-error callback.  So use that.
2017-11-29 16:56:29 -08:00
Greg Price 7ac2b58584 rabbitmq: Reorder a bit to group our reconnect logic together. 2017-11-29 16:56:29 -08:00
Greg Price c32b16715d tornado: Use spiffy new `call_later` rather than `add_timeout`.
This method was new in Tornado 4.0.  It saves us from having to get
the time ourselves and do the arithmetic -- which not only makes the
code a bit shorter, but also easier to get right.  Tornado docs (see
http://www.tornadoweb.org/en/stable/ioloop.html) say we should have
been getting the time from `ioloop.time()` rather than hardcoding
`time.time()`, because the loop could e.g. be running on the
`time.monotonic()` clock.
2017-11-29 16:56:29 -08:00
Greg Price 73886f57d2 zerver/lib/queue: Clean up import order. 2017-11-29 16:56:29 -08:00
Greg Price 3c4e4c14c9 rabbitmq: Add on-close callback atomically in creating the connection.
Adding it afterward is inherently racy, and upstream's API is quite
reasonable for avoiding that -- just like we can pass an on-open
callback up front, we can do the same with the on-close callback.

This is a more thorough version of 4adf2d5c2 from back in 2013-04.
2017-11-29 16:56:29 -08:00
Greg Price e88c2a7ee4 rabbitmq: Cut redundant `stop_ioloop_on_close` parameter.
The default value of this parameter is already False upstream.
(It was already False in pika version 0.9.6, which we were
supposedly using when we introduced this in 4baeaaa52; not sure
what the story was there.)
2017-11-29 16:56:29 -08:00