Django's default SMTP implementation can raise various exceptions
when trying to send an email. In order to allow Zulip calling code
to catch fewer exceptions to handle any cause of "email not
sent", we translate most of them into EmailNotDeliveredException.
The non-translated exceptions concern the connection with the
SMTP server. They were not merged with the rest to keep some
details about the nature of these.
Tests are implemented in the test_send_email.py module.
This will stop dropping events in the case that the background
`maybe_send_batched_email` thread takes longer than 30s. However, see
also #15280 and the TODO comment about how we lose events upon
restart; this worker is still lossy.
Previously the outgoing emails were sent over several SMTP
connections through the EmailSendingWorker; establishing a new
connection each time adds notable overhead.
Redefine EmailSendingWorker worker to be a LoopQueueProcessingWorker,
which allows it to handle batches of events. At the same time, persist
the connection across email sending, if possible.
The connection is initialized in the constructor of the worker
in order to keep the same connection throughout the whole process.
The concrete implementation of the consume_batch function is simply
processing each email one at a time until they have all been sent.
In order to reuse the previously implemented decorator to retry
sending failures a new method that meets the decorator's required
arguments is declared inside the EmailSendingWorker class. This
allows to retry the sending process of a particular email inside
the batch if the caught exception leaves this process retriable.
A second retry mechanism is used inside the initialize_connection
function to redo the opening of the connection until it works or
until three attempts failed. For this purpose the backoff module
has been added to the dependencies and a test has been added to
ensure that this retry mechanism works well.
The connection is closed when the stop method is called.
Fixes: #17672.
This was introduced in 8321bd3f92 to serve as a sort of drop-in
replacement for zerver.lib.queue.queue_json_publish, but its use has
been subsequently cut out (e.g. `9fcdb6c83ac5`).
Remote its last callsite.
django.utils.translation.ugettext is a deprecated alias of
django.utils.translation.gettext as of Django 3.0, and will be removed
in Django 4.0.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
The existing organization, of returning an opaque blob from
`build_bot_request`, which was later consumed by
`send_data_to_server`, is not particularly sensible; the steps become
oddly split between the OutgoingWebhookWorker, `do_rest_call`, and the
`OutgoingWebhookServiceInterface`.
Make the `OutgoingWebhookServiceInterface` in charge of building,
making, and returning the request in one method; another method
handles extracting content from a successful response. `do_rest_call`
is responsible for calling both halves of this, and doing common error
handling.
The `deployment` key was only set in `do_report_error`, which is now
only used in one codepath (the queue worker). The logging handlers on
staging call notify_server_error directly, which omits the
`deployment` key.
Remove the odd one-of key, and instead simply do dispatch in
`do_report_error`.
Not all of the workers are known to be safe to interrupt; they might
leave inconsistent state. As such, terminating them with timeouts
should currently only be a last-resort against stalled queues, not a
regular occurrence.
Since the exception can be triggered at arbitrary places in the stack
based on whenever the alarm happens to fire, they do not often group
together.
Explicitly group them together, grouped only by which queue the work
is in.
We already trust ids that are put on our queue
for deferred work. For example, see the code for
"mark_stream_messages_as_read_for_everyone"
We now pass stream_recipient_id when we queue
up work for do_mark_stream_messages_as_read.
This generally saves about 3 queries per
user when we unsubscribe them from a stream.
Since this was using repead individual get() calls previously, it
could not be monitored for having a consumer. Add it in, by marking
it of queue type "consumer" (the default), and adding Nagios lines for
it.
Also adjust missedmessage_emails to be monitored; it stopped using
LoopQueueProcessingWorker in 5cec566cb9, but was never added back
into the set of monitored consumers.
This low-level interface allows consuming from a queue with timeouts.
This can be used to either consume in batches (with an upper timeout),
or one-at-a-time. This is notably more performant than calling
`.get()` repeatedly (what json_drain_queue does under the hood), which
is "*highly discouraged* as it is *very inefficient*"[1].
Before this change:
```
$ ./manage.py queue_rate --count 10000 --batch
Purging queue...
Enqueue rate: 11158 / sec
Dequeue rate: 3075 / sec
```
After:
```
$ ./manage.py queue_rate --count 10000 --batch
Purging queue...
Enqueue rate: 11511 / sec
Dequeue rate: 19938 / sec
```
[1] https://www.rabbitmq.com/consumers.html#fetching
Despite its name, the `queue_size` method does not return the number
of items in the queue; it returns the number of items that the local
consumer has delivered but unprocessed. These are often, but not
always, the same.
RabbitMQ's queues maintain the queue of unacknowledged messages; when
a consumer connects, it sends to the consumer some number of messages
to handle, known as the "prefetch." This is a performance
optimization, to ensure the consumer code does not need to wait for a
network round-trip before having new data to consume.
The default prefetch is 0, which means that RabbitMQ immediately dumps
all outstanding messages to the consumer, which slowly processes and
acknowledges them. If a second consumer were to connect to the same
queue, they would receive no messages to process, as the first
consumer has already been allocated them. If the first consumer
disconnects or crashes, all prior events sent to it are then made
available for other consumers on the queue.
The consumer does not know the total size of the queue -- merely how
many messages it has been handed.
No change is made to the prefetch here; however, future changes may
wish to limit the prefetch, either for memory-saving, or to allow
multiple consumers to work the same queue.
Rename the method to make clear that it only contains information
about the local queue in the consumer, not the full RabbitMQ queue.
Also include the waiting message count, which is used by the
`consume()` iterator for similar purpose to the pending events list.
Otherwise, if consume_func raised an exception for any reason *other*
than the alarm being fired, the still-pending alarm would have fired
later at some arbitrary point in the calling code.
We need two try…finally blocks in case the signal arrives just before
signal.alarm(0).
Signed-off-by: Anders Kaseorg <anders@zulip.com>
SIGALRM is the simplest way to set a specific maximum duration that
queue workers can take to handle a specific message. This only works
in non-threaded environments, however, as signal handlers are
per-process, not per-thread.
The MAX_CONSUME_SECONDS is set quite high, at 10s -- the longest
average worker consume time is embed_links, which hovers near 1s.
Since just knowing the recent mean does not give much information[1],
it is difficult to know how much variance is expected. As such, we
set the threshold to be such that only events which are significant
outliers will be timed out. This can be tuned downwards as more
statistics are gathered on the runtime of the workers.
The exception to this is DeferredWorker, which deals with quite-long
requests, and thus has no enforceable SLO.
[1] https://www.autodesk.com/research/publications/same-stats-different-graphs
Currently, drain_queue and json_drain_queue ack every message as it is
pulled off of the queue, until the queue is empty. This means that if
the consumer crashes between pulling a batch of messages off the
queue, and actually processing them, those messages will be
permanently lost. Sending an ACK on every message also results in a
significant amount lot of traffic to rabbitmq, with notable
performance implications.
Send a singular ACK after the processing has completed, by making
`drain_queue` into a contextmanager. Additionally, use the `multiple`
flag to ACK all of the messages at once -- or explicitly NACK the
messages if processing failed. Sending a NACK will re-queue them at
the front of the queue.
Performance of a no-op dequeue before this change:
```
$ ./manage.py queue_rate --count 50000 --batch
Purging queue...
Enqueue rate: 10847 / sec
Dequeue rate: 2479 / sec
```
Performance of a no-op dequeue after this change (a 25% increase):
```
$ ./manage.py queue_rate --count 50000 --batch
Purging queue...
Enqueue rate: 10752 / sec
Dequeue rate: 3079 / sec
```
This system can't update stats while the queue is idle, without using
threads for this, but at least we ensure to update the file after
consuming an event if more than MAX_SECONDS_BEFORE_UPDATE_STATS passed
since the last update, regardless of the number of iterations done so
far.
The race condition is described in the comment block removed by this
commit. This leaves room for another, remaining race condition
that should be virtually impossible, but nevertheless it seems
worthwhile to have it documented in the code, so we put a new comment
describing it.
As a final note, this is not a new race condition,
it was hypothetically possible with the old code as well.