maybe_send_batched_emails handles batches of emails from different
users at once; as it processes each user's batch, it enqueues messages
onto the `email_senders` queue. If `handle_missedmessage_emails`
raises an exception when processing a single user's email, no events
are marked as handled -- including those that were already handled and
enqueued onto `email_senders`. This results in an increasing number
of users being sent repeated emails about the same missed messages.
Catch and log any exceptions when handling an individual user's
events. This guarantees forward progress, and that notifications are
sent at-most-once, not at-least-once.
We only have one query which will change database state in this function,
and we already have a lock on the process itself, so there's no need for
a transaction.
This was added in ebb4eab0f9.
Previously, we stored up to 2 minutes worth of email events in memory
before processing them. So, if the server were to go down we would lose
those events.
To fix this, we store the events in the database.
This is a prep change for allowing users to set custom grace period for
email notifications, since the bug noted above will aggravate with
longer grace periods.
The `# nocoverage` was unnecessary apart from for the compatibility code,
so add a test for that code and remove the `# nocoverage`.
The `message_id` -> `message_ids` conversion was done in
9869153ae8.
This fixes a bug introduced in 95b46549e1
which made the worker simply log a warning about the timeout and then
continue consume()ing the event that should have also been interrupted.
The idea here is to introduce an exception which can be used to
interrupt the consume() process without triggering the regular handling
of exceptions that happens in _handle_consume_exception.
Throwing an exception is excessive in case of this worker, as it's
expected for it to time out sometimes if the urls take too long to
process.
With a test added by tabbott.
This allows specific queue workers to override the defaut behavior and
implement their own response to the timer expiring. We will want to use
this for embed_links queue at least.
This adds a new class called MessageRenderingResult to contain the
additional properties we added to the Message object (like alert_words)
as well as the rendered content to ensure typesafe reference. No
behavioral change is made except changes in typing.
This is a preparatory change for adding django-stubs to the backend.
Related: #18777
This is a prep commit for adding realm-level default for various
user settings. We add the language, in which the invite email will
be sent, to the dict added to queue itself to avoid making queries
in a loop when sending multiple emails from queue.
We also handle the case for old events in the queue.
Django's default SMTP implementation can raise various exceptions
when trying to send an email. In order to allow Zulip calling code
to catch fewer exceptions to handle any cause of "email not
sent", we translate most of them into EmailNotDeliveredException.
The non-translated exceptions concern the connection with the
SMTP server. They were not merged with the rest to keep some
details about the nature of these.
Tests are implemented in the test_send_email.py module.
This will stop dropping events in the case that the background
`maybe_send_batched_email` thread takes longer than 30s. However, see
also #15280 and the TODO comment about how we lose events upon
restart; this worker is still lossy.
Previously the outgoing emails were sent over several SMTP
connections through the EmailSendingWorker; establishing a new
connection each time adds notable overhead.
Redefine EmailSendingWorker worker to be a LoopQueueProcessingWorker,
which allows it to handle batches of events. At the same time, persist
the connection across email sending, if possible.
The connection is initialized in the constructor of the worker
in order to keep the same connection throughout the whole process.
The concrete implementation of the consume_batch function is simply
processing each email one at a time until they have all been sent.
In order to reuse the previously implemented decorator to retry
sending failures a new method that meets the decorator's required
arguments is declared inside the EmailSendingWorker class. This
allows to retry the sending process of a particular email inside
the batch if the caught exception leaves this process retriable.
A second retry mechanism is used inside the initialize_connection
function to redo the opening of the connection until it works or
until three attempts failed. For this purpose the backoff module
has been added to the dependencies and a test has been added to
ensure that this retry mechanism works well.
The connection is closed when the stop method is called.
Fixes: #17672.
This was introduced in 8321bd3f92 to serve as a sort of drop-in
replacement for zerver.lib.queue.queue_json_publish, but its use has
been subsequently cut out (e.g. `9fcdb6c83ac5`).
Remote its last callsite.
django.utils.translation.ugettext is a deprecated alias of
django.utils.translation.gettext as of Django 3.0, and will be removed
in Django 4.0.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
The existing organization, of returning an opaque blob from
`build_bot_request`, which was later consumed by
`send_data_to_server`, is not particularly sensible; the steps become
oddly split between the OutgoingWebhookWorker, `do_rest_call`, and the
`OutgoingWebhookServiceInterface`.
Make the `OutgoingWebhookServiceInterface` in charge of building,
making, and returning the request in one method; another method
handles extracting content from a successful response. `do_rest_call`
is responsible for calling both halves of this, and doing common error
handling.
The `deployment` key was only set in `do_report_error`, which is now
only used in one codepath (the queue worker). The logging handlers on
staging call notify_server_error directly, which omits the
`deployment` key.
Remove the odd one-of key, and instead simply do dispatch in
`do_report_error`.
Not all of the workers are known to be safe to interrupt; they might
leave inconsistent state. As such, terminating them with timeouts
should currently only be a last-resort against stalled queues, not a
regular occurrence.
Since the exception can be triggered at arbitrary places in the stack
based on whenever the alarm happens to fire, they do not often group
together.
Explicitly group them together, grouped only by which queue the work
is in.
We already trust ids that are put on our queue
for deferred work. For example, see the code for
"mark_stream_messages_as_read_for_everyone"
We now pass stream_recipient_id when we queue
up work for do_mark_stream_messages_as_read.
This generally saves about 3 queries per
user when we unsubscribe them from a stream.
Since this was using repead individual get() calls previously, it
could not be monitored for having a consumer. Add it in, by marking
it of queue type "consumer" (the default), and adding Nagios lines for
it.
Also adjust missedmessage_emails to be monitored; it stopped using
LoopQueueProcessingWorker in 5cec566cb9, but was never added back
into the set of monitored consumers.
This low-level interface allows consuming from a queue with timeouts.
This can be used to either consume in batches (with an upper timeout),
or one-at-a-time. This is notably more performant than calling
`.get()` repeatedly (what json_drain_queue does under the hood), which
is "*highly discouraged* as it is *very inefficient*"[1].
Before this change:
```
$ ./manage.py queue_rate --count 10000 --batch
Purging queue...
Enqueue rate: 11158 / sec
Dequeue rate: 3075 / sec
```
After:
```
$ ./manage.py queue_rate --count 10000 --batch
Purging queue...
Enqueue rate: 11511 / sec
Dequeue rate: 19938 / sec
```
[1] https://www.rabbitmq.com/consumers.html#fetching
Despite its name, the `queue_size` method does not return the number
of items in the queue; it returns the number of items that the local
consumer has delivered but unprocessed. These are often, but not
always, the same.
RabbitMQ's queues maintain the queue of unacknowledged messages; when
a consumer connects, it sends to the consumer some number of messages
to handle, known as the "prefetch." This is a performance
optimization, to ensure the consumer code does not need to wait for a
network round-trip before having new data to consume.
The default prefetch is 0, which means that RabbitMQ immediately dumps
all outstanding messages to the consumer, which slowly processes and
acknowledges them. If a second consumer were to connect to the same
queue, they would receive no messages to process, as the first
consumer has already been allocated them. If the first consumer
disconnects or crashes, all prior events sent to it are then made
available for other consumers on the queue.
The consumer does not know the total size of the queue -- merely how
many messages it has been handed.
No change is made to the prefetch here; however, future changes may
wish to limit the prefetch, either for memory-saving, or to allow
multiple consumers to work the same queue.
Rename the method to make clear that it only contains information
about the local queue in the consumer, not the full RabbitMQ queue.
Also include the waiting message count, which is used by the
`consume()` iterator for similar purpose to the pending events list.
Otherwise, if consume_func raised an exception for any reason *other*
than the alarm being fired, the still-pending alarm would have fired
later at some arbitrary point in the calling code.
We need two try…finally blocks in case the signal arrives just before
signal.alarm(0).
Signed-off-by: Anders Kaseorg <anders@zulip.com>