Since 3853285241, PushNotificationsWorker uses the aioapns library
to send Apple push notifications. This introduces an asyncio event
loop into this worker process, which, if unlucky, can respond poorly
when a SIGALRM is introduced to it:
```
[asyncio] Task exception was never retrieved
future: <Task finished coro=<send_apple_push_notification.<locals>.attempt_send() done, defined at /path/to/zerver/lib/push_notifications.py:166> exception=WorkerTimeoutException(30, 1)>
Traceback (most recent call last):
File "/path/to/zerver/lib/push_notifications.py", line 169, in attempt_send
result = await apns_context.apns.send_notification(request)
File "/path/to/zulip-py3-venv/lib/python3.6/site-packages/aioapns/client.py", line 57, in send_notification
response = await self.pool.send_notification(request)
File "/path/to/zulip-py3-venv/lib/python3.6/site-packages/aioapns/connection.py", line 407, in send_notification
response = await connection.send_notification(request)
File "/path/to/zulip-py3-venv/lib/python3.6/site-packages/aioapns/connection.py", line 189, in send_notification
data = json.dumps(request.message, ensure_ascii=False).encode()
File "/usr/lib/python3.6/json/__init__.py", line 238, in dumps
**kw).encode(obj)
File "/usr/lib/python3.6/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib/python3.6/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/path/to/zerver/worker/queue_processors.py", line 353, in timer_expired
raise WorkerTimeoutException(limit, len(events))
zerver.worker.queue_processors.WorkerTimeoutException: Timed out after 30 seconds processing 1 events
```
...which subsequently leads to the worker failing to make any progress
on the queue.
Remove the timeout on the worker. This may result in failing to make
forward progress if Apple/Google take overly long handling requests,
but is likely preferable to failing to make forward progress if _one_
request takes too long and gets unlucky with when the signal comes
through.
This extends the invite api endpoints to handle an extra
argument, expiration duration, which states the number of
days before the invitation link expires.
For prereg users, expiration info is attached to event
object to pass it to invite queue processor in order to
create and send confirmation link.
In case of multiuse invites, confirmation links are
created directly inside do_create_multiuse_invite_link(),
For filtering valid user invites, expiration info stored in
Confirmation object is used, which is accessed by a prereg
user using reverse generic relations.
Fixes#16359.
This information can be gleaned from the stacktrace, but making it
explicit in the stringification makes it much easier to differentiate
types of errors at a glance, particularly in Sentry.
maybe_send_batched_emails handles batches of emails from different
users at once; as it processes each user's batch, it enqueues messages
onto the `email_senders` queue. If `handle_missedmessage_emails`
raises an exception when processing a single user's email, no events
are marked as handled -- including those that were already handled and
enqueued onto `email_senders`. This results in an increasing number
of users being sent repeated emails about the same missed messages.
Catch and log any exceptions when handling an individual user's
events. This guarantees forward progress, and that notifications are
sent at-most-once, not at-least-once.
We only have one query which will change database state in this function,
and we already have a lock on the process itself, so there's no need for
a transaction.
This was added in ebb4eab0f9.
Previously, we stored up to 2 minutes worth of email events in memory
before processing them. So, if the server were to go down we would lose
those events.
To fix this, we store the events in the database.
This is a prep change for allowing users to set custom grace period for
email notifications, since the bug noted above will aggravate with
longer grace periods.
The `# nocoverage` was unnecessary apart from for the compatibility code,
so add a test for that code and remove the `# nocoverage`.
The `message_id` -> `message_ids` conversion was done in
9869153ae8.
This fixes a bug introduced in 95b46549e1
which made the worker simply log a warning about the timeout and then
continue consume()ing the event that should have also been interrupted.
The idea here is to introduce an exception which can be used to
interrupt the consume() process without triggering the regular handling
of exceptions that happens in _handle_consume_exception.
Throwing an exception is excessive in case of this worker, as it's
expected for it to time out sometimes if the urls take too long to
process.
With a test added by tabbott.
This allows specific queue workers to override the defaut behavior and
implement their own response to the timer expiring. We will want to use
this for embed_links queue at least.
This adds a new class called MessageRenderingResult to contain the
additional properties we added to the Message object (like alert_words)
as well as the rendered content to ensure typesafe reference. No
behavioral change is made except changes in typing.
This is a preparatory change for adding django-stubs to the backend.
Related: #18777
This is a prep commit for adding realm-level default for various
user settings. We add the language, in which the invite email will
be sent, to the dict added to queue itself to avoid making queries
in a loop when sending multiple emails from queue.
We also handle the case for old events in the queue.
Django's default SMTP implementation can raise various exceptions
when trying to send an email. In order to allow Zulip calling code
to catch fewer exceptions to handle any cause of "email not
sent", we translate most of them into EmailNotDeliveredException.
The non-translated exceptions concern the connection with the
SMTP server. They were not merged with the rest to keep some
details about the nature of these.
Tests are implemented in the test_send_email.py module.
This will stop dropping events in the case that the background
`maybe_send_batched_email` thread takes longer than 30s. However, see
also #15280 and the TODO comment about how we lose events upon
restart; this worker is still lossy.
Previously the outgoing emails were sent over several SMTP
connections through the EmailSendingWorker; establishing a new
connection each time adds notable overhead.
Redefine EmailSendingWorker worker to be a LoopQueueProcessingWorker,
which allows it to handle batches of events. At the same time, persist
the connection across email sending, if possible.
The connection is initialized in the constructor of the worker
in order to keep the same connection throughout the whole process.
The concrete implementation of the consume_batch function is simply
processing each email one at a time until they have all been sent.
In order to reuse the previously implemented decorator to retry
sending failures a new method that meets the decorator's required
arguments is declared inside the EmailSendingWorker class. This
allows to retry the sending process of a particular email inside
the batch if the caught exception leaves this process retriable.
A second retry mechanism is used inside the initialize_connection
function to redo the opening of the connection until it works or
until three attempts failed. For this purpose the backoff module
has been added to the dependencies and a test has been added to
ensure that this retry mechanism works well.
The connection is closed when the stop method is called.
Fixes: #17672.
This was introduced in 8321bd3f92 to serve as a sort of drop-in
replacement for zerver.lib.queue.queue_json_publish, but its use has
been subsequently cut out (e.g. `9fcdb6c83ac5`).
Remote its last callsite.
django.utils.translation.ugettext is a deprecated alias of
django.utils.translation.gettext as of Django 3.0, and will be removed
in Django 4.0.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
The existing organization, of returning an opaque blob from
`build_bot_request`, which was later consumed by
`send_data_to_server`, is not particularly sensible; the steps become
oddly split between the OutgoingWebhookWorker, `do_rest_call`, and the
`OutgoingWebhookServiceInterface`.
Make the `OutgoingWebhookServiceInterface` in charge of building,
making, and returning the request in one method; another method
handles extracting content from a successful response. `do_rest_call`
is responsible for calling both halves of this, and doing common error
handling.
The `deployment` key was only set in `do_report_error`, which is now
only used in one codepath (the queue worker). The logging handlers on
staging call notify_server_error directly, which omits the
`deployment` key.
Remove the odd one-of key, and instead simply do dispatch in
`do_report_error`.
Not all of the workers are known to be safe to interrupt; they might
leave inconsistent state. As such, terminating them with timeouts
should currently only be a last-resort against stalled queues, not a
regular occurrence.
Since the exception can be triggered at arbitrary places in the stack
based on whenever the alarm happens to fire, they do not often group
together.
Explicitly group them together, grouped only by which queue the work
is in.
We already trust ids that are put on our queue
for deferred work. For example, see the code for
"mark_stream_messages_as_read_for_everyone"
We now pass stream_recipient_id when we queue
up work for do_mark_stream_messages_as_read.
This generally saves about 3 queries per
user when we unsubscribe them from a stream.