Commit Graph

120 Commits

Author SHA1 Message Date
Alex Vandiver 3efc0c9af3 workers: Rewrite missedmessage_emails with a worker thread.
The previous implementation leaked database connections, as a new
thread (and thus a new thread-local database connection) was made for
each timer execution.  While these connections were relatively
lightweight in Python, they also incur memory overhead in the
PostgreSQL server itself.  The logic for managing the timer was also
unclear, and the unavoidable deadlock in the stopping logic was rather
unfortunate.

Rewrite with one explicit worker thread which handles the delayed
message sending.  The RabbitMQ consumer creates the database rows, and
notifies the worker to start its 5s timeout.  Because it is controlled
by a condition variable, it does not hold the lock while waiting, and
can be notified to exit.
2023-04-10 17:38:08 -07:00
Anders Kaseorg ff1971f5ad ruff: Fix SIM105 Use `contextlib.suppress` instead of try-except-pass.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-01-23 11:18:36 -08:00
Anders Kaseorg 73c4da7974 ruff: Fix N818 exception name should be named with an Error suffix.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-11-17 16:52:00 -08:00
Tim Abbott 8010d06f9e compatiblity: Delete obsolete compatibility code.
Both of these compatibility blocks can be deleted, since you can't
upgrade directly to any supported release from the versions where the
old event formats would be used.
2022-11-15 15:39:38 -08:00
Sahil Batra 61365fbe21 invites: Use expiration time in minutes instead of days.
This commit changes the invite API to accept invitation
expiration time in minutes since we are going to add a
custom option in further commits which would allow a user
to set expiration time in minutes, hours and weeks as well.
2022-04-20 13:31:37 -07:00
Anders Kaseorg ec6355389a actions: Split out zerver.actions.user_settings.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-04-14 17:14:34 -07:00
Anders Kaseorg b0ce4f1bce docs: Fix many spelling mistakes.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-02-07 18:51:06 -08:00
Mateusz Mandera 8af7ffd9da rate_limit: Fix logging string when rate limiting email gateway.
realm.name is not the right "name" to log, we should use realm.subdomain
like everywhere else.
2021-11-22 10:28:56 -08:00
Alex Vandiver 7c3507feef queue: Allow passing down a prefetch count to pika. 2021-11-16 11:48:50 -08:00
Alex Vandiver 8d239f4984 tests: Move lambda in simulated_queue_client to help typing. 2021-11-16 11:48:50 -08:00
Alex Vandiver d1822b5630 tests: Move simulated_queue_client to the only test it is used in. 2021-11-16 11:48:50 -08:00
shanukun 8c1ea78d7d invite: Extend invite api for handling expiration duration.
This extends the invite api endpoints to handle an extra
argument, expiration duration, which states the number of
days before the invitation link expires.

For prereg users, expiration info is attached to event
object to pass it to invite queue processor in order to
create and send confirmation link.
In case of multiuse invites, confirmation links are
created directly inside do_create_multiuse_invite_link(),

For filtering valid user invites, expiration info stored in
Confirmation object is used, which is accessed by a prereg
user using reverse generic relations.

Fixes #16359.
2021-09-10 16:53:03 -07:00
Alex Vandiver 5b45f8a128 queue_processors: Include queue name in the timeout exception.
This information can be gleaned from the stacktrace, but making it
explicit in the stringification makes it much easier to differentiate
types of errors at a glance, particularly in Sentry.
2021-09-02 02:48:34 -07:00
Alex Vandiver 4d98b0552e missedmessage_emails: Ensure forward progress.
maybe_send_batched_emails handles batches of emails from different
users at once; as it processes each user's batch, it enqueues messages
onto the `email_senders` queue.  If `handle_missedmessage_emails`
raises an exception when processing a single user's email, no events
are marked as handled -- including those that were already handled and
enqueued onto `email_senders`.  This results in an increasing number
of users being sent repeated emails about the same missed messages.

Catch and log any exceptions when handling an individual user's
events.  This guarantees forward progress, and that notifications are
sent at-most-once, not at-least-once.
2021-08-20 07:21:39 -07:00
Anders Kaseorg 3665deb93a python: Remove unnecessary intermediate lists.
Generated automatically by pyupgrade.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-08-02 15:53:52 -07:00
Abhijeet Prasad Bodas dd5e12d112 MissedMessageWorker: Use custom batching periods from UserProfile. 2021-07-23 12:13:46 -07:00
Abhijeet Prasad Bodas 9fcb6e51ce MissedMessageWorker: Handle deleted messages.
The test for the try-except block is hacky. See the comment for
explaination.
2021-07-23 12:13:46 -07:00
Abhijeet Prasad Bodas ebb4eab0f9 worker: Rewrite MissedMessageWorker to not be lossy.
Previously, we stored up to 2 minutes worth of email events in memory
before processing them. So, if the server were to go down we would lose
those events.

To fix this, we store the events in the database.

This is a prep change for allowing users to set custom grace period for
email notifications, since the bug noted above will aggravate with
longer grace periods.
2021-07-13 17:21:38 -07:00
Mateusz Mandera 91b3ba6d66 test_queue_worker: Fix unnecessarily hard-coded user id. 2021-07-13 08:31:11 -07:00
Abhijeet Prasad Bodas e63e86dcb2 worker: Ensure complete coverage for PushNotificationsWorker.
The `# nocoverage` was unnecessary apart from for the compatibility code,
so add a test for that code and remove the `# nocoverage`.

The `message_id` -> `message_ids` conversion was done in
9869153ae8.
2021-07-13 08:30:31 -07:00
Tim Abbott ac568d303c test_queue_worker: Fix order-dependent assertions. 2021-07-06 14:37:16 -07:00
Mateusz Mandera 95b46549e1 embed_links: Only log warning if worker times out.
Throwing an exception is excessive in case of this worker, as it's
expected for it to time out sometimes if the urls take too long to
process.

With a test added by tabbott.
2021-07-06 14:17:24 -07:00
sahil839 37bf160298 queue_processor: Add langauge to the events added to invites queue.
This is a prep commit for adding realm-level default for various
user settings. We add the language, in which the invite email will
be sent, to the dict added to queue itself to avoid making queries
in a loop when sending multiple emails from queue.

We also handle the case for old events in the queue.
2021-06-22 16:55:32 -07:00
sahil839 6a191a7bf0 queue_processor: Remove email_body from the invite email events.
We removed the use of email_body field in 47fcb27e39, but was
still passed in events from do_resend_user_invite_email and
in tests. So this commit removes the email_body field from
these places.
2021-06-22 16:55:32 -07:00
Abhijeet Prasad Bodas 352634a851 tests: Consistently use assert_length helper.
This helper does some nice things like printing out
the data structure incase of failure.
2021-05-19 11:55:56 -07:00
Cyril Pletinckx e4ff372fc3 emails: Transform SMTPException into EmailNotDeliveredException.
Django's default SMTP implementation can raise various exceptions
when trying to send an email. In order to allow Zulip calling code
to catch fewer exceptions to handle any cause of "email not
sent", we translate most of them into EmailNotDeliveredException.
The non-translated exceptions concern the connection with the
SMTP server. They were not merged with the rest to keep some
details about the nature of these.

Tests are implemented in the test_send_email.py module.
2021-05-05 20:16:11 -07:00
Mateusz Mandera 977a2f7fa0 emails: Rename "missed message email" to "message notification email". 2021-04-21 10:10:54 -07:00
Mateusz Mandera b9c1fed18c invites: Delete old compat code in the invites queue worker.
1.7.* is old enough at this point that we can clean up this code.
2021-02-26 08:26:43 -08:00
Anders Kaseorg 6e4c3e41dc python: Normalize quotes with Black.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-02-12 13:11:19 -08:00
Anders Kaseorg 11741543da python: Reformat with Black, except quotes.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-02-12 13:11:19 -08:00
m-e-l-u-h-a-n ccf520ff13 logging: Migrate many backend tests to use assertLogs.
This commit  migrates some of the backend tests to use assertLogs(),
instead of mock.patch() as planned in #15331.

Tweaked by tabbott to avoid tautological assertions.
2021-02-03 17:55:49 -08:00
m-e-l-u-h-a-n 7417ac9165 logging: Remove unncessary logging patches in backend tests.
There were some tests that had mock patches for logging, although no
logging was actually happening there. This commit removes such patches
in `corporate/tests/test_stripe.py`, `zerver/tests/test_cache.py`,
`zerver/tests/test_queue_worker.py`,
and `zerver/tests/test_signup.py`.
2021-02-03 17:47:38 -08:00
Alex Vandiver c2526844e9 worker: Remove SignupWorker and friends.
ZULIP_FRIENDS_LIST_ID and MAILCHIMP_API_KEY are not currently used in
production.

This removes the unused 'signups' queue and worker.
2021-01-17 11:16:35 -08:00
Anders Kaseorg 2a8a59f548 test_queue_worker: Simplify worker_queue_names computation.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-11-10 15:46:04 -08:00
Alex Vandiver 7cf737988d queue: Be more explicit about test/real queue division. 2020-10-26 12:32:47 -07:00
Mateusz Mandera 716df658fa queue_processors: Don't run test queues with run-dev.py. 2020-10-18 14:07:31 -07:00
Alex Vandiver c2132a4f9c queue: Drop register_json_consumer / json_drain_queue interface.
Now that all callsites use the same interface, drop the now-unused
ones, and their tests.
2020-10-11 14:19:42 -07:00
Alex Vandiver f9358d5330 queue: Switch batch interface to use the channel.consume iterator.
This low-level interface allows consuming from a queue with timeouts.
This can be used to either consume in batches (with an upper timeout),
or one-at-a-time.  This is notably more performant than calling
`.get()` repeatedly (what json_drain_queue does under the hood), which
is "*highly discouraged* as it is *very inefficient*"[1].

Before this change:
```
$ ./manage.py queue_rate --count 10000 --batch
Purging queue...
Enqueue rate: 11158 / sec
Dequeue rate: 3075 / sec
```

After:
```
$ ./manage.py queue_rate --count 10000 --batch
Purging queue...
Enqueue rate: 11511 / sec
Dequeue rate: 19938 / sec
```

[1] https://www.rabbitmq.com/consumers.html#fetching
2020-10-11 14:19:40 -07:00
Alex Vandiver ac0ba21c2c tests: Stop reusing a variable name.
`loopworker_sleep_mock` is a file-level variable used to mock out the
sleep() call in LoopQueueProcessingWorker; don't reuse the variable
name for something else.
2020-10-09 20:42:20 -07:00
Alex Vandiver 754638f673 tests: Refactor test_queue_worker to separate queues. 2020-10-09 20:42:12 -07:00
Alex Vandiver d5a6b0f99a queue: Rename queue_size, and update for all local queues.
Despite its name, the `queue_size` method does not return the number
of items in the queue; it returns the number of items that the local
consumer has delivered but unprocessed.  These are often, but not
always, the same.

RabbitMQ's queues maintain the queue of unacknowledged messages; when
a consumer connects, it sends to the consumer some number of messages
to handle, known as the "prefetch."  This is a performance
optimization, to ensure the consumer code does not need to wait for a
network round-trip before having new data to consume.

The default prefetch is 0, which means that RabbitMQ immediately dumps
all outstanding messages to the consumer, which slowly processes and
acknowledges them.  If a second consumer were to connect to the same
queue, they would receive no messages to process, as the first
consumer has already been allocated them.  If the first consumer
disconnects or crashes, all prior events sent to it are then made
available for other consumers on the queue.

The consumer does not know the total size of the queue -- merely how
many messages it has been handed.

No change is made to the prefetch here; however, future changes may
wish to limit the prefetch, either for memory-saving, or to allow
multiple consumers to work the same queue.

Rename the method to make clear that it only contains information
about the local queue in the consumer, not the full RabbitMQ queue.
Also include the waiting message count, which is used by the
`consume()` iterator for similar purpose to the pending events list.
2020-10-09 20:40:39 -07:00
Alex Vandiver d47637fa40 queue: Set a max consume timeout with SIGALRM.
SIGALRM is the simplest way to set a specific maximum duration that
queue workers can take to handle a specific message.  This only works
in non-threaded environments, however, as signal handlers are
per-process, not per-thread.

The MAX_CONSUME_SECONDS is set quite high, at 10s -- the longest
average worker consume time is embed_links, which hovers near 1s.
Since just knowing the recent mean does not give much information[1],
it is difficult to know how much variance is expected.  As such, we
set the threshold to be such that only events which are significant
outliers will be timed out.  This can be tuned downwards as more
statistics are gathered on the runtime of the workers.

The exception to this is DeferredWorker, which deals with quite-long
requests, and thus has no enforceable SLO.

[1] https://www.autodesk.com/research/publications/same-stats-different-graphs
2020-10-06 17:26:14 -07:00
Alex Vandiver baf882a133 queue: Only ACK drain_queue once it has completed work on the list.
Currently, drain_queue and json_drain_queue ack every message as it is
pulled off of the queue, until the queue is empty.  This means that if
the consumer crashes between pulling a batch of messages off the
queue, and actually processing them, those messages will be
permanently lost.  Sending an ACK on every message also results in a
significant amount lot of traffic to rabbitmq, with notable
performance implications.

Send a singular ACK after the processing has completed, by making
`drain_queue` into a contextmanager.  Additionally, use the `multiple`
flag to ACK all of the messages at once -- or explicitly NACK the
messages if processing failed.  Sending a NACK will re-queue them at
the front of the queue.

Performance of a no-op dequeue before this change:
```
$ ./manage.py queue_rate --count 50000 --batch
Purging queue...
Enqueue rate: 10847 / sec
Dequeue rate: 2479 / sec
```
Performance of a no-op dequeue after this change (a 25% increase):
```
$ ./manage.py queue_rate --count 50000 --batch
Purging queue...
Enqueue rate: 10752 / sec
Dequeue rate: 3079 / sec
```
2020-10-06 17:26:14 -07:00
Alex Vandiver 8cf37a0d4b queue: Add a tool to profile no-op enqueue and dequeue actions. 2020-10-06 17:26:14 -07:00
Alex Vandiver b1cac67c31 tests: Check JSON serializability of test data with mock_queue_publish. 2020-09-03 17:34:31 -07:00
Mateusz Mandera 06151672ba
queue: Use locking to avoid race conditions in missedmessage_emails.
This queue had a race condition with creation of another Timer while
maybe_send_batched_emails is still doing its work, which may cause
two or more threads to be running maybe_send_batched_emails
at the same time, mutating the shared data simultaneously.

Another less likely potential race condition was that
maybe_send_batched_emails after sending out its email, can call
ensure_timer(). If the consume function is run simultaneously
in the main thread, it will call ensure_timer() too, which,
given unfortunate timings, might lead to both calls setting a new Timer.

We add locking to the queue to avoid such race conditions.

Tested manually, by print debugging with the following setup:
1. Making handle_missedmessage_emails sleep 2 seconds for each email,
   and changed BATCH_DURATION to 1s to make the queue start working
   right after launching.
2. Putting a bunch of events in the queue.
3. ./manage.py process_queue --queue_name missedmessage_emails
4. Once maybe_send_batched_emails is called and while it's processing
the events, I pushed more events to the queue. That triggers the
consume() function and ensure_timer().

Before implementing the locking mechanism, this causes two threads
to run maybe_send_batched_emails at the same time, mutating each other's
shared data, causing a traceback such as

Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 1182, in run
    self.function(*self.args, **self.kwargs)
  File "/srv/zulip/zerver/worker/queue_processors.py", line 507, in maybe_send_batched_emails
    del self.events_by_recipient[user_profile_id]
KeyError: '5'

With the locking mechanism, things get handled as expected, and
ensure_timer() exits if it can't obtain the lock due to
maybe_send_batched_emails still working.

Co-authored-by: Tim Abbott <tabbott@zulip.com>
2020-08-26 12:40:59 -07:00
Anders Kaseorg 61d0417e75 python: Replace ujson with orjson.
Fixes #6507.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-08-11 10:55:12 -07:00
Alex Vandiver 2928bbc8bd logging: Report stack_info on logging.exception calls.
The exception trace only goes from where the exception was thrown up
to where the `logging.exception` call is; any context as to where
_that_ was called from is lost, unless `stack_info` is passed as well.
Having the stack is particularly useful for Sentry exceptions, which
gain the full stack trace.

Add `stack_info=True` on all `logging.exception` calls with a
non-trivial stack; we omit `wsgi.py`.  Adjusts tests to match.
2020-08-11 10:16:54 -07:00
Mohit Gupta 3eda7586f3 test_queue_worker: Use assertlogs to verify logs while testing workers.
This will avoid spam in output of test-backend.
2020-07-27 11:02:17 -07:00
Anders Kaseorg 1ed2d9b4a0 logging: Use logging.exception and exc_info for unexpected exceptions.
logging.exception() and logging.debug(exc_info=True),
etc. automatically include a traceback.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-06-14 23:27:22 -07:00