Commit Graph

449 Commits

Author SHA1 Message Date
Anders Kaseorg b45484573e python: Use format string for logging str(obj).
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-10-10 08:32:29 -07:00
Anders Kaseorg fcd81a8473 python: Replace avoidable uses of __special__ attributes.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-10-10 08:32:29 -07:00
Christopher Chong 28173cafc8 message_flags: Fix deadlocks when updating message flags.
Previously, an active production Zulip server would experience a class
of deadlocks caused by two or more concurrent bulk update operations
on the UserMessage table.

This is because UPDATE ... SET ... WHERE statements that execute in
parallel take row-level UPDATE locks as they get results; since the
query plans may result in getting rows in different orders between two
queries, this can result in deadlocks.

Some databases allow ORDER BY on their UPDATE ... WHERE statements;
PostgreSQL does not. In PostgreSQL, the answer is to do a sub-select
with an ORDER BY ... FOR UPDATE to ensure consistent ordering on row
locks.

We do this all code paths using bitand or bitor as part of bulk
editing message flags, which should ensure that these concurrent
operations obtain row level locks on the table in the same order.

Fixes #19054.
2022-09-06 16:06:58 -07:00
Zixuan James Li 3ba51ef1e2 queue_processor: Fix type annotation for connection.
Signed-off-by: Zixuan James Li <p359101898@gmail.com>
2022-07-26 18:00:24 -07:00
Zixuan James Li cd8510607a queue_processor: Remove unreachable code.
This change was added in
c93f1d4eda (diff-d88010b113b79080cab5885fdfbbb56ae2d380cb601d8f520621b3361ad8cebc).
`message.content` cannot be `None` by the model definition.

Signed-off-by: Zixuan James Li <p359101898@gmail.com>
2022-07-19 17:30:15 -07:00
Alex Vandiver cd9c69cd12 message_send: Remove unnecessary user_ids argument.
cfcbf58cd1 rightly removed the use of `user_ids` in
`render_markdown`, which in turn makes it unnecessary in
`render_incoming_message`.

Remove the unnecessary parameter from `render_incoming_message`.
2022-05-04 14:45:18 -07:00
Alex Vandiver 74e9b086f9 embed_links: Check that the message still exists before proceeding. 2022-05-04 14:45:18 -07:00
Alex Vandiver de63000db6 embed_links: Take a lock on the message object while editing.
We leave the fetching of links outside of the lock, as they could take
seconds, which is an unreasonable amount of time to hold a lock on the
message row.  This may result in unnecessary work, in the case that
the message was since edited, but the unnecessary work is preferable
to blocking other work on the message row for the duration.
2022-05-04 14:45:18 -07:00
Alex Vandiver 127108c7d1 workers: Log the exception if the export fails.
We previously just swallowed the exception entirely.
2022-04-28 11:52:47 -07:00
Zixuan James Li a8fd9eb701 email_notifications: Soft reactivate mentioned users.
Signed-off-by: Zixuan James Li <359101898@qq.com>
2022-04-27 16:43:54 -07:00
Sahil Batra 61365fbe21 invites: Use expiration time in minutes instead of days.
This commit changes the invite API to accept invitation
expiration time in minutes since we are going to add a
custom option in further commits which would allow a user
to set expiration time in minutes, hours and weeks as well.
2022-04-20 13:31:37 -07:00
Alex Vandiver 351bdfaf78 preview: Use cache only as a non-durable cache, not an IPC.
The `get_link_embed_data` / `link_embed_data_from_cache` pair as
introduced in c93f1d4eda uses the cache
as a temporary store inside of the `embed_links` worker; this means
that it must be durable storage, or the worker will stall and re-fetch
the same links to preview them.

Switch to plumbing through the fetched URL embed data as an parameter
to the Markdown evaluation which uses them, rather than using the
cache as an intermediary.  This frees up the cache to be merely a
non-durable cache.

As a side-effect, this removes get_cache_with_key, and
link_embed_data_from_cache which was its only callsite.
2022-04-15 14:48:12 -07:00
Anders Kaseorg eda000899b actions: Split out zerver.actions.message_edit.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-04-14 17:14:36 -07:00
Anders Kaseorg eb4e9fe1e7 actions: Split out zerver.actions.message_flags.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-04-14 17:14:36 -07:00
Anders Kaseorg 975066e3f0 actions: Split out zerver.actions.message_send.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-04-14 17:14:34 -07:00
Anders Kaseorg b7adfb02f6 actions: Split out zerver.actions.presence.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-04-14 17:14:32 -07:00
Anders Kaseorg 6168c0110a actions: Split out zerver.actions.user_activity.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-04-14 17:14:32 -07:00
Anders Kaseorg 8fc5922ebd actions: Split out zerver.actions.realm_export.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-04-14 17:14:31 -07:00
Anders Kaseorg ca8d374e21 actions: Split out zerver.actions.invites.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-04-14 17:14:31 -07:00
Sahil Batra 392b17da5f invite: Add backend support for "Never expires" option.
The database value for expiry_date is None for the invite
that will never expire and the clients send -1 as value
in the API similar to the message retention setting.

Also, when passing invite_expire_in_days as an argument
in various functions, invite_expire_in_days is passed as
-1 for "Never expires" option since invite_expire_in_days
is an optional argument in some functions and thus we cannot
pass "None" value.
2022-02-24 16:32:19 -08:00
Mateusz Mandera 30ac291eba emoji: Add migration to reupload all RealmEmoji and ensure .author.
Fixes #19732.
2022-02-10 17:45:31 -08:00
Lauryn Menard c532829c35 backend: Change `do_report_error` return value.
As a preparatory step to refactoring json_success to accept
request as a parameter, change `do_report_error`, which is
called from the events queue for "error_reports", to return
None instead of json_success.

Adds an assertion error to `ErrorReporter` queue processor
and removes `JsonableError` from `do_report_error`.

It is likely that `do_error_report` was moved from a view in a
previous refactor, but was not updated to no longer return an
HttpReponse.
2022-02-04 15:16:55 -08:00
Alex Vandiver 3efed5f1e6 queue_processors: Shut down background missedmessage_emails thread.
Python's behaviour on `sys.exit` is to wait for all non-daemon threads
to exit.  In the context of the missedmessage_emails worker, if any
work is pending, a non-daemon Timer thread exists, which is waiting
for 5 seconds.  As soon as that thread is serviced, it sets up another
5-second Timer, a process which repeats until all
ScheduledMessageNotificationEmail records have been handled.  This
likely takes two minutes, but may theoretically take up to a week
until the thread exits, and thus sys.exit can complete.

Supervisor only gives the process 30 seconds to shut down, so
something else must prevent this endless Timer.

When `stop` is called, take the lock so we can mutate the timer.
However, since `stop` may have been called from a signal handler, our
thread may _already_ have the lock.  As Python provides no way to know
if our thread is the one which has the lock, make the lock a
re-entrant one, allowing us to always try to take it.

With the lock in hand, cancel any outstanding timers.  A race exists
where the timer may not be able to be canceled because it has
finished, maybe_send_batched_emails has been called, and is itself
blocked on the lock.  Handle this case by timing out the thread join
in `stop()`, and signal the running thread to exit by unsetting the
timer event, which will be detected once it claims the lock.
2021-11-23 10:45:49 -08:00
Mateusz Mandera 8af7ffd9da rate_limit: Fix logging string when rate limiting email gateway.
realm.name is not the right "name" to log, we should use realm.subdomain
like everywhere else.
2021-11-22 10:28:56 -08:00
Alex Vandiver faeffa2466 queue_processors: Set a bounded prefetch size on rabbitmq queues.
RabbitMQ clients have a setting called prefetch[1], which controls how
many un-acknowledged events the server forwards to the local queue in
the client.  The default is 0; this means that when clients first
connect, the server must send them every message in the queue.

This itself may cause unbounded memory usage in the client, but also
has other detrimental effects.  While the client is attempting to
process the head of the queue, it may be unable to read from the TCP
socket at the rate that the server is sending to it -- filling the TCP
buffers, and causing the server's writes to block.  If the server
blocks for more than 30 seconds, it times out the send, and closes the
connection with:

```
closing AMQP connection <0.30902.126> (127.0.0.1:53870 -> 127.0.0.1:5672):
{writer,send_failed,{error,timeout}}
```

This is https://github.com/pika/pika/issues/753#issuecomment-318119222.

Set a prefetch limit of 100 messages, or the batch size, to better
handle queues which start with large numbers of outstanding events.

Setting prefetch=1 causes significant performance degradation in the
no-op queue worker, to 30% of the prefetch=0 performance.  Setting
prefetch=100 achieves 90% of the prefetch=0 performance, and higher
values offer only minor gains above that.  For batch workers, their
performance is not notably degraded by prefetch equal to their batch
size, and they cannot function on smaller prefetches than their batch
size.

We also set a 100-count prefetch on Tornado workers, as they are
potentially susceptible to the same effect.

[1] https://www.rabbitmq.com/confirms.html#channel-qos-prefetch
2021-11-16 11:48:50 -08:00
Alex Vandiver 64268f47e8 queue_processors: Drop unused current_queue_size, which was local size.
The `current_queue_size` key in the queue monitoring stats file was
the local queue size, not the global queue size -- d5a6b0f99a
renamed the function, but did not adjust the queue monitoring JSON,
despite the last use of it having been removed in cd9b194d88.

The function is still used to mark "we emptied our queue," and it
remains a reasonable metric for that.
2021-11-16 11:48:50 -08:00
Alex Vandiver 800e38016a queue_rate: Output to CSV, and run multiple prefetch values. 2021-11-16 11:48:50 -08:00
Shlok Patel 893c9bc896 export: Remove `--delete-after-upload` flag in realm export.
For export realm following changes have been made:
- `./manage.py export --upload` would delete `.tar.gz` and unpacked dir
- `./manage.py export` would only delete `unpacked dir`

Besides, we have removed `--delete-after-upload` as we have set it as
the default.

Fixes #20081
2021-11-03 11:14:02 -07:00
Alex Vandiver 75f1070881 queue_processors: Disable timeouts with PushNotificationsWorker.
Since 3853285241, PushNotificationsWorker uses the aioapns library
to send Apple push notifications.  This introduces an asyncio event
loop into this worker process, which, if unlucky, can respond poorly
when a SIGALRM is introduced to it:

```
[asyncio] Task exception was never retrieved
future: <Task finished coro=<send_apple_push_notification.<locals>.attempt_send() done, defined at /path/to/zerver/lib/push_notifications.py:166> exception=WorkerTimeoutException(30, 1)>
Traceback (most recent call last):
  File "/path/to/zerver/lib/push_notifications.py", line 169, in attempt_send
    result = await apns_context.apns.send_notification(request)
  File "/path/to/zulip-py3-venv/lib/python3.6/site-packages/aioapns/client.py", line 57, in send_notification
    response = await self.pool.send_notification(request)
  File "/path/to/zulip-py3-venv/lib/python3.6/site-packages/aioapns/connection.py", line 407, in send_notification
    response = await connection.send_notification(request)
  File "/path/to/zulip-py3-venv/lib/python3.6/site-packages/aioapns/connection.py", line 189, in send_notification
    data = json.dumps(request.message, ensure_ascii=False).encode()
  File "/usr/lib/python3.6/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/usr/lib/python3.6/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python3.6/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/path/to/zerver/worker/queue_processors.py", line 353, in timer_expired
    raise WorkerTimeoutException(limit, len(events))
zerver.worker.queue_processors.WorkerTimeoutException: Timed out after 30 seconds processing 1 events
```

...which subsequently leads to the worker failing to make any progress
on the queue.

Remove the timeout on the worker.  This may result in failing to make
forward progress if Apple/Google take overly long handling requests,
but is likely preferable to failing to make forward progress if _one_
request takes too long and gets unlucky with when the signal comes
through.
2021-10-21 08:59:56 -07:00
Alex Vandiver ab985c0066 queue_processors: Add a comment clarifying that timeouts only happen when single-threaded. 2021-10-21 08:59:56 -07:00
shanukun 8c1ea78d7d invite: Extend invite api for handling expiration duration.
This extends the invite api endpoints to handle an extra
argument, expiration duration, which states the number of
days before the invitation link expires.

For prereg users, expiration info is attached to event
object to pass it to invite queue processor in order to
create and send confirmation link.
In case of multiuse invites, confirmation links are
created directly inside do_create_multiuse_invite_link(),

For filtering valid user invites, expiration info stored in
Confirmation object is used, which is accessed by a prereg
user using reverse generic relations.

Fixes #16359.
2021-09-10 16:53:03 -07:00
Anders Kaseorg 646c04eff2 Rename default branch to ‘main’.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-09-06 12:56:35 -07:00
Alex Vandiver 5b45f8a128 queue_processors: Include queue name in the timeout exception.
This information can be gleaned from the stacktrace, but making it
explicit in the stringification makes it much easier to differentiate
types of errors at a glance, particularly in Sentry.
2021-09-02 02:48:34 -07:00
Alex Vandiver 4d98b0552e missedmessage_emails: Ensure forward progress.
maybe_send_batched_emails handles batches of emails from different
users at once; as it processes each user's batch, it enqueues messages
onto the `email_senders` queue.  If `handle_missedmessage_emails`
raises an exception when processing a single user's email, no events
are marked as handled -- including those that were already handled and
enqueued onto `email_senders`.  This results in an increasing number
of users being sent repeated emails about the same missed messages.

Catch and log any exceptions when handling an individual user's
events.  This guarantees forward progress, and that notifications are
sent at-most-once, not at-least-once.
2021-08-20 07:21:39 -07:00
Mateusz Mandera a01594e72b bots: Pass realm to get_system_bot call in DeferredWorker. 2021-07-26 15:33:13 -07:00
Abhijeet Prasad Bodas dd5e12d112 MissedMessageWorker: Use custom batching periods from UserProfile. 2021-07-23 12:13:46 -07:00
Abhijeet Prasad Bodas 9fcb6e51ce MissedMessageWorker: Handle deleted messages.
The test for the try-except block is hacky. See the comment for
explaination.
2021-07-23 12:13:46 -07:00
Abhijeet Prasad Bodas de78b015d9 MissedMessageWorker: Remove unnecessary transaction.atomic.
We only have one query which will change database state in this function,
and we already have a lock on the process itself, so there's no need for
a transaction.

This was added in ebb4eab0f9.
2021-07-23 12:13:46 -07:00
Abhijeet Prasad Bodas ebb4eab0f9 worker: Rewrite MissedMessageWorker to not be lossy.
Previously, we stored up to 2 minutes worth of email events in memory
before processing them. So, if the server were to go down we would lose
those events.

To fix this, we store the events in the database.

This is a prep change for allowing users to set custom grace period for
email notifications, since the bug noted above will aggravate with
longer grace periods.
2021-07-13 17:21:38 -07:00
Abhijeet Prasad Bodas e63e86dcb2 worker: Ensure complete coverage for PushNotificationsWorker.
The `# nocoverage` was unnecessary apart from for the compatibility code,
so add a test for that code and remove the `# nocoverage`.

The `message_id` -> `message_ids` conversion was done in
9869153ae8.
2021-07-13 08:30:31 -07:00
Mateusz Mandera 58d9975cca embed_links: Interrupt consume() function on worker timeout.
This fixes a bug introduced in 95b46549e1
which made the worker simply log a warning about the timeout and then
continue consume()ing the event that should have also been interrupted.

The idea here is to introduce an exception which can be used to
interrupt the consume() process without triggering the regular handling
of exceptions that happens in _handle_consume_exception.
2021-07-07 09:24:50 -07:00
Mateusz Mandera 95b46549e1 embed_links: Only log warning if worker times out.
Throwing an exception is excessive in case of this worker, as it's
expected for it to time out sometimes if the urls take too long to
process.

With a test added by tabbott.
2021-07-06 14:17:24 -07:00
Mateusz Mandera d9ab70bdde queue_processors: Make timer_expired receive list of events as argument.
This will give queue workers more flexibility when defining their own
override of the method.
2021-07-06 13:46:48 -07:00
Mateusz Mandera c101f3acd6 queue_processors: Make timer_expired() a method.
This allows specific queue workers to override the defaut behavior and
implement their own response to the timer expiring. We will want to use
this for embed_links queue at least.
2021-07-06 13:46:48 -07:00
PIG208 75cea329b4 markdown: Refactor out additional properties added to Message.
This adds a new class called MessageRenderingResult to contain the
additional properties we added to the Message object (like alert_words)
as well as the rendered content to ensure typesafe reference. No
behavioral change is made except changes in typing.

This is a preparatory change for adding django-stubs to the backend.

Related: #18777
2021-06-24 18:14:53 -07:00
sahil839 37bf160298 queue_processor: Add langauge to the events added to invites queue.
This is a prep commit for adding realm-level default for various
user settings. We add the language, in which the invite email will
be sent, to the dict added to queue itself to avoid making queries
in a loop when sending multiple emails from queue.

We also handle the case for old events in the queue.
2021-06-22 16:55:32 -07:00
Mateusz Mandera 496e744053 queue_processors: Log more detailed info when marking messages as read. 2021-05-26 11:17:21 -07:00
PIG208 7150fe5dc5 backend: Extract check_update_message from update_message_backend. 2021-05-09 20:44:04 -07:00
Cyril Pletinckx e4ff372fc3 emails: Transform SMTPException into EmailNotDeliveredException.
Django's default SMTP implementation can raise various exceptions
when trying to send an email. In order to allow Zulip calling code
to catch fewer exceptions to handle any cause of "email not
sent", we translate most of them into EmailNotDeliveredException.
The non-translated exceptions concern the connection with the
SMTP server. They were not merged with the rest to keep some
details about the nature of these.

Tests are implemented in the test_send_email.py module.
2021-05-05 20:16:11 -07:00
Alex Vandiver a9688ceb75 worker: Allow long MissedMessageWorker consumes.
This will stop dropping events in the case that the background
`maybe_send_batched_email` thread takes longer than 30s.  However, see
also #15280 and the TODO comment about how we lose events upon
restart; this worker is still lossy.
2021-05-04 08:45:48 -07:00
Cyril Pletinckx 9afde790c6 email: Open a single SMTP connection to send email batches.
Previously the outgoing emails were sent over several SMTP
connections through the EmailSendingWorker; establishing a new
connection each time adds notable overhead.

Redefine EmailSendingWorker worker to be a LoopQueueProcessingWorker,
which allows it to handle batches of events. At the same time, persist
the connection across email sending, if possible.

The connection is initialized in the constructor of the worker
in order to keep the same connection throughout the whole process.
The concrete implementation of the consume_batch function is simply
processing each email one at a time until they have all been sent.

In order to reuse the previously implemented decorator to retry
sending failures a new method that meets the decorator's required
arguments is declared inside the EmailSendingWorker class. This
allows to retry the sending process of a particular email inside
the batch if the caught exception leaves this process retriable.

A second retry mechanism is used inside the initialize_connection
function to redo the opening of the connection until it works or
until three attempts failed. For this purpose the backoff module
has been added to the dependencies and a test has been added to
ensure that this retry mechanism works well.

The connection is closed when the stop method is called.

Fixes: #17672.
2021-04-26 17:27:22 -07:00
Alex Vandiver 0ad17925eb send_email: Remove unnecessary send_email_from_dict.
This was introduced in 8321bd3f92 to serve as a sort of drop-in
replacement for zerver.lib.queue.queue_json_publish, but its use has
been subsequently cut out (e.g. `9fcdb6c83ac5`).

Remote its last callsite.
2021-04-26 17:27:22 -07:00
Anders Kaseorg 178736c8eb docs: Fix spelling errors caught by codespell.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-04-26 09:31:08 -07:00
Tim Abbott 260861426c queue_processors: Document when can remove compatibility code. 2021-04-16 09:55:14 -07:00
Anders Kaseorg e7ed907cf6 python: Convert deprecated Django ugettext alias to gettext.
django.utils.translation.ugettext is a deprecated alias of
django.utils.translation.gettext as of Django 3.0, and will be removed
in Django 4.0.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-04-15 18:01:34 -07:00
Anders Kaseorg 1fe29aad42 queue_processors: Simplify unnecessary use of Optional.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-04-13 08:54:26 -07:00
Alex Vandiver a280905a89 outgoing_webhook: Join build_bot_request and send_data_to_server.
The existing organization, of returning an opaque blob from
`build_bot_request`, which was later consumed by
`send_data_to_server`, is not particularly sensible; the steps become
oddly split between the OutgoingWebhookWorker, `do_rest_call`, and the
`OutgoingWebhookServiceInterface`.

Make the `OutgoingWebhookServiceInterface` in charge of building,
making, and returning the request in one method; another method
handles extracting content from a successful response.  `do_rest_call`
is responsible for calling both halves of this, and doing common error
handling.
2021-03-29 18:24:44 -07:00
Anders Kaseorg d55dc6f8f1 requirements: Upgrade python-zulip-api from Git.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-03-26 16:31:03 -07:00
Mateusz Mandera b9c1fed18c invites: Delete old compat code in the invites queue worker.
1.7.* is old enough at this point that we can clean up this code.
2021-02-26 08:26:43 -08:00
Mateusz Mandera 09fc79f911 actions: Remove realm argument to internal_send_private_message.
The argument is redundant.
2021-02-23 15:26:47 -08:00
Anders Kaseorg 6e4c3e41dc python: Normalize quotes with Black.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-02-12 13:11:19 -08:00
Anders Kaseorg 11741543da python: Reformat with Black, except quotes.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-02-12 13:11:19 -08:00
Anders Kaseorg 5028c081cb python: Merge concatenated string literals that Black would uglify.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-02-12 13:11:19 -08:00
Alex Vandiver d0f0c2f2ed digest: Fix the structure that we enqueue across when digesting.
This rename was missed in bfa0bdf3d6.
Without this fix, digest messages fail to send.
2021-02-08 17:28:59 -08:00
Anders Kaseorg 454144c35f queue_processors: Fix retry_send_email_failures type.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-01-26 13:27:50 -08:00
Steve Howell bfa0bdf3d6 email digests: Process users in chunks of 30.
This should make the queue empty more quickly,
because we do bulk queries to prevent database
hops.
2021-01-17 11:28:30 -08:00
Alex Vandiver c2526844e9 worker: Remove SignupWorker and friends.
ZULIP_FRIENDS_LIST_ID and MAILCHIMP_API_KEY are not currently used in
production.

This removes the unused 'signups' queue and worker.
2021-01-17 11:16:35 -08:00
Alex Vandiver d688e18de2 errors: Remove references to "deployment", use "host".
The `deployment` key was only set in `do_report_error`, which is now
only used in one codepath (the queue worker).  The logging handlers on
staging call notify_server_error directly, which omits the
`deployment` key.

Remove the odd one-of key, and instead simply do dispatch in
`do_report_error`.
2021-01-17 11:08:12 -08:00
Anders Kaseorg cc55393671 python: Open text files as text to skip decode operations.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-10-30 11:36:38 -07:00
Abhijeet Prasad Bodas e98a8856c7 logging: Add logging in deferred_work queue processor.
Adds logging statements in deferred_work queue consume.
2020-10-29 10:34:53 -07:00
Alex Vandiver 142de0f670 queue: Increase default timeout to 30s, from 10s.
Not all of the workers are known to be safe to interrupt; they might
leave inconsistent state.  As such, terminating them with timeouts
should currently only be a last-resort against stalled queues, not a
regular occurrence.
2020-10-27 16:39:31 -07:00
Alex Vandiver c73dd194f0 sentry: Group all worker timeouts together, by queue.
Since the exception can be triggered at arbitrary places in the stack
based on whenever the alarm happens to fire, they do not often group
together.

Explicitly group them together, grouped only by which queue the work
is in.
2020-10-27 16:39:31 -07:00
Alex Vandiver 7cf737988d queue: Be more explicit about test/real queue division. 2020-10-26 12:32:47 -07:00
Mateusz Mandera 716df658fa queue_processors: Don't run test queues with run-dev.py. 2020-10-18 14:07:31 -07:00
Steve Howell 378062cc83 performance: Avoid call to access_stream_by_id.
We already trust ids that are put on our queue
for deferred work. For example, see the code for
"mark_stream_messages_as_read_for_everyone"

We now pass stream_recipient_id when we queue
up work for do_mark_stream_messages_as_read.

This generally saves about 3 queries per
user when we unsubscribe them from a stream.
2020-10-16 12:58:11 -07:00
Steve Howell 31eb97ddde performance: Fix do_mark_stream_messages_as_read.
This function no longer asks for data that it
doesn't need.
2020-10-16 12:58:11 -07:00
Anders Kaseorg 6564540d15 docs: Fix some spelling errors.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-10-13 15:47:13 -07:00
Alex Vandiver f0b23b0752 queue: Switch non-batch consumer to also use start_json_consumer.
This has no effect on consumption rate, but unifies the codepaths.
Before:
```
$ ./manage.py queue_rate --count 50000
Purging queue...
Enqueue rate: 11187 / sec
Dequeue rate: 4158 / sec
```

After:
```
$ ./manage.py queue_rate --count 50000
Purging queue...
Enqueue rate: 11010 / sec
Dequeue rate: 4113 / sec
```
2020-10-11 14:19:42 -07:00
Alex Vandiver 45c9c3cc30 queue: Monitor user_activity queue, now that it has a consumer.
Since this was using repead individual get() calls previously, it
could not be monitored for having a consumer.  Add it in, by marking
it of queue type "consumer" (the default), and adding Nagios lines for
it.

Also adjust missedmessage_emails to be monitored; it stopped using
LoopQueueProcessingWorker in 5cec566cb9, but was never added back
into the set of monitored consumers.
2020-10-11 14:19:42 -07:00
Alex Vandiver f9358d5330 queue: Switch batch interface to use the channel.consume iterator.
This low-level interface allows consuming from a queue with timeouts.
This can be used to either consume in batches (with an upper timeout),
or one-at-a-time.  This is notably more performant than calling
`.get()` repeatedly (what json_drain_queue does under the hood), which
is "*highly discouraged* as it is *very inefficient*"[1].

Before this change:
```
$ ./manage.py queue_rate --count 10000 --batch
Purging queue...
Enqueue rate: 11158 / sec
Dequeue rate: 3075 / sec
```

After:
```
$ ./manage.py queue_rate --count 10000 --batch
Purging queue...
Enqueue rate: 11511 / sec
Dequeue rate: 19938 / sec
```

[1] https://www.rabbitmq.com/consumers.html#fetching
2020-10-11 14:19:40 -07:00
Alex Vandiver 2547bdbf4a queue: Rename consume_wrapper to a better name. 2020-10-09 20:40:51 -07:00
Alex Vandiver d5a6b0f99a queue: Rename queue_size, and update for all local queues.
Despite its name, the `queue_size` method does not return the number
of items in the queue; it returns the number of items that the local
consumer has delivered but unprocessed.  These are often, but not
always, the same.

RabbitMQ's queues maintain the queue of unacknowledged messages; when
a consumer connects, it sends to the consumer some number of messages
to handle, known as the "prefetch."  This is a performance
optimization, to ensure the consumer code does not need to wait for a
network round-trip before having new data to consume.

The default prefetch is 0, which means that RabbitMQ immediately dumps
all outstanding messages to the consumer, which slowly processes and
acknowledges them.  If a second consumer were to connect to the same
queue, they would receive no messages to process, as the first
consumer has already been allocated them.  If the first consumer
disconnects or crashes, all prior events sent to it are then made
available for other consumers on the queue.

The consumer does not know the total size of the queue -- merely how
many messages it has been handed.

No change is made to the prefetch here; however, future changes may
wish to limit the prefetch, either for memory-saving, or to allow
multiple consumers to work the same queue.

Rename the method to make clear that it only contains information
about the local queue in the consumer, not the full RabbitMQ queue.
Also include the waiting message count, which is used by the
`consume()` iterator for similar purpose to the pending events list.
2020-10-09 20:40:39 -07:00
Anders Kaseorg 9bfbb29763 queue_processors: Use try…finally to prevent leaking an alarm.
Otherwise, if consume_func raised an exception for any reason *other*
than the alarm being fired, the still-pending alarm would have fired
later at some arbitrary point in the calling code.

We need two try…finally blocks in case the signal arrives just before
signal.alarm(0).

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-10-07 15:37:46 -07:00
Alex Vandiver d47637fa40 queue: Set a max consume timeout with SIGALRM.
SIGALRM is the simplest way to set a specific maximum duration that
queue workers can take to handle a specific message.  This only works
in non-threaded environments, however, as signal handlers are
per-process, not per-thread.

The MAX_CONSUME_SECONDS is set quite high, at 10s -- the longest
average worker consume time is embed_links, which hovers near 1s.
Since just knowing the recent mean does not give much information[1],
it is difficult to know how much variance is expected.  As such, we
set the threshold to be such that only events which are significant
outliers will be timed out.  This can be tuned downwards as more
statistics are gathered on the runtime of the workers.

The exception to this is DeferredWorker, which deals with quite-long
requests, and thus has no enforceable SLO.

[1] https://www.autodesk.com/research/publications/same-stats-different-graphs
2020-10-06 17:26:14 -07:00
Alex Vandiver baf882a133 queue: Only ACK drain_queue once it has completed work on the list.
Currently, drain_queue and json_drain_queue ack every message as it is
pulled off of the queue, until the queue is empty.  This means that if
the consumer crashes between pulling a batch of messages off the
queue, and actually processing them, those messages will be
permanently lost.  Sending an ACK on every message also results in a
significant amount lot of traffic to rabbitmq, with notable
performance implications.

Send a singular ACK after the processing has completed, by making
`drain_queue` into a contextmanager.  Additionally, use the `multiple`
flag to ACK all of the messages at once -- or explicitly NACK the
messages if processing failed.  Sending a NACK will re-queue them at
the front of the queue.

Performance of a no-op dequeue before this change:
```
$ ./manage.py queue_rate --count 50000 --batch
Purging queue...
Enqueue rate: 10847 / sec
Dequeue rate: 2479 / sec
```
Performance of a no-op dequeue after this change (a 25% increase):
```
$ ./manage.py queue_rate --count 50000 --batch
Purging queue...
Enqueue rate: 10752 / sec
Dequeue rate: 3079 / sec
```
2020-10-06 17:26:14 -07:00
Alex Vandiver df86a564dc queue: Let stop() work with LoopQueueProcessingWorker. 2020-10-06 17:26:14 -07:00
Alex Vandiver 8cf37a0d4b queue: Add a tool to profile no-op enqueue and dequeue actions. 2020-10-06 17:26:14 -07:00
Tim Abbott 7fa8bafe81 lint: Fix type of initial 0 in queue monitoring. 2020-09-21 15:47:30 -07:00
Mateusz Mandera 810514dd9d queue: Update stats file every 30 seconds.
This system can't update stats while the queue is idle, without using
threads for this, but at least we ensure to update the file after
consuming an event if more than MAX_SECONDS_BEFORE_UPDATE_STATS passed
since the last update, regardless of the number of iterations done so
far.
2020-09-21 15:24:02 -07:00
Mateusz Mandera 40c4511a9c queue: Fix misspelled consume_iteration_counter variable. 2020-09-21 15:22:58 -07:00
Mateusz Mandera 2365a53496 queue: Fix a race condition in monitoring after queue stops being idle.
The race condition is described in the comment block removed by this
commit. This leaves room for another, remaining race condition
that should be virtually impossible, but nevertheless it seems
worthwhile to have it documented in the code, so we put a new comment
describing it.
As a final note, this is not a new race condition,
it was hypothetically possible with the old code as well.
2020-09-21 15:22:56 -07:00
Alex Vandiver de1db2c838 sentry: Provide more metadata in queue processors.
This allows aggregation by queue, makes the event data more readily
accessible, and clears out the breadcrumbs upon every batch that is
serviced.
2020-09-18 15:13:08 -07:00
Mateusz Mandera bb4567f57e queue: Extract get_remaining_queue_size method. 2020-09-11 15:51:07 -07:00
Mateusz Mandera 1d466a4fc5 queue: Make embed_link updates stats on every iteration. 2020-09-11 15:51:07 -07:00
Anders Kaseorg bef46dab3c python: Prefer kwargs form of dict.update.
For less inflation by Black.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-09-03 17:51:09 -07:00
Anders Kaseorg ab120a03bc python: Replace unnecessary intermediate lists with generators.
Mostly suggested by the flake8-comprehension plugin.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-09-02 11:15:41 -07:00
Mateusz Mandera 9b50c49ea7 streams: Mark all messages as read when deactivating a stream.
The query to finds and marks all unread UserMessages in the stream as read
can be quite expensive, so we'll move that work to the deferred_work
queue and split it into batches.

Fixes #15770.
2020-09-01 11:24:27 -07:00
Mateusz Mandera 06151672ba
queue: Use locking to avoid race conditions in missedmessage_emails.
This queue had a race condition with creation of another Timer while
maybe_send_batched_emails is still doing its work, which may cause
two or more threads to be running maybe_send_batched_emails
at the same time, mutating the shared data simultaneously.

Another less likely potential race condition was that
maybe_send_batched_emails after sending out its email, can call
ensure_timer(). If the consume function is run simultaneously
in the main thread, it will call ensure_timer() too, which,
given unfortunate timings, might lead to both calls setting a new Timer.

We add locking to the queue to avoid such race conditions.

Tested manually, by print debugging with the following setup:
1. Making handle_missedmessage_emails sleep 2 seconds for each email,
   and changed BATCH_DURATION to 1s to make the queue start working
   right after launching.
2. Putting a bunch of events in the queue.
3. ./manage.py process_queue --queue_name missedmessage_emails
4. Once maybe_send_batched_emails is called and while it's processing
the events, I pushed more events to the queue. That triggers the
consume() function and ensure_timer().

Before implementing the locking mechanism, this causes two threads
to run maybe_send_batched_emails at the same time, mutating each other's
shared data, causing a traceback such as

Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 1182, in run
    self.function(*self.args, **self.kwargs)
  File "/srv/zulip/zerver/worker/queue_processors.py", line 507, in maybe_send_batched_emails
    del self.events_by_recipient[user_profile_id]
KeyError: '5'

With the locking mechanism, things get handled as expected, and
ensure_timer() exits if it can't obtain the lock due to
maybe_send_batched_emails still working.

Co-authored-by: Tim Abbott <tabbott@zulip.com>
2020-08-26 12:40:59 -07:00
Anders Kaseorg 61d0417e75 python: Replace ujson with orjson.
Fixes #6507.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-08-11 10:55:12 -07:00
Anders Kaseorg 60a25b2721 docs: Fix spelling errors caught by codespell.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-08-11 10:23:06 -07:00