Commit Graph

334 Commits

Author SHA1 Message Date
Steve Howell bfa0bdf3d6 email digests: Process users in chunks of 30.
This should make the queue empty more quickly,
because we do bulk queries to prevent database
hops.
2021-01-17 11:28:30 -08:00
Alex Vandiver c2526844e9 worker: Remove SignupWorker and friends.
ZULIP_FRIENDS_LIST_ID and MAILCHIMP_API_KEY are not currently used in
production.

This removes the unused 'signups' queue and worker.
2021-01-17 11:16:35 -08:00
Alex Vandiver d688e18de2 errors: Remove references to "deployment", use "host".
The `deployment` key was only set in `do_report_error`, which is now
only used in one codepath (the queue worker).  The logging handlers on
staging call notify_server_error directly, which omits the
`deployment` key.

Remove the odd one-of key, and instead simply do dispatch in
`do_report_error`.
2021-01-17 11:08:12 -08:00
Anders Kaseorg cc55393671 python: Open text files as text to skip decode operations.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-10-30 11:36:38 -07:00
Abhijeet Prasad Bodas e98a8856c7 logging: Add logging in deferred_work queue processor.
Adds logging statements in deferred_work queue consume.
2020-10-29 10:34:53 -07:00
Alex Vandiver 142de0f670 queue: Increase default timeout to 30s, from 10s.
Not all of the workers are known to be safe to interrupt; they might
leave inconsistent state.  As such, terminating them with timeouts
should currently only be a last-resort against stalled queues, not a
regular occurrence.
2020-10-27 16:39:31 -07:00
Alex Vandiver c73dd194f0 sentry: Group all worker timeouts together, by queue.
Since the exception can be triggered at arbitrary places in the stack
based on whenever the alarm happens to fire, they do not often group
together.

Explicitly group them together, grouped only by which queue the work
is in.
2020-10-27 16:39:31 -07:00
Alex Vandiver 7cf737988d queue: Be more explicit about test/real queue division. 2020-10-26 12:32:47 -07:00
Mateusz Mandera 716df658fa queue_processors: Don't run test queues with run-dev.py. 2020-10-18 14:07:31 -07:00
Steve Howell 378062cc83 performance: Avoid call to access_stream_by_id.
We already trust ids that are put on our queue
for deferred work. For example, see the code for
"mark_stream_messages_as_read_for_everyone"

We now pass stream_recipient_id when we queue
up work for do_mark_stream_messages_as_read.

This generally saves about 3 queries per
user when we unsubscribe them from a stream.
2020-10-16 12:58:11 -07:00
Steve Howell 31eb97ddde performance: Fix do_mark_stream_messages_as_read.
This function no longer asks for data that it
doesn't need.
2020-10-16 12:58:11 -07:00
Anders Kaseorg 6564540d15 docs: Fix some spelling errors.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-10-13 15:47:13 -07:00
Alex Vandiver f0b23b0752 queue: Switch non-batch consumer to also use start_json_consumer.
This has no effect on consumption rate, but unifies the codepaths.
Before:
```
$ ./manage.py queue_rate --count 50000
Purging queue...
Enqueue rate: 11187 / sec
Dequeue rate: 4158 / sec
```

After:
```
$ ./manage.py queue_rate --count 50000
Purging queue...
Enqueue rate: 11010 / sec
Dequeue rate: 4113 / sec
```
2020-10-11 14:19:42 -07:00
Alex Vandiver 45c9c3cc30 queue: Monitor user_activity queue, now that it has a consumer.
Since this was using repead individual get() calls previously, it
could not be monitored for having a consumer.  Add it in, by marking
it of queue type "consumer" (the default), and adding Nagios lines for
it.

Also adjust missedmessage_emails to be monitored; it stopped using
LoopQueueProcessingWorker in 5cec566cb9, but was never added back
into the set of monitored consumers.
2020-10-11 14:19:42 -07:00
Alex Vandiver f9358d5330 queue: Switch batch interface to use the channel.consume iterator.
This low-level interface allows consuming from a queue with timeouts.
This can be used to either consume in batches (with an upper timeout),
or one-at-a-time.  This is notably more performant than calling
`.get()` repeatedly (what json_drain_queue does under the hood), which
is "*highly discouraged* as it is *very inefficient*"[1].

Before this change:
```
$ ./manage.py queue_rate --count 10000 --batch
Purging queue...
Enqueue rate: 11158 / sec
Dequeue rate: 3075 / sec
```

After:
```
$ ./manage.py queue_rate --count 10000 --batch
Purging queue...
Enqueue rate: 11511 / sec
Dequeue rate: 19938 / sec
```

[1] https://www.rabbitmq.com/consumers.html#fetching
2020-10-11 14:19:40 -07:00
Alex Vandiver 2547bdbf4a queue: Rename consume_wrapper to a better name. 2020-10-09 20:40:51 -07:00
Alex Vandiver d5a6b0f99a queue: Rename queue_size, and update for all local queues.
Despite its name, the `queue_size` method does not return the number
of items in the queue; it returns the number of items that the local
consumer has delivered but unprocessed.  These are often, but not
always, the same.

RabbitMQ's queues maintain the queue of unacknowledged messages; when
a consumer connects, it sends to the consumer some number of messages
to handle, known as the "prefetch."  This is a performance
optimization, to ensure the consumer code does not need to wait for a
network round-trip before having new data to consume.

The default prefetch is 0, which means that RabbitMQ immediately dumps
all outstanding messages to the consumer, which slowly processes and
acknowledges them.  If a second consumer were to connect to the same
queue, they would receive no messages to process, as the first
consumer has already been allocated them.  If the first consumer
disconnects or crashes, all prior events sent to it are then made
available for other consumers on the queue.

The consumer does not know the total size of the queue -- merely how
many messages it has been handed.

No change is made to the prefetch here; however, future changes may
wish to limit the prefetch, either for memory-saving, or to allow
multiple consumers to work the same queue.

Rename the method to make clear that it only contains information
about the local queue in the consumer, not the full RabbitMQ queue.
Also include the waiting message count, which is used by the
`consume()` iterator for similar purpose to the pending events list.
2020-10-09 20:40:39 -07:00
Anders Kaseorg 9bfbb29763 queue_processors: Use try…finally to prevent leaking an alarm.
Otherwise, if consume_func raised an exception for any reason *other*
than the alarm being fired, the still-pending alarm would have fired
later at some arbitrary point in the calling code.

We need two try…finally blocks in case the signal arrives just before
signal.alarm(0).

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-10-07 15:37:46 -07:00
Alex Vandiver d47637fa40 queue: Set a max consume timeout with SIGALRM.
SIGALRM is the simplest way to set a specific maximum duration that
queue workers can take to handle a specific message.  This only works
in non-threaded environments, however, as signal handlers are
per-process, not per-thread.

The MAX_CONSUME_SECONDS is set quite high, at 10s -- the longest
average worker consume time is embed_links, which hovers near 1s.
Since just knowing the recent mean does not give much information[1],
it is difficult to know how much variance is expected.  As such, we
set the threshold to be such that only events which are significant
outliers will be timed out.  This can be tuned downwards as more
statistics are gathered on the runtime of the workers.

The exception to this is DeferredWorker, which deals with quite-long
requests, and thus has no enforceable SLO.

[1] https://www.autodesk.com/research/publications/same-stats-different-graphs
2020-10-06 17:26:14 -07:00
Alex Vandiver baf882a133 queue: Only ACK drain_queue once it has completed work on the list.
Currently, drain_queue and json_drain_queue ack every message as it is
pulled off of the queue, until the queue is empty.  This means that if
the consumer crashes between pulling a batch of messages off the
queue, and actually processing them, those messages will be
permanently lost.  Sending an ACK on every message also results in a
significant amount lot of traffic to rabbitmq, with notable
performance implications.

Send a singular ACK after the processing has completed, by making
`drain_queue` into a contextmanager.  Additionally, use the `multiple`
flag to ACK all of the messages at once -- or explicitly NACK the
messages if processing failed.  Sending a NACK will re-queue them at
the front of the queue.

Performance of a no-op dequeue before this change:
```
$ ./manage.py queue_rate --count 50000 --batch
Purging queue...
Enqueue rate: 10847 / sec
Dequeue rate: 2479 / sec
```
Performance of a no-op dequeue after this change (a 25% increase):
```
$ ./manage.py queue_rate --count 50000 --batch
Purging queue...
Enqueue rate: 10752 / sec
Dequeue rate: 3079 / sec
```
2020-10-06 17:26:14 -07:00
Alex Vandiver df86a564dc queue: Let stop() work with LoopQueueProcessingWorker. 2020-10-06 17:26:14 -07:00
Alex Vandiver 8cf37a0d4b queue: Add a tool to profile no-op enqueue and dequeue actions. 2020-10-06 17:26:14 -07:00
Tim Abbott 7fa8bafe81 lint: Fix type of initial 0 in queue monitoring. 2020-09-21 15:47:30 -07:00
Mateusz Mandera 810514dd9d queue: Update stats file every 30 seconds.
This system can't update stats while the queue is idle, without using
threads for this, but at least we ensure to update the file after
consuming an event if more than MAX_SECONDS_BEFORE_UPDATE_STATS passed
since the last update, regardless of the number of iterations done so
far.
2020-09-21 15:24:02 -07:00
Mateusz Mandera 40c4511a9c queue: Fix misspelled consume_iteration_counter variable. 2020-09-21 15:22:58 -07:00
Mateusz Mandera 2365a53496 queue: Fix a race condition in monitoring after queue stops being idle.
The race condition is described in the comment block removed by this
commit. This leaves room for another, remaining race condition
that should be virtually impossible, but nevertheless it seems
worthwhile to have it documented in the code, so we put a new comment
describing it.
As a final note, this is not a new race condition,
it was hypothetically possible with the old code as well.
2020-09-21 15:22:56 -07:00
Alex Vandiver de1db2c838 sentry: Provide more metadata in queue processors.
This allows aggregation by queue, makes the event data more readily
accessible, and clears out the breadcrumbs upon every batch that is
serviced.
2020-09-18 15:13:08 -07:00
Mateusz Mandera bb4567f57e queue: Extract get_remaining_queue_size method. 2020-09-11 15:51:07 -07:00
Mateusz Mandera 1d466a4fc5 queue: Make embed_link updates stats on every iteration. 2020-09-11 15:51:07 -07:00
Anders Kaseorg bef46dab3c python: Prefer kwargs form of dict.update.
For less inflation by Black.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-09-03 17:51:09 -07:00
Anders Kaseorg ab120a03bc python: Replace unnecessary intermediate lists with generators.
Mostly suggested by the flake8-comprehension plugin.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-09-02 11:15:41 -07:00
Mateusz Mandera 9b50c49ea7 streams: Mark all messages as read when deactivating a stream.
The query to finds and marks all unread UserMessages in the stream as read
can be quite expensive, so we'll move that work to the deferred_work
queue and split it into batches.

Fixes #15770.
2020-09-01 11:24:27 -07:00
Mateusz Mandera 06151672ba
queue: Use locking to avoid race conditions in missedmessage_emails.
This queue had a race condition with creation of another Timer while
maybe_send_batched_emails is still doing its work, which may cause
two or more threads to be running maybe_send_batched_emails
at the same time, mutating the shared data simultaneously.

Another less likely potential race condition was that
maybe_send_batched_emails after sending out its email, can call
ensure_timer(). If the consume function is run simultaneously
in the main thread, it will call ensure_timer() too, which,
given unfortunate timings, might lead to both calls setting a new Timer.

We add locking to the queue to avoid such race conditions.

Tested manually, by print debugging with the following setup:
1. Making handle_missedmessage_emails sleep 2 seconds for each email,
   and changed BATCH_DURATION to 1s to make the queue start working
   right after launching.
2. Putting a bunch of events in the queue.
3. ./manage.py process_queue --queue_name missedmessage_emails
4. Once maybe_send_batched_emails is called and while it's processing
the events, I pushed more events to the queue. That triggers the
consume() function and ensure_timer().

Before implementing the locking mechanism, this causes two threads
to run maybe_send_batched_emails at the same time, mutating each other's
shared data, causing a traceback such as

Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 1182, in run
    self.function(*self.args, **self.kwargs)
  File "/srv/zulip/zerver/worker/queue_processors.py", line 507, in maybe_send_batched_emails
    del self.events_by_recipient[user_profile_id]
KeyError: '5'

With the locking mechanism, things get handled as expected, and
ensure_timer() exits if it can't obtain the lock due to
maybe_send_batched_emails still working.

Co-authored-by: Tim Abbott <tabbott@zulip.com>
2020-08-26 12:40:59 -07:00
Anders Kaseorg 61d0417e75 python: Replace ujson with orjson.
Fixes #6507.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-08-11 10:55:12 -07:00
Anders Kaseorg 60a25b2721 docs: Fix spelling errors caught by codespell.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-08-11 10:23:06 -07:00
Alex Vandiver 2928bbc8bd logging: Report stack_info on logging.exception calls.
The exception trace only goes from where the exception was thrown up
to where the `logging.exception` call is; any context as to where
_that_ was called from is lost, unless `stack_info` is passed as well.
Having the stack is particularly useful for Sentry exceptions, which
gain the full stack trace.

Add `stack_info=True` on all `logging.exception` calls with a
non-trivial stack; we omit `wsgi.py`.  Adjusts tests to match.
2020-08-11 10:16:54 -07:00
Mateusz Mandera a7039c815e queue_processors: Fix UnboundLocalError in QueueProcessingWorker.
consume_time_seconds wasn't properly defined at the beginning, so when
a BaseException that isn't a subclass of Exception is thrown, the
finally: block could be entered with it still undefined.
2020-08-11 10:09:42 -07:00
Anders Kaseorg 8e6a439529 queue_processors: Fix strict_optional errors.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-07-06 11:25:48 -07:00
Tim Abbott 52676c0670 lint: Work around a pyflakes bug.
Without this change, pyflakes reports this exception:

pyflakes  | zerver/worker/queue_processors.py:152:9 local variable 'e' is assigned to but never used
pyflakes  | zerver/worker/queue_processors.py:155:81 undefined name 'e'
2020-07-03 17:24:36 -07:00
Mateusz Mandera d51afcf485 emails: Improve handling of timeouts when sending.
We use the EMAIL_TIMEOUT django setting to timeout after 15s of trying
to send an email. This will nicely lead to retries in the email_senders
queue, due to the retry_send_email_failures decorator.

smtlib documentation suggests that socket.timeout can be raised as the
result of timing out, so in attempts I'm getting
smtplib.SMTPServerDisconnected. Either way, seems appropriate to add
socket.timeout to the exception that we catch.
2020-07-03 16:52:50 -07:00
Vishnu KS 0a36f04c20 i18n: Mark notification bot message in queue_processors for translation. 2020-06-26 14:57:18 -07:00
Mateusz Mandera 85d4536486 docs: Update some comments for the new release versioning scheme.
With the new scheme, the equivalent of 2.3 is 4.0.
2020-06-25 10:33:03 -07:00
Anders Kaseorg 579f05f3ed queue_processors: Avoid unchecked casts.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-06-22 17:18:19 -07:00
Tim Abbott 4d7550d705 views: Extract message_edit.py for message editing views.
This is a pretty clean extraction of files that lets us shrink one of
our largest files.
2020-06-22 15:08:34 -07:00
Mateusz Mandera 8d2d64c100 CVE-2020-14215: Fix validation in PreregistrationUser queries.
The most import change here is the one in maybe_send_to_registration
codepath, as the insufficient validation there could lead to fetching
an expired PreregistrationUser that was invited as an administrator
admin even years ago, leading to this registration ending up in the
new user being a realm administrator.

Combined with the buggy migration in
0198_preregistrationuser_invited_as.py, this led to users incorrectly
joining as organizations administrators by accident.  But even without
that bug, this issue could have allowed a user who was invited as an
administrator but then had that invitation expire and then joined via
social authentication incorrectly join as an organization administrator.

The second change is in ConfirmationEmailWorker, where this wasn't a
security problem, but if the server was stopped for long enough, with
some invites to send out email for in the queue, then after starting it
up again, the queue worker would send out emails for invites that
had already expired.
2020-06-16 23:35:39 -07:00
Anders Kaseorg 5dc9b55c43 python: Manually convert more percent-formatting to f-strings.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-06-14 23:27:22 -07:00
Anders Kaseorg 1ed2d9b4a0 logging: Use logging.exception and exc_info for unexpected exceptions.
logging.exception() and logging.debug(exc_info=True),
etc. automatically include a traceback.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-06-14 23:27:22 -07:00
Anders Kaseorg a803e68528 email-mirror-postfix: Handle 8-bit messages correctly.
Since JSON can’t represent bytes, we encode them with base64.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-06-14 20:24:06 -07:00
Anders Kaseorg bff3dcadc8 email: Migrate to new Python ≥ 3.3 email API.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-06-14 20:24:06 -07:00
Anders Kaseorg 365fe0b3d5 python: Sort imports with isort.
Fixes #2665.

Regenerated by tabbott with `lint --fix` after a rebase and change in
parameters.

Note from tabbott: In a few cases, this converts technical debt in the
form of unsorted imports into different technical debt in the form of
our largest files having very long, ugly import sequences at the
start.  I expect this change will increase pressure for us to split
those files, which isn't a bad thing.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-06-11 16:45:32 -07:00