Despite its name, the `queue_size` method does not return the number
of items in the queue; it returns the number of items that the local
consumer has delivered but unprocessed. These are often, but not
always, the same.
RabbitMQ's queues maintain the queue of unacknowledged messages; when
a consumer connects, it sends to the consumer some number of messages
to handle, known as the "prefetch." This is a performance
optimization, to ensure the consumer code does not need to wait for a
network round-trip before having new data to consume.
The default prefetch is 0, which means that RabbitMQ immediately dumps
all outstanding messages to the consumer, which slowly processes and
acknowledges them. If a second consumer were to connect to the same
queue, they would receive no messages to process, as the first
consumer has already been allocated them. If the first consumer
disconnects or crashes, all prior events sent to it are then made
available for other consumers on the queue.
The consumer does not know the total size of the queue -- merely how
many messages it has been handed.
No change is made to the prefetch here; however, future changes may
wish to limit the prefetch, either for memory-saving, or to allow
multiple consumers to work the same queue.
Rename the method to make clear that it only contains information
about the local queue in the consumer, not the full RabbitMQ queue.
Also include the waiting message count, which is used by the
`consume()` iterator for similar purpose to the pending events list.
Otherwise, if consume_func raised an exception for any reason *other*
than the alarm being fired, the still-pending alarm would have fired
later at some arbitrary point in the calling code.
We need two try…finally blocks in case the signal arrives just before
signal.alarm(0).
Signed-off-by: Anders Kaseorg <anders@zulip.com>
SIGALRM is the simplest way to set a specific maximum duration that
queue workers can take to handle a specific message. This only works
in non-threaded environments, however, as signal handlers are
per-process, not per-thread.
The MAX_CONSUME_SECONDS is set quite high, at 10s -- the longest
average worker consume time is embed_links, which hovers near 1s.
Since just knowing the recent mean does not give much information[1],
it is difficult to know how much variance is expected. As such, we
set the threshold to be such that only events which are significant
outliers will be timed out. This can be tuned downwards as more
statistics are gathered on the runtime of the workers.
The exception to this is DeferredWorker, which deals with quite-long
requests, and thus has no enforceable SLO.
[1] https://www.autodesk.com/research/publications/same-stats-different-graphs
Currently, drain_queue and json_drain_queue ack every message as it is
pulled off of the queue, until the queue is empty. This means that if
the consumer crashes between pulling a batch of messages off the
queue, and actually processing them, those messages will be
permanently lost. Sending an ACK on every message also results in a
significant amount lot of traffic to rabbitmq, with notable
performance implications.
Send a singular ACK after the processing has completed, by making
`drain_queue` into a contextmanager. Additionally, use the `multiple`
flag to ACK all of the messages at once -- or explicitly NACK the
messages if processing failed. Sending a NACK will re-queue them at
the front of the queue.
Performance of a no-op dequeue before this change:
```
$ ./manage.py queue_rate --count 50000 --batch
Purging queue...
Enqueue rate: 10847 / sec
Dequeue rate: 2479 / sec
```
Performance of a no-op dequeue after this change (a 25% increase):
```
$ ./manage.py queue_rate --count 50000 --batch
Purging queue...
Enqueue rate: 10752 / sec
Dequeue rate: 3079 / sec
```
This system can't update stats while the queue is idle, without using
threads for this, but at least we ensure to update the file after
consuming an event if more than MAX_SECONDS_BEFORE_UPDATE_STATS passed
since the last update, regardless of the number of iterations done so
far.
The race condition is described in the comment block removed by this
commit. This leaves room for another, remaining race condition
that should be virtually impossible, but nevertheless it seems
worthwhile to have it documented in the code, so we put a new comment
describing it.
As a final note, this is not a new race condition,
it was hypothetically possible with the old code as well.
The query to finds and marks all unread UserMessages in the stream as read
can be quite expensive, so we'll move that work to the deferred_work
queue and split it into batches.
Fixes#15770.
This queue had a race condition with creation of another Timer while
maybe_send_batched_emails is still doing its work, which may cause
two or more threads to be running maybe_send_batched_emails
at the same time, mutating the shared data simultaneously.
Another less likely potential race condition was that
maybe_send_batched_emails after sending out its email, can call
ensure_timer(). If the consume function is run simultaneously
in the main thread, it will call ensure_timer() too, which,
given unfortunate timings, might lead to both calls setting a new Timer.
We add locking to the queue to avoid such race conditions.
Tested manually, by print debugging with the following setup:
1. Making handle_missedmessage_emails sleep 2 seconds for each email,
and changed BATCH_DURATION to 1s to make the queue start working
right after launching.
2. Putting a bunch of events in the queue.
3. ./manage.py process_queue --queue_name missedmessage_emails
4. Once maybe_send_batched_emails is called and while it's processing
the events, I pushed more events to the queue. That triggers the
consume() function and ensure_timer().
Before implementing the locking mechanism, this causes two threads
to run maybe_send_batched_emails at the same time, mutating each other's
shared data, causing a traceback such as
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 1182, in run
self.function(*self.args, **self.kwargs)
File "/srv/zulip/zerver/worker/queue_processors.py", line 507, in maybe_send_batched_emails
del self.events_by_recipient[user_profile_id]
KeyError: '5'
With the locking mechanism, things get handled as expected, and
ensure_timer() exits if it can't obtain the lock due to
maybe_send_batched_emails still working.
Co-authored-by: Tim Abbott <tabbott@zulip.com>
The exception trace only goes from where the exception was thrown up
to where the `logging.exception` call is; any context as to where
_that_ was called from is lost, unless `stack_info` is passed as well.
Having the stack is particularly useful for Sentry exceptions, which
gain the full stack trace.
Add `stack_info=True` on all `logging.exception` calls with a
non-trivial stack; we omit `wsgi.py`. Adjusts tests to match.
consume_time_seconds wasn't properly defined at the beginning, so when
a BaseException that isn't a subclass of Exception is thrown, the
finally: block could be entered with it still undefined.
Without this change, pyflakes reports this exception:
pyflakes | zerver/worker/queue_processors.py:152:9 local variable 'e' is assigned to but never used
pyflakes | zerver/worker/queue_processors.py:155:81 undefined name 'e'
We use the EMAIL_TIMEOUT django setting to timeout after 15s of trying
to send an email. This will nicely lead to retries in the email_senders
queue, due to the retry_send_email_failures decorator.
smtlib documentation suggests that socket.timeout can be raised as the
result of timing out, so in attempts I'm getting
smtplib.SMTPServerDisconnected. Either way, seems appropriate to add
socket.timeout to the exception that we catch.
The most import change here is the one in maybe_send_to_registration
codepath, as the insufficient validation there could lead to fetching
an expired PreregistrationUser that was invited as an administrator
admin even years ago, leading to this registration ending up in the
new user being a realm administrator.
Combined with the buggy migration in
0198_preregistrationuser_invited_as.py, this led to users incorrectly
joining as organizations administrators by accident. But even without
that bug, this issue could have allowed a user who was invited as an
administrator but then had that invitation expire and then joined via
social authentication incorrectly join as an organization administrator.
The second change is in ConfirmationEmailWorker, where this wasn't a
security problem, but if the server was stopped for long enough, with
some invites to send out email for in the queue, then after starting it
up again, the queue worker would send out emails for invites that
had already expired.
Fixes#2665.
Regenerated by tabbott with `lint --fix` after a rebase and change in
parameters.
Note from tabbott: In a few cases, this converts technical debt in the
form of unsorted imports into different technical debt in the form of
our largest files having very long, ugly import sequences at the
start. I expect this change will increase pressure for us to split
those files, which isn't a bad thing.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
Automatically generated by the following script, based on the output
of lint with flake8-comma:
import re
import sys
last_filename = None
last_row = None
lines = []
for msg in sys.stdin:
m = re.match(
r"\x1b\[35mflake8 \|\x1b\[0m \x1b\[1;31m(.+):(\d+):(\d+): (\w+)", msg
)
if m:
filename, row_str, col_str, err = m.groups()
row, col = int(row_str), int(col_str)
if filename == last_filename:
assert last_row != row
else:
if last_filename is not None:
with open(last_filename, "w") as f:
f.writelines(lines)
with open(filename) as f:
lines = f.readlines()
last_filename = filename
last_row = row
line = lines[row - 1]
if err in ["C812", "C815"]:
lines[row - 1] = line[: col - 1] + "," + line[col - 1 :]
elif err in ["C819"]:
assert line[col - 2] == ","
lines[row - 1] = line[: col - 2] + line[col - 1 :].lstrip(" ")
if last_filename is not None:
with open(last_filename, "w") as f:
f.writelines(lines)
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
This commit adds three `.pysa` model files: `false_positives.pysa`
for ruling out false positive flows with `Sanitize` annotations,
`req_lib.pysa` for educating pysa about Zulip's `REQ()` pattern for
extracting user input, and `redirects.pysa` for capturing the risk
of open redirects within Zulip code. Additionally, this commit
introduces `mark_sanitized`, an identity function which can be used
to selectively clear taint in cases where `Sanitize` models will not
work. This commit also puts `mark_sanitized` to work removing known
false postive flows.
Generated by pyupgrade --py36-plus --keep-percent-format, but with the
NamedTuple changes reverted (see commit
ba7906a3c6, #15132).
Signed-off-by: Anders Kaseorg <anders@zulip.com>
This saves the completely unnecessary work of mapping the Client name
to its ID. Because we had in-process caching of the immutable Client
objects, this isn't a material performance win, but it will eventually
let us delete that caching logic and have a simpler system.
While this functionality to post slow queries to a Zulip stream was
very useful in the early days of Zulip, when there were only a few
hundred accounts, it's long since been useless since (1) the total
request volume on larger Zulip servers run by Zulip developers, and
(2) other server operators don't want real-time notifications of slow
backend queries. The right structure for this is just a log file.
We get rid of the queue and replace it with a "zulip.slow_queries"
logger, which will still log to /var/log/zulip/slow_queries.log for
ease of access to this information and propagate to the other logging
handlers. Reducing the amount of queues is good for lowering zulip's
memory footprint and restart performance, since we run at least one
dedicated queue worker process for each one in most configurations.
Prior to this change, there were reports of 500s in
production due to `export.extra_data` being a
Nonetype. This was reproducible using the s3
backend in development when a row was created in
the `RealmAuditLog` table, but the export failed in
the `DeferredWorker`. This left an entry lying
about that was never updated with an `extra_data`
field.
To fix this, we catch any exceptions in the
`DeferredWorker`, and then update `extra_data` to
encode the failure. We also fix the fact that we
never updated the export UI table with pending exports.
These changes also negated the use for the somewhat
hacky `clear_success_banner` logic.
We've had a bug for a while that if any ScheduledEmail objects get
created with the wrong email sender address, even after the sysadmin
corrects the problem, they'll still get errors because of the objects
stored with the wrong format.
We solve this by using FromAddress placeholders strings in
send_future_email function, so that ScheduledEmail objects end up
setting the final `from_address` value when mail is actually sent
using the setting in effect at that time.
Fixes#11008.
Several of our queues are capable of doing work that includes
rendering markdown (outgoing_webhook, embedded_bots, embed_links, and
email_mirror). As a result, it's essential that these don't cache
per-request data (specifically, realm filters) longer than they
should, making editing/deleting linkifiers potentially use old
settings until the relevant process was restarted.
Flushing these caches is extremely cheap (just clearing two
dictionaries) and thus is reasonable to do after every queue event,
rather than trying to do it only the ~1/3 of queues that specifically
do markdown processing. We do the same in our middleware for
reset_queries.
It's not worth writing a test for this because it's very difficult to
create the test setup situation for this bug with a single test worker
process; one needs to edit the linkifier configuration in a different
process than the one sending the message in order to see the bug.
This was a much larger visible bug on Zulip 2.1.x, where the presence
of the message_sender queue meant that this would apply to messages
sent via a browser.
Fixes#14095.