Commit Graph

422 Commits

Author SHA1 Message Date
Steve Howell 51db22c86c per-request caches: Add per_request_cache library.
We have historically cached two types of values
on a per-request basis inside of memory:

    * linkifiers
    * display recipients

Both of these caches were hand-written, and they
both actually cache values that are also in memcached,
so the per-request cache essentially only saves us
from a few memcached hits.

I think the linkifier per-request cache is a necessary
evil. It's an important part of message rendering, and
it's not super easy to structure the code to just get
a single value up front and pass it down the stack.

I'm not so sure we even need the display recipient
per-request cache any more, as we are generally pretty
smart now about hydrating recipient data in terms of
how the code is organized. But I haven't done thorough
research on that hypotheseis.

Fortunately, it's not rocket science to just write
a glorified memoize decorator and tie it into key
places in the code:

    * middleware
    * tests (e.g. asserting db counts)
    * queue processors

That's what I did in this commit.

This commit definitely reduces the amount of code
to maintain. I think it also gets us closer to
possibly phasing out this whole technique, but that
effort is beyond the scope of this PR. We could
add some instrumentation to the decorator to see
how often we get a non-trivial number of saved
round trips to memcached.

Note that when we flush linkifiers, we just use
a big hammer and flush the entire per-request
cache for linkifiers, since there is only ever
one realm in the cache.
2023-08-11 11:09:34 -07:00
Alex Vandiver c77c78f147 missed-message: Add a try-catch to prevent killing background thread.
An exception which escapes from this loop can kill the background
worker thread; this results in consuming the queue (leading to the
illusion of progress) but more and more rows silently piling up in the
ScheduledMessageNotificationEmail table.

Wrap the inside of the `while True` loop in a try/catch to make sure
that no exceptions escape and kill the background thread.  To prevent
even more indentation, the inner loop is extracted into its own
function.  It returns true/false to signal if the `self.stopping` was
set to tell the loop to stop; we cannot check it ourselves in the
outer loop because it needs to hold the lock to be examined.
2023-07-25 10:01:00 -07:00
Anders Kaseorg b285813beb error_notify: Remove custom email error reporting handler.
Restore the default django.utils.log.AdminEmailHandler when
ERROR_REPORTING is enabled.  Those with more sophisticated needs can
turn it off and use Sentry or a Sentry-compatible system.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-07-20 11:00:09 -07:00
Alex Vandiver be960f4142 missed-message: Lock ScheduledMessageNotificationEmail rows.
This prevents the rows from being deleted out from under the worker
while it is sending emails.
2023-07-13 11:50:42 -07:00
Alex Vandiver d87895a3ef missed-message: Merge before calling handle_missedmessage_emails.
The MissedMessage queue worker is the single callsite of
`handle_missedmessage_emails`, which immediately transforms the list
of events into a dict keyed by message-id.

Skip the intermediate list step, and use defaultdict and a dataclass
to simplify and make explicit the pieces.  This removes the unused
user_profile_id and message_id pieces of the data structure.
2023-07-13 11:50:42 -07:00
Alex Vandiver c7d9a4784e missed-message: Remove unnecessary select_related().
This was added in ebb4eab0f99d; neither the `user_profile` nor the
`message` attribute are read off of the object.
2023-07-13 11:50:42 -07:00
Zixuan James Li b6d1e56cac queue_processors: Avoid queue worker timeouts in tests.
For tests that use the dev server, like test-api, test-js-with-puppeteer,
we don't have the consumers for the queues. As they eventually timeout,
we get unnecessary error messages. This adds a new flag, disable_timeout,
to disable this behavior for the test cases.
2023-06-28 11:06:24 -07:00
Lauryn Menard d3f7cfccbc zerver: Update comments with "private message" or "PM".
Updates comments/doc-strings that use "private message" or "PM" in
files in the `/zerver` directory to instead use "direct message".
2023-06-23 11:24:13 -07:00
Alex Vandiver 7811e99548 realm_export: Handle hard head-of-queue failures.
Realm exports may OOM on deployments with low memory; to ensure
forward progress, log the start time in the RealmAuditLog entry, and
key off of the existence of that to prevent re-attempting an export
which was already tried once.
2023-05-16 14:05:01 -07:00
Alex Vandiver 4a43856ba7 realm_export: Do not assume null extra_data is special.
Fixes: #20197.
2023-05-16 14:05:01 -07:00
Alex Vandiver 362177b788 workers: Run realm export with one thread if in low-memory environment.
We previously hard-coded 6 threads for the realm export; in low-memory
environments, spawning 6 threads for an export can lean to an OOM,
which kills the process and leaves a partial export on disk -- which
is then tried again, since the export was never completed.  This leads
to excessive disk consumption and brief repeated outages of all other
workers, until the failing export job is manually de-queued somehow.

Lower the export to only use on thread if it is already running in a
multi-threaded environment.  Note that this does not guarantee forward
progress, it merely makes it more likely that exports will succeed in
low-memory deployments.
2023-05-16 14:05:01 -07:00
Alex Vandiver 9f231322c9 workers: Pass down if they are running multi-threaded.
This allows them to decide for themselves if they should enable
timeouts.
2023-05-16 14:05:01 -07:00
Alex Vandiver daba72c116 error_notify: Drop any remaining browser-side errors in RabbitMQ queue. 2023-04-13 14:59:58 -07:00
Alex Vandiver 3efc0c9af3 workers: Rewrite missedmessage_emails with a worker thread.
The previous implementation leaked database connections, as a new
thread (and thus a new thread-local database connection) was made for
each timer execution.  While these connections were relatively
lightweight in Python, they also incur memory overhead in the
PostgreSQL server itself.  The logic for managing the timer was also
unclear, and the unavoidable deadlock in the stopping logic was rather
unfortunate.

Rewrite with one explicit worker thread which handles the delayed
message sending.  The RabbitMQ consumer creates the database rows, and
notifies the worker to start its 5s timeout.  Because it is controlled
by a condition variable, it does not hold the lock while waiting, and
can be notified to exit.
2023-04-10 17:38:08 -07:00
Alex Vandiver 02a73af386 deferred_work: Log at start of the work.
This is helpful for debugging -- generally these tasks are in a worker
queue because they take a long time to run, so knowing what long task
is about to start before it does, rather than just after, is useful.
2023-02-09 12:06:38 -08:00
Anders Kaseorg 7e3a681f80 ruff: Fix S108 Probable insecure usage of temporary file.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-01-26 10:14:56 -08:00
Anders Kaseorg 25346bde98 ruff: Fix SIM118 Use `k in d` instead of `k in d.keys()`.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-01-23 11:18:36 -08:00
Anders Kaseorg 46cdcd3f33 ruff: Fix PIE790 Unnecessary `pass` statement.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-01-04 16:25:07 -08:00
Anders Kaseorg e1ed44907b ruff: Fix SIM118 Use `key in dict` instead of `key in dict.keys()`.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-01-04 16:25:07 -08:00
Anders Kaseorg 73c4da7974 ruff: Fix N818 exception name should be named with an Error suffix.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-11-17 16:52:00 -08:00
Anders Kaseorg 9a8a2bd345 ruff: Enable import sorting, replacing isort.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-11-16 09:29:11 -08:00
Anders Kaseorg 1735b8863e ruff: Fix B012 return inside finally blocks.
return inside finally blocks causes exceptions to be silenced.
Although these blocks follow blanket ‘except Exception’ handlers, they
do not seem to have a goal of silencing BaseException and exceptions
thrown by the exception handler, so rewrite them to avoid it.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-11-16 09:29:11 -08:00
Tim Abbott 8010d06f9e compatiblity: Delete obsolete compatibility code.
Both of these compatibility blocks can be deleted, since you can't
upgrade directly to any supported release from the versions where the
old event formats would be used.
2022-11-15 15:39:38 -08:00
Anders Kaseorg b45484573e python: Use format string for logging str(obj).
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-10-10 08:32:29 -07:00
Anders Kaseorg fcd81a8473 python: Replace avoidable uses of __special__ attributes.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-10-10 08:32:29 -07:00
Christopher Chong 28173cafc8 message_flags: Fix deadlocks when updating message flags.
Previously, an active production Zulip server would experience a class
of deadlocks caused by two or more concurrent bulk update operations
on the UserMessage table.

This is because UPDATE ... SET ... WHERE statements that execute in
parallel take row-level UPDATE locks as they get results; since the
query plans may result in getting rows in different orders between two
queries, this can result in deadlocks.

Some databases allow ORDER BY on their UPDATE ... WHERE statements;
PostgreSQL does not. In PostgreSQL, the answer is to do a sub-select
with an ORDER BY ... FOR UPDATE to ensure consistent ordering on row
locks.

We do this all code paths using bitand or bitor as part of bulk
editing message flags, which should ensure that these concurrent
operations obtain row level locks on the table in the same order.

Fixes #19054.
2022-09-06 16:06:58 -07:00
Zixuan James Li 3ba51ef1e2 queue_processor: Fix type annotation for connection.
Signed-off-by: Zixuan James Li <p359101898@gmail.com>
2022-07-26 18:00:24 -07:00
Zixuan James Li cd8510607a queue_processor: Remove unreachable code.
This change was added in
c93f1d4eda (diff-d88010b113b79080cab5885fdfbbb56ae2d380cb601d8f520621b3361ad8cebc).
`message.content` cannot be `None` by the model definition.

Signed-off-by: Zixuan James Li <p359101898@gmail.com>
2022-07-19 17:30:15 -07:00
Alex Vandiver cd9c69cd12 message_send: Remove unnecessary user_ids argument.
cfcbf58cd1 rightly removed the use of `user_ids` in
`render_markdown`, which in turn makes it unnecessary in
`render_incoming_message`.

Remove the unnecessary parameter from `render_incoming_message`.
2022-05-04 14:45:18 -07:00
Alex Vandiver 74e9b086f9 embed_links: Check that the message still exists before proceeding. 2022-05-04 14:45:18 -07:00
Alex Vandiver de63000db6 embed_links: Take a lock on the message object while editing.
We leave the fetching of links outside of the lock, as they could take
seconds, which is an unreasonable amount of time to hold a lock on the
message row.  This may result in unnecessary work, in the case that
the message was since edited, but the unnecessary work is preferable
to blocking other work on the message row for the duration.
2022-05-04 14:45:18 -07:00
Alex Vandiver 127108c7d1 workers: Log the exception if the export fails.
We previously just swallowed the exception entirely.
2022-04-28 11:52:47 -07:00
Zixuan James Li a8fd9eb701 email_notifications: Soft reactivate mentioned users.
Signed-off-by: Zixuan James Li <359101898@qq.com>
2022-04-27 16:43:54 -07:00
Sahil Batra 61365fbe21 invites: Use expiration time in minutes instead of days.
This commit changes the invite API to accept invitation
expiration time in minutes since we are going to add a
custom option in further commits which would allow a user
to set expiration time in minutes, hours and weeks as well.
2022-04-20 13:31:37 -07:00
Alex Vandiver 351bdfaf78 preview: Use cache only as a non-durable cache, not an IPC.
The `get_link_embed_data` / `link_embed_data_from_cache` pair as
introduced in c93f1d4eda uses the cache
as a temporary store inside of the `embed_links` worker; this means
that it must be durable storage, or the worker will stall and re-fetch
the same links to preview them.

Switch to plumbing through the fetched URL embed data as an parameter
to the Markdown evaluation which uses them, rather than using the
cache as an intermediary.  This frees up the cache to be merely a
non-durable cache.

As a side-effect, this removes get_cache_with_key, and
link_embed_data_from_cache which was its only callsite.
2022-04-15 14:48:12 -07:00
Anders Kaseorg eda000899b actions: Split out zerver.actions.message_edit.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-04-14 17:14:36 -07:00
Anders Kaseorg eb4e9fe1e7 actions: Split out zerver.actions.message_flags.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-04-14 17:14:36 -07:00
Anders Kaseorg 975066e3f0 actions: Split out zerver.actions.message_send.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-04-14 17:14:34 -07:00
Anders Kaseorg b7adfb02f6 actions: Split out zerver.actions.presence.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-04-14 17:14:32 -07:00
Anders Kaseorg 6168c0110a actions: Split out zerver.actions.user_activity.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-04-14 17:14:32 -07:00
Anders Kaseorg 8fc5922ebd actions: Split out zerver.actions.realm_export.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-04-14 17:14:31 -07:00
Anders Kaseorg ca8d374e21 actions: Split out zerver.actions.invites.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-04-14 17:14:31 -07:00
Sahil Batra 392b17da5f invite: Add backend support for "Never expires" option.
The database value for expiry_date is None for the invite
that will never expire and the clients send -1 as value
in the API similar to the message retention setting.

Also, when passing invite_expire_in_days as an argument
in various functions, invite_expire_in_days is passed as
-1 for "Never expires" option since invite_expire_in_days
is an optional argument in some functions and thus we cannot
pass "None" value.
2022-02-24 16:32:19 -08:00
Mateusz Mandera 30ac291eba emoji: Add migration to reupload all RealmEmoji and ensure .author.
Fixes #19732.
2022-02-10 17:45:31 -08:00
Lauryn Menard c532829c35 backend: Change `do_report_error` return value.
As a preparatory step to refactoring json_success to accept
request as a parameter, change `do_report_error`, which is
called from the events queue for "error_reports", to return
None instead of json_success.

Adds an assertion error to `ErrorReporter` queue processor
and removes `JsonableError` from `do_report_error`.

It is likely that `do_error_report` was moved from a view in a
previous refactor, but was not updated to no longer return an
HttpReponse.
2022-02-04 15:16:55 -08:00
Alex Vandiver 3efed5f1e6 queue_processors: Shut down background missedmessage_emails thread.
Python's behaviour on `sys.exit` is to wait for all non-daemon threads
to exit.  In the context of the missedmessage_emails worker, if any
work is pending, a non-daemon Timer thread exists, which is waiting
for 5 seconds.  As soon as that thread is serviced, it sets up another
5-second Timer, a process which repeats until all
ScheduledMessageNotificationEmail records have been handled.  This
likely takes two minutes, but may theoretically take up to a week
until the thread exits, and thus sys.exit can complete.

Supervisor only gives the process 30 seconds to shut down, so
something else must prevent this endless Timer.

When `stop` is called, take the lock so we can mutate the timer.
However, since `stop` may have been called from a signal handler, our
thread may _already_ have the lock.  As Python provides no way to know
if our thread is the one which has the lock, make the lock a
re-entrant one, allowing us to always try to take it.

With the lock in hand, cancel any outstanding timers.  A race exists
where the timer may not be able to be canceled because it has
finished, maybe_send_batched_emails has been called, and is itself
blocked on the lock.  Handle this case by timing out the thread join
in `stop()`, and signal the running thread to exit by unsetting the
timer event, which will be detected once it claims the lock.
2021-11-23 10:45:49 -08:00
Mateusz Mandera 8af7ffd9da rate_limit: Fix logging string when rate limiting email gateway.
realm.name is not the right "name" to log, we should use realm.subdomain
like everywhere else.
2021-11-22 10:28:56 -08:00
Alex Vandiver faeffa2466 queue_processors: Set a bounded prefetch size on rabbitmq queues.
RabbitMQ clients have a setting called prefetch[1], which controls how
many un-acknowledged events the server forwards to the local queue in
the client.  The default is 0; this means that when clients first
connect, the server must send them every message in the queue.

This itself may cause unbounded memory usage in the client, but also
has other detrimental effects.  While the client is attempting to
process the head of the queue, it may be unable to read from the TCP
socket at the rate that the server is sending to it -- filling the TCP
buffers, and causing the server's writes to block.  If the server
blocks for more than 30 seconds, it times out the send, and closes the
connection with:

```
closing AMQP connection <0.30902.126> (127.0.0.1:53870 -> 127.0.0.1:5672):
{writer,send_failed,{error,timeout}}
```

This is https://github.com/pika/pika/issues/753#issuecomment-318119222.

Set a prefetch limit of 100 messages, or the batch size, to better
handle queues which start with large numbers of outstanding events.

Setting prefetch=1 causes significant performance degradation in the
no-op queue worker, to 30% of the prefetch=0 performance.  Setting
prefetch=100 achieves 90% of the prefetch=0 performance, and higher
values offer only minor gains above that.  For batch workers, their
performance is not notably degraded by prefetch equal to their batch
size, and they cannot function on smaller prefetches than their batch
size.

We also set a 100-count prefetch on Tornado workers, as they are
potentially susceptible to the same effect.

[1] https://www.rabbitmq.com/confirms.html#channel-qos-prefetch
2021-11-16 11:48:50 -08:00
Alex Vandiver 64268f47e8 queue_processors: Drop unused current_queue_size, which was local size.
The `current_queue_size` key in the queue monitoring stats file was
the local queue size, not the global queue size -- d5a6b0f99a
renamed the function, but did not adjust the queue monitoring JSON,
despite the last use of it having been removed in cd9b194d88.

The function is still used to mark "we emptied our queue," and it
remains a reasonable metric for that.
2021-11-16 11:48:50 -08:00
Alex Vandiver 800e38016a queue_rate: Output to CSV, and run multiple prefetch values. 2021-11-16 11:48:50 -08:00