Commit Graph

451 Commits

Author SHA1 Message Date
Alex Vandiver 572fbfe114 queue_processors: Pass the worker_num down into the class. 2024-05-02 14:25:10 -07:00
Alex Vandiver 5654d051f7 worker: Split into separate files.
This makes each worker faster to start up.
2024-04-16 23:00:02 -07:00
Anders Kaseorg 7e2ef11f61 ruff: Fix UP041 Replace aliased errors with `TimeoutError`.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2024-04-01 18:32:52 -07:00
Alex Vandiver 9d8d2d138b missedmessage_emails: Add Sentry spans to worker thread. 2024-03-21 12:46:13 -07:00
Alex Vandiver 9451d08bb9 worker: Split out worker sampling rate, and add Sentry transactions. 2024-03-21 12:46:13 -07:00
Alex Vandiver 3cbce0c5c7 missedmessage_emails: Clear caches and db query tracking per-loop.
Otherwise, these accumulate and leak memory.
2024-03-21 12:46:13 -07:00
Alex Vandiver 6e91e326e9 deferred_work: Reduce batch size due to bad statistics.
PostgreSQL's estimate of the number of usermessage rows for a single
message can be wildly off, due to poor statistics generation.  This
causes this query, with 100-message batch sizes, to incorrectly
estimate millions of matched rows, causing it to perform a full-table
index scan, rather than piecemeal using the `message_id` index.

Reduce the batch size to 50, which is enough to tip in favor of a
rational query plan.
2024-03-11 09:24:59 -07:00
Anders Kaseorg d748ec8d52 ruff: Fix PLW0108 Lambda may be unnecessary.
This is a preview rule, not yet enabled by default.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2024-03-01 09:30:04 -08:00
Alex Vandiver a808c730bc deferred_work: Use an id high-water-mark instead of offsets.
This solves the problem listed in the now-removed comment.
2024-02-27 17:02:34 -08:00
Alex Vandiver 58f0669997 deferred_work: Re-queue remaining "mark all as read" work after 30s. 2024-02-27 10:21:04 -08:00
Alex Vandiver 75e9903be5 deferred_work: Move all queries into the transaction.
The presence of `len(messages)` outside the transaction caused the
full resultset to be fetched outside of the transaction.  This should
ideally be inside the transaction, and also only need be the count.

However, also note that the process of counting matching rows, and
then executing a second query which embeds the same query, is
susceptible to phantom reads, where a query with the same conditions
returns different resultsets, under PostgreSQL's default transaction
isolation of "read committed."  While this is possible to resolve by
pulling the returned IDs into a Python list, it would not address the
issue that concurrent updates which change the resultset would make
the overall algorithm still incorrect.

Add a comment clarifying the conditions under which the algorithm is
correct.  A more correct algorithm would walk the UserMessage rows
which are unread and in the stream, but this requires a
whole-UserMessage index which would be quite large for such an
infrequent use case.
2024-02-27 10:21:04 -08:00
Alex Vandiver 37fa181e5f queue_processors: Process user_activity in one query.
This leads to significant speedups.  In a test, with 100 random unique
event classes, the old code processed a batch of 100 rows (on average
66-ish unique in the batch) in 0.45 seconds.  Doing this in a single
query processes the same batch in 0.0076 seconds.
2024-01-22 16:25:13 -08:00
Alex Vandiver e6a0284275 queue_processors: Defer initial email connection creation.
We previously created the connection to the outgoing email server when
the EmailSendingWorker was first created.  Since creating the
connection can fail (e.g. because of firewalls or typos in the
hostname), this can cause the `QueueProcessingWorker` creation to
raise an exception.  In multi-threaded mode, exceptions in the worker
threads which are _not_ during the handling of a specific event
percolate out to `log_and_exit_if_exception` and trigger the
termination of the entire process -- stopping all worker threads from
making forward progress.

Contain the blast radius of misconfigured email servers by deferring
the opening of the connection until it is first needed.  This will not
cause any overall performance change, since it only affects the
latency of the very first email after startup.
2024-01-12 08:38:46 -08:00
Anders Kaseorg 1f1b2f9a68 models: Extract zerver.models.bots.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-12-16 22:08:44 -08:00
Anders Kaseorg bac027962f models: Extract zerver.models.clients.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-12-16 22:08:44 -08:00
Anders Kaseorg 927d7a9a60 models: Extract zerver.models.prereg_users.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-12-16 22:08:44 -08:00
Anders Kaseorg 45bb8d2580 models: Extract zerver.models.users.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-12-16 22:08:44 -08:00
Prakhar Pratyush c1daabd3c0 remote_server: Rename to 'send_server_data_to_push_bouncer'.
This commit renames 'send_analytics_to_push_bouncer'
to 'send_server_data_to_push_bouncer'.
2023-12-11 14:07:39 -08:00
Tim Abbott 5c1a5a816f remote_server: Rename register_realm_with_push_bouncer.
We plan to have this potentially happen more than once for a given
realm.
2023-12-11 14:07:39 -08:00
Prakhar Pratyush d763fae9d0 remote_server: Eliminate separate realms-only code path.
Given that most of the use cases for realms-only code path would
really like to upload audit logs too, and the others would likely
produce a better user experience if they upoaded audit logs, we
should just have a single main code path here i.e.
'send_analytics_to_push_bouncer'.

We still only upload usage statistics according to documented
option, and only from the analytics cron job.

The error handling takes place in 'send_analytics_to_push_bouncer'
itself.
2023-12-11 14:07:39 -08:00
Anders Kaseorg 223b626256 python: Use urlsplit instead of urlparse.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-12-05 13:03:07 -08:00
Anders Kaseorg 3853fa875a python: Consistently use from…import for urllib.parse.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-12-05 13:03:07 -08:00
Anders Kaseorg 8a7916f21a python: Consistently use from…import for datetime.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-12-05 12:01:18 -08:00
Mateusz Mandera a67dd6dc1f realms: Call send_realms_only_to_push_bouncer at realm creation/import. 2023-12-03 08:49:58 -08:00
Anders Kaseorg a50eb2e809 mypy: Enable new error explicit-override.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-10-12 12:28:41 -07:00
Esther Anierobi b2ea3125b2 exports: Improve notifications about completed data exports.
Change the url in the notification message to point to the settings
interface rather than linking to the export directly.

This is a much better user experience in the case that the export has
been deleted since the time the export was requested.

Fixes: #26923.
2023-10-11 17:42:32 -07:00
Anders Kaseorg cf4791264c python: Replace functools.partial with type-safe returns.curry.partial.
The type annotation for functools.partial uses unchecked Any for all
the function parameters (both early and late).  returns.curry.partial
uses a mypy plugin to check the parameters safely.

https://returns.readthedocs.io/en/latest/pages/curry.html

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-09-11 18:03:45 -07:00
Alex Vandiver b94402152d models: Always search Messages with a realm_id or id limit.
Unless there is a limit on `id`, always provide a `realm_id` limit as
well.  We also notate which index is expected to be used in each
query.
2023-09-11 15:00:37 -07:00
Zixuan James Li 30495cec58 migration: Rename extra_data_json to extra_data in audit log models.
This migration applies under the assumption that extra_data_json has
been populated for all existing and coming audit log entries.

- This removes the manual conversions back and forth for extra_data
throughout the codebase including the orjson.loads(), orjson.dumps(),
and str() calls.

- The custom handler used for converting Decimal is removed since
DjangoJSONEncoder handles that for extra_data.

- We remove None-checks for extra_data because it is now no longer
nullable.

- Meanwhile, we want the bouncer to support processing RealmAuditLog entries for
remote servers before and after the JSONField migration on extra_data.

- Since now extra_data should always be a dict for the newer remote
server, which is now migrated, the test cases are updated to create
RealmAuditLog objects by passing a dict for extra_data before
sending over the analytics data. Note that while JSONField allows for
non-dict values, a proper remote server always passes a dict for
extra_data.

- We still test out the legacy extra_data format because not all
remote servers have migrated to use JSONField extra_data.
This verifies that support for extra_data being a string or None has not
been dropped.

Co-authored-by: Siddharth Asthana <siddharthasthana31@gmail.com>
Signed-off-by: Zixuan James Li <p359101898@gmail.com>
2023-08-16 17:18:14 -07:00
Steve Howell 51db22c86c per-request caches: Add per_request_cache library.
We have historically cached two types of values
on a per-request basis inside of memory:

    * linkifiers
    * display recipients

Both of these caches were hand-written, and they
both actually cache values that are also in memcached,
so the per-request cache essentially only saves us
from a few memcached hits.

I think the linkifier per-request cache is a necessary
evil. It's an important part of message rendering, and
it's not super easy to structure the code to just get
a single value up front and pass it down the stack.

I'm not so sure we even need the display recipient
per-request cache any more, as we are generally pretty
smart now about hydrating recipient data in terms of
how the code is organized. But I haven't done thorough
research on that hypotheseis.

Fortunately, it's not rocket science to just write
a glorified memoize decorator and tie it into key
places in the code:

    * middleware
    * tests (e.g. asserting db counts)
    * queue processors

That's what I did in this commit.

This commit definitely reduces the amount of code
to maintain. I think it also gets us closer to
possibly phasing out this whole technique, but that
effort is beyond the scope of this PR. We could
add some instrumentation to the decorator to see
how often we get a non-trivial number of saved
round trips to memcached.

Note that when we flush linkifiers, we just use
a big hammer and flush the entire per-request
cache for linkifiers, since there is only ever
one realm in the cache.
2023-08-11 11:09:34 -07:00
Alex Vandiver c77c78f147 missed-message: Add a try-catch to prevent killing background thread.
An exception which escapes from this loop can kill the background
worker thread; this results in consuming the queue (leading to the
illusion of progress) but more and more rows silently piling up in the
ScheduledMessageNotificationEmail table.

Wrap the inside of the `while True` loop in a try/catch to make sure
that no exceptions escape and kill the background thread.  To prevent
even more indentation, the inner loop is extracted into its own
function.  It returns true/false to signal if the `self.stopping` was
set to tell the loop to stop; we cannot check it ourselves in the
outer loop because it needs to hold the lock to be examined.
2023-07-25 10:01:00 -07:00
Anders Kaseorg b285813beb error_notify: Remove custom email error reporting handler.
Restore the default django.utils.log.AdminEmailHandler when
ERROR_REPORTING is enabled.  Those with more sophisticated needs can
turn it off and use Sentry or a Sentry-compatible system.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-07-20 11:00:09 -07:00
Alex Vandiver be960f4142 missed-message: Lock ScheduledMessageNotificationEmail rows.
This prevents the rows from being deleted out from under the worker
while it is sending emails.
2023-07-13 11:50:42 -07:00
Alex Vandiver d87895a3ef missed-message: Merge before calling handle_missedmessage_emails.
The MissedMessage queue worker is the single callsite of
`handle_missedmessage_emails`, which immediately transforms the list
of events into a dict keyed by message-id.

Skip the intermediate list step, and use defaultdict and a dataclass
to simplify and make explicit the pieces.  This removes the unused
user_profile_id and message_id pieces of the data structure.
2023-07-13 11:50:42 -07:00
Alex Vandiver c7d9a4784e missed-message: Remove unnecessary select_related().
This was added in ebb4eab0f99d; neither the `user_profile` nor the
`message` attribute are read off of the object.
2023-07-13 11:50:42 -07:00
Zixuan James Li b6d1e56cac queue_processors: Avoid queue worker timeouts in tests.
For tests that use the dev server, like test-api, test-js-with-puppeteer,
we don't have the consumers for the queues. As they eventually timeout,
we get unnecessary error messages. This adds a new flag, disable_timeout,
to disable this behavior for the test cases.
2023-06-28 11:06:24 -07:00
Lauryn Menard d3f7cfccbc zerver: Update comments with "private message" or "PM".
Updates comments/doc-strings that use "private message" or "PM" in
files in the `/zerver` directory to instead use "direct message".
2023-06-23 11:24:13 -07:00
Alex Vandiver 7811e99548 realm_export: Handle hard head-of-queue failures.
Realm exports may OOM on deployments with low memory; to ensure
forward progress, log the start time in the RealmAuditLog entry, and
key off of the existence of that to prevent re-attempting an export
which was already tried once.
2023-05-16 14:05:01 -07:00
Alex Vandiver 4a43856ba7 realm_export: Do not assume null extra_data is special.
Fixes: #20197.
2023-05-16 14:05:01 -07:00
Alex Vandiver 362177b788 workers: Run realm export with one thread if in low-memory environment.
We previously hard-coded 6 threads for the realm export; in low-memory
environments, spawning 6 threads for an export can lean to an OOM,
which kills the process and leaves a partial export on disk -- which
is then tried again, since the export was never completed.  This leads
to excessive disk consumption and brief repeated outages of all other
workers, until the failing export job is manually de-queued somehow.

Lower the export to only use on thread if it is already running in a
multi-threaded environment.  Note that this does not guarantee forward
progress, it merely makes it more likely that exports will succeed in
low-memory deployments.
2023-05-16 14:05:01 -07:00
Alex Vandiver 9f231322c9 workers: Pass down if they are running multi-threaded.
This allows them to decide for themselves if they should enable
timeouts.
2023-05-16 14:05:01 -07:00
Alex Vandiver daba72c116 error_notify: Drop any remaining browser-side errors in RabbitMQ queue. 2023-04-13 14:59:58 -07:00
Alex Vandiver 3efc0c9af3 workers: Rewrite missedmessage_emails with a worker thread.
The previous implementation leaked database connections, as a new
thread (and thus a new thread-local database connection) was made for
each timer execution.  While these connections were relatively
lightweight in Python, they also incur memory overhead in the
PostgreSQL server itself.  The logic for managing the timer was also
unclear, and the unavoidable deadlock in the stopping logic was rather
unfortunate.

Rewrite with one explicit worker thread which handles the delayed
message sending.  The RabbitMQ consumer creates the database rows, and
notifies the worker to start its 5s timeout.  Because it is controlled
by a condition variable, it does not hold the lock while waiting, and
can be notified to exit.
2023-04-10 17:38:08 -07:00
Alex Vandiver 02a73af386 deferred_work: Log at start of the work.
This is helpful for debugging -- generally these tasks are in a worker
queue because they take a long time to run, so knowing what long task
is about to start before it does, rather than just after, is useful.
2023-02-09 12:06:38 -08:00
Anders Kaseorg 7e3a681f80 ruff: Fix S108 Probable insecure usage of temporary file.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-01-26 10:14:56 -08:00
Anders Kaseorg 25346bde98 ruff: Fix SIM118 Use `k in d` instead of `k in d.keys()`.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-01-23 11:18:36 -08:00
Anders Kaseorg 46cdcd3f33 ruff: Fix PIE790 Unnecessary `pass` statement.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-01-04 16:25:07 -08:00
Anders Kaseorg e1ed44907b ruff: Fix SIM118 Use `key in dict` instead of `key in dict.keys()`.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-01-04 16:25:07 -08:00
Anders Kaseorg 73c4da7974 ruff: Fix N818 exception name should be named with an Error suffix.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-11-17 16:52:00 -08:00
Anders Kaseorg 9a8a2bd345 ruff: Enable import sorting, replacing isort.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-11-16 09:29:11 -08:00