Commit Graph

177 Commits

Author SHA1 Message Date
Alex Vandiver e0d3176098 digest: Increase size of stream cache.
Since the cache is flushed when the cutoff or realm changes, the
maximum size of the cache should cap out at the number of streams in
the realm.  Raise the max cache size, now that this will not simply
lead to useless cache space for smaller servers.
2023-09-13 13:25:59 -07:00
Alex Vandiver a8a1f10f3c digest: Clear the cache once we move to a new realm / cutoff value. 2023-09-13 13:25:59 -07:00
Alex Vandiver 39358f77dd digest: Enqueue emails as we generate the contexts.
There is now no longer any reason to have the scheduled_email
enqueuing wait until all of the users' contexts have been generated.
Switch to returning the contexts as an iterator, and send them as we
compute them.
2023-09-13 13:25:59 -07:00
Alex Vandiver b9f72bdd68 digest: Switch loop to early-abort for clarity. 2023-09-13 13:25:59 -07:00
Alex Vandiver b555d3f553 digest: Cache per-stream recent topics, rather than batching.
The query plan for fetching recent messages from the arbitrary set of
streams formed by the intersection of 30 random users can be quite
bad, and can descend into a sequential scan on `zerver_recipient`.
Worse, this work of pulling recent messages out is redone if the
stream appears in the next batch of 30 users.

Instead, pull the recent messages for a stream on a one-by-one basis,
but cache them in an in-memory cache.  Since digests are enqueued in
30-user batches but still one-realm-at-a-time, work will be saved both
in terms of faster query plans whose results can also be reused across
batches.

This requires that we pull the stream-id to stream-name mapping for
_all_ streams in the realm at once, but that is well-indexed and
unlikely to cause performance issues -- in fact, it may be faster
than pulling a random subset of the streams in the realm.
2023-09-13 13:25:59 -07:00
Alex Vandiver f8a9779b54 digest: Rename get_slim_stream_map slightly and explain its name more. 2023-09-13 13:25:59 -07:00
Alex Vandiver bca9821c89 digest: Rename get_recent_streams for clarity. 2023-09-13 13:25:59 -07:00
Alex Vandiver 524d4913b3 digest: Filter out users who have joined recently in SQL. 2023-09-13 13:25:59 -07:00
Alex Vandiver d8668ab242 digest: Narrow the query by only fetching the sender full name. 2023-09-13 13:25:59 -07:00
Alex Vandiver 058a168bfe digest: Rewrite target-user algorithm as one query.
There is no reason to do this set manipulation in Python.
2023-09-13 13:25:59 -07:00
Alex Vandiver 584c202d36 digest: Remove unnecessary should_process_digest function. 2023-09-13 13:25:59 -07:00
Anders Kaseorg 2665a3ce2b python: Elide unnecessary list wrappers.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-09-13 12:41:23 -07:00
Alex Vandiver b94402152d models: Always search Messages with a realm_id or id limit.
Unless there is a limit on `id`, always provide a `realm_id` limit as
well.  We also notate which index is expected to be used in each
query.
2023-09-11 15:00:37 -07:00
Anders Kaseorg c2c96eb0cf python: Annotate type aliases with TypeAlias.
This is not strictly necessary but it’s clearer and improves mypy’s
error messages.

https://docs.python.org/3/library/typing.html#typing.TypeAlias
https://mypy.readthedocs.io/en/stable/kinds_of_types.html#type-aliases

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-08-07 10:02:49 -07:00
Anders Kaseorg 77c15547e6 ruff: Fix C414 Unnecessary `list` call within `sorted()`.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-11-03 12:10:15 -07:00
Zixuan James Li ab1bbdda65 typing: Broaden type annotations for QuerySet compatibility.
To explain the rationale of this change, for example, there is
`get_user_activity_summary` which accepts either a `Collection[UserActivity]`,
where `QuerySet[T]` is not strictly `Sequence[T]` because its slicing behavior
is different from the `Protocol`, making `Collection` necessary.

Similarily, we should have `Iterable[T]` instead of `List[T]` so that
`QuerySet[T]` will also be an acceptable subtype, or `Sequence[T]` when we
also expect it to be indexed.

Signed-off-by: Zixuan James Li <p359101898@gmail.com>
2022-07-07 11:27:42 -07:00
Mateusz Mandera fcf82bf047 digest: Don't send emails to deactivated users, even if queued. 2022-04-15 14:32:55 -07:00
Anders Kaseorg b0ce4f1bce docs: Fix many spelling mistakes.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-02-07 18:51:06 -08:00
Anders Kaseorg 6e4c3e41dc python: Normalize quotes with Black.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-02-12 13:11:19 -08:00
Anders Kaseorg 11741543da python: Reformat with Black, except quotes.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-02-12 13:11:19 -08:00
Vishnu KS 3f4f16f4f1 digest: Remove comments from get_hot_topics.
The code is self explanatory.
2021-02-09 10:35:47 -08:00
Vishnu KS e9587900e6 digest: Use heapq.nlargest instead of sorted.
nlargest is the natural fit for selecting n biggest items
from an unsorted list. It's more readable as well as more
efficent (even though we don't care much about the efficeny
in this particular case).
2021-02-09 10:35:47 -08:00
Vishnu KS 738d759e6f digest: Create MAX_HOT_TOPICS_TO_BE_INCLUDED_IN_DIGEST constant. 2021-02-09 10:35:47 -08:00
Vishnu KS c0bd05b52d digest: Check whether length of hot topics is 4.
The length of hot topics would not exceed 4.
2021-02-09 10:35:47 -08:00
Vishnu KS 5c026d67e3 digest: Sort topics in descending order in get_hot_topics.
We want topics with high diversity and large lengths.
So they should be sorted with reverse=True.

This bug seems to be introduced in 936171d258
2021-02-09 10:35:47 -08:00
Alex Vandiver d0f0c2f2ed digest: Fix the structure that we enqueue across when digesting.
This rename was missed in bfa0bdf3d6.
Without this fix, digest messages fail to send.
2021-02-08 17:28:59 -08:00
Tim Abbott 5a02b33f2e digest: Add a large block comment on correctness. 2021-01-17 11:37:59 -08:00
Steve Howell 1040fb7219 email digests: Remove handle_digest_email shim.
The previous commit made it so we only call the
shim in tests, so now we completely remove it.
2021-01-17 11:28:30 -08:00
Steve Howell bfa0bdf3d6 email digests: Process users in chunks of 30.
This should make the queue empty more quickly,
because we do bulk queries to prevent database
hops.
2021-01-17 11:28:30 -08:00
Steve Howell e0b451730a email digests: Extract get_new_streams.
This makes us more efficient when handling
multiple users.  We don't have to keep
sending the same two queries to the database.

Note that as part of this we eliminated
a failure mode for the obscure population
of users from whom both `user.is_guest` and
`user.can_access_public_streams()` returns
False.  We know this would have only affected
Zephyr users (by looking at the code), and
we know we don't actually process Zephyr
users for email digests (or else we would
have raised exceptions in the old code).
2021-01-17 11:28:30 -08:00
Steve Howell 23de94504f email digests: Query streams for messages up front.
This should save us many hops to the database when
we process users in bulk.
2021-01-17 11:28:30 -08:00
Steve Howell 3662bf2dcb minor: Rename stream_map -> user_stream_map. 2021-01-17 11:28:30 -08:00
Steve Howell f8bbb7fea9 email digests: Use select_related("realm").
We mostly need realm_id, but when we go to build
message lists, we need realm.uri.

We could probably be more aggresive about using
`only` here, but for now I am just trying to
reduce hops to the database.
2021-01-17 11:28:29 -08:00
Steve Howell bb56f0ec0e minor: Move get_stream_map to module level.
This is a pure code move.
2021-01-17 11:28:29 -08:00
Steve Howell 52e2d5a733 email digests: Avoid long_term_idle check.
We want to exclude users with recent subscription
activity from emails, regardless of whether
the long_term_idle flag is set.
2021-01-17 11:28:29 -08:00
Steve Howell 162b372b93 email digests: Do one query for recent streams.
This is another way to limit hops to the database
when we process users in bulk.
2021-01-17 11:28:29 -08:00
Alex Vandiver 438d2aa632 digests: Ensure that the teaser_data can be JSON-serialized.
Leaving this as a set means that it fails in zerver.lib.send_email
when serializing into a ScheduledEmail object.
2020-12-15 11:44:50 -08:00
Steve Howell e2e0f06b2a email digests: Call get_recent_topics once per batch.
Once we start processing digests in batch, this will
let us amortize the expense of the message query
over multiple users.
2020-11-16 08:59:29 -08:00
Steve Howell 428f0564a0 minor: Move context code down in the function.
This will make a subsequent diff a bit less noisy.
2020-11-16 08:59:29 -08:00
Steve Howell 1d1e45e9ec digests: Use UserActivityInterval for user activity.
Note that we are much more efficient about finding
active users here:

    - we do one query per realm (instead of per-user)
    - we pass the cutoff date to the database
    - we get back just a list of distinct ids
2020-11-16 08:59:29 -08:00
Steve Howell b52f56080e performance: Just get user_ids to queue digest emails. 2020-11-16 08:59:29 -08:00
Steve Howell e13e5d104d refactor: Only require user_id for inactive_since().
This function is going away completely soon.  It is
querying everybody's entire UserActivity history instead
of passing the cutoff date to the database!
2020-11-16 08:59:29 -08:00
Steve Howell d0260392f7 digests: Get user objects from the database.
The query counts increase here for somewhat
contrived reasons.  The tests before this
commit reflected a successful trip to the
UserProfile cache, but that's not actually
realistic in practice.
2020-11-16 08:59:29 -08:00
Steve Howell e49a482baf email digests: Make transactions atomic. 2020-11-16 08:59:28 -08:00
Steve Howell cf6bcfb84a digest emails: Exclude users who had recent digests.
This code protects us in case we ever need to re-run
email digests twice in the same day.
2020-11-16 08:59:28 -08:00
Steve Howell 4271442fba email digests: Write RealmAuditLog rows. 2020-11-16 08:59:28 -08:00
Steve Howell 5da4332620 minor: Add order-by-id to digest message query.
The order-by-id is now explicit, and I add
comments to explain the select_related tables.
2020-11-06 10:05:46 -08:00
Steve Howell 936171d258 refactor: Extract DigestTopic class.
This gets us away from a lot of dictionary soup.
2020-11-06 10:05:46 -08:00
Steve Howell e8b6c56322 refactor: Simplify get_hot_topics().
The code we deleted here was no longer
doing anything.

Maybe the code was always dead, or maybe it
was written during a time when topics_by_diversity
and topics_by_length actually had different keys.

But now it's clearly cruft.

If we have 4 or more topics, then the code above
it would already have populated the list with 4
elements, and the `if num_convos < 4` condition
would evaluate to False.

And if we had 3 or fewer topics, then we would
have already put all possible topics into our
result, and the `topics_by_diversity[num_convos:4]`
slice would be empty.

It's possible that we should just have a simple
heuristic for topic hotness like `10*num_senders
+ messages`, so we don't have to maintain this
fiddly function, and we can just do something like
`topics_by_score[:4]`.
2020-11-06 10:05:46 -08:00
Steve Howell c5dc9d386f refactor: Use sets of stream_ids for email digests.
I now use sets for stream_ids in more of the digest
code.

As part of this I replaced exclude_subscription_modified_streams
with streams_recently_modified_for_user.

It's easier for the caller to just ask for ids
to delete from its callee than it is to pass
in a set/list to mutate.

The simpler boundary between the functions makes
the tests easier to write--you can see the
`filtered_streams` logic goes away in this diff.

I also make the tests a bit more thorough by using
combinations of Cordelia/Othello and Verona/Denmark
to try to find multiple possible flaws.

And I make the time intervals longer than 1s to
avoid false negatives from slow CI boxes.
2020-11-05 17:42:43 -08:00