Since the cache is flushed when the cutoff or realm changes, the
maximum size of the cache should cap out at the number of streams in
the realm. Raise the max cache size, now that this will not simply
lead to useless cache space for smaller servers.
There is now no longer any reason to have the scheduled_email
enqueuing wait until all of the users' contexts have been generated.
Switch to returning the contexts as an iterator, and send them as we
compute them.
The query plan for fetching recent messages from the arbitrary set of
streams formed by the intersection of 30 random users can be quite
bad, and can descend into a sequential scan on `zerver_recipient`.
Worse, this work of pulling recent messages out is redone if the
stream appears in the next batch of 30 users.
Instead, pull the recent messages for a stream on a one-by-one basis,
but cache them in an in-memory cache. Since digests are enqueued in
30-user batches but still one-realm-at-a-time, work will be saved both
in terms of faster query plans whose results can also be reused across
batches.
This requires that we pull the stream-id to stream-name mapping for
_all_ streams in the realm at once, but that is well-indexed and
unlikely to cause performance issues -- in fact, it may be faster
than pulling a random subset of the streams in the realm.
To explain the rationale of this change, for example, there is
`get_user_activity_summary` which accepts either a `Collection[UserActivity]`,
where `QuerySet[T]` is not strictly `Sequence[T]` because its slicing behavior
is different from the `Protocol`, making `Collection` necessary.
Similarily, we should have `Iterable[T]` instead of `List[T]` so that
`QuerySet[T]` will also be an acceptable subtype, or `Sequence[T]` when we
also expect it to be indexed.
Signed-off-by: Zixuan James Li <p359101898@gmail.com>
nlargest is the natural fit for selecting n biggest items
from an unsorted list. It's more readable as well as more
efficent (even though we don't care much about the efficeny
in this particular case).
This makes us more efficient when handling
multiple users. We don't have to keep
sending the same two queries to the database.
Note that as part of this we eliminated
a failure mode for the obscure population
of users from whom both `user.is_guest` and
`user.can_access_public_streams()` returns
False. We know this would have only affected
Zephyr users (by looking at the code), and
we know we don't actually process Zephyr
users for email digests (or else we would
have raised exceptions in the old code).
We mostly need realm_id, but when we go to build
message lists, we need realm.uri.
We could probably be more aggresive about using
`only` here, but for now I am just trying to
reduce hops to the database.
Note that we are much more efficient about finding
active users here:
- we do one query per realm (instead of per-user)
- we pass the cutoff date to the database
- we get back just a list of distinct ids
This function is going away completely soon. It is
querying everybody's entire UserActivity history instead
of passing the cutoff date to the database!
The query counts increase here for somewhat
contrived reasons. The tests before this
commit reflected a successful trip to the
UserProfile cache, but that's not actually
realistic in practice.
The code we deleted here was no longer
doing anything.
Maybe the code was always dead, or maybe it
was written during a time when topics_by_diversity
and topics_by_length actually had different keys.
But now it's clearly cruft.
If we have 4 or more topics, then the code above
it would already have populated the list with 4
elements, and the `if num_convos < 4` condition
would evaluate to False.
And if we had 3 or fewer topics, then we would
have already put all possible topics into our
result, and the `topics_by_diversity[num_convos:4]`
slice would be empty.
It's possible that we should just have a simple
heuristic for topic hotness like `10*num_senders
+ messages`, so we don't have to maintain this
fiddly function, and we can just do something like
`topics_by_score[:4]`.
I now use sets for stream_ids in more of the digest
code.
As part of this I replaced exclude_subscription_modified_streams
with streams_recently_modified_for_user.
It's easier for the caller to just ask for ids
to delete from its callee than it is to pass
in a set/list to mutate.
The simpler boundary between the functions makes
the tests easier to write--you can see the
`filtered_streams` logic goes away in this diff.
I also make the tests a bit more thorough by using
combinations of Cordelia/Othello and Verona/Denmark
to try to find multiple possible flaws.
And I make the time intervals longer than 1s to
avoid false negatives from slow CI boxes.