Commit Graph

88 Commits

Author SHA1 Message Date
Alex Vandiver a8a1f10f3c digest: Clear the cache once we move to a new realm / cutoff value. 2023-09-13 13:25:59 -07:00
Alex Vandiver b9f72bdd68 digest: Switch loop to early-abort for clarity. 2023-09-13 13:25:59 -07:00
Alex Vandiver b555d3f553 digest: Cache per-stream recent topics, rather than batching.
The query plan for fetching recent messages from the arbitrary set of
streams formed by the intersection of 30 random users can be quite
bad, and can descend into a sequential scan on `zerver_recipient`.
Worse, this work of pulling recent messages out is redone if the
stream appears in the next batch of 30 users.

Instead, pull the recent messages for a stream on a one-by-one basis,
but cache them in an in-memory cache.  Since digests are enqueued in
30-user batches but still one-realm-at-a-time, work will be saved both
in terms of faster query plans whose results can also be reused across
batches.

This requires that we pull the stream-id to stream-name mapping for
_all_ streams in the realm at once, but that is well-indexed and
unlikely to cause performance issues -- in fact, it may be faster
than pulling a random subset of the streams in the realm.
2023-09-13 13:25:59 -07:00
Alex Vandiver bca9821c89 digest: Rename get_recent_streams for clarity. 2023-09-13 13:25:59 -07:00
Alex Vandiver 524d4913b3 digest: Filter out users who have joined recently in SQL. 2023-09-13 13:25:59 -07:00
Alex Vandiver 584c202d36 digest: Remove unnecessary should_process_digest function. 2023-09-13 13:25:59 -07:00
Steve Howell 751b8b5bb5 tests: Flush per-request caches automatically for query counts. 2023-08-11 11:09:34 -07:00
Steve Howell 549891266d tests: Add assert_memcached_count.
We use a specific name to distinguish from other caches
like per-request caches.
2023-08-11 11:09:34 -07:00
Anders Kaseorg df001db1a9 black: Reformat with Black 23.
Black 23 enforces some slightly more specific rules about empty line
counts and redundant parenthesis removal, but the result is still
compatible with Black 22.

(This does not actually upgrade our Python environment to Black 23
yet.)

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-02-02 10:40:13 -08:00
Zixuan James Li 46329a2710 test_classes: Create a dedicate helper for query count check.
This adds a helper based on testing patterns of using the "queries_captured"
context manager with "assert_length" to check the number of queries
executed for preventing performance regression.

It explains the rationale of checking the query count through an
"AssertionError" and prints the queries captured as assert_length does,
but with a format optimized for displaying the queries in a more
readable manner.

Signed-off-by: Zixuan James Li <p359101898@gmail.com>
2022-10-17 11:32:52 -07:00
Mateusz Mandera 00b3546c9f models: Add denormalized .realm column to Message.
This commit adds the OPTIONAL .realm attribute to Message
(and ArchivedMessage), with the server changes for making new Messages
have this set. Old Messages still have to be migrated to backfill this,
before it can be non-nullable.

Appropriate test changes to correctly set .realm for Messages the tests
manually create are included here as well.
2022-10-07 10:09:38 -07:00
Mateusz Mandera 5850c38f4e test_digest: Use proper stream.id in test_get_hot_topics.
Just using values 1 and 2 as stream ids is not good, because there's no
idea in which realm these streams are (or hypothetically if they exist).
This can create weird Messages with sender being a user of "zulip" realm
and the stream being in another realm - which would be a corrupted
state.
2022-09-28 16:45:25 +02:00
Adam Sah ba5cf331a2 testing: 100% coverage for zerver/tests/test_digest.py. 2022-06-01 16:09:13 -07:00
Anders Kaseorg b572b18e70 test_digest: Modernize set literal syntax.
Generated by pyupgrade.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-04-27 12:57:49 -07:00
Mateusz Mandera fcf82bf047 digest: Don't send emails to deactivated users, even if queued. 2022-04-15 14:32:55 -07:00
Mateusz Mandera 7a13836d26 test_digest: Fix typo in a comment. 2022-04-15 14:32:55 -07:00
Anders Kaseorg cbad5739ab actions: Split out zerver.actions.create_user.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-04-14 17:14:35 -07:00
Anders Kaseorg b0ce4f1bce docs: Fix many spelling mistakes.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-02-07 18:51:06 -08:00
Anders Kaseorg 90e202cd38 docs: Consistently hyphenate “web-public”.
In English, compound adjectives should essentially always be
hyphenated.  This makes them easier to parse, especially for users who
might not recognize that the words “web public” go together as a
phrase.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-01-28 17:45:45 -08:00
Steve Howell 2902f8b931 tests: Ensure stream senders get a UserMessage row.
We now complain if a test author sends a stream message
that does not result in the sender getting a
UserMessage row for the message.

This is basically 100% equivalent to complaining that
the author failed to subscribe the sender to the stream
as part of the test setup, as far as I can tell, so the
AssertionError instructs the author to subscribe the
sender to the stream.

We exempt bots from this check, although it is
plausible we should only exempt the system bots like
the notification bot.

I considered auto-subscribing the sender to the stream,
but that can be a little more expensive than the
current check, and we generally want test setup to be
explicit.

If there is some legitimate way than a subscribed human
sender can't get a UserMessage, then we probably want
an explicit test for that, or we may want to change the
backend to just write a UserMessage row in that
hypothetical situation.

For most tests, including almost all the ones fixed
here, the author just wants their test setup to
realistically reflect normal operation, and often devs
may not realize that Cordelia is not subscribed to
Denmark or not realize that Hamlet is not subscribed to
Scotland.

Some of us don't remember our Shakespeare from high
school, and our stream subscriptions don't even
necessarily reflect which countries the Bard placed his
characters in.

There may also be some legitimate use case where an
author wants to simulate sending a message to an
unsubscribed stream, but for those edge cases, they can
always set allow_unsubscribed_sender to True.
2021-12-10 09:40:04 -08:00
Anders Kaseorg 3665deb93a python: Remove unnecessary intermediate lists.
Generated automatically by pyupgrade.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-08-02 15:53:52 -07:00
akshatdalton e203112fd4 refactor: Use `assert_length` helper instead of `assertTrue/assertEqual`. 2021-07-13 13:03:38 -07:00
shanukun 4b67946605 refactor: Make acting_user a mandatory kwarg for do_create_user. 2021-02-25 17:58:00 -08:00
Anders Kaseorg 6e4c3e41dc python: Normalize quotes with Black.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-02-12 13:11:19 -08:00
Anders Kaseorg 11741543da python: Reformat with Black, except quotes.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-02-12 13:11:19 -08:00
Vishnu KS 5c026d67e3 digest: Sort topics in descending order in get_hot_topics.
We want topics with high diversity and large lengths.
So they should be sorted with reverse=True.

This bug seems to be introduced in 936171d258
2021-02-09 10:35:47 -08:00
Alex Vandiver d0f0c2f2ed digest: Fix the structure that we enqueue across when digesting.
This rename was missed in bfa0bdf3d6.
Without this fix, digest messages fail to send.
2021-02-08 17:28:59 -08:00
Steve Howell 1040fb7219 email digests: Remove handle_digest_email shim.
The previous commit made it so we only call the
shim in tests, so now we completely remove it.
2021-01-17 11:28:30 -08:00
Steve Howell bfa0bdf3d6 email digests: Process users in chunks of 30.
This should make the queue empty more quickly,
because we do bulk queries to prevent database
hops.
2021-01-17 11:28:30 -08:00
Steve Howell e0b451730a email digests: Extract get_new_streams.
This makes us more efficient when handling
multiple users.  We don't have to keep
sending the same two queries to the database.

Note that as part of this we eliminated
a failure mode for the obscure population
of users from whom both `user.is_guest` and
`user.can_access_public_streams()` returns
False.  We know this would have only affected
Zephyr users (by looking at the code), and
we know we don't actually process Zephyr
users for email digests (or else we would
have raised exceptions in the old code).
2021-01-17 11:28:30 -08:00
Steve Howell 23de94504f email digests: Query streams for messages up front.
This should save us many hops to the database when
we process users in bulk.
2021-01-17 11:28:30 -08:00
Steve Howell f8bbb7fea9 email digests: Use select_related("realm").
We mostly need realm_id, but when we go to build
message lists, we need realm.uri.

We could probably be more aggresive about using
`only` here, but for now I am just trying to
reduce hops to the database.
2021-01-17 11:28:29 -08:00
Steve Howell 52e2d5a733 email digests: Avoid long_term_idle check.
We want to exclude users with recent subscription
activity from emails, regardless of whether
the long_term_idle flag is set.
2021-01-17 11:28:29 -08:00
Steve Howell 162b372b93 email digests: Do one query for recent streams.
This is another way to limit hops to the database
when we process users in bulk.
2021-01-17 11:28:29 -08:00
Steve Howell e2e0f06b2a email digests: Call get_recent_topics once per batch.
Once we start processing digests in batch, this will
let us amortize the expense of the message query
over multiple users.
2020-11-16 08:59:29 -08:00
Steve Howell 1d1e45e9ec digests: Use UserActivityInterval for user activity.
Note that we are much more efficient about finding
active users here:

    - we do one query per realm (instead of per-user)
    - we pass the cutoff date to the database
    - we get back just a list of distinct ids
2020-11-16 08:59:29 -08:00
Steve Howell b52f56080e performance: Just get user_ids to queue digest emails. 2020-11-16 08:59:29 -08:00
Steve Howell d0260392f7 digests: Get user objects from the database.
The query counts increase here for somewhat
contrived reasons.  The tests before this
commit reflected a successful trip to the
UserProfile cache, but that's not actually
realistic in practice.
2020-11-16 08:59:29 -08:00
Steve Howell 7737413cec digest tests: Improve gather_new_streams test.
We don't need to mock the dates here.  We also
explicitly clear out all streams first, and then
we explicitly test with both the stream being
current and the stream being old.
2020-11-16 08:59:28 -08:00
Steve Howell 9538edde06 digest tests: Simplify bots test.
We can use the _enqueue_emails_for_realm helper
to avoid all the Tuesday-related logic here.

We also don't bother to create UserActivity
records, since the bot gets excluded by virtue
of its being a bot.  (Also, the date ranges
here were sketchy due to the time mocking.)
2020-11-16 08:59:28 -08:00
Steve Howell 0624833af6 digest tests: Improve Tuesday tests.
If we're mocking time, we should do it consistently.
2020-11-16 08:59:28 -08:00
Steve Howell 2f4d7a6171 tests: Fix test_inactive_users_queued_for_digest.
We can avoid all the date mocking now for all
but a couple tests that exercise the is-it-Tuesday
logic.

And this test now correctly tests that we exclude
recently active users.

And this allows us to remove the other test.
2020-11-16 08:59:28 -08:00
Steve Howell cf6bcfb84a digest emails: Exclude users who had recent digests.
This code protects us in case we ever need to re-run
email digests twice in the same day.
2020-11-16 08:59:28 -08:00
Steve Howell fb3d4c1618 digest tests: Avoid warnings about naive time. 2020-11-16 08:59:28 -08:00
Steve Howell 4271442fba email digests: Write RealmAuditLog rows. 2020-11-16 08:59:28 -08:00
Steve Howell c5dc9d386f refactor: Use sets of stream_ids for email digests.
I now use sets for stream_ids in more of the digest
code.

As part of this I replaced exclude_subscription_modified_streams
with streams_recently_modified_for_user.

It's easier for the caller to just ask for ids
to delete from its callee than it is to pass
in a set/list to mutate.

The simpler boundary between the functions makes
the tests easier to write--you can see the
`filtered_streams` logic goes away in this diff.

I also make the tests a bit more thorough by using
combinations of Cordelia/Othello and Verona/Denmark
to try to find multiple possible flaws.

And I make the time intervals longer than 1s to
avoid false negatives from slow CI boxes.
2020-11-05 17:42:43 -08:00
Steve Howell 88a57ed4ac bulk digest: Get stream subscriptions in bulk.
If we have multiple users, this reduces the amount
of queries we need to do, because we get all
subscriptions for all users in a single query
to Subscription.

For the single-user case, we are introducing an
extra query hop, but the database is doing
roughly the same work, because we are just breaking
up this complex query into two hops:

    messages =
        select ...  from message
        where recipient__type_id in (
            select stream_id from subscription
            where ...
        )

Now it's more like:

    stream_ids =
        select stream_id from subscription
        where ...

    messages =
        select ... from message
        where recipient__type_id in stream_ids
2020-11-05 09:36:59 -08:00
Steve Howell c83db37161 email digests: Introduce bulk methods for digest.
Note that we are not changing anything semantically
or algorithmically yet.  The only overhead here
for the single-user case is boxing and unboxing
data into single-item dicts and lists.

The interfaces for callers in the view and the
queue processor remain the same for now.
2020-11-05 09:36:59 -08:00
Steve Howell 0e2d02b0a2 digest tests: Count cache tries. 2020-11-05 09:36:59 -08:00
Steve Howell 127f4e1291 digest tests: Add more users to bulk digest test. 2020-11-05 09:36:59 -08:00