Commit Graph

151 Commits

Author SHA1 Message Date
Tim Abbott 5a02b33f2e digest: Add a large block comment on correctness. 2021-01-17 11:37:59 -08:00
Steve Howell 1040fb7219 email digests: Remove handle_digest_email shim.
The previous commit made it so we only call the
shim in tests, so now we completely remove it.
2021-01-17 11:28:30 -08:00
Steve Howell bfa0bdf3d6 email digests: Process users in chunks of 30.
This should make the queue empty more quickly,
because we do bulk queries to prevent database
hops.
2021-01-17 11:28:30 -08:00
Steve Howell e0b451730a email digests: Extract get_new_streams.
This makes us more efficient when handling
multiple users.  We don't have to keep
sending the same two queries to the database.

Note that as part of this we eliminated
a failure mode for the obscure population
of users from whom both `user.is_guest` and
`user.can_access_public_streams()` returns
False.  We know this would have only affected
Zephyr users (by looking at the code), and
we know we don't actually process Zephyr
users for email digests (or else we would
have raised exceptions in the old code).
2021-01-17 11:28:30 -08:00
Steve Howell 23de94504f email digests: Query streams for messages up front.
This should save us many hops to the database when
we process users in bulk.
2021-01-17 11:28:30 -08:00
Steve Howell 3662bf2dcb minor: Rename stream_map -> user_stream_map. 2021-01-17 11:28:30 -08:00
Steve Howell f8bbb7fea9 email digests: Use select_related("realm").
We mostly need realm_id, but when we go to build
message lists, we need realm.uri.

We could probably be more aggresive about using
`only` here, but for now I am just trying to
reduce hops to the database.
2021-01-17 11:28:29 -08:00
Steve Howell bb56f0ec0e minor: Move get_stream_map to module level.
This is a pure code move.
2021-01-17 11:28:29 -08:00
Steve Howell 52e2d5a733 email digests: Avoid long_term_idle check.
We want to exclude users with recent subscription
activity from emails, regardless of whether
the long_term_idle flag is set.
2021-01-17 11:28:29 -08:00
Steve Howell 162b372b93 email digests: Do one query for recent streams.
This is another way to limit hops to the database
when we process users in bulk.
2021-01-17 11:28:29 -08:00
Alex Vandiver 438d2aa632 digests: Ensure that the teaser_data can be JSON-serialized.
Leaving this as a set means that it fails in zerver.lib.send_email
when serializing into a ScheduledEmail object.
2020-12-15 11:44:50 -08:00
Steve Howell e2e0f06b2a email digests: Call get_recent_topics once per batch.
Once we start processing digests in batch, this will
let us amortize the expense of the message query
over multiple users.
2020-11-16 08:59:29 -08:00
Steve Howell 428f0564a0 minor: Move context code down in the function.
This will make a subsequent diff a bit less noisy.
2020-11-16 08:59:29 -08:00
Steve Howell 1d1e45e9ec digests: Use UserActivityInterval for user activity.
Note that we are much more efficient about finding
active users here:

    - we do one query per realm (instead of per-user)
    - we pass the cutoff date to the database
    - we get back just a list of distinct ids
2020-11-16 08:59:29 -08:00
Steve Howell b52f56080e performance: Just get user_ids to queue digest emails. 2020-11-16 08:59:29 -08:00
Steve Howell e13e5d104d refactor: Only require user_id for inactive_since().
This function is going away completely soon.  It is
querying everybody's entire UserActivity history instead
of passing the cutoff date to the database!
2020-11-16 08:59:29 -08:00
Steve Howell d0260392f7 digests: Get user objects from the database.
The query counts increase here for somewhat
contrived reasons.  The tests before this
commit reflected a successful trip to the
UserProfile cache, but that's not actually
realistic in practice.
2020-11-16 08:59:29 -08:00
Steve Howell e49a482baf email digests: Make transactions atomic. 2020-11-16 08:59:28 -08:00
Steve Howell cf6bcfb84a digest emails: Exclude users who had recent digests.
This code protects us in case we ever need to re-run
email digests twice in the same day.
2020-11-16 08:59:28 -08:00
Steve Howell 4271442fba email digests: Write RealmAuditLog rows. 2020-11-16 08:59:28 -08:00
Steve Howell 5da4332620 minor: Add order-by-id to digest message query.
The order-by-id is now explicit, and I add
comments to explain the select_related tables.
2020-11-06 10:05:46 -08:00
Steve Howell 936171d258 refactor: Extract DigestTopic class.
This gets us away from a lot of dictionary soup.
2020-11-06 10:05:46 -08:00
Steve Howell e8b6c56322 refactor: Simplify get_hot_topics().
The code we deleted here was no longer
doing anything.

Maybe the code was always dead, or maybe it
was written during a time when topics_by_diversity
and topics_by_length actually had different keys.

But now it's clearly cruft.

If we have 4 or more topics, then the code above
it would already have populated the list with 4
elements, and the `if num_convos < 4` condition
would evaluate to False.

And if we had 3 or fewer topics, then we would
have already put all possible topics into our
result, and the `topics_by_diversity[num_convos:4]`
slice would be empty.

It's possible that we should just have a simple
heuristic for topic hotness like `10*num_senders
+ messages`, so we don't have to maintain this
fiddly function, and we can just do something like
`topics_by_score[:4]`.
2020-11-06 10:05:46 -08:00
Steve Howell c5dc9d386f refactor: Use sets of stream_ids for email digests.
I now use sets for stream_ids in more of the digest
code.

As part of this I replaced exclude_subscription_modified_streams
with streams_recently_modified_for_user.

It's easier for the caller to just ask for ids
to delete from its callee than it is to pass
in a set/list to mutate.

The simpler boundary between the functions makes
the tests easier to write--you can see the
`filtered_streams` logic goes away in this diff.

I also make the tests a bit more thorough by using
combinations of Cordelia/Othello and Verona/Denmark
to try to find multiple possible flaws.

And I make the time intervals longer than 1s to
avoid false negatives from slow CI boxes.
2020-11-05 17:42:43 -08:00
Steve Howell 88a57ed4ac bulk digest: Get stream subscriptions in bulk.
If we have multiple users, this reduces the amount
of queries we need to do, because we get all
subscriptions for all users in a single query
to Subscription.

For the single-user case, we are introducing an
extra query hop, but the database is doing
roughly the same work, because we are just breaking
up this complex query into two hops:

    messages =
        select ...  from message
        where recipient__type_id in (
            select stream_id from subscription
            where ...
        )

Now it's more like:

    stream_ids =
        select stream_id from subscription
        where ...

    messages =
        select ... from message
        where recipient__type_id in stream_ids
2020-11-05 09:36:59 -08:00
Steve Howell c83db37161 email digests: Introduce bulk methods for digest.
Note that we are not changing anything semantically
or algorithmically yet.  The only overhead here
for the single-user case is boxing and unboxing
data into single-item dicts and lists.

The interfaces for callers in the view and the
queue processor remain the same for now.
2020-11-05 09:36:59 -08:00
Steve Howell 7c89e46731 minor: Clean up some code formatting. 2020-11-05 09:36:59 -08:00
Steve Howell 4bd02eea19 minor: Use user, not user_profile, in some digest code. 2020-11-05 09:36:59 -08:00
Steve Howell e31326c823 refactor: Extract get_digest_context.
This eliminates the union type and boolean parameter,
and it makes it a bit easier to migrate to a
bulk-get approach.
2020-11-05 09:36:59 -08:00
Steve Howell 217967f743 refactor: Extract get_hot_topics.
This extraction will make a bit more sense when
we start doing bulk operations on a realm to
get digests, but even now, it encapsulates the
slightly complex way we cherry-pick the top 4
topics for a user.
2020-11-05 09:36:59 -08:00
Steve Howell 5a6d6f81ff refactor: Extract get_recent_topic_activity. 2020-11-05 09:36:59 -08:00
Steve Howell f987b014b3 refactor: Rename conversation to topic.
Not only is topic shorter, but the name makes
it clear that we're not dealing with abstract
conversations here--we are truly bucketing by
topic.
2020-11-05 09:36:59 -08:00
Steve Howell 6ac3cd3534 refactor: Use list of topics, not tuples. 2020-11-05 09:36:59 -08:00
Steve Howell 878e938a89 minor: Rename conversation_diversity to conversation_senders. 2020-11-05 09:36:59 -08:00
Steve Howell 6dc8250e9a mypy: Add TopicKey type for digests. 2020-11-05 09:36:59 -08:00
Steve Howell 96f6064b18 refactor: Move Messages query down the digest stack.
This prep step is mostly for diff hygiene; the next
commit will make the code a bit nicer.

The original code here had the nice property that
most (but not all) of the DB work happened up
front in `handle_digest_email`, and none of the
DB work was delegated to the callers.  But I
prefer the tradeoff of making the helpers a bit
more cohesive--let them get the data they need.
And we have query-count coverage in our tests,
so there's no real danger of having helpers
down in the stack insidiously doing a bunch of
extra DB hops.
2020-11-05 09:36:59 -08:00
Clara Dantas 8674287192 digest: Support digest of web public streams for guest users.
This change requires some basic plumbing for test code creating
web-public streams.
2020-09-25 16:11:04 -07:00
Anders Kaseorg bef46dab3c python: Prefer kwargs form of dict.update.
For less inflation by Black.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-09-03 17:51:09 -07:00
Anders Kaseorg ab120a03bc python: Replace unnecessary intermediate lists with generators.
Mostly suggested by the flake8-comprehension plugin.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-09-02 11:15:41 -07:00
Anders Kaseorg 022c4fbfc7 Revert "digest: Support digest of web public streams for guest users."
This reverts commit c3779338c6 (part
of #14638), which incorrectly depended on commits from the future,
with the effect of either halting the flow of entropic time in an
irresolvable temporal paradox, summoning extradimensional beings to
rain destruction on the galaxy, or failing CI.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-07-29 21:05:59 -07:00
Clara Dantas c3779338c6 digest: Support digest of web public streams for guest users. 2020-07-29 17:52:36 -07:00
Anders Kaseorg 365fe0b3d5 python: Sort imports with isort.
Fixes #2665.

Regenerated by tabbott with `lint --fix` after a rebase and change in
parameters.

Note from tabbott: In a few cases, this converts technical debt in the
form of unsorted imports into different technical debt in the form of
our largest files having very long, ugly import sequences at the
start.  I expect this change will increase pressure for us to split
those files, which isn't a bad thing.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-06-11 16:45:32 -07:00
Anders Kaseorg 69730a78cc python: Use trailing commas consistently.
Automatically generated by the following script, based on the output
of lint with flake8-comma:

import re
import sys

last_filename = None
last_row = None
lines = []

for msg in sys.stdin:
    m = re.match(
        r"\x1b\[35mflake8    \|\x1b\[0m \x1b\[1;31m(.+):(\d+):(\d+): (\w+)", msg
    )
    if m:
        filename, row_str, col_str, err = m.groups()
        row, col = int(row_str), int(col_str)

        if filename == last_filename:
            assert last_row != row
        else:
            if last_filename is not None:
                with open(last_filename, "w") as f:
                    f.writelines(lines)

            with open(filename) as f:
                lines = f.readlines()
            last_filename = filename
        last_row = row

        line = lines[row - 1]
        if err in ["C812", "C815"]:
            lines[row - 1] = line[: col - 1] + "," + line[col - 1 :]
        elif err in ["C819"]:
            assert line[col - 2] == ","
            lines[row - 1] = line[: col - 2] + line[col - 1 :].lstrip(" ")

if last_filename is not None:
    with open(last_filename, "w") as f:
        f.writelines(lines)

Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
2020-06-11 16:04:12 -07:00
Anders Kaseorg 67e7a3631d python: Convert percent formatting to Python 3.6 f-strings.
Generated by pyupgrade --py36-plus.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-06-10 15:02:09 -07:00
Anders Kaseorg 1f565a9f41 timezone: Use standard library datetime.timezone.utc consistently.
datetime.timezone is available in Python ≥ 3.2.  This also lets us
remove a pytz dependency from the PostgreSQL scripts.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-06-05 09:34:17 -07:00
Anders Kaseorg bdc365d0fe logging: Pass format arguments to logging.
https://docs.python.org/3/howto/logging.html#optimization

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-05-02 10:18:02 -07:00
Anders Kaseorg fead14951c python: Convert assignment type annotations to Python 3.6 style.
This commit was split by tabbott; this piece covers the vast majority
of files in Zulip, but excludes scripts/, tools/, and puppet/ to help
ensure we at least show the right error messages for Xenial systems.

We can likely further refine the remaining pieces with some testing.

Generated by com2ann, with whitespace fixes and various manual fixes
for runtime issues:

-    invoiced_through: Optional[LicenseLedger] = models.ForeignKey(
+    invoiced_through: Optional["LicenseLedger"] = models.ForeignKey(

-_apns_client: Optional[APNsClient] = None
+_apns_client: Optional["APNsClient"] = None

-    notifications_stream: Optional[Stream] = models.ForeignKey('Stream', related_name='+', null=True, blank=True, on_delete=CASCADE)
-    signup_notifications_stream: Optional[Stream] = models.ForeignKey('Stream', related_name='+', null=True, blank=True, on_delete=CASCADE)
+    notifications_stream: Optional["Stream"] = models.ForeignKey('Stream', related_name='+', null=True, blank=True, on_delete=CASCADE)
+    signup_notifications_stream: Optional["Stream"] = models.ForeignKey('Stream', related_name='+', null=True, blank=True, on_delete=CASCADE)

-    author: Optional[UserProfile] = models.ForeignKey('UserProfile', blank=True, null=True, on_delete=CASCADE)
+    author: Optional["UserProfile"] = models.ForeignKey('UserProfile', blank=True, null=True, on_delete=CASCADE)

-    bot_owner: Optional[UserProfile] = models.ForeignKey('self', null=True, on_delete=models.SET_NULL)
+    bot_owner: Optional["UserProfile"] = models.ForeignKey('self', null=True, on_delete=models.SET_NULL)

-    default_sending_stream: Optional[Stream] = models.ForeignKey('zerver.Stream', null=True, related_name='+', on_delete=CASCADE)
-    default_events_register_stream: Optional[Stream] = models.ForeignKey('zerver.Stream', null=True, related_name='+', on_delete=CASCADE)
+    default_sending_stream: Optional["Stream"] = models.ForeignKey('zerver.Stream', null=True, related_name='+', on_delete=CASCADE)
+    default_events_register_stream: Optional["Stream"] = models.ForeignKey('zerver.Stream', null=True, related_name='+', on_delete=CASCADE)

-descriptors_by_handler_id: Dict[int, ClientDescriptor] = {}
+descriptors_by_handler_id: Dict[int, "ClientDescriptor"] = {}

-worker_classes: Dict[str, Type[QueueProcessingWorker]] = {}
-queues: Dict[str, Dict[str, Type[QueueProcessingWorker]]] = {}
+worker_classes: Dict[str, Type["QueueProcessingWorker"]] = {}
+queues: Dict[str, Dict[str, Type["QueueProcessingWorker"]]] = {}

-AUTH_LDAP_REVERSE_EMAIL_SEARCH: Optional[LDAPSearch] = None
+AUTH_LDAP_REVERSE_EMAIL_SEARCH: Optional["LDAPSearch"] = None

Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
2020-04-22 11:02:32 -07:00
arpit551 8f7733cb20 emails: Added placeholders strings in FormAddress.
We've had a bug for a while that if any ScheduledEmail objects get
created with the wrong email sender address, even after the sysadmin
corrects the problem, they'll still get errors because of the objects
stored with the wrong format.

We solve this by using FromAddress placeholders strings in
send_future_email function, so that ScheduledEmail objects end up
setting the final `from_address` value when mail is actually sent
using the setting in effect at that time.

Fixes #11008.
2020-03-27 16:41:02 -07:00
Tim Abbott 3bc7ba1767 digest: Switch from emails to user IDs for logging.
This is better practice.
2019-11-15 17:07:52 -08:00
Mateusz Mandera dbe508bb91 models: Migration of Message.pub_date to date_sent, part 2.
Fixes #1727.

With the server down, apply migrations 0245 and 0246. 0246 will remove
the pub_date column, so it's essential that the previous migrations
ran correctly to copy data before running this.
2019-10-05 19:01:34 -07:00