Part of our codepath for subscribing users involves
fetching the users' existing subscriptions to make sure
we can do things like properly report to the clients
that the users were already subscribed. This codepath
used to be coupled to code that helped users maintain
unique stream colors.
Suppose you are creating a new stream, and you are
importing users from an older stream with 15k
subscribers, and each of your users is subscribed to
about 20 streams.
The prior code, instead of filtering on recipient_id,
would literally look at every subscription for every
user, which was kind of crazy if you didn't understand
the pick-stream-color complications.
Before this commit, we would fetch 300k rows with 15
columns each (granted, all but one of the columns are
bool/int). That's a total of 4.5 million tiny objects
that we had to glom into Django ORM objects and slice
and dice.
After this commit, we would fetch exactly zero rows
for the are-they-already-subscribed logic.
Yes, ZERO.
If we were to mistakenly try to re-add the same 15k
subscribers to the new stream (under the new code), we
will now fetch 15k Sub rows instead of 300k.
It is worth looking at the prior commit. We go through
great pains to ensure that users get new stream colors
when we invite them to a stream, and we still fetch a
bunch of data for that. Instead of 4.5 million cells,
it's more like 600k cells (2 columns per row), and it's
less than that insofar as some users may only
have 24 distinct colors among their many streams.
It's a lot of work.
This commit sets us up for the next commit, which will
save us a very expensive query.
If you are adding 15k users to a stream, and each user
has about 20 existing streams, then we need to retrieve
300k rows from the database to figure out which stream
colors they already have. We don't need all the extra
fields from Subscription, so now we get just the two
values we need for making a color map.
In the next commit we'll eliminate the other use case
for the big query, and I will explain in greater
depth how splitting out the color-picking code can
be a huge win. It is possible that some product decisions
could make this codepath easier. We could also do some
engineering specific to stream colors, such as caching
which colors users have already used.
This does cost us an extra round trip to the database.
This completes the effort to make it possible to use
bulk_access_message in contexts where there are more than a handful of
messages without creating performance issues.
This new function optimizes how we fetch subscriptions
for streams. Basically, it excludes most long-term-idle
users from the query.
With 8k users, of which all but 400 are long term idle,
this speeds up get_recipient_info from about 150ms
to 50ms.
Overall this change appears to save a factor of 2-3 in the backend
processing time for sending or editing a message in large, public
streams in chat.zulip.org (at 18K users today).
model__id syntax implies needing a JOIN on the model table to fetch the
id. That's usually redundant, because the first table in the query
simply has a 'model_id' column, so the id can be fetched directly.
Django is actually smart enough to not do those redundant joins, but we
should still avoid this misguided syntax.
The exceptions are ManytoMany fields and queries doing a backward
relationship lookup. If "streams" is a many-to-many relationship, then
streams_id is invalid - streams__id syntax is needed. If "y" is a
foreign fields from X to Y:
class X:
y = models.ForeignKey(Y)
then object x of class X has the field x.y_id, but y of class Y doesn't
have y.x_id. Thus Y queries need to be done like
Y.objects.filter(x__id__in=some_list)
get_active_subscriptions_for_stream_id should allow specifying whether
subscriptions of deactivated users should be included in the result.
Active subs of deactivated users are a subtlety that's easy to miss
when writing relevant code, so we make include_deactivated_users a
mandatory kwarg - this will force callers to definitely give thought to
whether such subs should be included or not.
This commit is just a refactoring, we keep original behavior everywhere
- there are places where subs of deactivates users should probably be
excluded but aren't - we don't fix that here, it'll be addressed in
follow-up commits.
This saves us a query for edge cases like when
you try to unsubscribe from a public stream
that you have already unsubscribed from.
But this is mostly to prep for upcoming
optimizations.
We get two speedups:
* The query to get existing subscribers only
gets the two fields we need. We no longer
need all the overhead of user_profile
and recipient data being returned in the
query.
* We avoid Django making extra hops to the
database to get user info.
We replace get_peer_user_ids_for_stream_change
with two bulk functions to get peers and/or
subscribers.
Note that we have three codepaths that care about
peers:
subscribing existing users:
we need to tell peers about new subscribers
we need to tell subscribed user about old subscribers
unsubscribing existing users:
we only need to tell peers who unsubscribed
subscribing new user:
we only need to tell peers about the new user
(right now we generate send_event
calls to tell the new user about existing
subscribers, but this is a waste
of effort that we will fix soon)
The two bulk functions are this:
bulk_get_subscriber_peer_info
bulk_get_peers
They have some overlap in the implementation,
but there are some nuanced differences that are
described in the comments.
Looking up peers/subscribers in bulk leads to some
nice optimizations.
We will save some memchached traffic if you are
subscribing to multiple public streams.
We will save a query in the remove-subscriber
case if you are only dealing with private streams.
Fixes#2665.
Regenerated by tabbott with `lint --fix` after a rebase and change in
parameters.
Note from tabbott: In a few cases, this converts technical debt in the
form of unsorted imports into different technical debt in the form of
our largest files having very long, ugly import sequences at the
start. I expect this change will increase pressure for us to split
those files, which isn't a bad thing.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
Automatically generated by the following script, based on the output
of lint with flake8-comma:
import re
import sys
last_filename = None
last_row = None
lines = []
for msg in sys.stdin:
m = re.match(
r"\x1b\[35mflake8 \|\x1b\[0m \x1b\[1;31m(.+):(\d+):(\d+): (\w+)", msg
)
if m:
filename, row_str, col_str, err = m.groups()
row, col = int(row_str), int(col_str)
if filename == last_filename:
assert last_row != row
else:
if last_filename is not None:
with open(last_filename, "w") as f:
f.writelines(lines)
with open(filename) as f:
lines = f.readlines()
last_filename = filename
last_row = row
line = lines[row - 1]
if err in ["C812", "C815"]:
lines[row - 1] = line[: col - 1] + "," + line[col - 1 :]
elif err in ["C819"]:
assert line[col - 2] == ","
lines[row - 1] = line[: col - 2] + line[col - 1 :].lstrip(" ")
if last_filename is not None:
with open(last_filename, "w") as f:
f.writelines(lines)
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
This new method prevents us from getting fat
objects from the database.
Instead, now we just get ids from the database
to build our subqueries.
Note that we could also technically eliminate
the `set(...)` wrappers in this code to have
Django make a subquery and save a round trip.
I am postponing that for another commit (since
it's still somewhat coupled to some other
complexity in `do_get_streams` that I am trying
to cut through, plus it's not the main point
of this commit.)
BEFORE:
# old, still in use for other codepaths
def get_stream_subscriptions_for_user(user_profile: UserProfile) -> QuerySet:
# TODO: Change return type to QuerySet[Subscription]
return Subscription.objects.filter(
user_profile=user_profile,
recipient__type=Recipient.STREAM,
)
user_subs = get_stream_subscriptions_for_user(user_profile).filter(
active=True,
).select_related('recipient')
recipient_check = Q(id__in=[sub.recipient.type_id for sub in user_subs])
AFTER:
# newly added
def get_subscribed_stream_ids_for_user(user_profile: UserProfile) -> QuerySet:
return Subscription.objects.filter(
user_profile_id=user_profile,
recipient__type=Recipient.STREAM,
active=True,
).values_list('recipient__type_id', flat=True)
subscribed_stream_ids = get_subscribed_stream_ids_for_user(user_profile)
recipient_check = Q(id__in=set(subscribed_stream_ids))
This gives us access to typing_extensions.Deque, which was not added
to typing until 3.5.4.
(PROVISION_VERSION is not bumped because the transitive dependency set
in dev.txt hasn’t changed.)
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
Rename notification property `enable_stream_sounds` to
`enable_stream_audible_notifications` to match with other
notification property patterns.
Fixes part of #12304
This commit migrates the Subscription's notification fields from a
BooleanField to a NullBooleanField where a value of None means to
inherit the value from user's profile.
Also includes a migrations to set the corresponding settings to None
if they match the user profile's values. This migration helps us in
getting rid of the weird "Apply to all" widget that we offered on
subscription settings page.
The mobile apps can't handle None appearing as the stream-level
notification settings, so for backwards-compatibility we arrange to
only send True/False to the mobile apps by applying those defaults
server-side. We introduce a notification_settings_null value within a
client_capabilities structure that newer versions of the mobile apps
can use to request the new model.
This mobile compatibility code is pretty effectively tested by the
existing test_events tests for the subscriptions subsystem.
Namely, annotate as best as possible, and add notes to indicate preference,
if QuerySet develops generic typing.
Note that the return values of functions with annotations changed in this
commit are used elsewhere as QuerySets, so the Sequence[T] approach used
for some functions in models.py is not applicable.
We extract get_bulk_stream_subscriber_info() from this
function to remove some of the complexity. Also, in that
new function we avoid a hop to the database by querying
on stream ids instead of recipient ids. The query that
gets changed here does require a join to the recipient
table (to get the stream id), so it's a little bit of a
tradeoff.
The first method we extract to this library is
get_active_subscriptions_for_stream_id().
We also move num_subscribers_for_stream_id() to here, which
is slightly annoying (having the method on Stream was nice)
but avoids some circular dependency issues.