mirror of https://github.com/zulip/zulip.git
docs: Add detailed documentation for soft deactivation.
This commit is contained in:
parent
8ca4ca1400
commit
010c02af09
|
@ -83,7 +83,10 @@ number of purposes:
|
||||||
* Store one `UserMessage` row in the database for each user who is
|
* Store one `UserMessage` row in the database for each user who is
|
||||||
a recipient of the message (including the sender), with
|
a recipient of the message (including the sender), with
|
||||||
appropriate `flags` for whether the user was mentioned, an alert
|
appropriate `flags` for whether the user was mentioned, an alert
|
||||||
word appears, etc.
|
word appears, etc. See
|
||||||
|
[the section on soft deactivation](#soft-deactivation) for
|
||||||
|
a clever optimization we use here that is important for large
|
||||||
|
open organizations.
|
||||||
* Do all the database queries to fetch relevant data for and then
|
* Do all the database queries to fetch relevant data for and then
|
||||||
send a `message` event to the
|
send a `message` event to the
|
||||||
[events system](../subsystems/events-system.html) containing the
|
[events system](../subsystems/events-system.html) containing the
|
||||||
|
@ -278,3 +281,142 @@ updated message `rendered_content`.
|
||||||
* We reuse the `update_message` framework (used for
|
* We reuse the `update_message` framework (used for
|
||||||
Zulip's message editing feature) in order to avoid needing custom code
|
Zulip's message editing feature) in order to avoid needing custom code
|
||||||
to implement the notification-and-rerender part of this implementation.
|
to implement the notification-and-rerender part of this implementation.
|
||||||
|
|
||||||
|
## Soft deactivation
|
||||||
|
|
||||||
|
This section details a somewhat subtle issue: How Zulip uses a
|
||||||
|
user-invisible technique called "soft deactivation" to handle
|
||||||
|
scalability to communities with many thousands of inactive users.
|
||||||
|
|
||||||
|
For background, Zulip’s threading model requires tracking which
|
||||||
|
individual messages each user has received and read (in other chat
|
||||||
|
products, the system either doesn’t track what the user has read at
|
||||||
|
all, or just needs to store a pointer for “how far the user has read”
|
||||||
|
in each room, channel, or stream).
|
||||||
|
|
||||||
|
We track these data in the backend in the `UserMessage` table, storing
|
||||||
|
rows `(message_id, user_id, flags)`, where `flags` is 32 bits of space
|
||||||
|
for boolean data like whether the user has read or starred the
|
||||||
|
message. All the key queries needed for accessing message history,
|
||||||
|
full-text search, and other key features can be done efficiently with
|
||||||
|
the database indexes on this table (with joins to the `Message` table
|
||||||
|
containing the actual message content where required).
|
||||||
|
|
||||||
|
The downside of this design is that when a new message is sent to a
|
||||||
|
stream with `N` recipients, we need to write `N` rows to the
|
||||||
|
`UserMessage` table to record those users receiving those messages.
|
||||||
|
Each row is just 3 integers in size, but even with modern databases
|
||||||
|
and SSDs, writing thousands of rows to a database starts to take a few
|
||||||
|
seconds.
|
||||||
|
|
||||||
|
This isn’t a problem for most Zulip servers, but is a major problem
|
||||||
|
for communities like chat.zulip.org, where might be 10,000s of
|
||||||
|
inactive users who only stopped by briefly to check out the product or
|
||||||
|
ask a single question, but are subscribed to whatever the default
|
||||||
|
streams in the organization are.
|
||||||
|
|
||||||
|
The total amount of work being done here was acceptable (a few seconds
|
||||||
|
of total CPU work per message to large public streams), but the
|
||||||
|
latency was unacceptable: The server backend was introducing a latency
|
||||||
|
of about 1 second per 2000 users subscribed to receive the message.
|
||||||
|
While these delays may not be immediately obvious to users (Zulip,
|
||||||
|
like many other chat applications,
|
||||||
|
[local echoes](../subsystems/markdown.html) messages that a user sends
|
||||||
|
as soon as the user hits “send”), latency beyond a second or two
|
||||||
|
significantly impacts the feeling of interactivity in a chat
|
||||||
|
experience (i.e. it feels like everyone takes a long time to reply to
|
||||||
|
even simple questions).
|
||||||
|
|
||||||
|
A key insight for addressing this problem is that there isn’t much of
|
||||||
|
a use case for long chat discussions among 1000s of users who are all
|
||||||
|
continuously online and actively participating. Streams with a very
|
||||||
|
large number of active users are likely to only be used for occasional
|
||||||
|
announcements, where some latency before everyone sees the message is
|
||||||
|
fine. Even in giant organizations, almost all messages are sent to
|
||||||
|
smaller streams with dozens or hundreds of active users, representing
|
||||||
|
some organizational unit within the community or company.
|
||||||
|
|
||||||
|
However, large, active streams are common in open source projects,
|
||||||
|
standards bodies, professional development groups, and other large
|
||||||
|
communities with the rough structure of the Zulip development
|
||||||
|
community. These communities usually have thousands of user accounts
|
||||||
|
subscribed to all the default streams, even if they only have dozens
|
||||||
|
or hundreds of those users active in any given month. Many of the
|
||||||
|
other accounts may be from people who signed up just to check the
|
||||||
|
community out, or who signed up to ask a few questions and may never
|
||||||
|
be seen again.
|
||||||
|
|
||||||
|
The key technical insight is that if we can make the latency scale
|
||||||
|
with the number of users who actually participate in the community,
|
||||||
|
not the total size of the community, then our database write limited
|
||||||
|
send latency of 1 second per 2000 users is totally fine. But we need
|
||||||
|
to do this in a way that doesn’t create problems if any of the
|
||||||
|
thousands of “inactive” users come back (or one of the active users
|
||||||
|
sends a private message to one of the inactive users), since it’s
|
||||||
|
impossible for the software to know which users are eventually coming
|
||||||
|
back or will eventually be interacted with by an existing user.
|
||||||
|
|
||||||
|
We solved this problem with a solution we call “soft deactivation”;
|
||||||
|
users that are soft-deactivated consume less resources from Zulip in a
|
||||||
|
way that is designed to be invisible both to other users and to the
|
||||||
|
user themself. If a user hasn’t logged into a given Zulip
|
||||||
|
organization for a few weeks, they are tagged as soft-deactivated.
|
||||||
|
|
||||||
|
The way this works internally is:
|
||||||
|
|
||||||
|
* We (usually) skip creating UserMessage rows for soft-deactivated
|
||||||
|
users when a message is sent to a stream where they are subscribed.
|
||||||
|
|
||||||
|
* If/when the user ever returns to Zulip, we can at that time
|
||||||
|
reconstruct the UserMessage rows that they missed, and create the rows
|
||||||
|
at that time (or, to avoid a latency spike if/when the user returns to
|
||||||
|
Zulip, this work can be done in a nightly cron job). We can construct
|
||||||
|
those rows later because we already have the data for when the user
|
||||||
|
might have been subscribed or unsubscribed from streams by other
|
||||||
|
users, and, importantly, we also know that the user didn’t interact
|
||||||
|
with the UI since the message was sent (and thus we can safely assume
|
||||||
|
that the messages has not been marked a read by the user). This is
|
||||||
|
done in the `add_missing_messages` function, which is the core of the
|
||||||
|
soft-deactivation implementation.
|
||||||
|
|
||||||
|
* The “usually” above is because there are a few flags that result
|
||||||
|
from content in the message (e.g., a message that mentions a user
|
||||||
|
results in a “mentioned” flag in the UserMessage row), that we need to
|
||||||
|
keep track of. Since parsing a message can be expensive (>10ms of
|
||||||
|
work, depending on message content), it would be too inefficient to
|
||||||
|
need to re-parse every message when a soft-deactivated user comes back
|
||||||
|
to Zulip. Conveniently, those messages are rare, and so we can just
|
||||||
|
create UserMessage rows which would have “interesting” flags at the
|
||||||
|
time they were sent without any material performance impact. And then
|
||||||
|
`add_missing_messages` skips any messages that already have a
|
||||||
|
`UserMessage` row for that user when doing its backfill.
|
||||||
|
|
||||||
|
The end result is the best of both worlds:
|
||||||
|
|
||||||
|
* Nobody's view of the world is different because the user was
|
||||||
|
soft-deactivated (resulting in no visible user-experience impact), at
|
||||||
|
least if one is running the cron job. If one does not run the cron
|
||||||
|
job, then users returning after being away for a very long time will
|
||||||
|
potentially have a (very) slow loading experience as potentially
|
||||||
|
100,000s of UserMessage rows might need to be reconstructed at once.
|
||||||
|
* On the latency-sensitive message sending and fanout code path, the
|
||||||
|
server only needs to do work for users who are currently interacting
|
||||||
|
with Zulip.
|
||||||
|
|
||||||
|
Empirically, we've found this technique completely resolved the "send
|
||||||
|
latency" scaling problem. The latency of sending a message to a stream
|
||||||
|
now scales only with the number of active subscribers, so one can send
|
||||||
|
a message to a stream with 5K subscribers of which 500 are active, and
|
||||||
|
it’ll arrive in the couple hundred milliseconds one would expect if
|
||||||
|
the extra 4500 inactive subscribers didn’t exist.
|
||||||
|
|
||||||
|
There are a few details that require special care with this system:
|
||||||
|
* Email and mobile push notifications. We need to make sure these are
|
||||||
|
still correctly delivered to soft-deactivated users; making this
|
||||||
|
work required careful work for those code paths that assumed a
|
||||||
|
`UserMessage` row would always exist for a message that triggers a
|
||||||
|
notification to a given user.
|
||||||
|
* Digest emails, which use the `UserMessage` table extensively to
|
||||||
|
determine what has happened in streams the user can see. We can use
|
||||||
|
the user's subscriptions to construct what messages they should have
|
||||||
|
access to for this feature.
|
||||||
|
|
Loading…
Reference in New Issue