docs: Add detailed documentation for soft deactivation.

This commit is contained in:
Tim Abbott 2019-03-07 17:46:42 -08:00
parent 8ca4ca1400
commit 010c02af09
1 changed files with 143 additions and 1 deletions

View File

@ -83,7 +83,10 @@ number of purposes:
* Store one `UserMessage` row in the database for each user who is * Store one `UserMessage` row in the database for each user who is
a recipient of the message (including the sender), with a recipient of the message (including the sender), with
appropriate `flags` for whether the user was mentioned, an alert appropriate `flags` for whether the user was mentioned, an alert
word appears, etc. word appears, etc. See
[the section on soft deactivation](#soft-deactivation) for
a clever optimization we use here that is important for large
open organizations.
* Do all the database queries to fetch relevant data for and then * Do all the database queries to fetch relevant data for and then
send a `message` event to the send a `message` event to the
[events system](../subsystems/events-system.html) containing the [events system](../subsystems/events-system.html) containing the
@ -278,3 +281,142 @@ updated message `rendered_content`.
* We reuse the `update_message` framework (used for * We reuse the `update_message` framework (used for
Zulip's message editing feature) in order to avoid needing custom code Zulip's message editing feature) in order to avoid needing custom code
to implement the notification-and-rerender part of this implementation. to implement the notification-and-rerender part of this implementation.
## Soft deactivation
This section details a somewhat subtle issue: How Zulip uses a
user-invisible technique called "soft deactivation" to handle
scalability to communities with many thousands of inactive users.
For background, Zulips threading model requires tracking which
individual messages each user has received and read (in other chat
products, the system either doesnt track what the user has read at
all, or just needs to store a pointer for “how far the user has read”
in each room, channel, or stream).
We track these data in the backend in the `UserMessage` table, storing
rows `(message_id, user_id, flags)`, where `flags` is 32 bits of space
for boolean data like whether the user has read or starred the
message. All the key queries needed for accessing message history,
full-text search, and other key features can be done efficiently with
the database indexes on this table (with joins to the `Message` table
containing the actual message content where required).
The downside of this design is that when a new message is sent to a
stream with `N` recipients, we need to write `N` rows to the
`UserMessage` table to record those users receiving those messages.
Each row is just 3 integers in size, but even with modern databases
and SSDs, writing thousands of rows to a database starts to take a few
seconds.
This isnt a problem for most Zulip servers, but is a major problem
for communities like chat.zulip.org, where might be 10,000s of
inactive users who only stopped by briefly to check out the product or
ask a single question, but are subscribed to whatever the default
streams in the organization are.
The total amount of work being done here was acceptable (a few seconds
of total CPU work per message to large public streams), but the
latency was unacceptable: The server backend was introducing a latency
of about 1 second per 2000 users subscribed to receive the message.
While these delays may not be immediately obvious to users (Zulip,
like many other chat applications,
[local echoes](../subsystems/markdown.html) messages that a user sends
as soon as the user hits “send”), latency beyond a second or two
significantly impacts the feeling of interactivity in a chat
experience (i.e. it feels like everyone takes a long time to reply to
even simple questions).
A key insight for addressing this problem is that there isnt much of
a use case for long chat discussions among 1000s of users who are all
continuously online and actively participating. Streams with a very
large number of active users are likely to only be used for occasional
announcements, where some latency before everyone sees the message is
fine. Even in giant organizations, almost all messages are sent to
smaller streams with dozens or hundreds of active users, representing
some organizational unit within the community or company.
However, large, active streams are common in open source projects,
standards bodies, professional development groups, and other large
communities with the rough structure of the Zulip development
community. These communities usually have thousands of user accounts
subscribed to all the default streams, even if they only have dozens
or hundreds of those users active in any given month. Many of the
other accounts may be from people who signed up just to check the
community out, or who signed up to ask a few questions and may never
be seen again.
The key technical insight is that if we can make the latency scale
with the number of users who actually participate in the community,
not the total size of the community, then our database write limited
send latency of 1 second per 2000 users is totally fine. But we need
to do this in a way that doesnt create problems if any of the
thousands of “inactive” users come back (or one of the active users
sends a private message to one of the inactive users), since its
impossible for the software to know which users are eventually coming
back or will eventually be interacted with by an existing user.
We solved this problem with a solution we call “soft deactivation”;
users that are soft-deactivated consume less resources from Zulip in a
way that is designed to be invisible both to other users and to the
user themself. If a user hasnt logged into a given Zulip
organization for a few weeks, they are tagged as soft-deactivated.
The way this works internally is:
* We (usually) skip creating UserMessage rows for soft-deactivated
users when a message is sent to a stream where they are subscribed.
* If/when the user ever returns to Zulip, we can at that time
reconstruct the UserMessage rows that they missed, and create the rows
at that time (or, to avoid a latency spike if/when the user returns to
Zulip, this work can be done in a nightly cron job). We can construct
those rows later because we already have the data for when the user
might have been subscribed or unsubscribed from streams by other
users, and, importantly, we also know that the user didnt interact
with the UI since the message was sent (and thus we can safely assume
that the messages has not been marked a read by the user). This is
done in the `add_missing_messages` function, which is the core of the
soft-deactivation implementation.
* The “usually” above is because there are a few flags that result
from content in the message (e.g., a message that mentions a user
results in a “mentioned” flag in the UserMessage row), that we need to
keep track of. Since parsing a message can be expensive (>10ms of
work, depending on message content), it would be too inefficient to
need to re-parse every message when a soft-deactivated user comes back
to Zulip. Conveniently, those messages are rare, and so we can just
create UserMessage rows which would have “interesting” flags at the
time they were sent without any material performance impact. And then
`add_missing_messages` skips any messages that already have a
`UserMessage` row for that user when doing its backfill.
The end result is the best of both worlds:
* Nobody's view of the world is different because the user was
soft-deactivated (resulting in no visible user-experience impact), at
least if one is running the cron job. If one does not run the cron
job, then users returning after being away for a very long time will
potentially have a (very) slow loading experience as potentially
100,000s of UserMessage rows might need to be reconstructed at once.
* On the latency-sensitive message sending and fanout code path, the
server only needs to do work for users who are currently interacting
with Zulip.
Empirically, we've found this technique completely resolved the "send
latency" scaling problem. The latency of sending a message to a stream
now scales only with the number of active subscribers, so one can send
a message to a stream with 5K subscribers of which 500 are active, and
itll arrive in the couple hundred milliseconds one would expect if
the extra 4500 inactive subscribers didnt exist.
There are a few details that require special care with this system:
* Email and mobile push notifications. We need to make sure these are
still correctly delivered to soft-deactivated users; making this
work required careful work for those code paths that assumed a
`UserMessage` row would always exist for a message that triggers a
notification to a given user.
* Digest emails, which use the `UserMessage` table extensively to
determine what has happened in streams the user can see. We can use
the user's subscriptions to construct what messages they should have
access to for this feature.