docs: Add detailed documentation for soft deactivation.

2019-03-07 17:46:42 -08:00 · 2019-03-07 17:46:42 -08:00 · 010c02af09
parent 8ca4ca1400
commit 010c02af09
1 changed files with 143 additions and 1 deletions
--- a/docs/subsystems/sending-messages.md
+++ b/docs/subsystems/sending-messages.md
@ -83,7 +83,10 @@ number of purposes:
   * Store one `UserMessage` row in the database for each user who is
     a recipient of the message (including the sender), with
     appropriate `flags` for whether the user was mentioned, an alert
-     word appears, etc.
+     word appears, etc.  See
+     [the section on soft deactivation](#soft-deactivation) for
+     a clever optimization we use here that is important for large
+     open organizations.
   * Do all the database queries to fetch relevant data for and then
     send a `message` event to the
     [events system](../subsystems/events-system.html) containing the
@ -278,3 +281,142 @@ updated message `rendered_content`.
 * We reuse the `update_message` framework (used for
 Zulip's message editing feature) in order to avoid needing custom code
 to implement the notification-and-rerender part of this implementation.
+
+## Soft deactivation
+
+This section details a somewhat subtle issue: How Zulip uses a
+user-invisible technique called "soft deactivation" to handle
+scalability to communities with many thousands of inactive users.
+
+For background, Zulip’s threading model requires tracking which
+individual messages each user has received and read (in other chat
+products, the system either doesn’t track what the user has read at
+all, or just needs to store a pointer for “how far the user has read”
+in each room, channel, or stream).
+
+We track these data in the backend in the `UserMessage` table, storing
+rows `(message_id, user_id, flags)`, where `flags` is 32 bits of space
+for boolean data like whether the user has read or starred the
+message.  All the key queries needed for accessing message history,
+full-text search, and other key features can be done efficiently with
+the database indexes on this table (with joins to the `Message` table
+containing the actual message content where required).
+
+The downside of this design is that when a new message is sent to a
+stream with `N` recipients, we need to write `N` rows to the
+`UserMessage` table to record those users receiving those messages.
+Each row is just 3 integers in size, but even with modern databases
+and SSDs, writing thousands of rows to a database starts to take a few
+seconds.
+
+This isn’t a problem for most Zulip servers, but is a major problem
+for communities like chat.zulip.org, where might be 10,000s of
+inactive users who only stopped by briefly to check out the product or
+ask a single question, but are subscribed to whatever the default
+streams in the organization are.
+
+The total amount of work being done here was acceptable (a few seconds
+of total CPU work per message to large public streams), but the
+latency was unacceptable: The server backend was introducing a latency
+of about 1 second per 2000 users subscribed to receive the message.
+While these delays may not be immediately obvious to users (Zulip,
+like many other chat applications,
+[local echoes](../subsystems/markdown.html) messages that a user sends
+as soon as the user hits “send”), latency beyond a second or two
+significantly impacts the feeling of interactivity in a chat
+experience (i.e. it feels like everyone takes a long time to reply to
+even simple questions).
+
+A key insight for addressing this problem is that there isn’t much of
+a use case for long chat discussions among 1000s of users who are all
+continuously online and actively participating.  Streams with a very
+large number of active users are likely to only be used for occasional
+announcements, where some latency before everyone sees the message is
+fine.  Even in giant organizations, almost all messages are sent to
+smaller streams with dozens or hundreds of active users, representing
+some organizational unit within the community or company.
+
+However, large, active streams are common in open source projects,
+standards bodies, professional development groups, and other large
+communities with the rough structure of the Zulip development
+community.  These communities usually have thousands of user accounts
+subscribed to all the default streams, even if they only have dozens
+or hundreds of those users active in any given month. Many of the
+other accounts may be from people who signed up just to check the
+community out, or who signed up to ask a few questions and may never
+be seen again.
+
+The key technical insight is that if we can make the latency scale
+with the number of users who actually participate in the community,
+not the total size of the community, then our database write limited
+send latency of 1 second per 2000 users is totally fine.  But we need
+to do this in a way that doesn’t create problems if any of the
+thousands of “inactive” users come back (or one of the active users
+sends a private message to one of the inactive users), since it’s
+impossible for the software to know which users are eventually coming
+back or will eventually be interacted with by an existing user.
+
+We solved this problem with a solution we call “soft deactivation”;
+users that are soft-deactivated consume less resources from Zulip in a
+way that is designed to be invisible both to other users and to the
+user themself.  If a user hasn’t logged into a given Zulip
+organization for a few weeks, they are tagged as soft-deactivated.
+
+The way this works internally is:
+
+* We (usually) skip creating UserMessage rows for soft-deactivated
+users when a message is sent to a stream where they are subscribed.
+
+* If/when the user ever returns to Zulip, we can at that time
+reconstruct the UserMessage rows that they missed, and create the rows
+at that time (or, to avoid a latency spike if/when the user returns to
+Zulip, this work can be done in a nightly cron job).  We can construct
+those rows later because we already have the data for when the user
+might have been subscribed or unsubscribed from streams by other
+users, and, importantly, we also know that the user didn’t interact
+with the UI since the message was sent (and thus we can safely assume
+that the messages has not been marked a read by the user).  This is
+done in the `add_missing_messages` function, which is the core of the
+soft-deactivation implementation.
+
+* The “usually” above is because there are a few flags that result
+from content in the message (e.g., a message that mentions a user
+results in a “mentioned” flag in the UserMessage row), that we need to
+keep track of.  Since parsing a message can be expensive (>10ms of
+work, depending on message content), it would be too inefficient to
+need to re-parse every message when a soft-deactivated user comes back
+to Zulip.  Conveniently, those messages are rare, and so we can just
+create UserMessage rows which would have “interesting” flags at the
+time they were sent without any material performance impact.  And then
+`add_missing_messages` skips any messages that already have a
+`UserMessage` row for that user when doing its backfill.
+
+The end result is the best of both worlds:
+
+* Nobody's view of the world is different because the user was
+soft-deactivated (resulting in no visible user-experience impact), at
+least if one is running the cron job.  If one does not run the cron
+job, then users returning after being away for a very long time will
+potentially have a (very) slow loading experience as potentially
+100,000s of UserMessage rows might need to be reconstructed at once.
+* On the latency-sensitive message sending and fanout code path, the
+server only needs to do work for users who are currently interacting
+with Zulip.
+
+Empirically, we've found this technique completely resolved the "send
+latency" scaling problem.  The latency of sending a message to a stream
+now scales only with the number of active subscribers, so one can send
+a message to a stream with 5K subscribers of which 500 are active, and
+it’ll arrive in the couple hundred milliseconds one would expect if
+the extra 4500 inactive subscribers didn’t exist.
+
+There are a few details that require special care with this system:
+* Email and mobile push notifications.  We need to make sure these are
+  still correctly delivered to soft-deactivated users; making this
+  work required careful work for those code paths that assumed a
+  `UserMessage` row would always exist for a message that triggers a
+  notification to a given user.
+* Digest emails, which use the `UserMessage` table extensively to
+  determine what has happened in streams the user can see.  We can use
+  the user's subscriptions to construct what messages they should have
+  access to for this feature.