2018-11-30 00:48:13 +01:00
|
|
|
|
# Sending messages
|
|
|
|
|
|
|
|
|
|
While sending a message in a chat product might seem simple, there's a
|
|
|
|
|
lot of underlying complexity required to make a professional-quality
|
|
|
|
|
experience.
|
|
|
|
|
|
|
|
|
|
This document aims to explain conceptually what happens when a message
|
|
|
|
|
is sent in Zulip, and why that is correct behavior. It assumes the
|
|
|
|
|
reader is familiar with our
|
2019-09-30 19:37:56 +02:00
|
|
|
|
[real-time sync system](../subsystems/events-system.md) for
|
2018-11-30 00:48:13 +01:00
|
|
|
|
server-to-client communication and
|
2019-09-30 19:37:56 +02:00
|
|
|
|
[new application feature tutorial](../tutorials/new-feature-tutorial.md),
|
2018-11-30 00:48:13 +01:00
|
|
|
|
and we generally don't repeat the content discussed there.
|
|
|
|
|
|
|
|
|
|
## Message lists
|
|
|
|
|
|
|
|
|
|
This is just a bit of terminology: A "message list" is what Zulip
|
|
|
|
|
calls the frontend concept of a (potentially narrowed) message feed.
|
|
|
|
|
There are 3 related structures:
|
|
|
|
|
* A `message_list_data` just has the sequencing data of which message
|
|
|
|
|
IDs go in what order.
|
|
|
|
|
* A `message_list` is built on top of `message_list_data` and
|
|
|
|
|
additionally contains the data for a visible-to-the-user message list
|
|
|
|
|
(E.g. where trailing bookends should appear, a selected message,
|
|
|
|
|
etc.).
|
|
|
|
|
* A `message_list_view` is built on top of `message_list` and
|
|
|
|
|
additionally contains rendering details like a window of up to 400
|
|
|
|
|
messages that is present in the DOM at the time, scroll position
|
|
|
|
|
controls, etc.
|
|
|
|
|
|
|
|
|
|
(This should later be expanded into a full article on message lists
|
|
|
|
|
and narrowing).
|
|
|
|
|
|
|
|
|
|
## Compose area
|
|
|
|
|
|
|
|
|
|
The compose box does a lot of fancy things that are out of scope for
|
|
|
|
|
this article. But it also does a decent amount of client-side
|
|
|
|
|
validation before sending a message off to the server, especially
|
|
|
|
|
around mentions (E.g. checking the stream name is a valid stream,
|
|
|
|
|
displaying a warning about the number of recipients before a user can
|
|
|
|
|
use `@**all**` or mention a user who is not subscribed to the current
|
|
|
|
|
stream, etc.).
|
|
|
|
|
|
|
|
|
|
## Backend implementation
|
|
|
|
|
|
|
|
|
|
The backend flow for sending messages is similar in many ways to the
|
|
|
|
|
process described in our
|
2019-09-30 19:37:56 +02:00
|
|
|
|
[new application feature tutorial](../tutorials/new-feature-tutorial.md).
|
2018-11-30 00:48:13 +01:00
|
|
|
|
This section details the ways in which it is different:
|
|
|
|
|
|
|
|
|
|
* There is significant custom code inside the `process_message_event`
|
|
|
|
|
function in `zerver/tornado/event_queue.py`. This custom code has a
|
|
|
|
|
number of purposes:
|
2020-12-19 03:05:20 +01:00
|
|
|
|
* Triggering [email and mobile push
|
|
|
|
|
notifications](../subsystems/notifications.md) for any users who
|
2018-11-30 00:48:13 +01:00
|
|
|
|
do not have active clients and have settings of the form "push
|
|
|
|
|
notifications when offline". In order to avoid doing any real
|
|
|
|
|
computational work inside the Tornado codebase, this logic aims
|
|
|
|
|
to just do the check for whether a notification should be
|
|
|
|
|
generated, and then put an event into an appropriate
|
2020-12-19 03:05:20 +01:00
|
|
|
|
[queue](../subsystems/queuing.md) to actually send the message.
|
|
|
|
|
See `maybe_enqueue_notifications` and related code for this part
|
|
|
|
|
of the logic.
|
2018-11-30 19:06:04 +01:00
|
|
|
|
* Splicing user-dependent data (E.g. `flags` such as when the user
|
|
|
|
|
was `mentioned`) into the events.
|
2018-11-30 00:48:13 +01:00
|
|
|
|
* Handling the [local echo details](#local-echo).
|
|
|
|
|
* Handling certain client configuration options that affect
|
|
|
|
|
messages. E.g. determining whether to send the
|
2020-08-11 01:47:49 +02:00
|
|
|
|
plaintext/Markdown raw content or the rendered HTML (e.g. the
|
2018-11-30 00:48:13 +01:00
|
|
|
|
`apply_markdown` and `client_gravatar` features in our
|
2020-06-08 23:04:39 +02:00
|
|
|
|
[events API docs](https://zulip.com/api/register-queue)).
|
2018-11-30 00:48:13 +01:00
|
|
|
|
* Following our standard naming convention, input validation is done
|
2021-03-10 16:45:23 +01:00
|
|
|
|
inside the `check_message` function in `zerver/lib/actions.py`, which is responsible for
|
2019-04-14 04:39:47 +02:00
|
|
|
|
validating the user can send to the recipient,
|
2020-08-11 01:47:49 +02:00
|
|
|
|
[rendering the Markdown](../subsystems/markdown.md), etc. --
|
2018-11-30 00:48:13 +01:00
|
|
|
|
basically everything that can fail due to bad user input.
|
|
|
|
|
* The core `do_send_messages` function (which handles actually sending
|
2021-03-10 16:45:23 +01:00
|
|
|
|
the message) in `zerver/lib/actions.py` is one of the most optimized and thus complex parts of
|
2018-11-30 00:48:13 +01:00
|
|
|
|
the system. But in short, its job is to atomically do a few key
|
|
|
|
|
things:
|
|
|
|
|
* Store a `Message` row in the database.
|
|
|
|
|
* Store one `UserMessage` row in the database for each user who is
|
|
|
|
|
a recipient of the message (including the sender), with
|
|
|
|
|
appropriate `flags` for whether the user was mentioned, an alert
|
2019-03-08 02:46:42 +01:00
|
|
|
|
word appears, etc. See
|
|
|
|
|
[the section on soft deactivation](#soft-deactivation) for
|
|
|
|
|
a clever optimization we use here that is important for large
|
|
|
|
|
open organizations.
|
2018-11-30 00:48:13 +01:00
|
|
|
|
* Do all the database queries to fetch relevant data for and then
|
|
|
|
|
send a `message` event to the
|
2019-09-30 19:37:56 +02:00
|
|
|
|
[events system](../subsystems/events-system.md) containing the
|
2018-11-30 00:48:13 +01:00
|
|
|
|
data it will need for the calculations described above. This
|
|
|
|
|
step adds a lot of complexity, because the events system cannot
|
|
|
|
|
make queries to the database directly.
|
|
|
|
|
* Trigger any other deferred work caused by the current message,
|
2020-06-08 23:04:39 +02:00
|
|
|
|
e.g. [outgoing webhooks](https://zulip.com/api/outgoing-webhooks)
|
2018-11-30 00:48:13 +01:00
|
|
|
|
or embedded bots.
|
|
|
|
|
* Every query is designed to be a bulk query; we carefully
|
|
|
|
|
unit-test this system for how many database and memcached queries
|
|
|
|
|
it makes when sending messages with large numbers of recipients,
|
|
|
|
|
to ensure its performance.
|
|
|
|
|
|
|
|
|
|
## Local echo
|
|
|
|
|
|
|
|
|
|
An essential feature for a good chat is experience is local echo
|
|
|
|
|
(i.e. having the message appear in the feed the moment the user hits
|
|
|
|
|
send, before the network round trip to the server). This is essential
|
|
|
|
|
both for freeing up the compose box (for the user to send more
|
|
|
|
|
messages) as well as for the experience to feel snappy.
|
|
|
|
|
|
|
|
|
|
A sloppy local echo experience (like Google Chat had for over a decade
|
|
|
|
|
for emoji) would just render the raw text the user entered in the
|
|
|
|
|
browser, and then replace it with data from the server when it
|
|
|
|
|
changes.
|
|
|
|
|
|
|
|
|
|
Zulip aims for a near-perfect local echo experience, which requires is
|
2020-08-11 01:47:49 +02:00
|
|
|
|
why our [Markdown system](../subsystems/markdown.md) requires both
|
|
|
|
|
an authoritative (backend) Markdown implementation and a secondary
|
|
|
|
|
(frontend) Markdown implementation, the latter used only for the local
|
|
|
|
|
echo feature. Read our Markdown documentation for all the tricky
|
2018-11-30 00:48:13 +01:00
|
|
|
|
details on how that works and is tested.
|
|
|
|
|
|
|
|
|
|
The rest of this section details how Zulip manages locally echoed
|
|
|
|
|
messages.
|
|
|
|
|
|
|
|
|
|
* The core function in the frontend codebase
|
|
|
|
|
`echo.try_deliver_locally`. This checks whether correct local echo
|
|
|
|
|
is possible (via `markdown.contains_backend_only_syntax`) and useful
|
|
|
|
|
(whether the message would appear in the current view), and if so,
|
|
|
|
|
causes Zulip to insert the message into the relevant feed(s).
|
|
|
|
|
* Since the message hasn't been confirmed by the server yet, it
|
|
|
|
|
doesn't have a message ID. The frontend makes one up, via
|
|
|
|
|
`local_message.next_local_id`, by taking the highest message ID it
|
|
|
|
|
has seen and adding the decimal `0.01`. The use of a floating point
|
|
|
|
|
value is critical, because it means the message should sort
|
|
|
|
|
correctly with other messages (at the bottom) and also won't be
|
|
|
|
|
duplicated by a real confirmed-by-the-backend message ID. We choose
|
|
|
|
|
just above the `max_message_id`, because we want any new messages
|
|
|
|
|
that other users send to the current view to be placed after it in
|
2020-10-15 18:54:27 +02:00
|
|
|
|
the feed (this decision is somewhat arbitrary; in any case we'll
|
2018-11-30 00:48:13 +01:00
|
|
|
|
resort it to its proper place once it is confirmed by the server.
|
|
|
|
|
We do it this way to minimize messages jumping around/reordering
|
|
|
|
|
visually).
|
|
|
|
|
* The `POST /messages` API request to the server to send the message
|
|
|
|
|
is passed two special parameters that clients not implementing local
|
|
|
|
|
echo don't use: `queue_id` and `local_id`. The `queue_id` is the ID
|
|
|
|
|
of the client's event queue; here, it is used just as a unique
|
|
|
|
|
identifier for the specific client (e.g. a browser tab) that sent
|
|
|
|
|
the message. And the `local_id` is, by the construction above, a
|
|
|
|
|
unique value within that namespace identifying the message.
|
|
|
|
|
* The `do_send_messages` backend code path includes the `queue_id` and
|
|
|
|
|
`local_id` in the data it passes to the
|
2019-09-30 19:37:56 +02:00
|
|
|
|
[events system](../subsystems/events-system.md). The events
|
2018-11-30 00:48:13 +01:00
|
|
|
|
system will extend the `message` event dictionary it delivers to
|
|
|
|
|
the client containing the `queue_id` with `local_message_id` field,
|
|
|
|
|
containing the `local_id` that the relevant client used when sending
|
|
|
|
|
the message. This allows the client to know that the `message`
|
|
|
|
|
event it is receiving is the same message it itself had sent.
|
|
|
|
|
* Using that information, rather than adding the "new message" to the
|
|
|
|
|
relevant message feed, it updates the (locally echoed) message's
|
|
|
|
|
properties (at the very least, message ID and timestamp) and
|
|
|
|
|
rerenders it in any message lists where it appears. This is
|
2021-05-03 12:48:51 +02:00
|
|
|
|
primarily done in the `process_from_server` function in
|
2018-11-30 00:48:13 +01:00
|
|
|
|
`static/js/echo.js`.
|
|
|
|
|
|
2019-04-22 20:13:23 +02:00
|
|
|
|
### Local echo in message editing
|
|
|
|
|
|
|
|
|
|
Zulip also supports local echo in the message editing code path for
|
|
|
|
|
edits to just the content of a message. The approach is analogous
|
|
|
|
|
(using `markdown.contains_backend_only_syntax`, etc.)), except we
|
|
|
|
|
don't need any of the `local_id` tracking logic, because the message
|
|
|
|
|
already has a permanent message id; as a result, the whole
|
|
|
|
|
implementation was under 150 lines of code.
|
|
|
|
|
|
2018-11-30 00:48:13 +01:00
|
|
|
|
## Putting it all together
|
|
|
|
|
|
|
|
|
|
This section just has a brief review of the sequence of steps all in
|
|
|
|
|
one place:
|
|
|
|
|
* User hits send in the compose box.
|
dependencies: Remove WebSockets system for sending messages.
Zulip has had a small use of WebSockets (specifically, for the code
path of sending messages, via the webapp only) since ~2013. We
originally added this use of WebSockets in the hope that the latency
benefits of doing so would allow us to avoid implementing a markdown
local echo; they were not. Further, HTTP/2 may have eliminated the
latency difference we hoped to exploit by using WebSockets in any
case.
While we’d originally imagined using WebSockets for other endpoints,
there was never a good justification for moving more components to the
WebSockets system.
This WebSockets code path had a lot of downsides/complexity,
including:
* The messy hack involving constructing an emulated request object to
hook into doing Django requests.
* The `message_senders` queue processor system, which increases RAM
needs and must be provisioned independently from the rest of the
server).
* A duplicate check_send_receive_time Nagios test specific to
WebSockets.
* The requirement for users to have their firewalls/NATs allow
WebSocket connections, and a setting to disable them for networks
where WebSockets don’t work.
* Dependencies on the SockJS family of libraries, which has at times
been poorly maintained, and periodically throws random JavaScript
exceptions in our production environments without a deep enough
traceback to effectively investigate.
* A total of about 1600 lines of our code related to the feature.
* Increased load on the Tornado system, especially around a Zulip
server restart, and especially for large installations like
zulipchat.com, resulting in extra delay before messages can be sent
again.
As detailed in
https://github.com/zulip/zulip/pull/12862#issuecomment-536152397, it
appears that removing WebSockets moderately increases the time it
takes for the `send_message` API query to return from the server, but
does not significantly change the time between when a message is sent
and when it is received by clients. We don’t understand the reason
for that change (suggesting the possibility of a measurement error),
and even if it is a real change, we consider that potential small
latency regression to be acceptable.
If we later want WebSockets, we’ll likely want to just use Django
Channels.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
2019-07-23 01:43:40 +02:00
|
|
|
|
* Compose box validation runs; if it passes, the browser locally
|
|
|
|
|
echoes the message and then sends a request to the `POST /messages`
|
|
|
|
|
API endpoint.
|
|
|
|
|
* The Django URL routes and middleware run, and eventually call the
|
2018-11-30 00:48:13 +01:00
|
|
|
|
`send_message_backend` view function in `zerver/views/messages.py`.
|
dependencies: Remove WebSockets system for sending messages.
Zulip has had a small use of WebSockets (specifically, for the code
path of sending messages, via the webapp only) since ~2013. We
originally added this use of WebSockets in the hope that the latency
benefits of doing so would allow us to avoid implementing a markdown
local echo; they were not. Further, HTTP/2 may have eliminated the
latency difference we hoped to exploit by using WebSockets in any
case.
While we’d originally imagined using WebSockets for other endpoints,
there was never a good justification for moving more components to the
WebSockets system.
This WebSockets code path had a lot of downsides/complexity,
including:
* The messy hack involving constructing an emulated request object to
hook into doing Django requests.
* The `message_senders` queue processor system, which increases RAM
needs and must be provisioned independently from the rest of the
server).
* A duplicate check_send_receive_time Nagios test specific to
WebSockets.
* The requirement for users to have their firewalls/NATs allow
WebSocket connections, and a setting to disable them for networks
where WebSockets don’t work.
* Dependencies on the SockJS family of libraries, which has at times
been poorly maintained, and periodically throws random JavaScript
exceptions in our production environments without a deep enough
traceback to effectively investigate.
* A total of about 1600 lines of our code related to the feature.
* Increased load on the Tornado system, especially around a Zulip
server restart, and especially for large installations like
zulipchat.com, resulting in extra delay before messages can be sent
again.
As detailed in
https://github.com/zulip/zulip/pull/12862#issuecomment-536152397, it
appears that removing WebSockets moderately increases the time it
takes for the `send_message` API query to return from the server, but
does not significantly change the time between when a message is sent
and when it is received by clients. We don’t understand the reason
for that change (suggesting the possibility of a measurement error),
and even if it is a real change, we consider that potential small
latency regression to be acceptable.
If we later want WebSockets, we’ll likely want to just use Django
Channels.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
2019-07-23 01:43:40 +02:00
|
|
|
|
(Alternatively, for an API request to send a message via Zulip's
|
|
|
|
|
REST API, things start here).
|
2018-11-30 00:48:13 +01:00
|
|
|
|
* `send_message_backend` does some validation before triggering the
|
dependencies: Remove WebSockets system for sending messages.
Zulip has had a small use of WebSockets (specifically, for the code
path of sending messages, via the webapp only) since ~2013. We
originally added this use of WebSockets in the hope that the latency
benefits of doing so would allow us to avoid implementing a markdown
local echo; they were not. Further, HTTP/2 may have eliminated the
latency difference we hoped to exploit by using WebSockets in any
case.
While we’d originally imagined using WebSockets for other endpoints,
there was never a good justification for moving more components to the
WebSockets system.
This WebSockets code path had a lot of downsides/complexity,
including:
* The messy hack involving constructing an emulated request object to
hook into doing Django requests.
* The `message_senders` queue processor system, which increases RAM
needs and must be provisioned independently from the rest of the
server).
* A duplicate check_send_receive_time Nagios test specific to
WebSockets.
* The requirement for users to have their firewalls/NATs allow
WebSocket connections, and a setting to disable them for networks
where WebSockets don’t work.
* Dependencies on the SockJS family of libraries, which has at times
been poorly maintained, and periodically throws random JavaScript
exceptions in our production environments without a deep enough
traceback to effectively investigate.
* A total of about 1600 lines of our code related to the feature.
* Increased load on the Tornado system, especially around a Zulip
server restart, and especially for large installations like
zulipchat.com, resulting in extra delay before messages can be sent
again.
As detailed in
https://github.com/zulip/zulip/pull/12862#issuecomment-536152397, it
appears that removing WebSockets moderately increases the time it
takes for the `send_message` API query to return from the server, but
does not significantly change the time between when a message is sent
and when it is received by clients. We don’t understand the reason
for that change (suggesting the possibility of a measurement error),
and even if it is a real change, we consider that potential small
latency regression to be acceptable.
If we later want WebSockets, we’ll likely want to just use Django
Channels.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
2019-07-23 01:43:40 +02:00
|
|
|
|
`check_message` + `do_send_messages` backend flow.
|
2018-11-30 00:48:13 +01:00
|
|
|
|
* That backend flow saves the data to the database and triggers a
|
|
|
|
|
`message` event in the `notify_tornado` queue (part of the events
|
|
|
|
|
system).
|
|
|
|
|
* The events system processes, and dispatches that event to all
|
|
|
|
|
clients subscribed to receive notifications for users who should
|
|
|
|
|
receive the message (including the sender). As a side effect, it
|
|
|
|
|
adds queue items to the email and push notification queues (which,
|
|
|
|
|
in turn, may trigger those notifications).
|
2020-10-15 18:54:27 +02:00
|
|
|
|
* Other clients receive the event and display the new message.
|
2018-11-30 00:48:13 +01:00
|
|
|
|
* For the client that sent the message, it instead replaces its
|
|
|
|
|
locally echoed message with the final message it received back
|
|
|
|
|
from the server (it indicates this to the sender by adding a
|
|
|
|
|
display timestamp to the message).
|
dependencies: Remove WebSockets system for sending messages.
Zulip has had a small use of WebSockets (specifically, for the code
path of sending messages, via the webapp only) since ~2013. We
originally added this use of WebSockets in the hope that the latency
benefits of doing so would allow us to avoid implementing a markdown
local echo; they were not. Further, HTTP/2 may have eliminated the
latency difference we hoped to exploit by using WebSockets in any
case.
While we’d originally imagined using WebSockets for other endpoints,
there was never a good justification for moving more components to the
WebSockets system.
This WebSockets code path had a lot of downsides/complexity,
including:
* The messy hack involving constructing an emulated request object to
hook into doing Django requests.
* The `message_senders` queue processor system, which increases RAM
needs and must be provisioned independently from the rest of the
server).
* A duplicate check_send_receive_time Nagios test specific to
WebSockets.
* The requirement for users to have their firewalls/NATs allow
WebSocket connections, and a setting to disable them for networks
where WebSockets don’t work.
* Dependencies on the SockJS family of libraries, which has at times
been poorly maintained, and periodically throws random JavaScript
exceptions in our production environments without a deep enough
traceback to effectively investigate.
* A total of about 1600 lines of our code related to the feature.
* Increased load on the Tornado system, especially around a Zulip
server restart, and especially for large installations like
zulipchat.com, resulting in extra delay before messages can be sent
again.
As detailed in
https://github.com/zulip/zulip/pull/12862#issuecomment-536152397, it
appears that removing WebSockets moderately increases the time it
takes for the `send_message` API query to return from the server, but
does not significantly change the time between when a message is sent
and when it is received by clients. We don’t understand the reason
for that change (suggesting the possibility of a measurement error),
and even if it is a real change, we consider that potential small
latency regression to be acceptable.
If we later want WebSockets, we’ll likely want to just use Django
Channels.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
2019-07-23 01:43:40 +02:00
|
|
|
|
* The `send_message_backend` view function returns
|
2018-11-30 00:48:13 +01:00
|
|
|
|
a 200 `HTTP` response; the client receives that response and mostly
|
|
|
|
|
does nothing with it other than update some logging details. (This
|
|
|
|
|
may happen before or after the client receives the event notifying
|
|
|
|
|
it about the new message via its event queue.)
|
|
|
|
|
|
|
|
|
|
## Message editing
|
|
|
|
|
|
|
|
|
|
Message editing uses a very similar principle to how sending messages
|
|
|
|
|
works. A few details are worth mentioning:
|
|
|
|
|
|
|
|
|
|
* `maybe_enqueue_notifications_for_message_update` is an analogue of
|
|
|
|
|
`maybe_enqueue_notifications`, and exists to handle cases like a
|
|
|
|
|
user was newly mentioned after the message is edited (since that
|
|
|
|
|
should trigger email/push notifications, even if the original
|
|
|
|
|
message didn't have one).
|
|
|
|
|
* We use a similar technique to what's described in the local echo
|
|
|
|
|
section for doing client-side rerendering to update the message feed.
|
|
|
|
|
* In the default configuration, Zulip stores the message edit history
|
|
|
|
|
(which is useful for forensics but also exposed in the UI), in the
|
|
|
|
|
`message.edit_history` attribute.
|
|
|
|
|
* We support topic editing, including bulk-updates moving several
|
|
|
|
|
messages between topics.
|
|
|
|
|
|
|
|
|
|
### Inline URL previews
|
|
|
|
|
|
|
|
|
|
Zulip's inline URL previews feature (`zerver/lib/url_preview/`) uses
|
|
|
|
|
variant of the message editing/local echo behavior. The reason is
|
|
|
|
|
that for inline URL previews, the backend needs to fetch the content
|
|
|
|
|
from the target URL, and for slow websites, this could result in a
|
|
|
|
|
significant delay in rendering the message and delivering it to other
|
|
|
|
|
users.
|
|
|
|
|
|
2020-08-11 01:47:49 +02:00
|
|
|
|
* For this case, Zulip's backend Markdown processor will render the
|
2018-11-30 00:48:13 +01:00
|
|
|
|
message without including the URL embeds/previews, but it will add a
|
|
|
|
|
deferred work item into the `embed_links` queue.
|
|
|
|
|
|
2019-09-30 19:37:56 +02:00
|
|
|
|
* The [queue processor](../subsystems/queuing.md) for the
|
2018-11-30 00:48:13 +01:00
|
|
|
|
`embed_links` queue will fetch the URLs, and then if they return
|
2020-08-11 01:47:49 +02:00
|
|
|
|
results, rerun the Markdown processor and notify clients of the
|
2018-11-30 00:48:13 +01:00
|
|
|
|
updated message `rendered_content`.
|
|
|
|
|
|
|
|
|
|
* We reuse the `update_message` framework (used for
|
|
|
|
|
Zulip's message editing feature) in order to avoid needing custom code
|
|
|
|
|
to implement the notification-and-rerender part of this implementation.
|
2019-03-08 02:46:42 +01:00
|
|
|
|
|
|
|
|
|
## Soft deactivation
|
|
|
|
|
|
|
|
|
|
This section details a somewhat subtle issue: How Zulip uses a
|
|
|
|
|
user-invisible technique called "soft deactivation" to handle
|
|
|
|
|
scalability to communities with many thousands of inactive users.
|
|
|
|
|
|
|
|
|
|
For background, Zulip’s threading model requires tracking which
|
|
|
|
|
individual messages each user has received and read (in other chat
|
|
|
|
|
products, the system either doesn’t track what the user has read at
|
|
|
|
|
all, or just needs to store a pointer for “how far the user has read”
|
|
|
|
|
in each room, channel, or stream).
|
|
|
|
|
|
|
|
|
|
We track these data in the backend in the `UserMessage` table, storing
|
|
|
|
|
rows `(message_id, user_id, flags)`, where `flags` is 32 bits of space
|
|
|
|
|
for boolean data like whether the user has read or starred the
|
|
|
|
|
message. All the key queries needed for accessing message history,
|
|
|
|
|
full-text search, and other key features can be done efficiently with
|
|
|
|
|
the database indexes on this table (with joins to the `Message` table
|
|
|
|
|
containing the actual message content where required).
|
|
|
|
|
|
|
|
|
|
The downside of this design is that when a new message is sent to a
|
|
|
|
|
stream with `N` recipients, we need to write `N` rows to the
|
|
|
|
|
`UserMessage` table to record those users receiving those messages.
|
|
|
|
|
Each row is just 3 integers in size, but even with modern databases
|
|
|
|
|
and SSDs, writing thousands of rows to a database starts to take a few
|
|
|
|
|
seconds.
|
|
|
|
|
|
|
|
|
|
This isn’t a problem for most Zulip servers, but is a major problem
|
|
|
|
|
for communities like chat.zulip.org, where might be 10,000s of
|
|
|
|
|
inactive users who only stopped by briefly to check out the product or
|
|
|
|
|
ask a single question, but are subscribed to whatever the default
|
|
|
|
|
streams in the organization are.
|
|
|
|
|
|
|
|
|
|
The total amount of work being done here was acceptable (a few seconds
|
|
|
|
|
of total CPU work per message to large public streams), but the
|
|
|
|
|
latency was unacceptable: The server backend was introducing a latency
|
|
|
|
|
of about 1 second per 2000 users subscribed to receive the message.
|
|
|
|
|
While these delays may not be immediately obvious to users (Zulip,
|
|
|
|
|
like many other chat applications,
|
2019-09-30 19:37:56 +02:00
|
|
|
|
[local echoes](../subsystems/markdown.md) messages that a user sends
|
2020-08-11 01:47:54 +02:00
|
|
|
|
as soon as the user hits “Send”), latency beyond a second or two
|
2019-03-08 02:46:42 +01:00
|
|
|
|
significantly impacts the feeling of interactivity in a chat
|
|
|
|
|
experience (i.e. it feels like everyone takes a long time to reply to
|
|
|
|
|
even simple questions).
|
|
|
|
|
|
|
|
|
|
A key insight for addressing this problem is that there isn’t much of
|
|
|
|
|
a use case for long chat discussions among 1000s of users who are all
|
|
|
|
|
continuously online and actively participating. Streams with a very
|
|
|
|
|
large number of active users are likely to only be used for occasional
|
|
|
|
|
announcements, where some latency before everyone sees the message is
|
|
|
|
|
fine. Even in giant organizations, almost all messages are sent to
|
|
|
|
|
smaller streams with dozens or hundreds of active users, representing
|
|
|
|
|
some organizational unit within the community or company.
|
|
|
|
|
|
|
|
|
|
However, large, active streams are common in open source projects,
|
|
|
|
|
standards bodies, professional development groups, and other large
|
|
|
|
|
communities with the rough structure of the Zulip development
|
|
|
|
|
community. These communities usually have thousands of user accounts
|
|
|
|
|
subscribed to all the default streams, even if they only have dozens
|
|
|
|
|
or hundreds of those users active in any given month. Many of the
|
|
|
|
|
other accounts may be from people who signed up just to check the
|
|
|
|
|
community out, or who signed up to ask a few questions and may never
|
|
|
|
|
be seen again.
|
|
|
|
|
|
|
|
|
|
The key technical insight is that if we can make the latency scale
|
|
|
|
|
with the number of users who actually participate in the community,
|
|
|
|
|
not the total size of the community, then our database write limited
|
|
|
|
|
send latency of 1 second per 2000 users is totally fine. But we need
|
|
|
|
|
to do this in a way that doesn’t create problems if any of the
|
|
|
|
|
thousands of “inactive” users come back (or one of the active users
|
|
|
|
|
sends a private message to one of the inactive users), since it’s
|
|
|
|
|
impossible for the software to know which users are eventually coming
|
|
|
|
|
back or will eventually be interacted with by an existing user.
|
|
|
|
|
|
|
|
|
|
We solved this problem with a solution we call “soft deactivation”;
|
|
|
|
|
users that are soft-deactivated consume less resources from Zulip in a
|
|
|
|
|
way that is designed to be invisible both to other users and to the
|
|
|
|
|
user themself. If a user hasn’t logged into a given Zulip
|
|
|
|
|
organization for a few weeks, they are tagged as soft-deactivated.
|
|
|
|
|
|
|
|
|
|
The way this works internally is:
|
|
|
|
|
|
|
|
|
|
* We (usually) skip creating UserMessage rows for soft-deactivated
|
|
|
|
|
users when a message is sent to a stream where they are subscribed.
|
|
|
|
|
|
|
|
|
|
* If/when the user ever returns to Zulip, we can at that time
|
|
|
|
|
reconstruct the UserMessage rows that they missed, and create the rows
|
|
|
|
|
at that time (or, to avoid a latency spike if/when the user returns to
|
|
|
|
|
Zulip, this work can be done in a nightly cron job). We can construct
|
|
|
|
|
those rows later because we already have the data for when the user
|
|
|
|
|
might have been subscribed or unsubscribed from streams by other
|
|
|
|
|
users, and, importantly, we also know that the user didn’t interact
|
|
|
|
|
with the UI since the message was sent (and thus we can safely assume
|
2020-10-15 18:54:27 +02:00
|
|
|
|
that the messages have not been marked as read by the user). This is
|
2019-03-08 02:46:42 +01:00
|
|
|
|
done in the `add_missing_messages` function, which is the core of the
|
|
|
|
|
soft-deactivation implementation.
|
|
|
|
|
|
|
|
|
|
* The “usually” above is because there are a few flags that result
|
|
|
|
|
from content in the message (e.g., a message that mentions a user
|
|
|
|
|
results in a “mentioned” flag in the UserMessage row), that we need to
|
|
|
|
|
keep track of. Since parsing a message can be expensive (>10ms of
|
|
|
|
|
work, depending on message content), it would be too inefficient to
|
|
|
|
|
need to re-parse every message when a soft-deactivated user comes back
|
|
|
|
|
to Zulip. Conveniently, those messages are rare, and so we can just
|
|
|
|
|
create UserMessage rows which would have “interesting” flags at the
|
|
|
|
|
time they were sent without any material performance impact. And then
|
|
|
|
|
`add_missing_messages` skips any messages that already have a
|
|
|
|
|
`UserMessage` row for that user when doing its backfill.
|
|
|
|
|
|
|
|
|
|
The end result is the best of both worlds:
|
|
|
|
|
|
|
|
|
|
* Nobody's view of the world is different because the user was
|
|
|
|
|
soft-deactivated (resulting in no visible user-experience impact), at
|
|
|
|
|
least if one is running the cron job. If one does not run the cron
|
|
|
|
|
job, then users returning after being away for a very long time will
|
|
|
|
|
potentially have a (very) slow loading experience as potentially
|
|
|
|
|
100,000s of UserMessage rows might need to be reconstructed at once.
|
|
|
|
|
* On the latency-sensitive message sending and fanout code path, the
|
|
|
|
|
server only needs to do work for users who are currently interacting
|
|
|
|
|
with Zulip.
|
|
|
|
|
|
|
|
|
|
Empirically, we've found this technique completely resolved the "send
|
|
|
|
|
latency" scaling problem. The latency of sending a message to a stream
|
|
|
|
|
now scales only with the number of active subscribers, so one can send
|
|
|
|
|
a message to a stream with 5K subscribers of which 500 are active, and
|
|
|
|
|
it’ll arrive in the couple hundred milliseconds one would expect if
|
|
|
|
|
the extra 4500 inactive subscribers didn’t exist.
|
|
|
|
|
|
|
|
|
|
There are a few details that require special care with this system:
|
2020-12-19 03:05:20 +01:00
|
|
|
|
* [Email and mobile push
|
|
|
|
|
notifications](../subsystems/notifications.md). We need to make
|
|
|
|
|
sure these are still correctly delivered to soft-deactivated users;
|
|
|
|
|
making this work required careful work for those code paths that
|
|
|
|
|
assumed a `UserMessage` row would always exist for a message that
|
|
|
|
|
triggers a notification to a given user.
|
2019-03-08 02:46:42 +01:00
|
|
|
|
* Digest emails, which use the `UserMessage` table extensively to
|
|
|
|
|
determine what has happened in streams the user can see. We can use
|
|
|
|
|
the user's subscriptions to construct what messages they should have
|
|
|
|
|
access to for this feature.
|