2018-07-31 23:07:42 +02:00
|
|
|
# Caching in Zulip
|
|
|
|
|
|
|
|
Like any product with good performance characteristics, Zulip makes
|
|
|
|
extensive use of caching. This article talks about our caching
|
|
|
|
strategy, focusing on how we use `memcached` (since it's the thing
|
|
|
|
people generally think about when they ask about how a server does
|
|
|
|
caching).
|
|
|
|
|
|
|
|
## Backend caching with memcached
|
|
|
|
|
|
|
|
On the backend, Zulip uses `memcached`, a popular key-value store, for
|
|
|
|
caching. Our `memcached` caching helps let us optimize Zulip's
|
|
|
|
performance and scalability, since most requests don't need to talk to
|
|
|
|
the database (which, even for a trivial query with everything on the
|
|
|
|
same machine, usually takes 3-10x as long as a memcached fetch).
|
|
|
|
|
|
|
|
We use Django's built-in caching integration to manage talking to
|
|
|
|
memcached, and then a small application-layer library
|
|
|
|
(`zerver/lib/cache.py`).
|
|
|
|
|
|
|
|
It's common for projects using a caching system like `memcached` to
|
|
|
|
either have the codebase littered with explicit requests to interact
|
|
|
|
with the cache (or flush data from a cache), or (worse) be littered
|
|
|
|
with weird bugs that disappear after you flush memcached.
|
|
|
|
|
|
|
|
Caching bugs are a pain to track down, because they generally require
|
|
|
|
an extra and difficult-to-guess step to reproduce (namely, putting the
|
|
|
|
wrong data into the cache).
|
|
|
|
|
|
|
|
So we've designed our backend to ensure that if we write a small
|
2018-08-03 21:52:14 +02:00
|
|
|
amount of Zulip's core caching code correctly, then the code most developers
|
|
|
|
naturally write will both benefit from caching and not create any cache
|
|
|
|
consistency problems.
|
2018-07-31 23:07:42 +02:00
|
|
|
|
|
|
|
The overall result of this design is that in the vast majority of
|
|
|
|
Zulip's Django codebase, all one needs to do is call the standard
|
|
|
|
accessor functions for data (like `get_user` or `get_stream` to fetch
|
|
|
|
user and stream objects, or for view code, functions like
|
|
|
|
`access_stream_by_id`, which checks permissions), and everything will
|
|
|
|
work great. The data fetches automatically benefit from `memcached`
|
|
|
|
caching, since those accessor methods have already been written to
|
|
|
|
transparently use Zulip's memcached caching system, and the developer
|
|
|
|
doesn't need to worry about whether the data returned is up-to-date:
|
|
|
|
it is. In the following sections, we'll talk about how we make this
|
|
|
|
work.
|
|
|
|
|
|
|
|
As a sidenote, the policy of using these accessor functions wherever
|
|
|
|
possible is a good idea, regardless of caching, because the functions
|
|
|
|
also generally take care of details you might not think about
|
|
|
|
(e.g. case-insensitive matching of stream names or email addresses).
|
|
|
|
It's amazing how slightly tricky logic that's duplicated in several
|
|
|
|
places invariably ends up buggy in some of those places, and in
|
|
|
|
aggregate we call these accessor functions hundreds of times in
|
|
|
|
Zulip. But the caching is certainly a nice bonus.
|
|
|
|
|
|
|
|
### The core implementation
|
|
|
|
|
|
|
|
The `get_user` function is a pretty typical piece of code using this
|
|
|
|
framework; as you can see, it's very little code on top of our
|
|
|
|
`cache_with_key` decorator:
|
|
|
|
|
|
|
|
``` python
|
|
|
|
def user_profile_cache_key_id(email: str, realm_id: int) -> str:
|
|
|
|
return u"user_profile:%s:%s" % (make_safe_digest(email.strip()), realm_id,)
|
|
|
|
|
|
|
|
def user_profile_cache_key(email: str, realm: 'Realm') -> str:
|
|
|
|
return user_profile_cache_key_id(email, realm.id)
|
|
|
|
|
|
|
|
@cache_with_key(user_profile_cache_key, timeout=3600*24*7)
|
|
|
|
def get_user(email: str, realm: Realm) -> UserProfile:
|
|
|
|
return UserProfile.objects.select_related().get(
|
|
|
|
email__iexact=email.strip(), realm=realm)
|
|
|
|
```
|
|
|
|
|
|
|
|
This decorator implements a pretty classic caching paradigm:
|
|
|
|
* The `user_profile_cache_key` function defines a unique map from a
|
|
|
|
canonical form of its arguments to a string. These strings are
|
|
|
|
namespaced (the `user_profile:` part) so that they won't overlap
|
|
|
|
with other caches, and encode the arguments so that two uses of this
|
|
|
|
cache won't overlap. In this case, a hash of the email address and
|
|
|
|
realm ID are those canonicalized arguments. (The `make_safe_digest`
|
|
|
|
is important to ensure we don't send special characters to
|
|
|
|
memcached). And we have two versions, depending whether the caller
|
|
|
|
has access to a `Realm` or just a `realm_id`.
|
|
|
|
* When `get_user` is called, `cache_with_key` will compute the key,
|
|
|
|
and do a Django `cache_get` query for the key (which goes to
|
|
|
|
memcached). If the key is in the cache, it just returns the value.
|
|
|
|
Otherwise, it fetches the value from the database (using the actual
|
|
|
|
code in the body of `get_user`), and then stores the value back to
|
|
|
|
that memcached key before returning the result to the caller.
|
|
|
|
* Cache entries expire after the timeout; in this case, a week.
|
|
|
|
Though in frequently deployed environments like chat.zulip.org,
|
|
|
|
often cache entries will stop being used long before that, because
|
|
|
|
`KEY_PREFIX` is rotated every time we deploy to production; see
|
|
|
|
below for details.
|
|
|
|
|
|
|
|
We use this decorator in more than 30 places in Zulip, and it saves a
|
|
|
|
huge amount of otherwise very self-similar caching code.
|
|
|
|
|
|
|
|
### Cautions
|
|
|
|
|
|
|
|
The one thing to be really careful with in using `cache_with_key` is
|
|
|
|
that if an item is in the cache, the body of `get_user` (above) is
|
|
|
|
never called. This means some things that might seem like clever code
|
|
|
|
reuse are actually a really bad idea. For example:
|
|
|
|
|
2020-05-17 14:14:57 +02:00
|
|
|
* Don't add a `get_active_user` function that uses the same cache key
|
|
|
|
function as `get_user` (but with a different query that filters our
|
|
|
|
deactivated users). If one called `get_active_user` to access a
|
|
|
|
deactivated user, the right thing would happen, but if you called
|
2018-07-31 23:07:42 +02:00
|
|
|
`get_user` to access that user first, then the `get_active_user`
|
2020-05-17 14:14:57 +02:00
|
|
|
function would happily return the user from the cache, without ever
|
|
|
|
doing your more restrictive query.
|
2018-07-31 23:07:42 +02:00
|
|
|
|
|
|
|
So remember: Use separate cache key functions for different data sets,
|
|
|
|
even if they feature the same objects.
|
|
|
|
|
|
|
|
### Cache invalidation after writes
|
|
|
|
|
|
|
|
The caching strategy described above works pretty well for anything
|
2018-08-03 21:52:14 +02:00
|
|
|
where the state it's storing is immutable (i.e. never changes). With
|
2018-07-31 23:07:42 +02:00
|
|
|
mutable state, one needs to do something to ensure that the Python
|
|
|
|
processes don't end up fetching stale data from the cache after a
|
|
|
|
write to the database.
|
|
|
|
|
|
|
|
We handle this using Django's fancy
|
|
|
|
[post_save signals][post-save-signals] feature. Django signals let
|
|
|
|
you configure some code to run every time Django does something (for
|
|
|
|
`post_save`, right after any write to the database using Django's
|
|
|
|
`.save()`).
|
|
|
|
|
|
|
|
There's a handful of lines in `zerver/models.py` like these that
|
|
|
|
configure this:
|
|
|
|
|
|
|
|
```
|
|
|
|
post_save.connect(flush_realm, sender=Realm)
|
|
|
|
post_save.connect(flush_user_profile, sender=UserProfile)
|
|
|
|
```
|
|
|
|
|
|
|
|
Once this `post_save` hook is registered, whenever one calls
|
|
|
|
`user_profile.save(...)` with a UserProfile object in our Django
|
|
|
|
project, Django will call the `flush_user_profile` function. Zulip is
|
|
|
|
systematic about using the standard Django `.save()` function for
|
|
|
|
modifying `user_profile` objects (and passing the `update_fields`
|
|
|
|
argument to `.save()` consistently, which encodes which fields on an
|
|
|
|
object changed). This means that all we have to do is write those
|
|
|
|
cache-flushing functions correctly, and people writing Zulip code
|
2018-08-03 21:52:14 +02:00
|
|
|
won't need to think about (or even know about!) the caching.
|
2018-07-31 23:07:42 +02:00
|
|
|
|
|
|
|
Each of those flush functions basically just computes the list of
|
|
|
|
cache keys that might contain data that was modified by the
|
|
|
|
`.save(...)` call (based on the object changed and the `update_fields`
|
|
|
|
data), and then sends a bulk delete request to `memcached` to remove
|
|
|
|
those keys from the cache (if present).
|
|
|
|
|
|
|
|
Maintaining these flush functions requires some care (every time we
|
|
|
|
add a new cache, we need to look through them), but overall it's a
|
|
|
|
pretty simple algorithm: If the changed data appears in any form in a
|
|
|
|
given cache key, that cache key needs to be cleared. E.g. the
|
|
|
|
`active_user_ids_cache_key` cache for a realm needs to be flushed
|
|
|
|
whenever a new user is created in that realm, or user is
|
|
|
|
deactivated/reactivated, even though it's just a list of IDs and thus
|
|
|
|
doesn't explicitly contain the `is_active` flag.
|
|
|
|
|
|
|
|
Once you understand how that works, it's pretty easy to reason about
|
|
|
|
when a particular flush function should clear a particular cache; so
|
|
|
|
the main thing that requires care is making sure we remember to reason
|
|
|
|
about that when changing cache semantics.
|
|
|
|
|
|
|
|
But the overall benefit of this cache system is that almost all the
|
|
|
|
code in Zulip just needs to modify Django model objects and call
|
|
|
|
`.save()`, and the caching system will do the right thing.
|
|
|
|
|
|
|
|
### Production deployments and database migrations
|
|
|
|
|
|
|
|
When upgrading a Zulip server, it's important to avoid having one
|
|
|
|
version of the code interact with cached objects from another version
|
|
|
|
that has a different data layout. In Zulip, we avoid this through
|
|
|
|
some clever caching strategies. Each "deployment directory" for Zulip
|
|
|
|
in production has inside it a `var/remote_cache_prefix` file,
|
|
|
|
containing a cache prefix (`KEY_PREFIX` in the code) that is
|
|
|
|
automatically appended to the start of any cache keys accessed by that
|
|
|
|
deployment directory (this is all handled internally by
|
|
|
|
`zerver/lib/cache.py`).
|
|
|
|
|
|
|
|
This completely solves the problem of potentially having contamination
|
|
|
|
from inconsistent versions of the source code / data formats in the cache.
|
|
|
|
|
|
|
|
### Automated testing and memcached
|
|
|
|
|
|
|
|
For Zulip's `test-backend` unit tests, we use the same strategy. In
|
|
|
|
particular, we just edit `KEY_PREFIX` before each unit test; this
|
|
|
|
means each of the thousands of test cases in Zulip has its own
|
|
|
|
independent memcached key namespace on each run of the unit tests. As
|
|
|
|
a result, we never have to worry about memcached caching causing
|
|
|
|
problems across multiple tests.
|
|
|
|
|
|
|
|
This is a really important detail. It makes it possible for us to do
|
|
|
|
assertions in our tests on the number of database queries or memcached
|
2018-08-03 21:52:14 +02:00
|
|
|
queries that are done as part of a particular function/route, and have
|
2018-07-31 23:07:42 +02:00
|
|
|
those checks consistently get the same result (those tests are great
|
|
|
|
for catching bugs where we accidentally do database queries in a
|
|
|
|
loop). And it means one can debug failures in the test suite without
|
|
|
|
having to consider the possibility that memcached is somehow confusing
|
|
|
|
the situation.
|
|
|
|
|
|
|
|
Further, this `KEY_PREFIX` model means that running the backend tests
|
|
|
|
won't potentially conflict with whatever you're doing in a Zulip
|
|
|
|
development environment on the same machine, which also saves a ton of
|
|
|
|
time when debugging, since developers don't need to think about things
|
|
|
|
like whether some test changed Hamlet's email address and that's why
|
|
|
|
login is broken.
|
|
|
|
|
|
|
|
More full-stack test suites like `test-js-with-casper` or `test-api`
|
|
|
|
use a similar strategy (set a random `KEY_PREFIX` at the start of the
|
|
|
|
test run).
|
|
|
|
|
|
|
|
### Performance
|
|
|
|
|
|
|
|
One thing be careful about with memcached queries is to avoid doing
|
|
|
|
them in loops (the same applies for database queries!). Instead, one
|
|
|
|
should use a bulk query. We have a fancy function,
|
|
|
|
`generate_bulk_cached_fetch`, which is super magical and handles this
|
|
|
|
for us, with support for a bunch of fancy features like marshalling
|
|
|
|
data before/after going into the cache (e.g. to compress `message`
|
|
|
|
objects to minimize data transfer between Django and memcached).
|
|
|
|
|
|
|
|
## In-process caching in Django
|
|
|
|
|
|
|
|
We generally try to avoid in-process backend caching in Zulip's Django
|
|
|
|
codebase, because every Zulip production installation involves
|
|
|
|
multiple servers. We do have a few, however:
|
|
|
|
* `per_request_display_recipient_cache`: A cache flushed at the start
|
|
|
|
of every request; this simplifies correctly implementing our goal of
|
|
|
|
not repeatedly fetching the "display recipient" (e.g. stream name)
|
|
|
|
for each message in the `GET /messages` codebase.
|
|
|
|
* Caches of various data, like the SourceMap object, that are
|
|
|
|
expensive to construct, not needed for most requests, and don't
|
|
|
|
change once a Zulip server has been deployed in production.
|
|
|
|
|
|
|
|
## Browser caching of state
|
|
|
|
|
|
|
|
Zulip makes extensive use of caching of data in the browser and mobile
|
|
|
|
apps; details like which users exist, with metadata like names and
|
|
|
|
avatars, similar details for streams, recent message history, etc.
|
|
|
|
|
|
|
|
These days are fetched in the `/register` endpoint (or `page_params`
|
|
|
|
for the webapp), and kept correct over time. The key to keeping these
|
|
|
|
state up to date is Zulip's
|
2019-09-30 19:37:56 +02:00
|
|
|
[real-time events system](../subsystems/events-system.md), which
|
2018-07-31 23:07:42 +02:00
|
|
|
allows the server to notify clients whenever state that might be
|
2018-08-03 21:52:14 +02:00
|
|
|
cached by clients is changed. Clients are responsible for handling
|
2018-07-31 23:07:42 +02:00
|
|
|
the events, updating their state, and rerendering any UI components
|
|
|
|
that might display the modified state.
|
|
|
|
|
|
|
|
[post-save-signals]:
|
|
|
|
https://docs.djangoproject.com/en/2.0/ref/signals/#post-save
|