mirror of https://github.com/zulip/zulip.git
270 lines
9.2 KiB
Markdown
270 lines
9.2 KiB
Markdown
# Exporting data from a large multi-realm Zulip server
|
|
|
|
## Draft status
|
|
|
|
This is a draft design document considering potential future
|
|
refinements and improvements to make large migrations easier going
|
|
forward, and is not yet a set of recommendations for Zulip systems
|
|
administrators to follow.
|
|
|
|
## Overview
|
|
|
|
Zulip offers an export tool, `management/export.py`, which works well
|
|
to export the data for a single Zulip realm, and which is your best
|
|
choice if you're migrating a Zulip realm to a new server.
|
|
|
|
This document supplements the explanation in `management/export.py`,
|
|
but here we focus more on the logistics of a big conversion of a
|
|
multi-realm Zulip installation. (For some historical perspective, this
|
|
document was originally begun as part of a big Zulip cut-over in
|
|
summer 2016.)
|
|
|
|
There are many major operational aspects to doing a conversion. I will
|
|
list them here, noting that several are not within the scope of this
|
|
document:
|
|
|
|
- Get new servers running.
|
|
- Export data from the old DB.
|
|
- Export files from Amazon S3.
|
|
- Import files into new storage.
|
|
- Import data into new DB.
|
|
- Restart new servers.
|
|
- Decommission old server.
|
|
|
|
This document focuses almost entirely on the **export** piece. Issues
|
|
with getting Zulip itself running are out of scope here; see [the
|
|
production installation instructions](index.html#prod-install-docs).
|
|
As for the import side of things, we only touch on it implicitly. (My
|
|
reasoning was that we *had* to get the export piece right in a timely
|
|
fashion, even if it meant we would have to sort out some straggling
|
|
issues on the import side later.)
|
|
|
|
## Exporting multiple realms' data when moving to a new server
|
|
|
|
The main exporting tools in place as of summer 2016 are below:
|
|
|
|
- We can export single realms (but not yet limit users within the
|
|
realm).
|
|
- We can export single users (but then we get no realm-wide data in
|
|
the process).
|
|
- We can run exports simultaneously (but have to navigate a bunch of
|
|
/tmp directories).
|
|
|
|
Things that we still may need:
|
|
- We may want to export multiple realms simultaneously.
|
|
- We may want to export multiple single users simultaneously.
|
|
- We may want to limit users within realm exports.
|
|
- We may want more operational robustness/convenience while doing
|
|
several exports simultaneously.
|
|
- We may want to merge multiple export files to remove duplicates.
|
|
|
|
We have a few major classes of data. They are listed below in the order
|
|
that we process them in `do_export_realm()`:
|
|
|
|
#### Public Realm Data
|
|
|
|
`Realm/RealmDomain/RealmEmoji/RealmFilter/DefaultStream`.
|
|
|
|
#### Cross Realm Data
|
|
|
|
`Client/zerver_userprofile_cross_realm`
|
|
|
|
This includes `Client` and three bots.
|
|
|
|
`Client` is unique in being a fairly core table that is not tied to
|
|
`UserProfile` or `Realm` (unless you somewhat painfully tie it back to
|
|
users in a bottom-up fashion though other tables).
|
|
|
|
#### Disjoint User Data
|
|
|
|
`UserProfile/UserActivity/UserActivityInterval/UserPresence`.
|
|
|
|
#### Recipient Data
|
|
|
|
`Recipient/Stream/Subscription/Huddle`.
|
|
|
|
These tables are tied back to users, but they introduce complications
|
|
when you try to deal with multi-user subsets.
|
|
|
|
#### File-related Data
|
|
|
|
`Attachment`
|
|
|
|
This includes `Attachment`, and it references the `avatar_source` field
|
|
of `UserProfile`. Most importantly, of course, it requires us to grab
|
|
files from S3. Finally, `Attachment`'s `m2m` relationship ties to
|
|
`Message`.
|
|
|
|
#### Message Data
|
|
|
|
`Message/UserMessage`
|
|
|
|
### Summary
|
|
|
|
Here are the same classes of data, listed in roughly
|
|
decreasing order of riskiness:
|
|
|
|
- Message Data (sheer volume/lack of time/security)
|
|
- File-Related Data (S3/security/lots of moving parts)
|
|
- Recipient Data (complexity/security/cross-realm considerations)
|
|
- Cross Realm Data (duplicate ids)
|
|
- Disjoint User Data
|
|
- Public Realm Data
|
|
|
|
(Note the above list is essentially in reverse order of how we
|
|
process the data, which isn't surprising for a top-down approach.)
|
|
|
|
The next section of the document talks about risk factors.
|
|
|
|
# Risk Mitigation
|
|
|
|
## Generic considerations
|
|
|
|
We have two major mechanisms for getting data:
|
|
|
|
##### Top Down
|
|
|
|
Get realm data, then all users in realm, then all recipients, then all
|
|
messages, etc.
|
|
|
|
The problem with the top-down approach will be **filtering**. Also,
|
|
if errors arise during top-down passes, it may be time consuming to
|
|
re-run the processes.
|
|
|
|
##### Bottom Up
|
|
|
|
Start with users, get their recipient data, etc.
|
|
|
|
The problems with the bottom up approach will be **merging**. Also,
|
|
if we run multiple bottom-up passes, there is the danger of
|
|
duplicating some work, particularly on the message side of things.
|
|
|
|
### Approved Transfers
|
|
|
|
We have not yet integrated the approved-transfer model, which tells us
|
|
which users can be moved.
|
|
|
|
## Risk factors broken out by data categories
|
|
|
|
### Message Data
|
|
|
|
- models: `Message`/`UserMessage`.
|
|
- assets: `messages-*.json`, subprocesses, partial files
|
|
|
|
Rows in the `Message` model depend on `Recipient/UserProfile`.
|
|
|
|
Rows in the `UserMessage` model depend on `UserProfile/Message`.
|
|
|
|
The biggest concern here is the **sheer volume** of data, with
|
|
security being a close second. (They are interrelated, as without
|
|
security concerns, we could just bulk-export everything one time.)
|
|
|
|
We currently have these measures in place for top-down processing:
|
|
- chunking
|
|
- multi-processing
|
|
- messages are filtered by both sender and recipient
|
|
|
|
|
|
### File Related Data
|
|
|
|
- models: `Attachment`
|
|
- assets: S3, `attachment.json`, `uploads-temp/`, image files in
|
|
`avatars/`, assorted files in `uploads/`, `avatars/records.json`,
|
|
`uploads/records.json`, `zerver_attachment_messages`
|
|
|
|
When it comes to exporting attachment data, we have some minor volume
|
|
issues, but the main concern is just that there are **lots of moving
|
|
parts**:
|
|
|
|
- S3 needs to be up, and we get some metadata from it as well as
|
|
files.
|
|
- We have security concerns about copying over only files that belong
|
|
to users who approved the transfer.
|
|
- This piece is just different in how we store data from all the other
|
|
DB-centric pieces.
|
|
- At import time we have to populate the `m2m` table (but fortunately,
|
|
this is pretty low risk in terms of breaking anything.)
|
|
|
|
### Recipient Data
|
|
- models: `Recipient/Stream/Subscription/Huddle`
|
|
- assets: `realm.json`, `(user,stream,huddle)_(recipient,subscription)`
|
|
|
|
This data is fortunately low to medium in volume. The risk here will
|
|
come from **model complexity** and **cross-realm concerns**.
|
|
|
|
From the top down, here are the dependencies:
|
|
|
|
- `Recipient` depends on `UserProfile`
|
|
- `Subscription` depends on `Recipient`
|
|
- `Stream` currently depends on `Realm` (but maybe it should be tied
|
|
to `Subscription`)
|
|
- `Huddle` depends on `Subscription` and `UserProfile`
|
|
|
|
The biggest risk factor here is probably just the possibility that we
|
|
could introduce some bug in our code as we try to segment `Recipient`
|
|
into user, stream, and huddle components, especially if we try to
|
|
handle multiple users or realms. I think this can be largely
|
|
mitigated by the new `Config` approach.
|
|
|
|
And then we also have some complicated `Huddle` logic that will be
|
|
customized regardless. The fiddliest part of the `Huddle` logic is
|
|
creating the set of `unsafe_huddle_recipient_ids`.
|
|
|
|
Last but not least, if we go with some hybrid of bottom-up and
|
|
top-down, these tables are neither close to the bottom nor close to
|
|
the top, so they may have the most fiddly edge cases when it comes to
|
|
filtering and merging.
|
|
|
|
Recommendation: We probably want to get a backup of all this data that
|
|
is very simply bulk-exported from the entire DB, and we should
|
|
obviously put it in a secure place.
|
|
|
|
### Cross Realm Data
|
|
- models: `Client`
|
|
- assets: `realm.json`, three bots (`notification`/`email`/`welcome`),
|
|
`id_maps`
|
|
|
|
The good news here is that `Client` is a small table, and there are
|
|
only three special bots.
|
|
|
|
The bad news is that cross-realm data **complicates everything else**,
|
|
and we have to avoid **database ID conflicts**.
|
|
|
|
If we use bottom-up approaches to load small user populations at a
|
|
time, we may have **merging** issues here. We will need to
|
|
consolidate IDs either by merging exports in `/tmp` or handle it at
|
|
import time.
|
|
|
|
For the three bots, they live in `zerver_userprofile_crossrealm`, and
|
|
we re-map their IDs on the new server.
|
|
|
|
Recommendation: Do not sweat the exports too much. Deal with all the
|
|
messiness at import time, and rely on the tables being really small.
|
|
We already have logic to catch `Client.DoesNotExist` exceptions, for
|
|
example. As for possibly missing messages that the welcome bot and
|
|
friends have sent in the past, I am not sure what our risk profile is
|
|
there, but I imagine it is relatively low.
|
|
|
|
### Disjoint User Data
|
|
- models: `UserProfile/UserActivity/UserActivityInterval/UserPresence`
|
|
- assets: `realm.json`, `password`, `api_key`, `avatar salt`,
|
|
`id_maps`
|
|
|
|
On the DB side this data should be fairly easy to deal with. All of
|
|
these tables are basically disjoint by user profile ID. Our biggest
|
|
risk is **remapped user ids** at import time, but this is mostly
|
|
covered in the section above.
|
|
|
|
We have code in place to exclude `password` and `api_key` from
|
|
`UserProfile` rows. The import process calls
|
|
`set_unusable_password()`.
|
|
|
|
### Public Realm Data
|
|
|
|
- models: `Realm/RealmDomain/RealmEmoji/RealmFilter/DefaultStream`
|
|
- asserts: `realm.json`
|
|
|
|
All of these tables are public (per-realm), and they are keyed by
|
|
realm ID. There is not a ton to worry about here, except possibly
|
|
**merging** if we run multiple bottom-up jobs for a single realm.
|