docs: Improve export documentation.

Added user and realm export guidance in production maintenance docs,
linked to conversion guide, and revamped the introduction and styled
the text that Steve wrote.
This commit is contained in:
Sumana Harihareswara 2016-10-19 05:27:26 -04:00 committed by Tim Abbott
parent d9d389f64f
commit 2083ffa7a6
2 changed files with 154 additions and 103 deletions

View File

@ -1,9 +1,23 @@
# Exporting data
# Exporting data from a large multi-realm Zulip server
## Draft status
This is a draft design document considering potential future
refinements and improvements to make large migrations easier going
forward, and is not yet a set of recommendations for Zulip systems
administrators to follow.
## Overview
Occasionally Zulip administrators will need to move data from one
server to another.
Zulip offers an export tool, `management/export.py`, which works well
to export the data for a single Zulip realm, and which is your best
choice if you're migrating a Zulip realm to a new server.
This document supplements the explanation in `management/export.py`,
but here we focus more on the logistics of a big conversion of a
multi-realm Zulip installation. (For some historical perspective, this
document was originally begun as part of a big Zulip cut-over in
summer 2016.)
There are many major operational aspects to doing a conversion. I will
list them here, noting that several are not within the scope of this
@ -11,41 +25,37 @@ document:
- Get new servers running.
- Export data from the old DB.
- Export files from S3.
- Export files from Amazon S3.
- Import files into new storage.
- Import data into new DB.
- Restart new servers.
- Decommission old server.
This document focuses almost entirely on the **export** piece. Issues
with getting Zulip itself running are totally out of scope here. For the
import side of things, I only touch on it implicity. (My reasoning is
that we *have* to get the export piece right in a timely fashion, even
if it means we have to sort out some straggling issues on the import side
later.)
with getting Zulip itself running are out of scope here; see [the
production installation instructions](index.html#prod-install-docs).
As for the import side of things, we only touch on it implicity. (My
reasoning was that we *had* to get the export piece right in a timely
fashion, even if it meant we would have to sort out some straggling
issues on the import side later.)
## Export
We have tools that essentially export Zulip data to the file system.
A good overview of the process is here:
[management/export.py](https://github.com/zulip/zulip/blob/master/zerver/management/commands/export.py)
This document supplements that explanation, but here we focus more
on the logistics of a big conversion. For some historical perspective,
this document was originally drafted as part of a big Zulip cut-over.
## Exporting multiple realms' data when moving to a new server
The main exporting tools in place as of summer 2016 are below:
- We can export single realms (but not yet limit users within the realm).
- We can export single users (but then we get no realm-wide data in the process).
- We can run exports simultaneously (but have to navigate a bunch of /tmp directories).
- We can export single realms (but not yet limit users within the
realm).
- We can export single users (but then we get no realm-wide data in
the process).
- We can run exports simultaneously (but have to navigate a bunch of
/tmp directories).
Things that we still may need:
- We may want to export multiple realms simultaneously.
- We may want to export multiple single users simultaneously.
- We may want to limit users within realm exports.
- We may want more operational robustness/convenience while doing several exports simultaenously.
- We may want more operational robustness/convenience while doing
several exports simultaenously.
- We may want to merge multiple export files to remove duplicates.
We have a few major classes of data. They are listed below in the order
@ -53,40 +63,41 @@ that we process them in `do_export_realm()`:
#### Public Realm Data
Realm/RealmAlias/RealmEmoji/RealmFilter/DefaultStream.
`Realm/RealmAlias/RealmEmoji/RealmFilter/DefaultStream`.
#### Cross Realm Data
Client/zerver_userprofile_cross_realm
`Client/zerver_userprofile_cross_realm`
This includes Client and three bots.
This includes `Client` and three bots.
Client is unique in being a fairly core table that is
not tied to UserProfile or Realm (unless you somewhat painfully tie
it back to users in a bottom-up fashion though other tables).
`Client` is unique in being a fairly core table that is not tied to
`UserProfile` or `Realm` (unless you somewhat painfully tie it back to
users in a bottom-up fashion though other tables).
#### Disjoint User Data
UserProfile/UserActivity/UserActivityInterval/UserPresence.
`UserProfile/UserActivity/UserActivityInterval/UserPresence`.
#### Recipient Data
Recipient/Stream/Subscription/Huddle.
`Recipient/Stream/Subscription/Huddle`.
These tables are tied back to users, but they introduce complications
when you try to deal with multi-user subsets.
#### File-related Data
Attachment
`Attachment`
This includes Attachment, and it referencs the avatar_source field of
UserProfile. Most importantly, of course, it requires us to grab files
from S3. Finally, Attachment's m2m relationship ties to Message.
This includes `Attachment`, and it referencs the `avatar_source` field
of `UserProfile`. Most importantly, of course, it requires us to grab
files from S3. Finally, `Attachment`'s `m2m` relationship ties to
`Message`.
#### Message Data
Message/UserMessage
`Message/UserMessage`
### Summary
@ -113,19 +124,20 @@ We have two major mechanisms for getting data:
##### Top Down
Get realm data, then all users in realm, then all recipients, then all messages, etc.
Get realm data, then all users in realm, then all recipients, then all
messages, etc.
The problem with the top down approach will be **filtering**. Also, if
errors arise during top-down passes, it may be time consuming to re-run
the processes.
The problem with the top down approach will be **filtering**. Also,
if errors arise during top-down passes, it may be time consuming to
re-run the processes.
##### Bottom Up
Start with users, get their recipient data, etc.
The problems with the bottom up approach will be **merging**. Also, if
we run multiple bottom-up passes, there is the danger of duplicating some
work, particularly on the message side of things.
The problems with the bottom up approach will be **merging**. Also,
if we run multiple bottom-up passes, there is the danger of
duplicating some work, particularly on the message side of things.
### Approved Transfers
@ -136,12 +148,12 @@ which users can be moved.
### Message Data
- models: Message/UserMessage.
- assets: messages-*.json, subprocesses, partial files
- models: `Message`/`UserMessage`.
- assets: `messages-*.json`, subprocesses, partial files
Rows in the Message model depend on Recipient/UserProfile.
Rows in the `Message` model depend on `Recipient/UserProfile`.
Rows in the UserMessage model depend on UserProfile/Message.
Rows in the `UserMessage` model depend on `UserProfile/Message`.
The biggest concern here is the **sheer volume** of data, with
security being a close second. (They are interrelated, as without
@ -155,88 +167,103 @@ We currently have these measures in place for top-down processing:
### File Related Data
- models: Attachment
- assets: S3, attachment.json, uploads-temp/, image files in avatars/, assorted files in uploads/, avatars/records.json, uploads/records.json, zerver_attachment_messages
- models: `Attachment`
- assets: S3, `attachment.json`, `uploads-temp/`, image files in
`avatars/`, assorted files in `uploads/`, `avatars/records.json`,
`uploads/records.json`, `zerver_attachment_messages`
When it comes to exporting attachment data, we have some minor volume issues, but the
main concern is just that there are **lots of moving parts**:
When it comes to exporting attachment data, we have some minor volume
issues, but the main concern is just that there are **lots of moving
parts**:
- S3 needs to be up, and we get some metadata from it as well as files.
- We have security concerns about copying over only files that belong to users who approved the transfer.
- This piece is just different in how we store data from all the other DB-centric pieces.
- At import time we have to populate the m2m table (but fortunately, this is pretty low
risk in terms of breaking anything.)
- S3 needs to be up, and we get some metadata from it as well as
files.
- We have security concerns about copying over only files that belong
to users who approved the transfer.
- This piece is just different in how we store data from all the other
DB-centric pieces.
- At import time we have to populate the `m2m` table (but fortunately,
this is pretty low risk in terms of breaking anything.)
### Recipient Data
- models: Recipient/Stream/Subscription/Huddle
- assets: realm.json, (user,stream,huddle)_(recipient,subscription)
- models: `Recipient/Stream/Subscription/Huddle`
- assets: `realm.json`, `(user,stream,huddle)_(recipient,subscription)`
This data is fortunately low to medium in volume. The risk here will come
from **model complexity** and **cross-realm concerns**.
This data is fortunately low to medium in volume. The risk here will
come from **model complexity** and **cross-realm concerns**.
From the top down, here are the dependencies:
- Recipient depends on UserProfile
- Subscription depends on Recipient
- Stream currently depends on Realm (but maybe it should be tied to Subscription)
- Huddle depends on Subscription and UserProfile
- `Recipient` depends on `UserProfile`
- `Subscription` depends on `Recipient`
- `Stream` currently depends on `Realm` (but maybe it should be tied
to `Subscription`)
- `Huddle` depends on `Subscription` and `UserProfile`
The biggest risk factor here is probably just the possibility that we could introduce
some bug in our code as we try to segment Recipient into user, stream, and huddle components,
especially if we try to handle multiple users or realms.
I think this can be largely mitigated by the new Config approach.
The biggest risk factor here is probably just the possibility that we
could introduce some bug in our code as we try to segment `Recipient`
into user, stream, and huddle components, especially if we try to
handle multiple users or realms. I think this can be largely
mitigated by the new `Config` approach.
And then we also have some complicated Huddle logic that will be customized
regardless. The fiddliest part
of the Huddle logic is creating the set of unsafe_huddle_recipient_ids.
And then we also have some complicated `Huddle` logic that will be
customized regardless. The fiddliest part of the `Huddle` logic is
creating the set of `unsafe_huddle_recipient_ids`.
Last but not least, if we go with some hybrid of bottom-up and top-down, these tables
are neither close to the bottom nor close to the top, so they may have the most
fiddly edge cases when it comes to filtering and merging.
Last but not least, if we go with some hybrid of bottom-up and
top-down, these tables are neither close to the bottom nor close to
the top, so they may have the most fiddly edge cases when it comes to
filtering and merging.
Recommendation: We probably want to get a backup of all this data that is very simply
bulk-exported from the entire DB, and we should obviously put it in a secure place.
Recommendation: We probably want to get a backup of all this data that
is very simply bulk-exported from the entire DB, and we should
obviously put it in a secure place.
### Cross Realm Data
- models: Client
- assets: realm.json, three bots (notification/email/welcome), id_maps
- models: `Client`
- assets: `realm.json`, three bots (`notification`/`email`/`welcome`),
`id_maps`
The good news here is that Client is a small table, and there are
The good news here is that `Client` is a small table, and there are
only three special bots.
The bad news is that cross-realm data **complicates everything else**,
and we have to avoid **database id conflicts**.
and we have to avoid **database ID conflicts**.
If we use bottom-up approaches to load small user populations at a time, we may
have **merging** issues here. We will need to consolidate ids either by merging
exports in /tmp or handle it import time.
If we use bottom-up approaches to load small user populations at a
time, we may have **merging** issues here. We will need to
consolidate IDs either by merging exports in `/tmp` or handle it at
import time.
For the three bots, they live in zerver_userprofile_crossrealm, and we re-map
their ids on the new server.
For the three bots, they live in `zerver_userprofile_crossrealm`, and
we re-map their IDs on the new server.
Recommendation: Do not sweat the exports too much. Deal with all the messiness at
import time, and rely on the tables being really small. We already have logic
to catch Client.DoesNotExist exceptions, for example. As for possibly missing
messages that the welcome bot and friends have sent in the past, I am not sure
what our risk profile is there, but I imagine it is relatively low.
Recommendation: Do not sweat the exports too much. Deal with all the
messiness at import time, and rely on the tables being really small.
We already have logic to catch `Client.DoesNotExist` exceptions, for
example. As for possibly missing messages that the welcome bot and
friends have sent in the past, I am not sure what our risk profile is
there, but I imagine it is relatively low.
### Disjoint User Data
- models: UserProfile/UserActivity/UserActivityInterval/UserPresence
- assets: realm.json, password, api_key, avatar salt, id_maps
- models: `UserProfile/UserActivity/UserActivityInterval/UserPresence`
- assets: `realm.json`, `password`, `api_key`, `avatar salt`,
`id_maps`
On the DB side this data should be fairly easy to deal with. All of these
tables are basically disjoint by user profile id. Our biggest
risk is **remapped user ids** at import time, but this is mostly covered
in the section above.
On the DB side this data should be fairly easy to deal with. All of
these tables are basically disjoint by user profile ID. Our biggest
risk is **remapped user ids** at import time, but this is mostly
covered in the section above.
We have code in place to exclude password and api_key from UserProfile
rows. The import process calls set_unusable_password().
We have code in place to exclude `password` and `api_key` from
`UserProfile` rows. The import process calls
`set_unusable_password()`.
### Public Realm Data
- models: Realm/RealmAlias/RealmEmoji/RealmFilter/DefaultStream
- asserts: realm.json
- models: `Realm/RealmAlias/RealmEmoji/RealmFilter/DefaultStream`
- asserts: `realm.json`
All of these tables are public (per-realm), and they are keyed by
realm id. There is not a ton to worry about here, except possibly
realm ID. There is not a ton to worry about here, except possibly
**merging** if we run multiple bottom-up jobs for a single realm.

View File

@ -11,7 +11,6 @@ secure Zulip installation, including:
- [Security Model](#security-model)
- [Management commands](#management-commands)
## Upgrading
**We recommend reading this entire section before doing your first
@ -190,6 +189,16 @@ email), etc.
they do get large on a busy server, and it's definitely
lower-priority.
If you are interested in backups because you are moving from one Zulip
server to another server and can't transfer a full postgres dump
(which is definitely the simplest approach), our draft
[conversion and export design document](conversion.html) may help.
The tool is well designed and was tested carefully with dozens of
realms as of mid-2016 but is not integrated into Zulip's regular
testing process, and thus it is worth asking on the Zulip developers
mailing list whether it needs any minor updates to do things like
export newly added tables.
### Restore from backups
To restore from backups, the process is basically the reverse of the above:
@ -237,7 +246,6 @@ Contribution of a step-by-step guide for setting this up (and moving
this configuration to be available in the main `puppet/zulip/` tree)
would be very welcome!
## Monitoring
The complete Nagios configuration (sans secret keys) used to
@ -570,6 +578,22 @@ with the `--permission=api_super_user` argument. See
`bots/irc-mirror.py` and `bots/jabber_mirror.py` for further detail on
these.
#### Exporting users and realms with manage.py export
If you need to do an export of a single user or of an entire realm, we
have tools in `management/` that essentially export Zulip data to the
file system.
`export_single_user.py` exports the message history and realm-public
metadata for a single Zulip user.
A good overview of the process for exporting a single realm when
moving a realm to a new server (without moving a full database dump)
is in
[management/export.py](https://github.com/zulip/zulip/blob/master/zerver/management/commands/export.py). We
recommend you read the comment there for words of wisdom on speed,
what is and is not exported, what will break upon a move to a new
server, and suggested procedure.
### Other useful manage.py commands