docs: Improve export documentation.

Added user and realm export guidance in production maintenance docs, linked to conversion guide, and revamped the introduction and styled the text that Steve wrote.
2016-10-19 05:27:26 -04:00 · 2016-10-19 05:27:26 -04:00 · 2083ffa7a6
parent d9d389f64f
commit 2083ffa7a6
2 changed files with 154 additions and 103 deletions
--- a/docs/conversion.md
+++ b/docs/conversion.md
@ -1,9 +1,23 @@
-# Exporting data
+# Exporting data from a large multi-realm Zulip server
+
+## Draft status
+
+This is a draft design document considering potential future
+refinements and improvements to make large migrations easier going
+forward, and is not yet a set of recommendations for Zulip systems
+administrators to follow.

 ## Overview

-Occasionally Zulip administrators will need to move data from one
-server to another.
+Zulip offers an export tool, `management/export.py`, which works well
+to export the data for a single Zulip realm, and which is your best
+choice if you're migrating a Zulip realm to a new server.
+
+This document supplements the explanation in `management/export.py`,
+but here we focus more on the logistics of a big conversion of a
+multi-realm Zulip installation. (For some historical perspective, this
+document was originally begun as part of a big Zulip cut-over in
+summer 2016.)

 There are many major operational aspects to doing a conversion. I will
 list them here, noting that several are not within the scope of this
@ -11,41 +25,37 @@ document:

 - Get new servers running.
 - Export data from the old DB.
- Export files from S3.
+- Export files from Amazon S3.
 - Import files into new storage.
 - Import data into new DB.
 - Restart new servers.
 - Decommission old server.

 This document focuses almost entirely on the **export** piece.  Issues
-with getting Zulip itself running are totally out of scope here.  For the
-import side of things, I only touch on it implicity.  (My reasoning is
-that we *have* to get the export piece right in a timely fashion, even
-if it means we have to sort out some straggling issues on the import side
-later.)
+with getting Zulip itself running are out of scope here; see [the
+production installation instructions](index.html#prod-install-docs).
+As for the import side of things, we only touch on it implicity.  (My
+reasoning was that we *had* to get the export piece right in a timely
+fashion, even if it meant we would have to sort out some straggling
+issues on the import side later.)

-## Export
-
-We have tools that essentially export Zulip data to the file system.
-
-A good overview of the process is here:
-[management/export.py](https://github.com/zulip/zulip/blob/master/zerver/management/commands/export.py)
-
-This document supplements that explanation, but here we focus more
-on the logistics of a big conversion.  For some historical perspective,
-this document was originally drafted as part of a big Zulip cut-over.
+## Exporting multiple realms' data when moving to a new server

 The main exporting tools in place as of summer 2016 are below:

- We can export single realms (but not yet limit users within the realm).
- We can export single users (but then we get no realm-wide data in the process).
- We can run exports simultaneously (but have to navigate a bunch of /tmp directories).
+- We can export single realms (but not yet limit users within the
+  realm).
+- We can export single users (but then we get no realm-wide data in
+  the process).
+- We can run exports simultaneously (but have to navigate a bunch of
+  /tmp directories).

 Things that we still may need:
 - We may want to export multiple realms simultaneously.
 - We may want to export multiple single users simultaneously.
 - We may want to limit users within realm exports.
- We may want more operational robustness/convenience while doing several exports simultaenously.
+- We may want more operational robustness/convenience while doing
+  several exports simultaenously.
 - We may want to merge multiple export files to remove duplicates.

 We have a few major classes of data.  They are listed below in the order
@ -53,40 +63,41 @@ that we process them in `do_export_realm()`:

 #### Public Realm Data

-Realm/RealmAlias/RealmEmoji/RealmFilter/DefaultStream.
+`Realm/RealmAlias/RealmEmoji/RealmFilter/DefaultStream`.

 #### Cross Realm Data

-Client/zerver_userprofile_cross_realm
+`Client/zerver_userprofile_cross_realm`

-This includes Client and three bots.
+This includes `Client` and three bots.

-Client is unique in being a fairly core table that is
-not tied to UserProfile or Realm (unless you somewhat painfully tie
-it back to users in a bottom-up fashion though other tables).
+`Client` is unique in being a fairly core table that is not tied to
+`UserProfile` or `Realm` (unless you somewhat painfully tie it back to
+users in a bottom-up fashion though other tables).

 #### Disjoint User Data

-UserProfile/UserActivity/UserActivityInterval/UserPresence.
+`UserProfile/UserActivity/UserActivityInterval/UserPresence`.

 #### Recipient Data

-Recipient/Stream/Subscription/Huddle.
+`Recipient/Stream/Subscription/Huddle`.

 These tables are tied back to users, but they introduce complications
 when you try to deal with multi-user subsets.

 #### File-related Data

-Attachment
+`Attachment`

-This includes Attachment, and it referencs the avatar_source field of
-UserProfile.  Most importantly, of course, it requires us to grab files
-from S3.  Finally, Attachment's m2m relationship ties to Message.
+This includes `Attachment`, and it referencs the `avatar_source` field
+of `UserProfile`.  Most importantly, of course, it requires us to grab
+files from S3.  Finally, `Attachment`'s `m2m` relationship ties to
+`Message`.

 #### Message Data

-Message/UserMessage
+`Message/UserMessage`

 ### Summary

@ -113,19 +124,20 @@ We have two major mechanisms for getting data:

 ##### Top Down

-Get realm data, then all users in realm, then all recipients, then all messages, etc.
+Get realm data, then all users in realm, then all recipients, then all
+messages, etc.

-The problem with the top down approach will be **filtering**.  Also, if
-errors arise during top-down passes, it may be time consuming to re-run
-the processes.
+The problem with the top down approach will be **filtering**.  Also,
+if errors arise during top-down passes, it may be time consuming to
+re-run the processes.

 ##### Bottom Up

 Start with users, get their recipient data, etc.

-The problems with the bottom up approach will be **merging**.  Also, if
-we run multiple bottom-up passes, there is the danger of duplicating some
-work, particularly on the message side of things.
+The problems with the bottom up approach will be **merging**.  Also,
+if we run multiple bottom-up passes, there is the danger of
+duplicating some work, particularly on the message side of things.

 ### Approved Transfers

@ -136,12 +148,12 @@ which users can be moved.

 ### Message Data

- models: Message/UserMessage.
- assets: messages-*.json, subprocesses, partial files
+- models: `Message`/`UserMessage`.
+- assets: `messages-*.json`, subprocesses, partial files

-Rows in the Message model depend on Recipient/UserProfile.
+Rows in the `Message` model depend on `Recipient/UserProfile`.

-Rows in the UserMessage model depend on UserProfile/Message.
+Rows in the `UserMessage` model depend on `UserProfile/Message`.

 The biggest concern here is the **sheer volume** of data, with
 security being a close second.  (They are interrelated, as without
@ -155,88 +167,103 @@ We currently have these measures in place for top-down processing:

 ### File Related Data

- models: Attachment
- assets: S3, attachment.json, uploads-temp/, image files in avatars/, assorted files in uploads/, avatars/records.json, uploads/records.json, zerver_attachment_messages
+- models: `Attachment`
+- assets: S3, `attachment.json`, `uploads-temp/`, image files in
+  `avatars/`, assorted files in `uploads/`, `avatars/records.json`,
+  `uploads/records.json`, `zerver_attachment_messages`

-When it comes to exporting attachment data, we have some minor volume issues, but the
-main concern is just that there are **lots of moving parts**:
+When it comes to exporting attachment data, we have some minor volume
+issues, but the main concern is just that there are **lots of moving
+parts**:

- S3 needs to be up, and we get some metadata from it as well as files.
- We have security concerns about copying over only files that belong to users who approved the transfer.
- This piece is just different in how we store data from all the other DB-centric pieces.
- At import time we have to populate the m2m table (but fortunately, this is pretty low
-  risk in terms of breaking anything.)
+- S3 needs to be up, and we get some metadata from it as well as
+  files.
+- We have security concerns about copying over only files that belong
+  to users who approved the transfer.
+- This piece is just different in how we store data from all the other
+  DB-centric pieces.
+- At import time we have to populate the `m2m` table (but fortunately,
+  this is pretty low risk in terms of breaking anything.)

 ### Recipient Data
- models: Recipient/Stream/Subscription/Huddle
- assets: realm.json, (user,stream,huddle)_(recipient,subscription)
+- models: `Recipient/Stream/Subscription/Huddle`
+- assets: `realm.json`, `(user,stream,huddle)_(recipient,subscription)`

-This data is fortunately low to medium in volume.  The risk here will come
-from **model complexity** and **cross-realm concerns**.
+This data is fortunately low to medium in volume.  The risk here will
+come from **model complexity** and **cross-realm concerns**.

 From the top down, here are the dependencies:

- Recipient depends on UserProfile
- Subscription depends on Recipient
- Stream currently depends on Realm (but maybe it should be tied to Subscription)
- Huddle depends on Subscription and UserProfile
+- `Recipient` depends on `UserProfile`
+- `Subscription` depends on `Recipient`
+- `Stream` currently depends on `Realm` (but maybe it should be tied
+  to `Subscription`)
+- `Huddle` depends on `Subscription` and `UserProfile`

-The biggest risk factor here is probably just the possibility that we could introduce
-some bug in our code as we try to segment Recipient into user, stream, and huddle components,
-especially if we try to handle multiple users or realms.
-I think this can be largely mitigated by the new Config approach.
+The biggest risk factor here is probably just the possibility that we
+could introduce some bug in our code as we try to segment `Recipient`
+into user, stream, and huddle components, especially if we try to
+handle multiple users or realms.  I think this can be largely
+mitigated by the new `Config` approach.

-And then we also have some complicated Huddle logic that will be customized
-regardless.  The fiddliest part
-of the Huddle logic is creating the set of unsafe_huddle_recipient_ids.
+And then we also have some complicated `Huddle` logic that will be
+customized regardless.  The fiddliest part of the `Huddle` logic is
+creating the set of `unsafe_huddle_recipient_ids`.

-Last but not least, if we go with some hybrid of bottom-up and top-down, these tables
-are neither close to the bottom nor close to the top, so they may have the most
-fiddly edge cases when it comes to filtering and merging.
+Last but not least, if we go with some hybrid of bottom-up and
+top-down, these tables are neither close to the bottom nor close to
+the top, so they may have the most fiddly edge cases when it comes to
+filtering and merging.

-Recommendation: We probably want to get a backup of all this data that is very simply
-bulk-exported from the entire DB, and we should obviously put it in a secure place.
+Recommendation: We probably want to get a backup of all this data that
+is very simply bulk-exported from the entire DB, and we should
+obviously put it in a secure place.

 ### Cross Realm Data
- models: Client
- assets: realm.json, three bots (notification/email/welcome), id_maps
+- models: `Client`
+- assets: `realm.json`, three bots (`notification`/`email`/`welcome`),
+  `id_maps`

-The good news here is that Client is a small table, and there are
+The good news here is that `Client` is a small table, and there are
 only three special bots.

 The bad news is that cross-realm data **complicates everything else**,
-and we have to avoid **database id conflicts**.
+and we have to avoid **database ID conflicts**.

-If we use bottom-up approaches to load small user populations at a time, we may
-have **merging** issues here.  We will need to consolidate ids either by merging
-exports in /tmp or handle it import time.
+If we use bottom-up approaches to load small user populations at a
+time, we may have **merging** issues here.  We will need to
+consolidate IDs either by merging exports in `/tmp` or handle it at
+import time.

-For the three bots, they live in zerver_userprofile_crossrealm, and we re-map
-their ids on the new server.
+For the three bots, they live in `zerver_userprofile_crossrealm`, and
+we re-map their IDs on the new server.

-Recommendation: Do not sweat the exports too much.  Deal with all the messiness at
-import time, and rely on the tables being really small.  We already have logic
-to catch Client.DoesNotExist exceptions, for example.  As for possibly missing
-messages that the welcome bot and friends have sent in the past, I am not sure
-what our risk profile is there, but I imagine it is relatively low.
+Recommendation: Do not sweat the exports too much.  Deal with all the
+messiness at import time, and rely on the tables being really small.
+We already have logic to catch `Client.DoesNotExist` exceptions, for
+example.  As for possibly missing messages that the welcome bot and
+friends have sent in the past, I am not sure what our risk profile is
+there, but I imagine it is relatively low.

 ### Disjoint User Data
- models: UserProfile/UserActivity/UserActivityInterval/UserPresence
- assets: realm.json, password, api_key, avatar salt, id_maps
+- models: `UserProfile/UserActivity/UserActivityInterval/UserPresence`
+- assets: `realm.json`, `password`, `api_key`, `avatar salt`,
+  `id_maps`

-On the DB side this data should be fairly easy to deal with.  All of these
-tables are basically disjoint by user profile id.  Our biggest
-risk is **remapped user ids** at import time, but this is mostly covered
-in the section above.
+On the DB side this data should be fairly easy to deal with.  All of
+these tables are basically disjoint by user profile ID.  Our biggest
+risk is **remapped user ids** at import time, but this is mostly
+covered in the section above.

-We have code in place to exclude password and api_key from UserProfile
-rows.  The import process calls set_unusable_password().
+We have code in place to exclude `password` and `api_key` from
+`UserProfile` rows.  The import process calls
+`set_unusable_password()`.

 ### Public Realm Data

- models: Realm/RealmAlias/RealmEmoji/RealmFilter/DefaultStream
- asserts: realm.json
+- models: `Realm/RealmAlias/RealmEmoji/RealmFilter/DefaultStream`
+- asserts: `realm.json`

 All of these tables are public (per-realm), and they are keyed by
-realm id.  There is not a ton to worry about here, except possibly
+realm ID.  There is not a ton to worry about here, except possibly
 **merging** if we run multiple bottom-up jobs for a single realm.
--- a/docs/prod-maintain-secure-upgrade.md
+++ b/docs/prod-maintain-secure-upgrade.md
@ -11,7 +11,6 @@ secure Zulip installation, including:
 - [Security Model](#security-model)
 - [Management commands](#management-commands)

-
 ## Upgrading

 **We recommend reading this entire section before doing your first
@ -190,6 +189,16 @@ email), etc.
 they do get large on a busy server, and it's definitely
 lower-priority.

+If you are interested in backups because you are moving from one Zulip
+server to another server and can't transfer a full postgres dump
+(which is definitely the simplest approach), our draft
+[conversion and export design document](conversion.html) may help.
+The tool is well designed and was tested carefully with dozens of
+realms as of mid-2016 but is not integrated into Zulip's regular
+testing process, and thus it is worth asking on the Zulip developers
+mailing list whether it needs any minor updates to do things like
+export newly added tables.
+
 ### Restore from backups

 To restore from backups, the process is basically the reverse of the above:
@ -237,7 +246,6 @@ Contribution of a step-by-step guide for setting this up (and moving
 this configuration to be available in the main `puppet/zulip/` tree)
 would be very welcome!

-
 ## Monitoring

 The complete Nagios configuration (sans secret keys) used to
@ -570,6 +578,22 @@ with the `--permission=api_super_user` argument.  See
 `bots/irc-mirror.py` and `bots/jabber_mirror.py` for further detail on
 these.

+#### Exporting users and realms with manage.py export
+
+If you need to do an export of a single user or of an entire realm, we
+have tools in `management/` that essentially export Zulip data to the
+file system.
+
+`export_single_user.py` exports the message history and realm-public
+metadata for a single Zulip user.
+
+A good overview of the process for exporting a single realm when
+moving a realm to a new server (without moving a full database dump)
+is in
+[management/export.py](https://github.com/zulip/zulip/blob/master/zerver/management/commands/export.py). We
+recommend you read the comment there for words of wisdom on speed,
+what is and is not exported, what will break upon a move to a new
+server, and suggested procedure.

 ### Other useful manage.py commands