# Exporting data from a large multi-realm Zulip server ## Draft status This is a draft design document considering potential future refinements and improvements to make large migrations easier going forward, and is not yet a set of recommendations for Zulip systems administrators to follow. ## Overview Zulip offers an export tool, `management/export.py`, which works well to export the data for a single Zulip realm, and which is your best choice if you're migrating a Zulip realm to a new server. This document supplements the explanation in `management/export.py`, but here we focus more on the logistics of a big conversion of a multi-realm Zulip installation. (For some historical perspective, this document was originally begun as part of a big Zulip cut-over in summer 2016.) There are many major operational aspects to doing a conversion. I will list them here, noting that several are not within the scope of this document: - Get new servers running. - Export data from the old DB. - Export files from Amazon S3. - Import files into new storage. - Import data into new DB. - Restart new servers. - Decommission old server. This document focuses almost entirely on the **export** piece. Issues with getting Zulip itself running are out of scope here; see [the production installation instructions](index.html#zulip-in-production). As for the import side of things, we only touch on it implicitly. (My reasoning was that we *had* to get the export piece right in a timely fashion, even if it meant we would have to sort out some straggling issues on the import side later.) ## Exporting multiple realms' data when moving to a new server The main exporting tools in place as of summer 2016 are below: - We can export single realms (but not yet limit users within the realm). - We can export single users (but then we get no realm-wide data in the process). - We can run exports simultaneously (but have to navigate a bunch of /tmp directories). Things that we still may need: - We may want to export multiple realms simultaneously. - We may want to export multiple single users simultaneously. - We may want to limit users within realm exports. - We may want more operational robustness/convenience while doing several exports simultaneously. - We may want to merge multiple export files to remove duplicates. We have a few major classes of data. They are listed below in the order that we process them in `do_export_realm()`: #### Public Realm Data `Realm/RealmDomain/RealmEmoji/RealmFilter/DefaultStream`. #### Cross Realm Data `Client/zerver_userprofile_cross_realm` This includes `Client` and three bots. `Client` is unique in being a fairly core table that is not tied to `UserProfile` or `Realm` (unless you somewhat painfully tie it back to users in a bottom-up fashion though other tables). #### Disjoint User Data `UserProfile/UserActivity/UserActivityInterval/UserPresence`. #### Recipient Data `Recipient/Stream/Subscription/Huddle`. These tables are tied back to users, but they introduce complications when you try to deal with multi-user subsets. #### File-related Data `Attachment` This includes `Attachment`, and it references the `avatar_source` field of `UserProfile`. Most importantly, of course, it requires us to grab files from S3. Finally, `Attachment`'s `m2m` relationship ties to `Message`. #### Message Data `Message/UserMessage` ### Summary Here are the same classes of data, listed in roughly decreasing order of riskiness: - Message Data (sheer volume/lack of time/security) - File-Related Data (S3/security/lots of moving parts) - Recipient Data (complexity/security/cross-realm considerations) - Cross Realm Data (duplicate ids) - Disjoint User Data - Public Realm Data (Note the above list is essentially in reverse order of how we process the data, which isn't surprising for a top-down approach.) The next section of the document talks about risk factors. # Risk Mitigation ## Generic considerations We have two major mechanisms for getting data: ##### Top Down Get realm data, then all users in realm, then all recipients, then all messages, etc. The problem with the top-down approach will be **filtering**. Also, if errors arise during top-down passes, it may be time consuming to re-run the processes. ##### Bottom Up Start with users, get their recipient data, etc. The problems with the bottom up approach will be **merging**. Also, if we run multiple bottom-up passes, there is the danger of duplicating some work, particularly on the message side of things. ### Approved Transfers We have not yet integrated the approved-transfer model, which tells us which users can be moved. ## Risk factors broken out by data categories ### Message Data - models: `Message`/`UserMessage`. - assets: `messages-*.json`, subprocesses, partial files Rows in the `Message` model depend on `Recipient/UserProfile`. Rows in the `UserMessage` model depend on `UserProfile/Message`. The biggest concern here is the **sheer volume** of data, with security being a close second. (They are interrelated, as without security concerns, we could just bulk-export everything one time.) We currently have these measures in place for top-down processing: - chunking - multi-processing - messages are filtered by both sender and recipient ### File Related Data - models: `Attachment` - assets: S3, `attachment.json`, `uploads-temp/`, image files in `avatars/`, assorted files in `uploads/`, `avatars/records.json`, `uploads/records.json`, `zerver_attachment_messages` When it comes to exporting attachment data, we have some minor volume issues, but the main concern is just that there are **lots of moving parts**: - S3 needs to be up, and we get some metadata from it as well as files. - We have security concerns about copying over only files that belong to users who approved the transfer. - This piece is just different in how we store data from all the other DB-centric pieces. - At import time we have to populate the `m2m` table (but fortunately, this is pretty low risk in terms of breaking anything.) ### Recipient Data - models: `Recipient/Stream/Subscription/Huddle` - assets: `realm.json`, `(user,stream,huddle)_(recipient,subscription)` This data is fortunately low to medium in volume. The risk here will come from **model complexity** and **cross-realm concerns**. From the top down, here are the dependencies: - `Recipient` depends on `UserProfile` - `Subscription` depends on `Recipient` - `Stream` currently depends on `Realm` (but maybe it should be tied to `Subscription`) - `Huddle` depends on `Subscription` and `UserProfile` The biggest risk factor here is probably just the possibility that we could introduce some bug in our code as we try to segment `Recipient` into user, stream, and huddle components, especially if we try to handle multiple users or realms. I think this can be largely mitigated by the new `Config` approach. And then we also have some complicated `Huddle` logic that will be customized regardless. The fiddliest part of the `Huddle` logic is creating the set of `unsafe_huddle_recipient_ids`. Last but not least, if we go with some hybrid of bottom-up and top-down, these tables are neither close to the bottom nor close to the top, so they may have the most fiddly edge cases when it comes to filtering and merging. Recommendation: We probably want to get a backup of all this data that is very simply bulk-exported from the entire DB, and we should obviously put it in a secure place. ### Cross Realm Data - models: `Client` - assets: `realm.json`, three bots (`notification`/`email`/`welcome`), `id_maps` The good news here is that `Client` is a small table, and there are only three special bots. The bad news is that cross-realm data **complicates everything else**, and we have to avoid **database ID conflicts**. If we use bottom-up approaches to load small user populations at a time, we may have **merging** issues here. We will need to consolidate IDs either by merging exports in `/tmp` or handle it at import time. For the three bots, they live in `zerver_userprofile_crossrealm`, and we re-map their IDs on the new server. Recommendation: Do not sweat the exports too much. Deal with all the messiness at import time, and rely on the tables being really small. We already have logic to catch `Client.DoesNotExist` exceptions, for example. As for possibly missing messages that the welcome bot and friends have sent in the past, I am not sure what our risk profile is there, but I imagine it is relatively low. ### Disjoint User Data - models: `UserProfile/UserActivity/UserActivityInterval/UserPresence` - assets: `realm.json`, `password`, `api_key`, `avatar salt`, `id_maps` On the DB side this data should be fairly easy to deal with. All of these tables are basically disjoint by user profile ID. Our biggest risk is **remapped user ids** at import time, but this is mostly covered in the section above. We have code in place to exclude `password` and `api_key` from `UserProfile` rows. The import process calls `set_unusable_password()`. ### Public Realm Data - models: `Realm/RealmDomain/RealmEmoji/RealmFilter/DefaultStream` - asserts: `realm.json` All of these tables are public (per-realm), and they are keyed by realm ID. There is not a ton to worry about here, except possibly **merging** if we run multiple bottom-up jobs for a single realm.