diff --git a/docs/subsystems/conversion.md b/docs/subsystems/conversion.md deleted file mode 100644 index 153ef15153..0000000000 --- a/docs/subsystems/conversion.md +++ /dev/null @@ -1,269 +0,0 @@ -# Exporting data from a large multi-realm Zulip server - -## Draft status - -This is a draft design document considering potential future -refinements and improvements to make large migrations easier going -forward, and is not yet a set of recommendations for Zulip systems -administrators to follow. - -## Overview - -Zulip offers an export tool, `management/export.py`, which works well -to export the data for a single Zulip realm, and which is your best -choice if you're migrating a Zulip realm to a new server. - -This document supplements the explanation in `management/export.py`, -but here we focus more on the logistics of a big conversion of a -multi-realm Zulip installation. (For some historical perspective, this -document was originally begun as part of a big Zulip cut-over in -summer 2016.) - -There are many major operational aspects to doing a conversion. I will -list them here, noting that several are not within the scope of this -document: - -- Get new servers running. -- Export data from the old DB. -- Export files from Amazon S3. -- Import files into new storage. -- Import data into new DB. -- Restart new servers. -- Decommission old server. - -This document focuses almost entirely on the **export** piece. Issues -with getting Zulip itself running are out of scope here; see [the -production installation instructions](../index.html#zulip-in-production). -As for the import side of things, we only touch on it implicitly. (My -reasoning was that we *had* to get the export piece right in a timely -fashion, even if it meant we would have to sort out some straggling -issues on the import side later.) - -## Exporting multiple realms' data when moving to a new server - -The main exporting tools in place as of summer 2016 are below: - -- We can export single realms (but not yet limit users within the - realm). -- We can export single users (but then we get no realm-wide data in - the process). -- We can run exports simultaneously (but have to navigate a bunch of - /tmp directories). - -Things that we still may need: -- We may want to export multiple realms simultaneously. -- We may want to export multiple single users simultaneously. -- We may want to limit users within realm exports. -- We may want more operational robustness/convenience while doing - several exports simultaneously. -- We may want to merge multiple export files to remove duplicates. - -We have a few major classes of data. They are listed below in the order -that we process them in `do_export_realm()`: - -#### Public realm data - -`Realm/RealmDomain/RealmEmoji/RealmFilter/DefaultStream`. - -#### Cross realm data - -`Client/zerver_userprofile_cross_realm` - -This includes `Client` and three bots. - -`Client` is unique in being a fairly core table that is not tied to -`UserProfile` or `Realm` (unless you somewhat painfully tie it back to -users in a bottom-up fashion though other tables). - -#### Disjoint user data - -`UserProfile/UserActivity/UserActivityInterval/UserPresence`. - -#### Recipient data - -`Recipient/Stream/Subscription/Huddle`. - -These tables are tied back to users, but they introduce complications -when you try to deal with multi-user subsets. - -#### File-related data - -`Attachment` - -This includes `Attachment`, and it references the `avatar_source` field -of `UserProfile`. Most importantly, of course, it requires us to grab -files from S3. Finally, `Attachment`'s `m2m` relationship ties to -`Message`. - -#### Message data - -`Message/UserMessage` - -### Summary - -Here are the same classes of data, listed in roughly -decreasing order of riskiness: - -- Message data (sheer volume/lack of time/security) -- File-related data (S3/security/lots of moving parts) -- Recipient data (complexity/security/cross-realm considerations) -- Cross realm data (duplicate ids) -- Disjoint user data -- Public realm data - -(Note the above list is essentially in reverse order of how we -process the data, which isn't surprising for a top-down approach.) - -The next section of the document talks about risk factors. - -## Risk mitigation - -### Generic considerations - -We have two major mechanisms for getting data: - -##### Top down - -Get realm data, then all users in realm, then all recipients, then all -messages, etc. - -The problem with the top-down approach will be **filtering**. Also, -if errors arise during top-down passes, it may be time consuming to -re-run the processes. - -##### Bottom up - -Start with users, get their recipient data, etc. - -The problems with the bottom up approach will be **merging**. Also, -if we run multiple bottom-up passes, there is the danger of -duplicating some work, particularly on the message side of things. - -### Approved transfers - -We have not yet integrated the approved-transfer model, which tells us -which users can be moved. - -### Risk factors broken out by data categories - -#### Message data - -- models: `Message`/`UserMessage`. -- assets: `messages-*.json`, subprocesses, partial files - -Rows in the `Message` model depend on `Recipient/UserProfile`. - -Rows in the `UserMessage` model depend on `UserProfile/Message`. - -The biggest concern here is the **sheer volume** of data, with -security being a close second. (They are interrelated, as without -security concerns, we could just bulk-export everything one time.) - -We currently have these measures in place for top-down processing: -- chunking -- multi-processing -- messages are filtered by both sender and recipient - - -#### File related data - -- models: `Attachment` -- assets: S3, `attachment.json`, `uploads-temp/`, image files in - `avatars/`, assorted files in `uploads/`, `avatars/records.json`, - `uploads/records.json`, `zerver_attachment_messages` - -When it comes to exporting attachment data, we have some minor volume -issues, but the main concern is just that there are **lots of moving -parts**: - -- S3 needs to be up, and we get some metadata from it as well as - files. -- We have security concerns about copying over only files that belong - to users who approved the transfer. -- This piece is just different in how we store data from all the other - DB-centric pieces. -- At import time we have to populate the `m2m` table (but fortunately, - this is pretty low risk in terms of breaking anything.) - -#### Recipient data -- models: `Recipient/Stream/Subscription/Huddle` -- assets: `realm.json`, `(user,stream,huddle)_(recipient,subscription)` - -This data is fortunately low to medium in volume. The risk here will -come from **model complexity** and **cross-realm concerns**. - -From the top down, here are the dependencies: - -- `Recipient` depends on `UserProfile` -- `Subscription` depends on `Recipient` -- `Stream` currently depends on `Realm` (but maybe it should be tied - to `Subscription`) -- `Huddle` depends on `Subscription` and `UserProfile` - -The biggest risk factor here is probably just the possibility that we -could introduce some bug in our code as we try to segment `Recipient` -into user, stream, and huddle components, especially if we try to -handle multiple users or realms. I think this can be largely -mitigated by the new `Config` approach. - -And then we also have some complicated `Huddle` logic that will be -customized regardless. The fiddliest part of the `Huddle` logic is -creating the set of `unsafe_huddle_recipient_ids`. - -Last but not least, if we go with some hybrid of bottom-up and -top-down, these tables are neither close to the bottom nor close to -the top, so they may have the most fiddly edge cases when it comes to -filtering and merging. - -Recommendation: We probably want to get a backup of all this data that -is very simply bulk-exported from the entire DB, and we should -obviously put it in a secure place. - -#### Cross realm data -- models: `Client` -- assets: `realm.json`, three bots (`notification`/`email`/`welcome`), - `id_maps` - -The good news here is that `Client` is a small table, and there are -only three special bots. - -The bad news is that cross-realm data **complicates everything else**, -and we have to avoid **database ID conflicts**. - -If we use bottom-up approaches to load small user populations at a -time, we may have **merging** issues here. We will need to -consolidate IDs either by merging exports in `/tmp` or handle it at -import time. - -For the three bots, they live in `zerver_userprofile_crossrealm`, and -we re-map their IDs on the new server. - -Recommendation: Do not sweat the exports too much. Deal with all the -messiness at import time, and rely on the tables being really small. -We already have logic to catch `Client.DoesNotExist` exceptions, for -example. As for possibly missing messages that the welcome bot and -friends have sent in the past, I am not sure what our risk profile is -there, but I imagine it is relatively low. - -#### Disjoint user data -- models: `UserProfile/UserActivity/UserActivityInterval/UserPresence` -- assets: `realm.json`, `password`, `api_key`, `avatar salt`, - `id_maps` - -On the DB side this data should be fairly easy to deal with. All of -these tables are basically disjoint by user profile ID. Our biggest -risk is **remapped user ids** at import time, but this is mostly -covered in the section above. - -We have code in place to exclude `password` and `api_key` from -`UserProfile` rows. The import process calls -`set_unusable_password()`. - -#### Public realm data - -- models: `Realm/RealmDomain/RealmEmoji/RealmFilter/DefaultStream` -- asserts: `realm.json` - -All of these tables are public (per-realm), and they are keyed by -realm ID. There is not a ton to worry about here, except possibly -**merging** if we run multiple bottom-up jobs for a single realm. diff --git a/docs/subsystems/index.rst b/docs/subsystems/index.rst index c02bdd760f..b19797b703 100644 --- a/docs/subsystems/index.rst +++ b/docs/subsystems/index.rst @@ -31,7 +31,6 @@ Subsystems documentation django-upgrades release-checklist api-release-checklist - conversion input-pills thumbnailing presence