Our recently-added code for rewriting user IDs on data import didn't
correctly handle wildcard mentions and mentions generated by very old
versions of Zulip (pre data-user-id).
The previous query ended up doing an awkward join that did not
guarantee use of the Recipient index on zerver_message, turning a very
fast query into something that could take much longer for a single
stream than the rest of the import combined.
lxml parser appends html and body tags to the soup object which
are not reqired. There are no other major parsing diffrences between
the two parsers as long the HTML input is perfectly formated.
lxml parser is much faster than html.parser but it hardly matters
in our case.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-
between-parsers
Previously, if you exported a Zulip organization and then re-imported
it, we'd end up renumbering the user IDs and all direct foreign key
references to them in the database, but not the data-user-id
references in mentions. Fix this by parsing the message content and
doing that renumbering.
(Because we import raw markdown, not HTML, from third-party tools,
these changes won't affect data import from slack etc.)
Fixes the high-priority part of #11293.
This field is primarily intended to support avoiding displaying the
"more topics" feature in new organizations and streams, where we might
know that all messages in the stream are already available in the
browser.
Based on original work by Roman Godov, and significantly modified by
tabbott.
The second migration involved here could be expensive on Zulip Cloud,
but is unlikely to be an issue on other servers.
In commit de65a04 we can see that if the need ever arises to modify
how stream descriptions are rendered, we would need to make changes
at 5 different call points which can be quite cumbersome. So this
functionality has been extracted to a new method called
'render_stream_descriptions'.
This commit leverages the ahocorasick algorithm to build a set of user_ids
that have their alert_words present in the message. It runs in linear time
of the order of length of the input message as opposed to number of
alert_words. This is after building a ahocorasick Automaton which runs
in O(number of alert_words in entire realm) which is usually cached.
We want to use the baseline features of bugdown, but not fancy things
like inline URL previews, since the whole structure of stream
descriptions is to have a single-line thing supporting some
formatting.
The migration part of this change fixes a bug encountered by some
organizations upgrading from older versions of Zulip.
We've for a while had logic to set plan_type to LIMITED when importing
into Zulip Cloud; we need corresponding logic to set it to SELF_HOSTED
when importing into a self-hosted server.
Fixes#11541.
This helps keep the realm.json small and easy to process; previously,
almost the entire size of that file was the analytics data.
We implement this by refactoring the analytics Config objects into a
separate subroutine that writes to a separate file, plus the
corresponding import code.
Manual testing was performed by exporting the 'analytics' realm, and
importing back to a newly created 'test' realm. The 'test' realm was
then exported and the json files were inspected. The data appeared
consistent with no abnormalities.
Fixes: #11220.
This commit does the following three things:
1. Update stream model to accomodate rendered description.
2. Render and save the stream rendered description on update.
3. Render and save stream descriptions on creation.
Further, the stream's rendered description is also sent whenever the
stream's description is being sent.
This is preparatory work for eliminating the use of the
non-authoritative marked.js markdown parser for stream descriptions.
This should eliminate the need to do manual analytics work when
importing organizations imported/exported using the zulip -> zulip
import/export tools.
The octet-stream content type is potentially under-specified, but it's
better than potentially submitting None and increases consistency of
this part of the codebase.
The boto library's s3 interface allows setting only string-format
metadata keys. So we need to cast the last_modified floating-point
timestamp into a string before storing on the S3 object.
This bug mostly broke uploading avatars when using the S3 storage backend.
Our HipChat conversion tool didn't properly handle basic avatar
images, resulting in only the medium-size avatar images being imported
properly. This fixes that bug by asking the import tool to do the
thumbnailing for the basic avatar image (from the .original file) as
well as the medium avatar image.
Fixes a bug in import_realm where secondary attributes like message
visibility weren't being set, and also makes bugs like this less likely in
the future.
Also, putting the plan_type change at the end of import_realm, so that
future restrictions to LIMITED realms don't affect the import process.
We've had a long stream of bugs existed because only one of these two
code paths was tested (usually the local uploads backend). By
deduplicating these functions, we ensure that this category of bugs no
longer happens.
Following my recent refactor, this is just a straightforward merge,
with code for one or the other backend ending up inside an if
statement.
Previously, we were incorrectly importing avatar PNGs to a filename
without the .png extension, resulting in them effectively not being
imported.
This was mitigated by the fact that we imported the originals and ran
the appropriate `ensure_` functions, but still a bug.
This commit speeds up the import by avoiding
sender lookups and instead using the data
for users that we already have in memory.
This avoids a few DB hops, many hops to memcached,
plus some object construction.
We now call do_render_markdown() directly. This
also makes it more explicit that the import has
never rendered alert words.
This function requires a message object, whereas
we want to work with JSON data to avoid necessary
queries when we import data. Inlining the function
sets us up for a subsequent refactoring.
We change the way we deal with theoretical return
values of `None` to use an assertion; otherwise,
we would have to loosen up a bunch of mypy types
from `str` to `Optional[str]`. It's not clear `None`
is even possible--we've moved toward throwing exceptions
there instead of silently failing.
The previous logic was incorrect, in that if `content_type` was set to
None (which happens with Slack/HipChat export, among other things),
then we wouldn't run the `guess_type` logic to auto-detect the
Content-Type to send to S3.
The UserMessage table can be huge, so creating a
bunch of entries in `ID_MAP` can overflow memory.
We don't have any tables that depend on `UserMessage`,
and we don't send the 'id' fields from `zerver_usermessage`
to the database, so re-mapping them was just busy-work.
When we create new ids for message rows, we
now sort the new ids by their corresponding
pub_date values in the rows.
This takes a sizable chunk of memory.
This feature only gets turned on if you
set sort_by_date to True in realm.json.
We use UserMessageLite to avoid Django overhead, and we
do updates in chunks of 10000. (The export may be broken
into several files already, but a reasonable chunking at
import time is good defense against running out of memory.)
The code was needlessly querying the DB to get full
objects for entities where we only needed user_id,
realm_id, and stream_id.
With my test data of ~1000 records this sped up the
function from ~8s to ~0.5s. The speedup would probably
be even more for larger data sets.
If any user had sent the reply to the welcome bot recommended by our
tutorial, then the Zulip export/import process didn't work properly,
because we weren't including (and then remapping) the recipient ID for
sending PMs to the cross-realm bots. This commit fixes that gap, by
recording the necessary data on the export side, and doing the
appropriate remapping on the import side.
Previously, our realm import logic only did the special remapping
logic for the original notifications_stream_id; when we added the new
signup_notifications_stream_id field, we neglected to handle it in the
same way.
The 'last_modified' value in emoji records is
needed for uploading the file to the S3 backend.
We set the same in the function 'import_uploads_s3'.
We also have to remove the keyword 'last_modified'
while building the RealmEmoji dict, as it is not
a field which exists in RealmEmoji objects.
The s3 import code path made a hard assumption about `user_profile_id`
being set (we'd already fixed this in the local uploads code path).
Ideally, it should be, and I've opened #10268 for fixing that, but for
now this is how it needs to work.
After the messages have been imported, set the rendered_content of the
messages instead of leaving its value to be 'None'.
This is important to ensure that:
(1) Performance for users is good after completing the import.
(2) The database's full-text indexes have all of the imported messages
(which only happens properly when Message rows have their
rendered_content field edited).
Fixes#9168.
random_api_key, the function we use to generate random tokens for API
keys, has been moved to zerver/lib/utils.py because it's used in more
parts of the codebase (apart from user creation), and having it in
zerver/lib/create_user.py was prone to cyclic dependencies.
The function has also been renamed to generate_api_key to have an
imperative name, that makes clearer what it does.
Implement this function in 'bulk_import_model'
and 'update_model_ids'.
This lets us save on redundant-feeling arguments in these
frequently-called helper functions.
The function 'update_model_ids' should be used on
the models BotStorageData and BotConfigData.
It is wrongly added here for UserGroup model.
Also the sequence name for BotStorageData and
BotConfigData is 'zerver_botuserstatedata_id_seq' and
'zerver_botuserconfigdata_id_seq' respectively, which
should be specifically mentioned in the function
'allocate_ids'.
This fixes some nondeterministic test failures.
This will be used while for any ManyToMany field which
is being imported.
We add an internal function which takes in the old ID list
of the ManyToMany field and return the new updated ID list.
For importing huddles we have to have unique huddle hashes.
Huddle hashes are extracted from the list of users participating
in a huddle. So to extract these user ids, we first use huddle
id to getting the matching recipient, and then we use subscription
to get the user ids from the recipient id.
Added tests for the same (tests slightly tweaked by tabbott).
* If `zerver_realmauditlog` is present in the exported data,
`RealmAuditLog` would be imported normally.
* If it is not present, `create_subscription_events`
function in would create the `subscription_created`
events for RealmAuditLog. The reason this function
is in `import_realm` module and not in the individual
export tool scripts (like Slack) is because this
function would be common for all export tools.
This fixes#9846 for users who have not already done an import of
their organization from Slack.
Fixes#9846.
For the S3 backend uploads, 'attachment_path' should be
saved with the 's3_path' of the record, as the original
'path' is changed while exporting files from s3. (See
function 'export_files_from_s3' in export.py for reference.)
In 'zerver_reaction', the emoji_code should be updated
with the RealmEmoji allocated id when the 'reaction_type'
is 'realm_emoji'. Hence we add an extra field 'reaction_field'
in 're_map_foreign_keys', to process the above mentioned
condition.