The Slack import process would incorrectly issue
CustomProfileFieldValue entries with a value of "" for users who
didn't have a given CustomProfileField (especially common for the
"skype" and "phone" fields). This had no user-visible effect, but
certainly added some clutter in the database.
This works by yielding messages sorted based on timestamp. Because
the Slack exports are broken into files by date, it's convenient to do
a 2-layer sorting process, where we open all the files for a given
day, and then sort their messages by timestamp before yielding them.
Fixes#10930.
The previous implementation used run_parallel incorrectly, passing it
a set of very small jobs (each was to download a single file), which
meant that we'd end up forking once for every file to download.
This correct implementation sends each of N threads 1/N of the files
to download, which is more consistent with the goal of distributing
the download work between N threads.
For messages with strange senders, we don't import
messages. Basically, we only import a message if
it has sender with an id that maps to a non-deleted
user.
We now account for streams having users that
may be deleted. We do a couple things:
- use a loop instead of map
- only pass in users to hipchat_subscriber
- early-exit if there are not users
- skip owner/members logic for public streams
Normal hipchat exports use integer ids for their
users and "rooms," which we just borrowed during
conversion.
Atlassian Stride uses stride UUIDs for these instead, but otherwise
has the same export format.
We now introduce IdMapper to handle external ids
that aren't integer. The IdMapper will map UUID
ids to ints and remember them. For ints it just
leaves them alone.
Fixes#10805.
We (lexically) remove "subject" from the conversion code. The
`build_message` helper calls `set_topic_name` under the hood,
so things still have "subject" in the JSON.
There was good code coverage on `build_message`.
This was a pretty nasty error, where we were accidentally accessing
the parent list in this inner loop function.
This appears to have been introduced as a refactoring bug in
7822ef38c2.
The last_modified field is intended to support setting the
orig-last-modified field in the S3 backend when importing, basically
to keep track of this bit of pre-export data for debugging. In the
event that it isn't available, the correct thing to do is not write
out an invalid `last_modified` field; we should just not write it out
at all.
This is mostly an extraction, but it does change the
way we calculate `content`. We append the markdown
links from ALL files to any content that came in the
message itself.
Separating this out also allows us to add more
test coverage for the extracted code.
We now use subscriber_map for building UserMessage
rows in Slack/Gitter conversions.
This is mostly designed to simplify the code, rather
than having to scan the entire subscribers for each
message.
I am guessing this will improve performance for most
conversions. We sort small lists on every message,
in order to be deterministic, but the sorting cost
is probably more than offset by avoiding the O(N)
scans across all subscriptions. Also, it's probably
negligible in the grand scheme of things, compared
to JSON parsing, file I/O, etc.
This commits also fixes some typos with mentioned_users_id ->
mentioned_user_ids and cleans up a test a bit as well.
We now have all three third party
conversions (Gitter/Slack/Hipchat)
go through build_user_message().
Hipchat was already using this helper.
We also avoid callers having to pass in
an id to build_user_message().
Masking content can be useful for testing
out conversions where you're dealing
with data from customers and want to avoid
inadvertently reading their content (while
still having semi-realistic messages).
Having two smaller functions should make it
easier to customize the behavior for each specific
use case. The only reason they were ever coupled
was to keep ids in sequence, but the recent NEXT_ID
changes make that a non-issue now.
We now instantiate NEXT_ID in sequencer.py, which avoids
having multiple modules make multiple copies of a sequencer
and possibly causing id collisions.
This bug was introduced very recently and is an
aliasing bug. It caused extra UserMessage rows to
be created as we inadvertently updated the underlying
subscriber_map sets for multiple messages.
This probably mostly affected PMs.
It's doubtful the bug ever got out into the field.
We extract this function and put it in the shared
library `import_util.py`.
Also, we make it one time higher up in the call
stack, rather than re-building it for every batch
of messages. I doubt this was super expensive, but
there's no reason to repeatedly execute this.
Before this fix, we were creating two copies of every
PM Message in zerver_message with only corresponding
UserMessage row.
Now we only create one PM Message per message, which
we accomplish by making sure we only use imported
messages from the sender's history.json file. And
then we write UserMessage rows for both participants
by making sure to include sender_id in the set of
user_ids that feeds into making UserMessage. For
the case where you PM yourself, there's just one
UserMessage row.
It does not appear that we need to support huddles
yet.
When we create new ids for message rows, we
now sort the new ids by their corresponding
pub_date values in the rows.
This takes a sizable chunk of memory.
This feature only gets turned on if you
set sort_by_date to True in realm.json.
Even individual "room" files from hipchat can be large,
so we process only 1000 messages at a time
within each file, which produces smaller JSON files.
We simplify the code for is_realm_admin
and set is_guest as well.
I verified that build_user() is not used
by Slack/Gitter, so the extra argument there
should be fine.
Fixes#10639
This is a very early version of a tool to convert Hipchat
tar files into data files that can be used by the Zulip
import process.
We include the most fundamental entities--users and
streams. Customers who don't care about past messages
or customizations could start an instance off of this
and start communicating.
Of course, there are a lot of things missing in the
initial version:
* messages!
* file assets -- avatars, emojis, attachments
* probably lots of other minor things
We currently ignore any incoming dates from Hipchat data
and just use the current time. This is consistent with
other imports.
We also don't have any docs yet, although the process
will be extremely similar to the "Slack" process:
https://zulipchat.com/help/import-from-slack
Also, there's a comment at the top of convert_hipchat_data.py
that describes how to test this in dev mode.
I tested this by following the steps in the comment above.
The users just "show up" in /devlogin, so that's nice, and
you can send messages to other users. To verify the stream
data you have to go into the gear menu and click on "All
Streams", then you can subscribe and send a message.
Production users will need to get new passwords and
re-subscribe to streams. We will probably auto-subscribe
all users to public streams.
The missing fields are checked by `full_clean()` method.
The datetime field errors are ignored as they are fixed
in the `import_realm` script. The field that are
allowed to be null are not included while building
this object.
The 'last_modified' value in emoji records is
needed for uploading the file to the S3 backend.
We set the same in the function 'import_uploads_s3'.
We also have to remove the keyword 'last_modified'
while building the RealmEmoji dict, as it is not
a field which exists in RealmEmoji objects.