As mentioned in the TODO this commit deletes, the export with member
consent system was failing to account for the fact that if consenting
users only have access to a subset of messages of a stream with
protected history, only that subset should be exported - rather than all
the stream's messages.
SCIMClient is a type-unsafe workaround for django-scim2’s conflation
of SCIM users with Django users. Given that a SCIMClient is not a
UserProfile, it might as well not be a model at all, since it’s only
used to satisfy django-scim2’s request.user.is_authenticated queries.
This doesn’t solve the type safety issue with assigning a SCIMClient
to request.user, nor the performance issue with running the SCIM
middleware on non-SCIM requests. But it reduces the risk of potential
consequences worse than crashing, since there’s no longer a
request.user.id for Django to confuse with the ID of an actual
UserProfile.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
We do not need direct_members and direct_subgroups field of
UserGroup objects in the export data since we already have
UserGroupMembership and GroupGroupMembership object data.
While importing we keep these fields empty when creating
UserGroup objects and direct_members and direct_subgroups
fields will get set when UserGroupMembership and
GroupGroupMembership objects are created.
This change will also help us in further changes when we
will change the order of importing to import UserGroup
objects just after Realm objects.
Zulip Server 2.1.0 and above have a UI tool, accessible only to server
owners and server administrators, which provides a way to download a
“public data” export. While this export tool is only accessible to
administrators, in many configurations server administrators are not
expected to have access to private messages and private
streams. However, the “public data” export which administrators could
generate contained the attachment contents for all attachments, even
those from private messages and streams.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
The pattern of using the same variable to apply filters
or alter the `QuerySet` in other ways might produce `QuerySet`s
with incompatible types. This behavior is not allowed by mypy.
Signed-off-by: Zixuan James Li <p359101898@gmail.com>
To explain the rationale of this change, for example, there is
`get_user_activity_summary` which accepts either a `Collection[UserActivity]`,
where `QuerySet[T]` is not strictly `Sequence[T]` because its slicing behavior
is different from the `Protocol`, making `Collection` necessary.
Similarily, we should have `Iterable[T]` instead of `List[T]` so that
`QuerySet[T]` will also be an acceptable subtype, or `Sequence[T]` when we
also expect it to be indexed.
Signed-off-by: Zixuan James Li <p359101898@gmail.com>
`cachify` is essentially caching the return value of a function using only
the non-keyword-only arguments as the key.
The use case of the function in the backend can be sufficiently covered by
`functools.lru_cache` as an unbound cache. There is no signficant difference
apart from `cachify` overlooking keyword-only arguments, and
`functools.lru_cache` being conveniently typed.
Signed-off-by: Zixuan James Li <359101898@qq.com>
Fixes#18017.
In previous commits, the change to the bouncer API was introduced to
support this and then a series of migrations added .uuid to
UserProfiles.
Now the code for self-hosted servers that makes requests
to the bouncer is changed to make use of it.
These are not considered to be "personal"
info, even if you upload them, so we
don't export them.
Generally the only folks who upload
these are admins, who can easily get
them in other ways. In fact, anybody
can get these via the app.
We now ensure that all message ids are sorted BEFORE
we split them into batches.
We now do a few extra "slim" queries to get message
ids up front.
But, now, when we divide them into batches, we no
longer run 2 or 3 different complicated queries in
a loop. We just basically hydrate our message ids,
so `write_message_partials` should be easy to reason
about.
This change also means that for tiny realms with
< 1000 messages you will always have just one
json file, since we aggregate the ids from the
queries before batching.
This accomplishes a few things:
* It extracts `chunkify` rather than having us
clumsily track chunking-related stuff in a
big loop that is doing other stuff.
* It makes it so that all message ids
in message-000001.json < message-000002.json.
* It makes it easier for us to customize
the messages we send to a single user
(coming soon).
BTW we probably have a slicker version of chunkify
somewhere in our codebase, but I couldn't remember
where.
Now all file writes go through our three
helper functions, and we consistently
write a single log message after the file
gets written.
I killed off write_message_exports, since
all but one of its callers can call
write_table_data, which automatically
sorts data. In particular, our Message
and UserMessage data will now be sorted
by ids.
This probably just postpones the list creation until
Django builds the "IN" query, but semantically it's
good to work in sets where we don't have any
meaningful ordering of the list that gets used.
The immediate benefit of this is stronger mypy
checks (avoiding the ugly union caused by message
files).
The subsequent commit will add sorting.
We have test coverage on all these lines insofar
as if you comment out the lines, tests will
explode (i.e. more than superficial line
coverage).
The distinction here wasn't super meaningful
due to the way we order our "elif" statements,
but we want to reserver "normal_parent" for the
majority of use cases, where you simply tell
the Config what the "foreign_key" is.
For realm-wide exports, there is no reason to query
inefficiently against a list of modified users.
We move the Config out of the common child configs.
Even though Django usually treats foo__in
and foo_id__in identically for filters where
foo is a ForeignKey type, we want to insist
on somewhat more consistent syntax, because
we have the odd combo of type and type_id
in Recipient, where type_id is kinda like a
foreign key, but not a ForeignKey.
So we assert for now that all our include_rows
values end in "_id__in".
We don't have automated test coverage on this yet,
but below are the results from manual testing.
Note that we include the realm icon and logo even
though they were not created by Cordelia.
./manage.py export_single_user cordelia@zulip.com
$ (cd /tmp/zulip-export-4v3mo802/ && find .)
.
./emoji
./emoji/2
./emoji/2/emoji
./emoji/2/emoji/images
./emoji/2/emoji/images/3.jpg
./emoji/records.json
./messages-000001.json
./realm_icons
./realm_icons/2
./realm_icons/2/night_logo.original
./realm_icons/2/night_logo.png
./realm_icons/2/icon.png
./realm_icons/2/icon.original
./realm_icons/records.json
./avatars
./avatars/2
./avatars/2/c5125af0447f4d66ce34c1b32eac75ac27ebe0e7.original
./avatars/2/c5125af0447f4d66ce34c1b32eac75ac27ebe0e7.png
./avatars/records.json
./uploads
./uploads/2
./uploads/2/68
./uploads/2/68/xyEkC5dTIp8m42_6HJ3kBfdt
./uploads/2/68/xyEkC5dTIp8m42_6HJ3kBfdt/denver.jpg
./uploads/2/96
./uploads/2/96/ol5WE6RTUntvuPDSpJUrYTim
./uploads/2/96/ol5WE6RTUntvuPDSpJUrYTim/denver.jpg
./uploads/records.json
./user.json