Under the unicodedata distributed with Python 3.6, some Emoji are
classified as `Cn`, and not `So`:
```
$ unicode 1f929 --long
U+1F929 GRINNING FACE WITH STAR EYES
UTF-8: f0 9f a4 a9 UTF-16BE: d83edd29 Decimal: 🤩 Octal: \0374451
🤩
Category: So (Symbol, Other); East Asian width: W (wide)
Unicode block: 1F900..1F9FF; Supplemental Symbols and Pictographs
Bidi: ON (Other Neutrals)
$ python3.6 -c 'import unicodedata; print(unicodedata.category("\U0001f929"))'
Cn
$ python3.7 -c 'import unicodedata; print(unicodedata.category("\U0001f929"))'
So
```
Drop `Cn` from the list of excluded Unicode character classes, and
replace it with an explicit list of the 66 non-characters, which are
invariant.
Co-authored-by: Shlok Patel <shlokcpatel2001@gmail.com>
An explanatory note on the changes in zulip.yaml and
curl_param_value_generators is warranted here. In our automated
tests for our curl examples, the test for the API endpoint that
changes the posting permissions of a stream comes before our
existing curl test for adding message reactions.
Since there is an extra notification message due to the change in
posting permissions, the message IDs used in tests that come after
need to be incremented by 1.
This is a part of #20289.
Prior to this commit, we wrapped all incoming messages from Slack
in backticks. This led to weird formatting errors when an incom-
ing message from Slack contains backticks, to refer to a function
name, for instance.
Moves `flags` field to top part of object description because
it is always included in the event.
If a field is present only for certain types of message updates,
the description begins by stating when the field is present:
"Only present if ...".
These fields are organized by the type of message update:
stream, stream and/or topic, topic, content.
If a field is not present due to a special event, the description
ends by stating when the field is not present:
"Not present if ...".
Adds documentation for fields currently required to be returned
with any `update_message` event.
do_delete_users had two bugs:
1. Creating the replacement dummy users
with active=True
2. Creating the replacement dummy users with email domain set to
realm.uri, which may not be a valid email domain.
Prior commits fixed the bugs, and this migration fixes the pre-existing
objects.
Otherwise the dummy user can be created with an invalid email domain -
e.g. in development environment with the domain
"@http://localhost:9991". get_fake_email_domain exists exactly for
handling these kinds of scenarios.
Stop using `access_user_group_by_id` in notifications codepaths, as it
is meant to be used to check for _write_ access, not read
access (which is not limited). In the notification codepaths, there
are no ACLs to apply, and the ID is known-good; just load it
directly. The `for_mention` flag is removed, as it was not used in the
mention codepaths at all, only the notification ones.
get_remote_server_by_uuid (called in validate_api_key) raises
ValidationError when given an invalid UUID due to how Django handles
UUIDField. We don't want that exception and prefer the ordinary
DoesNotExist exception to be raised.
APNs payloads nest the zulip-custom data further than the top level,
as Android notifications do. This led to APNs data silently never
being truncated; this case was not caught in tests because the mocks
provided the wrong data for the APNs structure.
Adjust to look in the appropriate place within the APNs data, and
truncate that.
This replaces the temporary (and testless) fix in
24b1439e93 with a more permanent
fix.
Instead of checking if the user is a bot just before
sending the notifications, we now just don't enqueue
notifications for bots. This is done by sending a list
of bot IDs to the event_queue code, just like other
lists which are used for creating NotificationData objects.
Credit @andersk for the test code in `test_notification_data.py`.
As explained in the comments in the code, just doing UUID(string) and
catching ValueError is not enough, because the uuid library sometimes
tries to modify the string to convert it into a valid UUID:
>>> a = '18cedb98-5222-5f34-50a9-fc418e1ba972'
>>> uuid.UUID(a, version=4)
UUID('18cedb98-5222-4f34-90a9-fc418e1ba972')
This diff looks slightly noisy, but the main chunk of
code that we moved here has the same logic as before,
and it just gets realm_id from MentionBackend now, instead
of having our markdown processor have to supply it.
We basically want MentionData to be the gatekeeper of
mention data, and then we delegate backend tasks to
MentionBackend.
Soon we will add a cache to MentionBacked, which will
justify this change a bit more.
We now make it mandatory to pass in the Realm object.
If this function was ever called with None, I am scared
to know what the expected results were at the time of
writing.
It's slightly annoying to plumb Optional[MentionBackend]
down the stack, but it's a one-time change.
I tried to make the cache code relatively unobtrusive
for the single-message use case.
We should be able to eliminate redundant stream queries
using similar techniques.
I considered caching at the level of rendering the message
itself, but this involves nearly as much plumbing, and
you have to account for the fact that several users on
your realm may have distinct default languages (French,
Spanish, Russian, etc.), so you would not eliminate as
many query hops. Also, if multiple streams were involved,
users would get slightly different messages based on
their prior subscriptions.
When our handlers specifically reference self.md.zulip_db_data,
we now use an explicit type.
We probably want a more robust solution here, such as a semgrep
rule.
We now serialize still_url as None for non-animated emojis,
instead of omitting the field. The webapp does proper checks
for falsiness here. The mobile app does not yet use the field
(to my knowledge).
We bump the API version here. More discussion here:
https://chat.zulip.org/#narrow/stream/378-api-design/topic/still_url/near/1302573
Appending to bytes in a loop leads to a quadratic slowdown since
Python doesn’t optimize this for bytes like it does for str.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
While accepting an invitation from a user, there was no condition in
place to check if the user sending the invitation was now
now-deactivated.
Skip sending notifications about newly-joined users to users who are
now disabled.
Fixes#18569.
We don't have to go to the database to get the Recipient
fields for `user_profile.recipient`.
See also 85ed6f332a from a little
over a year ago--it's very similar.
The bug here probably didn't come up too much in
practice, but if we were adding a user to multiple
streams when they already had used all N available
colors, all the new streams would be assigned the same
color, since the size of used_colors would stay at N,
thwarting our little modulo-len hackery.
It's not a terrible bug, since users can obviously
customize their stream colors as they see fit.
Usually when we are adding a user to multiple streams,
the users are fairly new, and thus don't have many
existing streams, so I have never heard this bug
reported in the field.
Anyway, assigning the colors in bulk seems to make more
sense, and I added some tests.
For the situations where all the colors have already
been used, I didn't put a ton of thought into exactly
which repeated colors we want to choose; instead, I
just ensure they're different modulo 24. It's possible
that we should just have more than 24 canned colors, or
we should just assign the same default color every time
and let users change it themselves (once they've gone
beyond the 24, to be clear). Or maybe we can just do
something smarter here. I don't have enough time for a
deep dive on this issue.
Part of our codepath for subscribing users involves
fetching the users' existing subscriptions to make sure
we can do things like properly report to the clients
that the users were already subscribed. This codepath
used to be coupled to code that helped users maintain
unique stream colors.
Suppose you are creating a new stream, and you are
importing users from an older stream with 15k
subscribers, and each of your users is subscribed to
about 20 streams.
The prior code, instead of filtering on recipient_id,
would literally look at every subscription for every
user, which was kind of crazy if you didn't understand
the pick-stream-color complications.
Before this commit, we would fetch 300k rows with 15
columns each (granted, all but one of the columns are
bool/int). That's a total of 4.5 million tiny objects
that we had to glom into Django ORM objects and slice
and dice.
After this commit, we would fetch exactly zero rows
for the are-they-already-subscribed logic.
Yes, ZERO.
If we were to mistakenly try to re-add the same 15k
subscribers to the new stream (under the new code), we
will now fetch 15k Sub rows instead of 300k.
It is worth looking at the prior commit. We go through
great pains to ensure that users get new stream colors
when we invite them to a stream, and we still fetch a
bunch of data for that. Instead of 4.5 million cells,
it's more like 600k cells (2 columns per row), and it's
less than that insofar as some users may only
have 24 distinct colors among their many streams.
It's a lot of work.
This commit sets us up for the next commit, which will
save us a very expensive query.
If you are adding 15k users to a stream, and each user
has about 20 existing streams, then we need to retrieve
300k rows from the database to figure out which stream
colors they already have. We don't need all the extra
fields from Subscription, so now we get just the two
values we need for making a color map.
In the next commit we'll eliminate the other use case
for the big query, and I will explain in greater
depth how splitting out the color-picking code can
be a huge win. It is possible that some product decisions
could make this codepath easier. We could also do some
engineering specific to stream colors, such as caching
which colors users have already used.
This does cost us an extra round trip to the database.
Having the `wildcard_mentions_notify` setting turned on does
not necessarily mean that the user will receive notification
for that message. There is more nuance to this, as explained
in the updated comment.
We recently ran into a payload in production that didn't contain
an event type at all. A payload where we can't figure out the event
type is quite rare. Instead of letting these payloads run amok, we
should raise a more informative exception for such unusual payloads.
If we encounter too many of these, then we can choose to conduct a
deeper investigation on a case-by-case basis.
With some changes by Tim Abbott.
Given that these values are uuids, it's better to use UUIDField which is
meant for exactly that, rather than an arbitrary CharField.
This requires modifying some tests to use valid uuids.
We avoid repeating the same calculations over and
over again for the same stream.
This helps, but the real bottleneck in this function
is that send_event usually takes at least a millisecond,
and that adds up quickly if you're doing something
like subscribing 5k users to a new stream.
GIF files can be `.GIF`, and also we determine the file format by
inspecting the image data, so there's no reason to have this
assertion.
(The code for serving still images does not rely on the file being a
GIF.)
Have kept process_new_human_user out of
the atomic block because it involves many
different operations and also sends events.
Tried enclosing event in on_commit but that
would need many changes in the tests, so have
skipped it for now.
Updates testing helpers in `event_schema.py` for `do_update_message` so
that all stream message fields are present in any edits / updates to
stream messages. Adds verfication tests of events returned from private
message edits and from stream message content-only and topic-only edits.
Updates the `update_message` event type to always include a `stream_id`
field when the message being edited is a stream message. This change
aligns with the current definition of the `\get-events` endpoint
in the OpenAPI documentation.
It is better to press on, than stop halfway through due to a user
whose email no longer works. The exception is already logged, which
is sufficient here, as this is generally run interactively.
These fundamentally tested send_email, not build_email, and thus
belong in TestSendEmail, not TestBuildEmail. They also duplicated the
code in test_send_email_exceptions; reuse it.
This allows verify_uploads to use the database
as the authoritative source for what attachments
we need to look for when we're verifying the
images got exported properly, while still
also verifying attachment.json is correct.
It is better for the verifying code to just explicitly
ensure that the exported file bytes match the bytes
in the test image. This introduces a tiny bit more
of I/O.
It's easier to read the code without the intermediate
full_data dictionary that obscures where the files live.
We also avoid some unnecessary file i/o in the tests.
We do a sanity check for every table
that gets written to user.json as part of
the single-user export.
If we add more tables to the single-user export,
the test that I modified here will now ask
the author to add a new checker function, which
means we should always have at least a basic
sanity check for every exported table as long
as we stay in this new paradigm.
We also remove a little bit of old code that
became redundant.
This replaces the TERMS_OF_SERVICE and PRIVACY_POLICY settings with
just a POLICIES_DIRECTORY setting, in order to support settings (like
Zulip Cloud) where there's more policies than just those two.
With minor changes by Eeshan Garg.
We do s/TOS/TERMS_OF_SERVICE/ on the name, and while we're at it,
remove the assumed zerver/ namespace for the template, which isn't
correct -- Zulip Cloud related content should be in the corporate/
directory.
We now complain if a test author sends a stream message
that does not result in the sender getting a
UserMessage row for the message.
This is basically 100% equivalent to complaining that
the author failed to subscribe the sender to the stream
as part of the test setup, as far as I can tell, so the
AssertionError instructs the author to subscribe the
sender to the stream.
We exempt bots from this check, although it is
plausible we should only exempt the system bots like
the notification bot.
I considered auto-subscribing the sender to the stream,
but that can be a little more expensive than the
current check, and we generally want test setup to be
explicit.
If there is some legitimate way than a subscribed human
sender can't get a UserMessage, then we probably want
an explicit test for that, or we may want to change the
backend to just write a UserMessage row in that
hypothetical situation.
For most tests, including almost all the ones fixed
here, the author just wants their test setup to
realistically reflect normal operation, and often devs
may not realize that Cordelia is not subscribed to
Denmark or not realize that Hamlet is not subscribed to
Scotland.
Some of us don't remember our Shakespeare from high
school, and our stream subscriptions don't even
necessarily reflect which countries the Bard placed his
characters in.
There may also be some legitimate use case where an
author wants to simulate sending a message to an
unsubscribed stream, but for those edge cases, they can
always set allow_unsubscribed_sender to True.
These variables can be unset if the `os.path.exists` check fails.
That should be rare, since we've previously checked the files do
exist before getting here.
While races here are unlikely, it is most correct to enforce this
invariant at the database layer, and having a database-level
constraint makes the models file a bit more readable.
These are not considered to be "personal"
info, even if you upload them, so we
don't export them.
Generally the only folks who upload
these are admins, who can easily get
them in other ways. In fact, anybody
can get these via the app.
We now ensure that all message ids are sorted BEFORE
we split them into batches.
We now do a few extra "slim" queries to get message
ids up front.
But, now, when we divide them into batches, we no
longer run 2 or 3 different complicated queries in
a loop. We just basically hydrate our message ids,
so `write_message_partials` should be easy to reason
about.
This change also means that for tiny realms with
< 1000 messages you will always have just one
json file, since we aggregate the ids from the
queries before batching.
This accomplishes a few things:
* It extracts `chunkify` rather than having us
clumsily track chunking-related stuff in a
big loop that is doing other stuff.
* It makes it so that all message ids
in message-000001.json < message-000002.json.
* It makes it easier for us to customize
the messages we send to a single user
(coming soon).
BTW we probably have a slicker version of chunkify
somewhere in our codebase, but I couldn't remember
where.
Following b3c58f454f, we want to clean up
old topics that may contain the disallowed characters. The Message table
is large, so we go in batches, making sure we limit topic fetches and
UPDATE query to no more than BATCH_SIZE Message rows per query.
Now all file writes go through our three
helper functions, and we consistently
write a single log message after the file
gets written.
I killed off write_message_exports, since
all but one of its callers can call
write_table_data, which automatically
sorts data. In particular, our Message
and UserMessage data will now be sorted
by ids.
This probably just postpones the list creation until
Django builds the "IN" query, but semantically it's
good to work in sets where we don't have any
meaningful ordering of the list that gets used.
The immediate benefit of this is stronger mypy
checks (avoiding the ugly union caused by message
files).
The subsequent commit will add sorting.
We have test coverage on all these lines insofar
as if you comment out the lines, tests will
explode (i.e. more than superficial line
coverage).
The distinction here wasn't super meaningful
due to the way we order our "elif" statements,
but we want to reserver "normal_parent" for the
majority of use cases, where you simply tell
the Config what the "foreign_key" is.
For realm-wide exports, there is no reason to query
inefficiently against a list of modified users.
We move the Config out of the common child configs.
Even though Django usually treats foo__in
and foo_id__in identically for filters where
foo is a ForeignKey type, we want to insist
on somewhat more consistent syntax, because
we have the odd combo of type and type_id
in Recipient, where type_id is kinda like a
foreign key, but not a ForeignKey.
So we assert for now that all our include_rows
values end in "_id__in".
Zulip shows two guides on How to reply, first one by
the welcome bot and second one is intro_reply hotspot.
To simply and avoid redundancy, intro_reply hotspot is
removed.
Fixes#20482.
Force postgres to give reactions in ID order - which
is generally chronological order. Results in frontend
displaying reactions in said order.
Fixes#20060.
In many of our stream notification messages, we make use of the
same silent user mention syntax, the template for which was always
hardcoded. This commit adds a helper function that all relevant
callers can call to get the right syntax when mentioning users.
Thanks to Tim Abbott for this suggestion!
Removed existing empty narrow divs from app/home.html and created
a new javascript module to dynamically load empty narrow messages
using handlebar template.
Fixes#18797
The original intention of this was to prevent coding
errors with realm getters that don't, um, filter
on realm.
Unfortunately, you can still write a broken realm getter
that forgets to filter on realm, but which returns a
Set, and the new safeguards won't see any difference.
We could make all the getters return sorted lists
instead, but that's for another day.
This code does serve another purpose, which is to
prevet egregious bugs in the import itself.
The diff here is ugly, but to summarize:
BEFORE IMPORT:
define get_user_id
define get_huddle_hashes
AFTER IMPORT AND MAKING GETTERS:
check realm id
define assert_realm_values
verify emoji codes
check huddle hashes
We don't have automated test coverage on this yet,
but below are the results from manual testing.
Note that we include the realm icon and logo even
though they were not created by Cordelia.
./manage.py export_single_user cordelia@zulip.com
$ (cd /tmp/zulip-export-4v3mo802/ && find .)
.
./emoji
./emoji/2
./emoji/2/emoji
./emoji/2/emoji/images
./emoji/2/emoji/images/3.jpg
./emoji/records.json
./messages-000001.json
./realm_icons
./realm_icons/2
./realm_icons/2/night_logo.original
./realm_icons/2/night_logo.png
./realm_icons/2/icon.png
./realm_icons/2/icon.original
./realm_icons/records.json
./avatars
./avatars/2
./avatars/2/c5125af0447f4d66ce34c1b32eac75ac27ebe0e7.original
./avatars/2/c5125af0447f4d66ce34c1b32eac75ac27ebe0e7.png
./avatars/records.json
./uploads
./uploads/2
./uploads/2/68
./uploads/2/68/xyEkC5dTIp8m42_6HJ3kBfdt
./uploads/2/68/xyEkC5dTIp8m42_6HJ3kBfdt/denver.jpg
./uploads/2/96
./uploads/2/96/ol5WE6RTUntvuPDSpJUrYTim
./uploads/2/96/ol5WE6RTUntvuPDSpJUrYTim/denver.jpg
./uploads/records.json
./user.json
There are tactical reasons to remove this assertion.
Basically, the reason it's safe to remove is that it's
been around a long time and we would have seen this
operationally. Also, the check to make sure that the
S3 filename thingy matches the avatar hash is a much
stronger check.
We will soon restore a stronger version of this check
that applies to all of our asset types (emojis/avatars/etc.).
This makes it easier to read the calling code and see
the big picture of how the four asset types are
organized.
I also handle uploads first, to be similar to the local
code.
This code is well tested--you can modify any of the callers
to pass in a wrong value of `object_key` and get a failing
test.