The change in 180d8abed6, while correct
for the Django part of the codebase, had the nasty side effect of
exposing a failure mode in the process_notification logic if the users
list was empty.
This, in turn, could cause our process_notification code to fail with
an IndexError when trying to process the event, which would result in
that tornado process not automatically recovering, due to the outer
try/except handler for consume triggering a NACK and thus repeating
the event.
When more than one outgoing webhook is configured,
the message which is send to the webhook bot passes
through finalize_payload function multiple times,
which mutated the message dict in a way that many keys
were lost from the dict obj.
This commit fixes that problem by having
`finalize_payload` return a shallow copy of the
incoming dict, instead of mutating it. We still
mutate dicts inside of `post_process_dicts`, though,
for performance reasons.
This was slightly modified by @showell to fix the
`test_both_codepaths` test that was added concurrently
to this work. (I used a slightly verbose style in the
tests to emphasize the transformation from `wide_dict`
to `narrow_dict`.)
I also removed a deepcopy call inside
`get_client_payload`, since we now no longer mutate
in `finalize_payload`.
Finally, I added some comments here and there.
For testing, I mostly protect against the root
cause of the bug happening again, by adding a line
to make sure that `sender_realm_id` does not get
wiped out from the "wide" dictionary.
A better test would exercise the actual code that
exposed the bug here by sending a message to a bot
with two or more services attached to it. I will
do that in a future commit.
Fixes#14384
If we have an old event that's missing the field
`sender_delivery_email`, we now patch it at the top
of `process_message_event`, rather than for each call
to `get_client_payload`. This will make an upcoming
commit a bit easier to reason about. Basically, it's
simpler to shim the incoming event one time rather
than doing it up to four times. We know that
`get_client_payload` is non-destructive, because it
does a deepcopy.
Instead of trying to set the _requestor_for_logs attribute in all the
relevant places, we try to use request.user when possible (that will be
when it's a UserProfile or RemoteZulipServer as of now). In other
places, we set _requestor_for_logs to avoid manually editing the
request.user attribute, as it should mostly be left for Django to manage
it.
In places where we remove the "request._requestor_for_logs = ..." line,
it is clearly implied by the previous code (or the current surrounding
code) that request.user is of the correct type.
Since essentially the first use of Tornado in Zulip, we've been
maintaining our Tornado+Django system, AsyncDjangoHandler, with
several hundred lines of Django code copied into it.
The goal for that code was simple: We wanted a way to use our Django
middleware (for code sharing reasons) inside a Tornado process (since
we wanted to use Tornado for our async events system).
As part of the Django 2.2.x upgrade, I looked at upgrading this
implementation to be based off modern Django, and it's definitely
possible to do that:
* Continue forking load_middleware to save response middleware.
* Continue manually running the Django response middleware.
* Continue working out a hack involving copying all of _get_response
to change a couple lines allowing us our Tornado code to not
actually return the Django HttpResponse so we can long-poll. The
previous hack of returning None stopped being viable with the Django 2.2
MiddlewareMixin.__call__ implementation.
But I decided to take this opportunity to look at trying to avoid
copying material Django code, and there is a way to do it:
* Replace RespondAsynchronously with a response.asynchronous attribute
on the HttpResponse; this allows Django to run its normal plumbing
happily in a way that should be stable over time, and then we
proceed to discard the response inside the Tornado `get()` method to
implement long-polling. (Better yet might be raising an
exception?). This lets us eliminate maintaining a patched copy of
_get_response.
* Removing the @asynchronous decorator, which didn't add anything now
that we only have one API endpoint backend (with two frontend call
points) that could call into this. Combined with the last bullet,
this lets us remove a significant hack from our
never_cache_responses function.
* Calling the normal Django `get_response` method from zulip_finish
after creating a duplicate request to process, rather than writing
totally custom code to do that. This lets us eliminate maintaining
a patched copy of Django's load_middleware.
* Adding detailed comments explaining how this is supposed to work,
what problems we encounter, and how we solve various problems, which
is critical to being able to modify this code in the future.
A key advantage of these changes is that the exact same code should
work on Django 1.11, Django 2.2, and Django 3.x, because we're no
longer copying large blocks of core Django code and thus should be
much less vulnerable to refactors.
There may be a modest performance downside, in that we now run both
request and response middleware twice when longpolling (once for the
request we discard). We may be able to avoid the expensive part of
it, Zulip's own request/response middleware, with a bit of additional
custom code to save work for requests where we're planning to discard
the response. Profiling will be important to understanding what's
worth doing here.
This fixes a bug where our asynchronous requests were only copying the
Content-Type header (i.e. the one case where we're noticed) from the
Django HttpResponse. I'm not sure what the impact of this would be;
the rate-limiting headers rarely come up when breaking a long-polled
request. But it seems clearly an improvement to do this in a
consistent fashion.
Only the headers piece is a change; in Tornado
self.finish(x)
is equivalent to:
self.write(x)
self.finish()
In e3ad9baf1d, we introduced yet another
bug where we incorrectly shared event dictionaries between multiple
queues.
Fortunately, the logging that reports on "event was not in the queue"
issues worked and detected this on chat.zulip.org, but this is a clear
indication that the comments we have around this system were not
sufficient to produce correct behavior.
We fix this by changing event_queue.push, the code that mutates the
event dictionaries, to do the shallow copies itself. The only
downside here is process_message_event, a relatively low-traffic code
path, does an extra per-queue dictionary copy. Given that presence,
heartbeat, and message reading events are likely more traffic and
dealing with HTTP is likely much more expensive than a dictionary
copy, this probably doesn't matter performance-wise.
(And if profiling later finds it is, there are potential workarounds
like passing a skip_copy argument we can do).
This flag affects page_params and the
payload you get back from POSTs to this
url:
users/me/presence
The flag does not yet affect the
presence events that get sent to a
client.
Zulip has had a small use of WebSockets (specifically, for the code
path of sending messages, via the webapp only) since ~2013. We
originally added this use of WebSockets in the hope that the latency
benefits of doing so would allow us to avoid implementing a markdown
local echo; they were not. Further, HTTP/2 may have eliminated the
latency difference we hoped to exploit by using WebSockets in any
case.
While we’d originally imagined using WebSockets for other endpoints,
there was never a good justification for moving more components to the
WebSockets system.
This WebSockets code path had a lot of downsides/complexity,
including:
* The messy hack involving constructing an emulated request object to
hook into doing Django requests.
* The `message_senders` queue processor system, which increases RAM
needs and must be provisioned independently from the rest of the
server).
* A duplicate check_send_receive_time Nagios test specific to
WebSockets.
* The requirement for users to have their firewalls/NATs allow
WebSocket connections, and a setting to disable them for networks
where WebSockets don’t work.
* Dependencies on the SockJS family of libraries, which has at times
been poorly maintained, and periodically throws random JavaScript
exceptions in our production environments without a deep enough
traceback to effectively investigate.
* A total of about 1600 lines of our code related to the feature.
* Increased load on the Tornado system, especially around a Zulip
server restart, and especially for large installations like
zulipchat.com, resulting in extra delay before messages can be sent
again.
As detailed in
https://github.com/zulip/zulip/pull/12862#issuecomment-536152397, it
appears that removing WebSockets moderately increases the time it
takes for the `send_message` API query to return from the server, but
does not significantly change the time between when a message is sent
and when it is received by clients. We don’t understand the reason
for that change (suggesting the possibility of a measurement error),
and even if it is a real change, we consider that potential small
latency regression to be acceptable.
If we later want WebSockets, we’ll likely want to just use Django
Channels.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
This change makes it possible for users to control the notification
settings for wildcard mentions as a separate control from PMs and
direct @-mentions.
Eventually, we'll want to replace emails with user IDs here entirely,
but until we make that happen, we should at least use the same email
address present in our other logging.
I think we won't miss updating these in a future migration thanks to
mypy types.
Since years ago, this field hasn't been used for anything other than
some logging that would be better off logging the user ID anyway.
It existed in the first place simply because we weren't passing the
user_profile_id to Tornado at all.
Then, find and fix a predictable number of previous misuses.
With a small change by tabbott to preserve backwards compatibility for
sending `yes` for the `forged` field.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
Most of the failures were due to parameters that are not intended to
be used by third-party code, so the correct fix for those was the set
intentionally_undocumented=True.
Fixes#12969.
Historically, Zulip's implementation of wildcard mentions never
triggered either email or push notifications, instead being limited to
desktop notifications and the "mentions" counter.
We fix this just by plumbing the "wildcard_mentioned" flag through our
system.
Implements much of
https://github.com/zulip/zulip/issues/6040#issuecomment-510157264.
We're also now ready to seriously work on #3750.
Apparently, our edit-message events did not guarantee that the outer
wrapper dictionary, which is intended to be unique for each client,
was unique for every client (instead only ensuring it was unique for
each user).
This led to clients unexpectedly getting last_event_id validation
errors in this code path when a user had multiple connected clients,
because the linear ordering of event IDs within a given queue was
corrupted.
In fd2a63b049, we accidentally fixed
this issue with a different set of userdata events, without fixing the
edit-message event bug. This commit fixes the remaining issue.
Previous cleanups (mostly the removals of Python __future__ imports)
were done in a way that introduced leading newlines. Delete leading
newlines from all files, except static/assets/zulip-emoji/NOTICE,
which is a verbatim copy of the Apache 2.0 license.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
Apparently, our edit-message events did not guarantee that the outer
wrapper dictionary, which is intended to be unique for each client,
was unique for every client (instead only ensuring it was unique for
each user).
This led to clients unexpectedly getting last_event_id validation
errors in this code path when a user had multiple connected clients,
because the linear ordering of event IDs within a given queue was
corrupted.
This gives us access to typing_extensions.Deque, which was not added
to typing until 3.5.4.
(PROVISION_VERSION is not bumped because the transitive dependency set
in dev.txt hasn’t changed.)
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
This verifies that the client passed a last_event_id that actually
came from the queue instead of making up an ID from the future. It
turns out one of our tests was making up such an ID, but legitimate
clients are expected not to do so.
The previous version of this commit (commit
e00d4be6d5, #12888) had to be reverted
(commit b86c5cc490) because it was
missing the `to_dict`/`from_dict` migration code.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
This fixes two issues:
* The syntax check logic we had for zerver.tornado.autoreload would
end up clearing _reload_hooks if one of the files that had changed
was zerver.tornado.autoreload itself (because we'd had re-imported
the current module), which could be incredibly confusing when trying
to test the autoreload logic. It seems better to just not run the
syntax check for syntax errors in this file.
Similarly, because reloading event_queue.py would destroy the state
in the queues, we avoid that as well.
* We make sure to flush stdout after running and reload hooks, to make
sure their output reaches the user.
We were apparently not running our own forked Tornado autoreload
library when adding reload hooks, which meant that our autoreload
hooks didn't run at all.
This fixes an issue that made dump_event_queues never run and thus the
local development environment difficult to use for testing event queues.
This verifies that the client passed a last_event_id that actually
came from the queue instead of making up an ID from the future. It
turns out one of our tests was making up such an ID, but legitimate
clients are expected not to do so.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
This makes it more convenient for developers to set very short values
for this (e.g. 1 minute) for the purposes of testing/debugging; there
aren't obvious problems with letting users set short values for this.
This is a preparator refactor for supporting hosting different Tornado
processes on different servers; to look up which Tornado server we
should be sending the event to, we'll need the realm object.
This should make it possible for there to safely be multiple Tornado
processes running on different ports on the same system.
It may also fix a rare race bug in development, where previously, it
was possible for the Tornados processes for Casper and the main
development server to interfere; I haven't investigated whether this
was a real bug or not, but now those two services will use independent
Tornado files.
We still need to add something to direct traffic between the different
Tornado processes.
Previously, these timer accounting functions could be easily mistaken
for referring to starting/stopping the request. By adding timer to
the name, we make the code easier for the casual observer to read and
understand.
Historically, queue_json_publish had a special third argument that was
basically its default mock behavior in the test suite. We've been
migrating away from that model, because it was confusing and resulted
in poor test coverage of our queue worker code paths; this was one of
the last holdouts.
As it turns out, we don't exercise this code path in a way that
impacts tests much; the main downside of this change is a likely small
penalty to performance of the full test suite when sending private
messages.
As part of our effort to change the data model away from each user
having a single API key, we're eliminating the couple requests that
were made from Django to Tornado (as part of a /register or home
request) where we used the user's API key grabbed from the database
for authentication.
Instead, we use the (already existing) internal_notify_view
authentication mechanism, which uses the SHARED_SECRET setting for
security, for these requests, and just fetch the user object using
get_user_profile_by_id directly.
Tweaked by Yago to include the new /api/v1/events/internal endpoint in
the exempt_patterns list in test_helpers, since it's an endpoint we call
through Tornado. Also added a couple missing return type annotations.
This adds support to the event queue system for triggering
missed-message notifications (whether push or email) to support the
stream push notifications feature.
Previously, maybe_enqueue_notifications had this very subtle logic,
where it set the notice variable only inside the block for push
notifications, but then also used it inside the block for email
notifications.
This "worked", because previously the conditions for push
notifications were always true if the conditions for email
notifications were, but the code was unnecessarily confusing. The
only good reason to write it this way is if build_offline_notification
was expensive; in fact, the most expensive thing it does is calling
time.time(), so that reason does not apply here.
This was further confusing, in that in the original logic, we relied
on the fact that push notification code path edited the "notice"
dictionary for further processing.
Instead, we just call it separately and setup the data separately in
each code path.
This data will be required for correctly implementing the upcoming
stream_push_notify feature; it also helps support cleaning up the code
for the existing stream mentions logic.
We've for a long time been plagued by run-dev.py needing to be
restarted every time one does a rebase that has merge conflicts,
because the Tornado process restarts itself into a syntax error and
crashes.
This fixes the Tornado autoreload process to check explicitly for
whether files actually syntax-check before trying to actually reload
the Tornado process to run that code.
There are a few things that are a bit janky:
* Ideally, this would go into Tornado upstream
* We removed the `_watched_files` feature, which we weren't using.
* Ideally, we'd use something other than `importlib.reload` that just
does the syntax-check without adjusting the state within our current
process.
Fixes#4351.
We're never going to add tests for this block, which is fundamentally
well-tested code from Django with a since line changed which is hard
to screw up (long-polling will not work at all without it). The hope
is to remove it entirely and replace it with a cleaner monkey-patch,
but until then, unit tests for it would be redundant.
We only use this in the direct management command, and it involves
some autoreload process setup stuff that we probably don't want to do
in our unit tests regardless.
This is basically a simple fix, where we consistently set
`flags` to an empty array when we pass it around. The history
here is that we had kind of a nasty bug from setting it to
`None`, which only showed up in the somewhat obscure circumstance
of somebody subscribing to all stream events in our API.
Fixes#7921
A `None` value is not properly handled in this function, which
indicates some lack of testing or a recent regression we don't
understand. We were getting lots of tracebacks from this line
of code on our test server:
mentioned = 'mentioned' in flags and 'read' not in flags
This method was new in Tornado 4.0. It saves us from having to get
the time ourselves and do the arithmetic -- which not only makes the
code a bit shorter, but also easier to get right. Tornado docs (see
http://www.tornadoweb.org/en/stable/ioloop.html) say we should have
been getting the time from `ioloop.time()` rather than hardcoding
`time.time()`, because the loop could e.g. be running on the
`time.monotonic()` clock.
This commit allows clients to register client_gravatar=True, and
then we recognize that flag for message events. If the flag is
True, we will not calculate gravatar URLs and let the clients do
it themselves. (Clients can calculate gravatar URLs based on
emails with just a little bit of code.)
This refactoring doesn't change behavior, but it sets us up
to more easily handle a register setting for `client_gravatar`,
which will allow clients to tell us they're going to compute
their own gravatar URLs.
The `client_gravatar` flag already exists in our code, but it
is only used for Django views (users/messages) but not for
Zulip events.
The main change is to move the call to `set_sender_avatar` into
`finalize_payload`, which adds the boolean `client_gravatar`
parameter to that function. And then we update various callers
to supply that flag.
One small performance benefit of this change is that we now
lazily compute the client message payloads in
`event_queue.process_message_event` now, so this will improve
performance if all interested clients have the same value of
`apply_markdown`. But the change here is really preparing us
for the additional boolean parameter, which will cause us to
have four variations of the payload.
This test helper method duplicated a bunch of logic in
`zerver/worker/queue_processors.py` in a specialized fashion for the
tests. Now that we're using `call_consume_in_tests` in this code
path, we don't need it.
In do_send_messages, we only produce one dictionary for
the event queues, instead of different flavors for text
vs. html. This prevents two unnecessary queries to the
database.
It also means we only put one dictionary on the "message"
event queue instead of two, albeit a wider one that has
some values that won't be sent to the actual clients.
This wider dictionary from MessageDict.wide_dict is also
used for the `feedback_messages` queue and service bot
queues. Since the extra fields are possibly useful down
the road, and they'll just be ignored for now, we don't
bother to remove them. Also, those queue processors won't
have access to `content_type`, which they shouldn't need.
Fixes#6947
This removes the utterly unnecessary `triggers` dict (which always was
a dict with exactly one value True) in favor of a single field,
'trigger'.
Inspired by Kunal Gupta's work in #6659.
While the missedmessage_hook logic originally did a reasonably good
job of avoiding double-sending notifications, there was a corner case
it didn't handle, namely a user who had been presence-idle when a
message was sent and became also event-queue-idle as well within the
next 10 minutes. For those users, they got a notification at message
send time, and the missedmessage_hook would deliver it a second time.
We fix this by just checking the conveniently available push_notified
and email_notified variables that indicate whether the message already
had a notification triggered.
Fixes#7031.
This makes tests of queue processors more realistic,
by adding a parameter to `queue_json_publish` that
calls a queue's consumer function if accessed in a test.
Fixes part of #6542.
This field would get overwritten with an improper value when
we looped over multiple clients, due to not making full copies
of the message dictionary. This failure would be somewhat
random depending on how clients were ordered in the loop.
The only consumers of this field were the mobile app and the
apply-events-to-unread-counts logic. Both of these will now
use `flags` instead.
We now do push notifications and missed message emails
for offline users who are subscribed to the stream for
a message that has been edited, but we short circuit
the offline-notification logic for any user who presumably
would have already received a notification on the original
message.
This effectively boils down to sending notifications to newly
mentioned users. The motivating use case here is that you
forget to mention somebody in a message, and then you edit
the message to mention the person. If they are offline, they
will now get pushed notifications and missed message emails,
with some minor caveats.
We try to mostly use the same techniques here as the
send-message code path, and we share common code with the
send-message path once we get to the Tornado layer and call
maybe_enqueue_notifications.
The major places where we differ are in a function called
maybe_enqueue_notifications_for_message_update, and the top
of that function short circuits a bunch of cases where we
can mostly assume that the original message had an offline
notification.
We can expect a couple changes in the future:
* Requirements may change here, and it might make sense
to send offline notifications on the update side even
in circumstances where the original message had a
notification.
* We may track more notifications in a DB model, which
may simplify our short-circuit logic.
In the view/action layer, we already had two separate codepaths
for send-message and update-message, but this mostly echoes
what the send-message path does in terms of collecting data
about recipients.
This ensures that as we expand the logic for under what circumstances
email and push notifications should be sent, we can be confident about
this code path always doing the right thing.
This fixes a problem introduced in the recent refactoring where
`triggers` would not be set correctly when a push or email
notification was triggered by missedmessage_hook.
Fixes#6612.
Now, the two code paths do the same thing for this check.
It seems like there may be more work to do here, in that
wildcard_mentioned messages seem to not be eligible for sending
email/push notifications. We probably want to add some logic there
for the user doing the mention to control whether or not it does.
This is a nonfunctional refactor, designed primarily to make it
simpler to extend this code path when we later add support for
controlling whether email notifications go out on stream messages.
Previously, due to a logic bug, this feature would also send email
notifications for all messages on the stream, which is definitely not
the intent. The recent refactoring we just did makes the logic more
obvious.
This creates a lot of logging noise, and also causes confusion
for new contributors when something isn't working as they expect
and they aren't sure if this message is normal or an error.
Usually a small minority of users are eligible to receive missed
message emails or mobile notifications.
We now filter users first before hitting UserPresence to find idle
users. We also simply check for the existence of recent activity
rather than borrowing the more complicated data structures that we
use for the buddy list.
Because the Redis client returns exclusively bytes -- even for
hash keys -- even on Python 3, the test `'response' in status`
was always returning false, and the line that tries to decode
as JSON was never running, so we were passing `response`
through as a `bytes` object encoding some JSON.
I'm not sure what the impact of this bug was, and in particular
whether something downstream would have fudged it to make up for
this error.