Since essentially the first use of Tornado in Zulip, we've been
maintaining our Tornado+Django system, AsyncDjangoHandler, with
several hundred lines of Django code copied into it.
The goal for that code was simple: We wanted a way to use our Django
middleware (for code sharing reasons) inside a Tornado process (since
we wanted to use Tornado for our async events system).
As part of the Django 2.2.x upgrade, I looked at upgrading this
implementation to be based off modern Django, and it's definitely
possible to do that:
* Continue forking load_middleware to save response middleware.
* Continue manually running the Django response middleware.
* Continue working out a hack involving copying all of _get_response
to change a couple lines allowing us our Tornado code to not
actually return the Django HttpResponse so we can long-poll. The
previous hack of returning None stopped being viable with the Django 2.2
MiddlewareMixin.__call__ implementation.
But I decided to take this opportunity to look at trying to avoid
copying material Django code, and there is a way to do it:
* Replace RespondAsynchronously with a response.asynchronous attribute
on the HttpResponse; this allows Django to run its normal plumbing
happily in a way that should be stable over time, and then we
proceed to discard the response inside the Tornado `get()` method to
implement long-polling. (Better yet might be raising an
exception?). This lets us eliminate maintaining a patched copy of
_get_response.
* Removing the @asynchronous decorator, which didn't add anything now
that we only have one API endpoint backend (with two frontend call
points) that could call into this. Combined with the last bullet,
this lets us remove a significant hack from our
never_cache_responses function.
* Calling the normal Django `get_response` method from zulip_finish
after creating a duplicate request to process, rather than writing
totally custom code to do that. This lets us eliminate maintaining
a patched copy of Django's load_middleware.
* Adding detailed comments explaining how this is supposed to work,
what problems we encounter, and how we solve various problems, which
is critical to being able to modify this code in the future.
A key advantage of these changes is that the exact same code should
work on Django 1.11, Django 2.2, and Django 3.x, because we're no
longer copying large blocks of core Django code and thus should be
much less vulnerable to refactors.
There may be a modest performance downside, in that we now run both
request and response middleware twice when longpolling (once for the
request we discard). We may be able to avoid the expensive part of
it, Zulip's own request/response middleware, with a bit of additional
custom code to save work for requests where we're planning to discard
the response. Profiling will be important to understanding what's
worth doing here.
This fixes a bug where our asynchronous requests were only copying the
Content-Type header (i.e. the one case where we're noticed) from the
Django HttpResponse. I'm not sure what the impact of this would be;
the rate-limiting headers rarely come up when breaking a long-polled
request. But it seems clearly an improvement to do this in a
consistent fashion.
Only the headers piece is a change; in Tornado
self.finish(x)
is equivalent to:
self.write(x)
self.finish()
In e3ad9baf1d, we introduced yet another
bug where we incorrectly shared event dictionaries between multiple
queues.
Fortunately, the logging that reports on "event was not in the queue"
issues worked and detected this on chat.zulip.org, but this is a clear
indication that the comments we have around this system were not
sufficient to produce correct behavior.
We fix this by changing event_queue.push, the code that mutates the
event dictionaries, to do the shallow copies itself. The only
downside here is process_message_event, a relatively low-traffic code
path, does an extra per-queue dictionary copy. Given that presence,
heartbeat, and message reading events are likely more traffic and
dealing with HTTP is likely much more expensive than a dictionary
copy, this probably doesn't matter performance-wise.
(And if profiling later finds it is, there are potential workarounds
like passing a skip_copy argument we can do).
This flag affects page_params and the
payload you get back from POSTs to this
url:
users/me/presence
The flag does not yet affect the
presence events that get sent to a
client.
Zulip has had a small use of WebSockets (specifically, for the code
path of sending messages, via the webapp only) since ~2013. We
originally added this use of WebSockets in the hope that the latency
benefits of doing so would allow us to avoid implementing a markdown
local echo; they were not. Further, HTTP/2 may have eliminated the
latency difference we hoped to exploit by using WebSockets in any
case.
While we’d originally imagined using WebSockets for other endpoints,
there was never a good justification for moving more components to the
WebSockets system.
This WebSockets code path had a lot of downsides/complexity,
including:
* The messy hack involving constructing an emulated request object to
hook into doing Django requests.
* The `message_senders` queue processor system, which increases RAM
needs and must be provisioned independently from the rest of the
server).
* A duplicate check_send_receive_time Nagios test specific to
WebSockets.
* The requirement for users to have their firewalls/NATs allow
WebSocket connections, and a setting to disable them for networks
where WebSockets don’t work.
* Dependencies on the SockJS family of libraries, which has at times
been poorly maintained, and periodically throws random JavaScript
exceptions in our production environments without a deep enough
traceback to effectively investigate.
* A total of about 1600 lines of our code related to the feature.
* Increased load on the Tornado system, especially around a Zulip
server restart, and especially for large installations like
zulipchat.com, resulting in extra delay before messages can be sent
again.
As detailed in
https://github.com/zulip/zulip/pull/12862#issuecomment-536152397, it
appears that removing WebSockets moderately increases the time it
takes for the `send_message` API query to return from the server, but
does not significantly change the time between when a message is sent
and when it is received by clients. We don’t understand the reason
for that change (suggesting the possibility of a measurement error),
and even if it is a real change, we consider that potential small
latency regression to be acceptable.
If we later want WebSockets, we’ll likely want to just use Django
Channels.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
This change makes it possible for users to control the notification
settings for wildcard mentions as a separate control from PMs and
direct @-mentions.
Eventually, we'll want to replace emails with user IDs here entirely,
but until we make that happen, we should at least use the same email
address present in our other logging.
I think we won't miss updating these in a future migration thanks to
mypy types.
Since years ago, this field hasn't been used for anything other than
some logging that would be better off logging the user ID anyway.
It existed in the first place simply because we weren't passing the
user_profile_id to Tornado at all.
Then, find and fix a predictable number of previous misuses.
With a small change by tabbott to preserve backwards compatibility for
sending `yes` for the `forged` field.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
Most of the failures were due to parameters that are not intended to
be used by third-party code, so the correct fix for those was the set
intentionally_undocumented=True.
Fixes#12969.
Historically, Zulip's implementation of wildcard mentions never
triggered either email or push notifications, instead being limited to
desktop notifications and the "mentions" counter.
We fix this just by plumbing the "wildcard_mentioned" flag through our
system.
Implements much of
https://github.com/zulip/zulip/issues/6040#issuecomment-510157264.
We're also now ready to seriously work on #3750.
Apparently, our edit-message events did not guarantee that the outer
wrapper dictionary, which is intended to be unique for each client,
was unique for every client (instead only ensuring it was unique for
each user).
This led to clients unexpectedly getting last_event_id validation
errors in this code path when a user had multiple connected clients,
because the linear ordering of event IDs within a given queue was
corrupted.
In fd2a63b049, we accidentally fixed
this issue with a different set of userdata events, without fixing the
edit-message event bug. This commit fixes the remaining issue.
Previous cleanups (mostly the removals of Python __future__ imports)
were done in a way that introduced leading newlines. Delete leading
newlines from all files, except static/assets/zulip-emoji/NOTICE,
which is a verbatim copy of the Apache 2.0 license.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
Apparently, our edit-message events did not guarantee that the outer
wrapper dictionary, which is intended to be unique for each client,
was unique for every client (instead only ensuring it was unique for
each user).
This led to clients unexpectedly getting last_event_id validation
errors in this code path when a user had multiple connected clients,
because the linear ordering of event IDs within a given queue was
corrupted.
This gives us access to typing_extensions.Deque, which was not added
to typing until 3.5.4.
(PROVISION_VERSION is not bumped because the transitive dependency set
in dev.txt hasn’t changed.)
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
This verifies that the client passed a last_event_id that actually
came from the queue instead of making up an ID from the future. It
turns out one of our tests was making up such an ID, but legitimate
clients are expected not to do so.
The previous version of this commit (commit
e00d4be6d5, #12888) had to be reverted
(commit b86c5cc490) because it was
missing the `to_dict`/`from_dict` migration code.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
This fixes two issues:
* The syntax check logic we had for zerver.tornado.autoreload would
end up clearing _reload_hooks if one of the files that had changed
was zerver.tornado.autoreload itself (because we'd had re-imported
the current module), which could be incredibly confusing when trying
to test the autoreload logic. It seems better to just not run the
syntax check for syntax errors in this file.
Similarly, because reloading event_queue.py would destroy the state
in the queues, we avoid that as well.
* We make sure to flush stdout after running and reload hooks, to make
sure their output reaches the user.
We were apparently not running our own forked Tornado autoreload
library when adding reload hooks, which meant that our autoreload
hooks didn't run at all.
This fixes an issue that made dump_event_queues never run and thus the
local development environment difficult to use for testing event queues.
This verifies that the client passed a last_event_id that actually
came from the queue instead of making up an ID from the future. It
turns out one of our tests was making up such an ID, but legitimate
clients are expected not to do so.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
This makes it more convenient for developers to set very short values
for this (e.g. 1 minute) for the purposes of testing/debugging; there
aren't obvious problems with letting users set short values for this.
This is a preparator refactor for supporting hosting different Tornado
processes on different servers; to look up which Tornado server we
should be sending the event to, we'll need the realm object.
This should make it possible for there to safely be multiple Tornado
processes running on different ports on the same system.
It may also fix a rare race bug in development, where previously, it
was possible for the Tornados processes for Casper and the main
development server to interfere; I haven't investigated whether this
was a real bug or not, but now those two services will use independent
Tornado files.
We still need to add something to direct traffic between the different
Tornado processes.