Without this change, pyflakes reports this exception:
pyflakes | zerver/worker/queue_processors.py:152:9 local variable 'e' is assigned to but never used
pyflakes | zerver/worker/queue_processors.py:155:81 undefined name 'e'
We use the EMAIL_TIMEOUT django setting to timeout after 15s of trying
to send an email. This will nicely lead to retries in the email_senders
queue, due to the retry_send_email_failures decorator.
smtlib documentation suggests that socket.timeout can be raised as the
result of timing out, so in attempts I'm getting
smtplib.SMTPServerDisconnected. Either way, seems appropriate to add
socket.timeout to the exception that we catch.
The most import change here is the one in maybe_send_to_registration
codepath, as the insufficient validation there could lead to fetching
an expired PreregistrationUser that was invited as an administrator
admin even years ago, leading to this registration ending up in the
new user being a realm administrator.
Combined with the buggy migration in
0198_preregistrationuser_invited_as.py, this led to users incorrectly
joining as organizations administrators by accident. But even without
that bug, this issue could have allowed a user who was invited as an
administrator but then had that invitation expire and then joined via
social authentication incorrectly join as an organization administrator.
The second change is in ConfirmationEmailWorker, where this wasn't a
security problem, but if the server was stopped for long enough, with
some invites to send out email for in the queue, then after starting it
up again, the queue worker would send out emails for invites that
had already expired.
Fixes#2665.
Regenerated by tabbott with `lint --fix` after a rebase and change in
parameters.
Note from tabbott: In a few cases, this converts technical debt in the
form of unsorted imports into different technical debt in the form of
our largest files having very long, ugly import sequences at the
start. I expect this change will increase pressure for us to split
those files, which isn't a bad thing.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
Automatically generated by the following script, based on the output
of lint with flake8-comma:
import re
import sys
last_filename = None
last_row = None
lines = []
for msg in sys.stdin:
m = re.match(
r"\x1b\[35mflake8 \|\x1b\[0m \x1b\[1;31m(.+):(\d+):(\d+): (\w+)", msg
)
if m:
filename, row_str, col_str, err = m.groups()
row, col = int(row_str), int(col_str)
if filename == last_filename:
assert last_row != row
else:
if last_filename is not None:
with open(last_filename, "w") as f:
f.writelines(lines)
with open(filename) as f:
lines = f.readlines()
last_filename = filename
last_row = row
line = lines[row - 1]
if err in ["C812", "C815"]:
lines[row - 1] = line[: col - 1] + "," + line[col - 1 :]
elif err in ["C819"]:
assert line[col - 2] == ","
lines[row - 1] = line[: col - 2] + line[col - 1 :].lstrip(" ")
if last_filename is not None:
with open(last_filename, "w") as f:
f.writelines(lines)
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
This commit adds three `.pysa` model files: `false_positives.pysa`
for ruling out false positive flows with `Sanitize` annotations,
`req_lib.pysa` for educating pysa about Zulip's `REQ()` pattern for
extracting user input, and `redirects.pysa` for capturing the risk
of open redirects within Zulip code. Additionally, this commit
introduces `mark_sanitized`, an identity function which can be used
to selectively clear taint in cases where `Sanitize` models will not
work. This commit also puts `mark_sanitized` to work removing known
false postive flows.
Generated by pyupgrade --py36-plus --keep-percent-format, but with the
NamedTuple changes reverted (see commit
ba7906a3c6, #15132).
Signed-off-by: Anders Kaseorg <anders@zulip.com>
This saves the completely unnecessary work of mapping the Client name
to its ID. Because we had in-process caching of the immutable Client
objects, this isn't a material performance win, but it will eventually
let us delete that caching logic and have a simpler system.
While this functionality to post slow queries to a Zulip stream was
very useful in the early days of Zulip, when there were only a few
hundred accounts, it's long since been useless since (1) the total
request volume on larger Zulip servers run by Zulip developers, and
(2) other server operators don't want real-time notifications of slow
backend queries. The right structure for this is just a log file.
We get rid of the queue and replace it with a "zulip.slow_queries"
logger, which will still log to /var/log/zulip/slow_queries.log for
ease of access to this information and propagate to the other logging
handlers. Reducing the amount of queues is good for lowering zulip's
memory footprint and restart performance, since we run at least one
dedicated queue worker process for each one in most configurations.
Prior to this change, there were reports of 500s in
production due to `export.extra_data` being a
Nonetype. This was reproducible using the s3
backend in development when a row was created in
the `RealmAuditLog` table, but the export failed in
the `DeferredWorker`. This left an entry lying
about that was never updated with an `extra_data`
field.
To fix this, we catch any exceptions in the
`DeferredWorker`, and then update `extra_data` to
encode the failure. We also fix the fact that we
never updated the export UI table with pending exports.
These changes also negated the use for the somewhat
hacky `clear_success_banner` logic.
We've had a bug for a while that if any ScheduledEmail objects get
created with the wrong email sender address, even after the sysadmin
corrects the problem, they'll still get errors because of the objects
stored with the wrong format.
We solve this by using FromAddress placeholders strings in
send_future_email function, so that ScheduledEmail objects end up
setting the final `from_address` value when mail is actually sent
using the setting in effect at that time.
Fixes#11008.
Several of our queues are capable of doing work that includes
rendering markdown (outgoing_webhook, embedded_bots, embed_links, and
email_mirror). As a result, it's essential that these don't cache
per-request data (specifically, realm filters) longer than they
should, making editing/deleting linkifiers potentially use old
settings until the relevant process was restarted.
Flushing these caches is extremely cheap (just clearing two
dictionaries) and thus is reasonable to do after every queue event,
rather than trying to do it only the ~1/3 of queues that specifically
do markdown processing. We do the same in our middleware for
reset_queries.
It's not worth writing a test for this because it's very difficult to
create the test setup situation for this bug with a single test worker
process; one needs to edit the linkifier configuration in a different
process than the one sending the message in order to see the bug.
This was a much larger visible bug on Zulip 2.1.x, where the presence
of the message_sender queue meant that this would apply to messages
sent via a browser.
Fixes#14095.
Note that while the test mocks the actual message
send, we now have a `get_stream` call in the queue
worker, so we have to set up a real stream for
testing (or we could have mocked that as well, but
it didn't seem necessary). The setup queries add
to the amount of queries reported by the test,
plus the `get_stream` call. I just made the
query count a digits regex, which is a little bit
lame, but I don't think it's worth risking test
flakes for this.
This legacy cross-realm bot hasn't been used in several years, as far
as I know. If we wanted to re-introduce it, I'd want to implement it
as an embedded bot using those common APIs, rather than the totally
custom hacky code used for it that involves unnecessary queue workers
and similar details.
Fixes#13533.
Zulip has had a small use of WebSockets (specifically, for the code
path of sending messages, via the webapp only) since ~2013. We
originally added this use of WebSockets in the hope that the latency
benefits of doing so would allow us to avoid implementing a markdown
local echo; they were not. Further, HTTP/2 may have eliminated the
latency difference we hoped to exploit by using WebSockets in any
case.
While we’d originally imagined using WebSockets for other endpoints,
there was never a good justification for moving more components to the
WebSockets system.
This WebSockets code path had a lot of downsides/complexity,
including:
* The messy hack involving constructing an emulated request object to
hook into doing Django requests.
* The `message_senders` queue processor system, which increases RAM
needs and must be provisioned independently from the rest of the
server).
* A duplicate check_send_receive_time Nagios test specific to
WebSockets.
* The requirement for users to have their firewalls/NATs allow
WebSocket connections, and a setting to disable them for networks
where WebSockets don’t work.
* Dependencies on the SockJS family of libraries, which has at times
been poorly maintained, and periodically throws random JavaScript
exceptions in our production environments without a deep enough
traceback to effectively investigate.
* A total of about 1600 lines of our code related to the feature.
* Increased load on the Tornado system, especially around a Zulip
server restart, and especially for large installations like
zulipchat.com, resulting in extra delay before messages can be sent
again.
As detailed in
https://github.com/zulip/zulip/pull/12862#issuecomment-536152397, it
appears that removing WebSockets moderately increases the time it
takes for the `send_message` API query to return from the server, but
does not significantly change the time between when a message is sent
and when it is received by clients. We don’t understand the reason
for that change (suggesting the possibility of a measurement error),
and even if it is a real change, we consider that potential small
latency regression to be acceptable.
If we later want WebSockets, we’ll likely want to just use Django
Channels.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
Addresses point 1 of #13533.
MissedMessageEmailAddress objects get tied to the specific that was
missed by the user. A useful benefit of that is that email message sent
to that address will handle topic changes - if the message that was
missed gets its topic changed, the email response will get posted under
the new topic, while in the old model it would get posted under the
old topic, which could potentially be confusing.
Migrating redis data to this new model is a bit tricky, so the migration
code has comments explaining some of the compromises made there, and
test_migrations.py tests handling of the various possible cases that
could arise.
QueueProcessingWorker and LoopQueueProcessingWorker are abstract classes
meant to be subclassed by a class that will define its own consume()
or consume_batch() method. ABCs are suited for that and we can tag
consume/consume_batch with the @abstractmethod wrapper which will
prevent subclasses that don't define these methods properly to be
impossible to even instantiate (as opposed to only crashing once
consume() is called). It's also nicely detected by mypy, which will
throw errors such as this on invalid use:
error: Only concrete class can be given where "Type[TestWorker]" is
expected
error: Cannot instantiate abstract class 'TestWorker' with abstract
attribute 'consume'
Due to it being detected by mypy, we can remove the test
test_worker_noconsume which just tested the old version of this -
raising an exception when the unimplemented consume() gets called. Now
it can be handled already on the linter level.
LoopQueueProcessingWorker can handle exceptions inside consume_batch in
a similar manner to how QueueProcessingWorker handles exceptions inside
consume.
We use the plumbing introduced in a previous commit, to now raise
PushNotificationBouncerRetryLaterError in send_to_push_bouncer in case
of issues with talking to the bouncer server. That's a better way of
dealing with the errors than the previous approach of returning a
"failed" boolean, which generally wasn't checked in the code anyway and
did nothing.
The PushNotificationBouncerRetryLaterError exception will be nicely
handled by queue processors to retry sending again, and due to being a
JsonableError, it will also communicate the error to API users.
We add PushNotificationBouncerRetryLaterError as an exception to signal
an error occurred when trying to communicate with the bouncer and it
should be retried. We use JsonableError as the base class, because this
signal will need to work in two roles:
1. When the push notification was being issued by the queue worker
PushNotificationsWorker, it will signal to the worker to requeue the
event and try again later.
2. The exception will also possibly be raised (this will be added in the
next commit) on codepaths coming from a request to an API endpoint (for
example to add a token, to users/me/apns_device_token). In that case,
it'll be needed to provide a good error to the API user - and basing
this exception on JsonableError will allow that.
This includes adding a new endpoint to the push notification bouncer
interface, and code to call it appropriately after resetting a user's
personal API key.
When we add support for a user having multiple API keys, we may need
to add an additional key here to support removing keys associated with
just one client.
A confirmation object is already created when
do_send_confirmation_email is called just above.
Tweaked by tabbott to remove an unnecessary somewhat hacky database
query.
This should dramatically improve the queue processor's performance in
cases where there's a very high volume of requests on a given endpoint
by a given user, as described in the new docstring.
Until we test this more broadly in production, we won't know if this
is a full solution to the problem, but I think it's likely. We've
never seen the UserActivityInterval worker end up backlogged without a
total queue processor outage, and it should have a similar workload.
Fixes#13180.
We don't actually need to go to the memcached (falling back to the
database) to fetch either user or client objects on every event. For
user objects, we actually can just pass through the user ID
transparently; for client objects, we can use an in-process cache,
since the mapping of string to ID never changes.
Hopefully this does a better job of spurring people to action, and also
suggests a self-service fix if they don't (i.e. contacting the person that
invited them).
zerver/openapi/python_examples.py:105: error: Argument 1 to "get_user_presence" of "Client" has incompatible type "str"; expected "Dict[str, Any]"
zerver/openapi/python_examples.py:563: error: Argument 1 to "add_reaction" of "Client" has incompatible type "Dict[str, object]"; expected "Dict[str, str]"
zerver/openapi/python_examples.py:576: error: Argument 1 to "remove_reaction" of "Client" has incompatible type "Dict[str, object]"; expected "Dict[str, str]"
zerver/worker/queue_processors.py:587: error: Argument "client" to "extract_query_without_mention" has incompatible type "EmbeddedBotHandler"; expected "ExternalBotHandler"
These were only missed because mypy daemon mode requires us to set
`follow_imports = skip` for the `zulip` package.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
This feature is intended to cover all of our ways of exporting a
realm, not just the initial "public export" feature, so we should name
things appropriately for that goal.
Additionally, we don't want to include data exports in page_params;
the original implementation was actually buggy and would have.
We were seeing errors when pubishing typical events in the form of
`Dict[str, Any]` as the expected type to be a `Union`. So we instead
change the only non-dictionary call, to pass a dict instead of `str`.
The RealmAuditLog object ID was stored in the event sent to the
deferred_work queue as a means to update the row's extra_data field.
The extra_data field then stores the location of the export.
A unique path was created using the `LOCAL_UPLOADS_DIR` backend, similar
to the code used in `LocalUploadBackend`. The exported tarball was
copied to the directory, and an nginx url was created to serve the file
publicly.
Tweaked by tabbott to output an actual URL.
The upload option will no longer be limited to strictly S3 uploads. This
commit serves as a preliminary step for supporting LOCAL_UPLOADS_DIR as
part of the public only export feature.
An endpoint was created in zerver/views. Basic rate-limiting was
implemented using RealmAuditLog. The idea here is to simply log each
export event as a realm_exported event. The number of events
occurring in the time delta is checked to ensure that the weekly
limit is not exceeded.
The event is published to the 'deferred_work' queue processor to
prevent the export process from being killed after 60s.
Upon completion of the export the realm admin(s) are notified.
Closes#2420
We add rate limiting (max X emails withing Y seconds per realm) to the
email mirror. By creating RateLimitedRealmMirror class, inheriting from
RateLimitedObject, and rate_limit_mirror_by_realm function, following a
mechanism used by rate_limit_user, we're able to have this
implementation mostly rely on the already existing, and proven over
time, rate_limiter.py code. The rules are configurable in settings.py in
RATE_LIMITING_MIRROR_REALM_RULES, analogically to RATE_LIMITING_RULES.
Rate limit verification happens in the MirrorWorker in
queue_processors.py. We don't rate limit missed message emails, as due
to using one time addresses, they're not a spam threat.
test_mirror_worker is adapted to the altered MirrorWorker code and a new
test - test_mirror_worker_rate_limiting is added in test_queue_worker.py
to provide coverage for these changes.
When a bunch of messages with active notifications are all read at
once -- e.g. by the user choosing to mark all messages, or all in a
stream, as read, or just scrolling quickly through a PM conversation
-- there can be a large batch of this information to convey. Doing it
in a single GCM/FCM message is better for server congestion, and for
the device's battery.
The corresponding client-side logic is in zulip/zulip-mobile#3343 .
Existing clients today only understand one message ID at a time; so
accommodate them by sending individual GCM/FCM messages up to an
arbitrary threshold, with the rest only as a batch.
Also add an explicit test for this logic. The existing tests
that happen to cause this function to run don't exercise the
last condition, so without a new test `--coverage` complains.
This adds language paramater to send_future_email. As a result, this
properly internationalizes invitation reminder emails, by passing
correct language into send_future_email.
Fixes#11240.
Apparently, we have a second code path where we might try to call
send_email library functions on old data, namely in the
queue_processors codebase. So we apply the same migration logic here.
This adds a function that sends provided email to all administrators
of a realm, but in a single email. As a result, send_email now takes
arguments to_user_ids and to_emails instead of to_user_id and
to_email.
We adjust other APIs to match, but note that send_future_email does
not yet support the multiple recipients model for good reasons.
Tweaked by tabbott to modify `manage.py deliver_email` to handle
backwards-compatibily for any ScheduledEmail objects already in the
database.
Fixes#10896.
This library was absolutely essential as part of our Python 2->3
migration process, but all of its calls should be either no-ops or
encode/decode operations.
Note also that the library has been wrong since the incorrect
refactoring in 1f9244e060.
Fixes#10807.
While it could make sense to print these logging statements at WARN
level on server startup, it doesn't make sense to do so on every
message (though it perhaps did make sense to do so before more recent
changes added good ways to discover you forgot to configure push
notifications).
Instead, we now just do a WARN log on queue processor startup, and
then at DEBUG level for individual messages.
Fixes#10894.
Previously, MissedMessageWorker used a batching strategy of just
grabbing all the events from the last 2 minutes, and then sending them
off as emails. This suffered from the problem that you had a random
time, between 0s and 120s, to edit your message before it would be
sent out via an email.
Additionally, this made the queue had to monitor, because it was
expected to pile up large numbers of events, even if everything was
fine.
We fix this by batching together the events using a timer; the queue
processor itself just tracks the items, and then a timer-handler
process takes care of ensuring that the emails get sent at least 120s
(and at most 130s) after the first triggering message was sent in Zulip.
This introduces a new unpleasant bug, namely that when we restart a
Zulip server, we can now lose some missed_message email events;
further work is required on this point.
Fixes#6839.
This fixes a couple things:
* process_event() is a pretty vague name
* returning tuples should generally be avoided
* we were producing the same REST parameters in both
subclasses
* relative_url_path was always blank
* request_kwargs was always empty
Now process_event() is called build_bot_request(),
and it only returns request data,
not a tuple of `rest_operation` and `request_data`.
By no longer returning `rest_operation`, there are
fewer moving parts. We just have `do_rest_call` make
a POST call.
Before this change, we instantiated base_url into a superclass
of subclasses that returned base_url into a dictionary that
gets returned to our caller.
Now we just pull base_url out of service when we need to make
the REST call.
This uses the recently introduced active_mobile_push_notification
flag; messages that have had a mobile push notification sent will have
a removal push notification sent as soon as they are marked as read.
Note that this feature is behind a setting,
SEND_REMOVE_PUSH_NOTIFICATIONS, since the notification format is not
supported by the mobile apps yet, and we want to give a grace period
before we start sending notifications that appear as (null) to
clients. But the tracking logic to maintain the set of message IDs
with an active push notification runs unconditionally.
This is designed with at-least-once semantics; so mobile clients need
to handle the possibility that they receive duplicat requests to
remove a push notification.
We reuse the existing missedmessage_mobile_notifications queue
processor for the work, to avoid materially impacting the latency of
marking messages as read.
Fixes#7459, though we'll need to open a follow-up issue for
using these data on iOS.
Private messages are not supported in Slack-format webhook.
Instead of raising a NotImplementedError, we warn the user
that PM service is not supported by sending a message to the
user.
Added tests for the same.
Fixes#9239
We already had a setting for whether these logs were enabled; now it
also controls which stream the messages go to.
As part of this migration, we disable the feature in dev/production by
default; it's not useful for most environments.
Fixes the proximal data-export issue reported in #10078 (namely, a
stream with nobody ever subscribed to having been created).
Slow queries during backend tests sends messages to Error Bot
which affects the database state causing the tests to fail.
This fixes the occasional flakes due to that.
These two classes are tricky to test, and nocoverage-ing them
allows us to mark queue_processors.py as fully covered. We
still want to cover these two workers at some point, but for
now, it's nice to enforce full coverage for any future changes
to queue_processors.py.
Fixes (sort of) #6542.
We already check in get_service_bot_events() if a bot is mentioned,
and then only pass on the call to the bot handler if it is. The
commit removes the additional check in the embedded bot queue
processor simply because it is impossible to obtain test coverage
for it (there is no meaningful way to trigger the content of the
if-clause, because there will never be a message reaching the bot
without @-mentioning it.
To alleviate the danger of a potential regression, the check is not
removed completely, but rather replaced by an assert statement.
Revert c8f034e9a "queue: Remove missedmessage_email_senders code."
As the comment in the code says, it ensures a smooth upgrade path
from 1.7.x; we can delete it in master after 1.8.0 is released.
The removal commit was merged early due to a communication failure.
External bots may call bot_handler.quit() when
they wish to terminate, e.g. due to a misconfiguration.
Currently, embedded bots ignore calls to quit(), even
though they signal a problem. This commit does the first
step in handling quit() calls by logging a warning.
If an exception was thrown inside `send_email` resulting in a retry,
we would include the `failed_tries` data in the event, which turned
out to thrown an exception itself.
This fixes that flow, including deepening the test so that it would
fail if we didn't have the new logic.
This commit just copies all the code from MissedMessageSendingWorker
class to a new EmailSendingWorker class. All the logic to send an email
through a queue was already there. This commit only makes the logic
generic. It does so by creating a special purpose queue called
'email_senders' to send any type of email. To make
MissedMessageSendingWorker still work we derive it from
EmailSendingWorker. All the tests that were testing
MissedMessageSendingWorker now run against EmailSendingWorker.
The original logic is buggy now that emails can belong to (and be
invited to) multiple realms.
The new logic in the `invites` queue worker also avoids the bug where
when the PreregistrationUser was gone by the time the queue worker got
to the invite (e.g., because it'd been revoked), we threw an exception.
[greg: fix upgrade-compatibility logic; add test; explain
revoked-invite race above]
Because we use access_stream_by_id here, and that checks for an active
subscription to interact with a private stream, this didn't work.
The correct fix to add an option to active_stream_by_id to accept an
argument indicating whether we need an active subscription; for this
use case, we definitely do not.
There's one migration required by this release:
* queue_processors: Stop passing state_handler to handle_message.
state_handler is now a property of bot_handler and thus, does
not need to be passed to bot_handler.handle_message().
The commit responsible is:
2a74ad11c5
This fixes a bug where, when a user is unsubscribed from a stream,
they might have unread messages on that stream leak. While it might
seem to be a minor problem, it can cause significant problems for
computing the `unread_msgs` data structures, since it means we need to
add an extra filter for whether the user is still subscribed, either
in the backend or in the UI.
Fixes#7095.
This fixes a regression in 25c669df52.
We were draining the queue in both the superclass and the subclass,
so by the time the subclass started processing events, there were
no events to process. Now the subclass properly uses the events
passed in from the superclass.
The os.mkdir call is straightforward and doesn't testing.
Workers relying on LoopQueueProcessingWorker are tested
with its consume method that exists solely for this purpose.
This should help make it easier to add tests coverage for these queue
processors, since they now use a system more similar to the other
queue processors.
This replaces the former non-functional StateHandler
stub with a dictionary-like state object. Accessing it will
will read and store strings in the BotUserStateData model.
Each bot has a limited state size. To enforce this limit while
keeping data updates efficient, StateHandler caches the expensive
query for getting a bot's total state size. Assignments to a key
then only need to fetch that entry's previous size, if any, and
compare it to the new entry's size.
Previously, invitation reminder emails were only being cleared after a
successful signup if newsletter_data was available, since that was the
circumstance in which we were calling the relevant queue processor
code. Now, we (1) clear them when a human user finishes signing up
and (2) correctly clear them using the 'address' field of
ScheduleEmail, not user_id.
There is no reason for either render_incoming_message() or
render_markdown() to require full UserProfile objects just to
triage alert words.
By only asking for user_ids, we save extra queries in two
callpaths and we make it easier to start using user_ids in
do_send_messages().
Add test to check if the embedded bot service being used is in the
registry or not.
Add test to check if the bot being added to the registry has a valid
bot corresponding to it.
Move 'get_bot_handler' to 'zerver/lib/bot_lib.py' as it is an independent
function, not related to the 'EmbeddedBotWorker' class that it was
previously a part of.
This fixes some error message strings and skips converting request_data
into json. From now, conversion would be the responsibility of interface.
Also, base_url is now not passed into event structure.
Splitting bot_lib.py file into 2 files led to unnecessary
redirection of the code workflow. For an embedded bot/service to
send a reply, it was being redirected 3 times.
First, the code flow comes to "EmbeddedBotHandler" class to send
reply, then it goes to the common function in "zulip_bots/lib.py",
then it would come back to "EmbeddedBotHandler". Later on, if we
create an abstract class, from where the bot work flow would
directly hit and then from there it is classified into
EmbeddedBotHandler or ExternalBotHandler and accordingly it would
get redirected.
Now, first the bot flow goes to it's handler class External or
Embedded (where we pass that this is External or Embedded bot as
parameter) and then goes to a common point and then comes back to
the same class.
This is required, since we just reorganized the python-zulip-api
repository into 3 packages.
A nice side effect is that we get to eliminate some now-unnecessary
code for editing sys.path.