Commit Graph

312 Commits

Author SHA1 Message Date
Mateusz Mandera 9406bfbc0a analytics: Store realm disk space used as a CountStat.
Fixes #29632.

The issue description explains this well:

We currently recalculate `currently_used_upload_space_bytes` every file
upload, by dint of calling `flush_used_upload_space_cache`  on
save/delete, and then immediately calling
`user_profile.realm.currently_used_upload_space_bytes()` in
`notify_attachment_update`.  Since this walks the Attachments table,
recalculating this can take seconds in large realms.

Switch this to using a CountStat, so we don't need to walk significant
chunks of the Attachment table when we upload an attachment.  This will
also give us a historical daily graph of usage.
2024-05-09 10:54:44 -07:00
Anders Kaseorg 96fbe060a6 python: Mark regexes as raw strings.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2024-04-26 12:30:31 -07:00
Mateusz Mandera 8038e2322c realm: Change implementation approach for upload_quota_gb.
Most importantly, fixes a bug where a realm with a custom
.upload_quota_gb value (set by changing it in the database via e.g.
manage.py shell) would end up having it lowered while upgrading their
plan via the do_change_realm_plan_type function, which used to just set
it to the value implied by the new plan without caring about whether
that isn't lower than the original limit.

The new approach is cleaner since we don't do db queries by
upload_quota_gb so it's nicer to just generate these dynamically, making
changes to our limit-per-plan rules much easier - skipping the need for
migrations.
2024-04-15 15:08:56 -07:00
Alex Vandiver 7f46773ef1 tests: Clear in-memory Client caches before testing query counts.
This makes counts more apples-to-apples comparable when run
back-to-back.
2024-02-14 12:27:03 -08:00
Anders Kaseorg cff0b78771 models: Move some functions to zerver.lib.attachments.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-12-16 22:08:44 -08:00
Anders Kaseorg cd96193768 models: Extract zerver.models.realms.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-12-16 22:08:44 -08:00
Anders Kaseorg 45bb8d2580 models: Extract zerver.models.users.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-12-16 22:08:44 -08:00
Anders Kaseorg 3853fa875a python: Consistently use from…import for urllib.parse.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-12-05 13:03:07 -08:00
Alex Vandiver 82960d9bc2 upload: Redirect unauthorized anonymous requests to login.
Note that this also redirects rate-limited anonymous requests to the
login page, as we do not currently differentiate the cases.
2023-11-28 09:44:55 -08:00
Alex Vandiver f9884af114 upload: Return images for 404/403 responses with image Accept: headers.
If the request's `Accept:` header signals a preference for serving
images over text, return an image representing the 404/403 instead of
serving a `text/html` response.

Fixes: #23739.
2023-11-28 09:44:55 -08:00
Sahil Batra 58461660c3 users: Restrict accessing avatar for inaccessible users.
We now return the special avatar used for inaccessible users
when a guest user tries to access avatar of an inaccessibe
user using "/avatar" endpoint.
2023-11-21 23:58:45 -08:00
Anders Kaseorg a50eb2e809 mypy: Enable new error explicit-override.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-10-12 12:28:41 -07:00
Anders Kaseorg 6988622fe8 ruff: Enable B023 Function definition does not bind loop variable.
Python’s loop scoping is misdesigned, resulting in a very common
gotcha for functions that close over loop variables [1].  The general
problem is so bad that even the Go developers plan to break
compatibility in order to fix the same design mistake in their
language [2].

Enable the Ruff rule function-uses-loop-variable (B023) [3], which
conservatively prohibits functions from binding loop variables at all.

[1] https://docs.python-guide.org/writing/gotchas/#late-binding-closures
[2] https://go.dev/s/loopvar-design
[3] https://beta.ruff.rs/docs/rules/function-uses-loop-variable/

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-09-11 18:03:45 -07:00
Anders Kaseorg 81bd63cb46 ruff: Fix PIE808 Unnecessary `start` argument in `range`.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-09-01 14:57:01 -07:00
Alex Vandiver b67108c8c6 retention: Prevent deletion of partially-archived messages.
Previously, this code:
```python3
old_archived_attachments = ArchivedAttachment.objects.annotate(
    has_other_messages=Exists(
        Attachment.objects.filter(id=OuterRef("id"))
        .exclude(messages=None)
        .exclude(scheduled_messages=None)
    )
).filter(messages=None, create_time__lt=delta_weeks_ago, has_other_messages=False)
```

...protected from removal any ArchivedAttachment objects where there
was an Attachment which had _both_ a message _and_ a scheduled
message, instead of _either_ a message _or_ a scheduled message.
Since files are removed from disk when the ArchivedAttachment rows are
deleted, this meant that if an upload was referenced in two messages,
and one was deleted, the file was permanently deleted when the
ArchivedMessage and ArchivedAttachment were cleaned up, despite being
still referenced in live Messages and Attachments.

Switch from `.exclude(messages=None).exclude(scheduled_messages=None)`
to `.exclude(messages=None, scheduled_messages=None)` which "OR"s
those conditions appropriately.

Pull the relevant test into its own file, and expand it significantly
to cover this, and other, corner cases.
2023-08-06 13:40:02 -07:00
Anders Kaseorg e932e2ce52 ruff: Fix UP032 Use f-string instead of `format` call.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-08-02 15:58:55 -07:00
arghyadeep10 1808cdec90 uploads: Improve file not found message.
It replaces the "File not found." text with:
"This file does not exist or has been deleted."

At present when a file is deleted it results in a confusing
experience when looking at the "File not found." message.
In order to clarify the situation is not a bug, the message
has been replaced with a better alternative.

Fixes part of Issue #23739.
2023-07-06 09:32:41 -07:00
Alex Vandiver e2847790b6 upload: Provide a default upload file name, rather than 500. 2023-07-03 21:51:58 -07:00
Ujjawal Modi f7346f36fc attachments: Refactor code for flushing used_upload_space cache.
Subsequent commits will add "on_delete=models.RESTRICT"
relationships, which will result in the Attachment
objects being deleted after Realm has been deleted from
the database.

In order to handle this, we update
get_realm_used_upload_space_cache_key function to accept
realm_id as parameter instead of realm object, so that
the code for flushing the cache works even after the
realm is deleted. This change is fine because eventually
only realm_id is used by this function and there is no
need of the complete realm object.
2023-06-28 18:03:32 -07:00
Lauryn Menard d53b854a7c backend-tests: Update "private message" or "PM" to "direct message".
Updates comments and test strings/names with "private message" or
"PM" to use "direct message" instead.
2023-06-23 11:24:13 -07:00
Anders Kaseorg c09e7d6407 codespell: Correct “requestor” to “requester”.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-06-20 16:17:55 -07:00
Anders Kaseorg 92c83c1df4 tests: Remove assert_streaming_content helper in favor of getvalue.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-06-15 16:49:27 -07:00
Alex Vandiver fbb831ff3b uploads: Allow access to the /download/ variant anonymously.
This was mistakenly left off of b799ec32b0.
2023-06-12 12:55:27 -07:00
Alex Vandiver 0dbe111ab3 test_helpers: Switch add/remove_ratelimit to a contextmanager.
Failing to remove all of the rules which were added causes action at a
distance with other tests.  The two methods were also only used by
test code, making their existence in zerver.lib.rate_limiter clearly
misplaced.

This fixes one instance of a mis-balanced add/remove, which caused
tests to start failing if run non-parallel and one more anonymous
request was added within a rate-limit-enabled block.
2023-06-12 12:55:27 -07:00
Lauryn Menard 154af5bb6b scheduled-messages: Remove ID from create scheduled message.
Part of splitting creating and editing scheduled messages.
Should be merged with final commit in series. Breaks tests.

Removes `scheduled_message_id` parameter from the create scheduled
message path.
2023-05-26 18:05:55 -07:00
Mateusz Mandera 414658fc8e scheduled_message: Handle attachments properly.
Fixes #25414.

We add Attachment.scheduled_messages relation to track ScheduledMessages
which reference the attachment.

The import bits can be done after merging this, by updating #25345.
2023-05-08 09:56:02 -07:00
Mateusz Mandera 4598607a46 test_uploads: Fix two typos. 2023-05-08 09:56:02 -07:00
AcKindle3 b0ef8f0822 test: Replace occurences of `uri` with `url`.
In all the tests files, replaced all occurences of `uri` with `url`
appeared in comments, local variablles, function names and their callers.
2023-04-08 16:27:55 -07:00
Anders Kaseorg a881918a05 requirements: Upgrade Python requirements.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-04-03 22:39:21 -07:00
Lauryn Menard 7b225245c0 tests: Update ZulipTestCase.tearDown to remove local uploads.
Previously, tests that exercised code paths that added local
uploads did not always clean up `settings.LOCAL_UPLOADS_DIR`
after the test was complete.

Updates the `ZulipTestCase` class to remove any local uploads
in the unique `settings.LOCAL_UPLOADS_DIR` in `tearDown` for
all tests.
2023-03-28 14:38:06 -07:00
Anders Kaseorg 2d9b2a2a05 models: Remove type prefixes from __str__ values.
The Django convention is for __repr__ to include the type and __str__
to omit it.  In fact its default __repr__ implementation for models
automatically adds a type prefix to __str__, which has resulted in the
type being duplicated:

    >>> UserProfile.objects.first()
    <UserProfile: <UserProfile: emailgateway@zulip.com <Realm: zulipinternal 1>>>

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-03-08 22:56:55 -08:00
Alex Vandiver 880a3f95a7 tests: Split out s3 and local tests.
This mirrors the split of the code in 7c0d414aff.
2023-03-02 16:36:19 -08:00
Alex Vandiver bd80c048be upload: Rename delete_message_image to use word "attachment".
The table is named Attachment, and not all of them are images.
2023-03-02 16:36:19 -08:00
Alex Vandiver 567d1d54e7 upload: Rename upload_message_file to use word "attachment".
For consistency with the table, which is named Attachment.
2023-03-02 16:36:19 -08:00
Alex Vandiver e31767dda4 settings: Make DEFAULT_LOGO_URI/DEFAULT_AVATAR_URI use staticfiles. 2023-02-14 17:17:06 -05:00
Anders Kaseorg df001db1a9 black: Reformat with Black 23.
Black 23 enforces some slightly more specific rules about empty line
counts and redundant parenthesis removal, but the result is still
compatible with Black 22.

(This does not actually upgrade our Python environment to Black 23
yet.)

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-02-02 10:40:13 -08:00
Alex Vandiver 04cf68b45e uploads: Serve S3 uploads directly from nginx.
When file uploads are stored in S3, this means that Zulip serves as a
302 to S3.  Because browsers do not cache redirects, this means that
no image contents can be cached -- and upon every page load or reload,
every recently-posted image must be re-fetched.  This incurs extra
load on the Zulip server, as well as potentially excessive bandwidth
usage from S3, and on the client's connection.

Switch to fetching the content from S3 in nginx, and serving the
content from nginx.  These have `Cache-control: private, immutable`
headers set on the response, allowing browsers to cache them locally.

Because nginx fetching from S3 can be slow, and requests for uploads
will generally be bunched around when a message containing them are
first posted, we instruct nginx to cache the contents locally.  This
is safe because uploaded file contents are immutable; access control
is still mediated by Django.  The nginx cache key is the URL without
query parameters, as those parameters include a time-limited signed
authentication parameter which lets nginx fetch the non-public file.

This adds a number of nginx-level configuration parameters to control
the caching which nginx performs, including the amount of in-memory
index for he cache, the maximum storage of the cache on disk, and how
long data is retained in the cache.  The currently-chosen figures are
reasonable for small to medium deployments.

The most notable effect of this change is in allowing browsers to
cache uploaded image content; however, while there will be many fewer
requests, it also has an improvement on request latency.  The
following tests were done with a non-AWS client in SFO, a server and
S3 storage in us-east-1, and with 100 requests after 10 requests of
warm-up (to fill the nginx cache).  The mean and standard deviation
are shown.

|                   | Redirect to S3      | Caching proxy, hot  | Caching proxy, cold |
| ----------------- | ------------------- | ------------------- | ------------------- |
| Time in Django    | 263.0 ms ±  28.3 ms | 258.0 ms ±  12.3 ms | 258.0 ms ±  12.3 ms |
| Small file (842b) | 586.1 ms ±  21.1 ms | 266.1 ms ±  67.4 ms | 288.6 ms ±  17.7 ms |
| Large file (660k) | 959.6 ms ± 137.9 ms | 609.5 ms ±  13.0 ms | 648.1 ms ±  43.2 ms |

The hot-cache performance is faster for both large and small files,
since it saves the client the time having to make a second request to
a separate host.  This performance improvement remains at least 100ms
even if the client is on the same coast as the server.

Cold nginx caches are only slightly slower than hot caches, because
VPC access to S3 endpoints is extremely fast (assuming it is in the
same region as the host), and nginx can pool connections to S3 and
reuse them.

However, all of the 648ms taken to serve a cold-cache large file is
occupied in nginx, as opposed to the only 263ms which was spent in
nginx when using redirects to S3.  This means that to overall spend
less time responding to uploaded-file requests in nginx, clients will
need to find files in their local cache, and skip making an
uploaded-file request, at least 60% of the time.  Modeling shows a
reduction in the number of client requests by about 70% - 80%.

The `Content-Disposition` header logic can now also be entirely shared
with the local-file codepath, as can the `url_only` path used by
mobile clients.  While we could provide the direct-to-S3 temporary
signed URL to mobile clients, we choose to provide the
served-from-Zulip signed URL, to better control caching headers on it,
and greater consistency.  In doing so, we adjust the salt used for the
URL; since these URLs are only valid for 60s, the effect of this salt
change is minimal.
2023-01-09 18:23:58 -05:00
Alex Vandiver ed6d62a9e7 avatars: Serve /user_avatars/ through Django, which offloads to nginx.
Moving `/user_avatars/` to being served partially through Django
removes the need for the `no_serve_uploads` nginx reconfiguring when
switching between S3 and local backends.  This is important because a
subsequent commit will move S3 attachments to being served through
nginx, which would make `no_serve_uploads` entirely nonsensical of a
name.

Serve the files through Django, with an offload for the actual image
response to an internal nginx route.  In development, serve the files
directly in Django.

We do _not_ mark the contents as immutable for caching purposes, since
the path for avatar images is hashed only by their user-id and a salt,
and as such are reused when a user's avatar is updated.
2023-01-09 18:23:58 -05:00
Alex Vandiver f0f4aa66e0 uploads: Inline the one callsite of get_local_file_path.
This helps make more explicit the assert_is_local_storage_path which
makes using local_path safe.
2023-01-09 18:23:58 -05:00
Alex Vandiver 7ad06473b6 uploads: Add LOCAL_AVATARS_DIR / LOCAL_FILES_DIR computed settings.
This avoids strewing "avatars" and "files" constants throughout.
2023-01-09 18:23:58 -05:00
Alex Vandiver 24f95a3788 uploads: Move internal upload serving path to under /internal/. 2023-01-09 18:23:58 -05:00
Alex Vandiver cc9b028312 uploads: Set X-Accel-Redirect manually, without using django-sendfile2.
The `django-sendfile2` module unfortunately only supports a single
`SENDFILE` root path -- an invariant which subsequent commits need to
break.  Especially as Zulip only runs with a single webserver, and
thus sendfile backend, the functionality is simple to inline.

It is worth noting that the following headers from the initial Django
response are _preserved_, if present, and sent unmodified to the
client; all other headers are overridden by those supplied by the
internal redirect[^1]:
 - Content-Type
 - Content-Disposition
 - Accept-Ranges
 - Set-Cookie
 - Cache-Control
 - Expires

As such, we explicitly unset the Content-type header to allow nginx to
set it from the static file, but set Content-Disposition and
Cache-Control as we want them to be.

[^1]: https://www.nginx.com/resources/wiki/start/topics/examples/xsendfile/
2023-01-09 18:23:58 -05:00
Alex Vandiver 679fb76acf uploads: Provide our own Content-Disposition header.
sendfile already applied a Content-Disposition header, but the
algorithm may provide both `filename=` and `filename*=` values (which
is potentially confusing to clients) and incorrectly slash-escapes
quotes in Unicode strings.

Django provides a correct implementation, but it is only accessible to
FileResponse objects.  Since the entire point is to offload the
filehandle handling, we cannot use a FileResponse.

Django 4.2 will make the function available outside of FileResponse.
Until then, extract our own Content-Disposition handling, based on
Django's.

We remove the very verbose comment added in d4360e2287, describing
Content-Disposition headers, as it does not add much.
2023-01-09 18:23:58 -05:00
Alex Vandiver 7c0d414aff uploads: Split out S3 and local file backends into separate files.
The uploads file is large, and conceptually the S3 and local-file
backends are separable.
2023-01-09 18:23:58 -05:00
Zixuan James Li 46329a2710 test_classes: Create a dedicate helper for query count check.
This adds a helper based on testing patterns of using the "queries_captured"
context manager with "assert_length" to check the number of queries
executed for preventing performance regression.

It explains the rationale of checking the query count through an
"AssertionError" and prints the queries captured as assert_length does,
but with a format optimized for displaying the queries in a more
readable manner.

Signed-off-by: Zixuan James Li <p359101898@gmail.com>
2022-10-17 11:32:52 -07:00
madrix01 4303ba1efc actions: Create a separate message_delete.py file.
This is preparatory commit for #18941.
Importing `do_delete_message` from `message_edit.py` was causing a
circular import error. In order to avoid that, we create a separate
message_delete.py file which has all the functions related to deleting
messages.
The tests for deleting messages are present in
`zerver/tests/test_message_edit.py`.

Fixes a part of #18941
2022-09-01 14:18:38 -07:00
Lauryn Menard aa796af0a8 upload: Remove `mimetype` url parameter in `get_file_info`.
This `mimetype` parameter was introduced in c4fa29a and its last
usage removed in 5bab2a3. This parameter was undocumented in the
OpenAPI endpoint documentation for `/user_uploads`, therefore
there shouldn't be client implementations that rely on it's
presence.

Removes the `request.GET` call for the `mimetype` parameter and
replaces it by getting the `content_type` value from the file,
which is an instance of Django's `UploadedFile` class and stores
that file metadata as a property.

If that returns `None` or an empty string, then we try to guess
the `content_type` from the filename, which is the same as the
previous behaviour when `mimetype` was `None` (which we assume
has been true since it's usage was removed; see above).

If unable to guess the `content_type` from the filename, we now
fallback to "application/octet-stream", instead of an empty string
or `None` value.

Also, removes the specific test written for having `mimetype` as
a url parameter in the request, and replaces it with a test that
covers when we try to guess `content_type` from the filename.
2022-08-08 16:06:09 -07:00
Zixuan James Li a142fbff85 tests: Refactor away result.json() calls with helpers.
Signed-off-by: Zixuan James Li <p359101898@gmail.com>
2022-06-06 23:06:00 -07:00
Mateusz Mandera 09dc166b45 do_delete_old_unclaimed_attachments: Consider ArchivedAttachment rows.
This function is oblivious to the existence of ArchivedAttachment, which
is incorrect. A file can be removed if and only if it is not referenced
by any Messages or ArchivedMessages.
2022-06-02 17:32:23 -07:00
Mateusz Mandera 5ff4754090 test_upload: Fix some URLs to uploaded files.
Using http://localhost:9991 is incorrect - e.g. messages sent with file
urls constructed trigger do_claim_attachments to be called with empty
list in potential_path_ids.

realm.host should be used in all these places, like in the other tests
in the file.
2022-06-02 17:32:23 -07:00