zulip

Commit Graph

Author	SHA1	Message	Date
Mateusz Mandera	420849ff6a	slack: Call the correct resize_* function when importing realm icon. For resizing the icon.png files, we use resize_avatar, not resize_logo. This is pretty confusing - sure, for icons we use the same function as for avatars, but we should have a proper name for the function called in the icon context. So this commit also adds resize_realm_icon, and changes the calls to resize_avatar in icon contexts to resize_realm_icon.	2024-11-08 15:43:18 -08:00
Alex Vandiver	a20673a267	upload: Allow filtering to just a prefix (e.g. a realm id).	2024-09-26 12:01:11 -07:00
Alex Vandiver	2dc737335e	upload: Switch from BinaryIO to IO[bytes]. This is slightly more generally-compatible.	2024-09-26 12:01:11 -07:00
Alex Vandiver	638c579c56	tusd: Set metadata correctly in S3. The Content-Type, Content-Disposition, StorageClass, and general metadata are not set according to our patterns by tusd; copy the file to itself to update those properties.	2024-09-26 12:00:43 -07:00
Alex Vandiver	84280ed7c2	upload: When serving s3 download URLs, send real filename. Setting `ResponseContentDisposition=attachment` means that we override the stored `ContentDisposition`, which includes a filename. This means that using the "Download" link on servers with S3 storage produced a file named the sanitized version we stored. Explicitly build a `ContentDisposition` to tell S3 to return, which includes both `attachment` as well as the filename (if we have it locally).	2024-09-26 12:00:43 -07:00
Anders Kaseorg	184c0203f3	upload: Lazily import boto3. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2024-09-24 16:38:37 -07:00
Alex Vandiver	e125ad823d	exports: Add a separate bucket for realm exports. This allows finer-grained access control and auditing. The links generated also expire after one week, and the suggested configuration is that the underlying data does as well. Co-authored-by: Prakhar Pratyush <prakhar@zulip.com>	2024-09-20 15:43:49 -07:00
Alex Vandiver	c1e8ecd08f	uploads: Cache boto client in the module and be writable. The `get_signed_upload_url` code is called for every S3 file serve request, and is thus in the hot path. The boto3 client caching optimization is thus potentially useful as a performance optimization.	2024-09-20 15:43:49 -07:00
Alex Vandiver	1a7b3ef7ed	upload: Use get_export_tarball_url in upload_export_tarball.	2024-09-20 15:43:49 -07:00
Alex Vandiver	4cf835d9dd	upload: Remove common cache from get_export_tarball_url. This is not called in the hot path like get_avatar_url is.	2024-09-20 15:43:49 -07:00
Alex Vandiver	a5bf452202	upload: Realm is not Optional in upload_export_tarball. `af4eb8c0d5` marked the base class and local backend as non-Optional, but left the S3 backend as Optional for some reason. Remove it.	2024-09-20 15:43:49 -07:00
Alex Vandiver	9a1f78db22	thumbnail: Support checking for images from streaming sources. We may not always have trivial access to all of the bytes of the uploaded file -- for instance, if the file was uploaded previously, or by some other process. Downloading the entire image in order to check its headers is an inefficient use of time and bandwidth. Adjust `maybe_thumbnail` and dependencies to potentially take a `pyvips.Source` which supports streaming data from S3 or disk. This allows making the ImageAttachment row, if deemed appropriate, based on only a few KB of data, and not the entire image.	2024-09-17 12:51:30 -07:00
Alex Vandiver	b4764f49df	upload: Download files with their original names. Fixes: #29491.	2024-09-09 12:40:17 -07:00
Alex Vandiver	ca72e756eb	upload: Rename "upload_image_to_s3"; it is not only for images.	2024-09-09 12:40:17 -07:00
Anders Kaseorg	91ade25ba3	python: Simplify with str.removeprefix, str.removesuffix. These are available in Python ≥ 3.9. https://docs.python.org/3/library/stdtypes.html#str.removeprefix Signed-off-by: Anders Kaseorg <anders@zulip.com>	2024-09-03 12:30:16 -07:00
Alex Vandiver	2e38f426f4	upload: Generate thumbnails when images are uploaded. A new table is created to track which path_id attachments are images, and for those their metadata, and which thumbnails have been created. Using path_id as the effective primary key lets us ignore if the attachment is archived or not, saving some foreign key messes. A new worker is added to observe events when rows are added to this table, and to generate and store thumbnails for those images in differing sizes and formats.	2024-07-16 13:22:15 -07:00
Anders Kaseorg	0fa5e7f629	ruff: Fix UP035 Import from `collections.abc`, `typing` instead. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2024-07-13 22:28:22 -07:00
Anders Kaseorg	531b34cb4c	ruff: Fix UP007 Use `X \| Y` for type annotations. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2024-07-13 22:28:22 -07:00
Anders Kaseorg	e08a24e47f	ruff: Fix UP006 Use `list` instead of `List` for type annotation. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2024-07-13 22:28:22 -07:00
Alex Vandiver	0385e5bab9	emoji: Store in S3 with a long public cache-control.	2024-07-12 13:26:47 -07:00
Alex Vandiver	ff90e5355f	upload: Pass down content-type of realm icon/logo to backend. This saves having to try to re-derive it from the file extension, which may be ".original" in some cases.	2024-07-11 07:31:39 -07:00
Alex Vandiver	58a9fe9af1	upload: Drop unused parameters to upload_message_attachment.	2024-07-07 14:40:07 -07:00
Alex Vandiver	e29a455b2d	avatars: Encode version into the filename. Hash the salt, user-id, and now avatar version into the filename. This allows the URL contents to be immutable, and thus to be marked as immutable and cacheable. Since avatars are served unauthenticated, hashing with a server-side salt makes the current and past avatars not enumerable. This requires plumbing the current (or future) avatar version through various parts of the upload process. Since this already requires a full migration of current avatars, also take the opportunity to fix the missing `.png` on S3 uploads (#12852). We switch from SHA-1 to SHA-256, but truncate it such that avatar URL data does not substantially increase in size. Fixes: #12852.	2024-07-07 14:40:07 -07:00
Alex Vandiver	feca9939bb	s3: Support setting a cache-control on uploads.	2024-07-07 14:40:07 -07:00
Alex Vandiver	6258817bfd	s3: Stop setting empty Content-Disposition header.	2024-07-07 14:40:07 -07:00
Alex Vandiver	2eaf098c5d	upload: Content-type is always defined.	2024-06-26 16:43:11 -07:00
Alex Vandiver	c826d80061	upload: Factor out common code into zerver.lib.upload.	2024-06-26 16:43:11 -07:00
Alex Vandiver	5cd10ce51d	s3: Allow setting a CloudFront URL prefix for avatar and emoji images.	2024-06-26 16:43:11 -07:00
Alex Vandiver	9fb03cb2c7	upload: Factor out common avatar logic.	2024-06-26 16:38:01 -07:00
Alex Vandiver	d92993c972	upload: Factor out common emoji logic.	2024-06-26 16:38:01 -07:00
Alex Vandiver	0153d6dbcd	thumbnailing: Move resizing functions into zerver.lib.thumbnail.	2024-06-20 23:06:08 -04:00
Anders Kaseorg	fb4ad1422e	mime_types: Add audio and image types missing from Python library. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2024-06-20 15:29:20 -07:00
Anders Kaseorg	93198a19ed	requirements: Upgrade Python requirements. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2024-01-29 10:41:54 -08:00
Alex Vandiver	75d6f35069	s3: Add a setting for S3 addressing style. This controls if boto3 attempts to use `https://bucketname.endpointname/` or `https://endpointname/bucket/` as its prefix. See https://botocore.amazonaws.com/v1/documentation/api/latest/reference/config.html Fixes: #28424.	2024-01-05 11:12:18 -08:00
Alex Vandiver	3aea67a8ed	s3: Only use get_bucket to get to boto3 clients and resources. boto3 has two different modalities of making API calls -- through resources, and through clients. Resources are a higher-level abstraction, and thus more generally useful, but some APIs are only accessible through clients. It is possible to get to a client object from a resource, but not vice versa. Use `get_bucket(...).meta.client` when we need direct access to the client object for more complex API calls; this lets all of the configuration for how to access S3 to sit within `get_bucket`. Client objects are not bound to only one bucket, but we get to them based on the bucket we will be interacting with, for clarity. We removed the cached session object, as it serves no real purpose.	2024-01-05 11:12:18 -08:00
Alex Vandiver	214bd4ed88	s3: Stop caching get_boto_client, which is only ever called once. `e883ab057f` started caching the boto client, which we had identified as slow call. `e883ab057f` went further, calling `get_boto_client().generate_presigned_url()` once and caching that result. This makes the inner cache on the client useless. Remove it.	2024-01-05 11:12:18 -08:00
Anders Kaseorg	3853fa875a	python: Consistently use from…import for urllib.parse. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2023-12-05 13:03:07 -08:00
Anders Kaseorg	a50eb2e809	mypy: Enable new error explicit-override. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2023-10-12 12:28:41 -07:00
Anders Kaseorg	55aa29bef4	ruff: Fix FLY002 Consider f"…" instead of string join. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2023-08-07 17:12:41 -07:00
Alex Vandiver	d957559371	uploads: Allow uploads to set storage class. Uploads are well-positioned to use S3's "intelligent tiering" storage class. Add a setting to let uploaded files to declare their desired storage class at upload time, and document how to move existing files to the same storage class.	2023-07-19 16:19:34 -07:00
Alex Vandiver	e408f069fe	uploads: Add a method to copy attachment contents out.	2023-04-07 09:13:48 -07:00
Alex Vandiver	3bf3f47b49	delete_old_unclaimed_attachments: Add flag to clean up storage. Actions like deleting realms may leave unreferenced uploads in the attachment storage backend. Fix these by walking the complete contents of the attachment storage backend, and removing files which are no longer present in the database. This may take quite some time, as it is necessarily O(n) in the number of files uploaded to the system.	2023-03-02 16:36:19 -08:00
Alex Vandiver	c9d1755a12	delete_realm: Optimize attachment cleanup by batching.	2023-03-02 16:36:19 -08:00
Alex Vandiver	b31a6dc56c	upload: Reorder functions into logical groupings.	2023-03-02 16:36:19 -08:00
Alex Vandiver	bd80c048be	upload: Rename delete_message_image to use word "attachment". The table is named Attachment, and not all of them are images.	2023-03-02 16:36:19 -08:00
Alex Vandiver	567d1d54e7	upload: Rename upload_message_file to use word "attachment". For consistency with the table, which is named Attachment.	2023-03-02 16:36:19 -08:00
Alex Vandiver	2f6c5a883e	CVE-2023-22735: Provide the Content-Disposition header from S3. The Content-Type of user-provided uploads was provided by the browser at initial upload time, and stored in S3; however, `04cf68b45e` switched to determining the Content-Disposition merely from the filename. This makes uploads vulnerable to a stored XSS, wherein a file uploaded with a content-type of `text/html` and an extension of `.png` would be served to browsers as `Content-Disposition: inline`, which is unsafe. The `Content-Security-Policy` headers in the previous commit mitigate this, but only for browsers which support them. Revert parts of `04cf68b45e`, specifically by allowing S3 to provide the Content-Disposition header, and using the `ResponseContentDisposition` argument when necessary to override it to `attachment`. Because we expect S3 responses to vary based on this argument, we include it in the cache key; since the query parameter has dashes in it, we can't use use the helper `$arg_` variables, and must parse it from the query parameters manually. Adding the disposition may decrease the cache hit rate somewhat, but downloads are infrequent enough that it is unlikely to have a noticeable effect. We take care to not adjust the cache key for requests which do not specify the disposition.	2023-02-07 17:09:52 +00:00
Alex Vandiver	04cf68b45e	uploads: Serve S3 uploads directly from nginx. When file uploads are stored in S3, this means that Zulip serves as a 302 to S3. Because browsers do not cache redirects, this means that no image contents can be cached -- and upon every page load or reload, every recently-posted image must be re-fetched. This incurs extra load on the Zulip server, as well as potentially excessive bandwidth usage from S3, and on the client's connection. Switch to fetching the content from S3 in nginx, and serving the content from nginx. These have `Cache-control: private, immutable` headers set on the response, allowing browsers to cache them locally. Because nginx fetching from S3 can be slow, and requests for uploads will generally be bunched around when a message containing them are first posted, we instruct nginx to cache the contents locally. This is safe because uploaded file contents are immutable; access control is still mediated by Django. The nginx cache key is the URL without query parameters, as those parameters include a time-limited signed authentication parameter which lets nginx fetch the non-public file. This adds a number of nginx-level configuration parameters to control the caching which nginx performs, including the amount of in-memory index for he cache, the maximum storage of the cache on disk, and how long data is retained in the cache. The currently-chosen figures are reasonable for small to medium deployments. The most notable effect of this change is in allowing browsers to cache uploaded image content; however, while there will be many fewer requests, it also has an improvement on request latency. The following tests were done with a non-AWS client in SFO, a server and S3 storage in us-east-1, and with 100 requests after 10 requests of warm-up (to fill the nginx cache). The mean and standard deviation are shown. \| \| Redirect to S3 \| Caching proxy, hot \| Caching proxy, cold \| \| ----------------- \| ------------------- \| ------------------- \| ------------------- \| \| Time in Django \| 263.0 ms ± 28.3 ms \| 258.0 ms ± 12.3 ms \| 258.0 ms ± 12.3 ms \| \| Small file (842b) \| 586.1 ms ± 21.1 ms \| 266.1 ms ± 67.4 ms \| 288.6 ms ± 17.7 ms \| \| Large file (660k) \| 959.6 ms ± 137.9 ms \| 609.5 ms ± 13.0 ms \| 648.1 ms ± 43.2 ms \| The hot-cache performance is faster for both large and small files, since it saves the client the time having to make a second request to a separate host. This performance improvement remains at least 100ms even if the client is on the same coast as the server. Cold nginx caches are only slightly slower than hot caches, because VPC access to S3 endpoints is extremely fast (assuming it is in the same region as the host), and nginx can pool connections to S3 and reuse them. However, all of the 648ms taken to serve a cold-cache large file is occupied in nginx, as opposed to the only 263ms which was spent in nginx when using redirects to S3. This means that to overall spend less time responding to uploaded-file requests in nginx, clients will need to find files in their local cache, and skip making an uploaded-file request, at least 60% of the time. Modeling shows a reduction in the number of client requests by about 70% - 80%. The `Content-Disposition` header logic can now also be entirely shared with the local-file codepath, as can the `url_only` path used by mobile clients. While we could provide the direct-to-S3 temporary signed URL to mobile clients, we choose to provide the served-from-Zulip signed URL, to better control caching headers on it, and greater consistency. In doing so, we adjust the salt used for the URL; since these URLs are only valid for 60s, the effect of this salt change is minimal.	2023-01-09 18:23:58 -05:00
Alex Vandiver	43fe24a5a0	uploads: Make realm_avatar_and_logo_path non-abstract.	2023-01-09 18:23:58 -05:00
Alex Vandiver	7c0d414aff	uploads: Split out S3 and local file backends into separate files. The uploads file is large, and conceptually the S3 and local-file backends are separable.	2023-01-09 18:23:58 -05:00

50 Commits