uploads: Extra-escape internal S3 paths.

In nginx, `location` blocks operate on the _decoded_ URI[^1]:

> The matching is performed against a normalized URI, after decoding
> the text encoded in the “%XX” form

This means that if a user-uploaded file contains characters that are
not URI-safe, the browser encodes them in UTF-8 and then URI-encodes
them -- and nginx decodes them and reassembles the original character
before running the `location ~ ^/...` match.  This means that the `$2`
_is not URI-encoded_ and _may contain non-ASCII characters.

When `proxy_pass` is passed a value containing one or more variables,
it does no encoding on that expanded value, assuming that the bytes
are exactly as they should be passed to the upstream.  This means that
directly calling `proxy_pass https://$1/$2` would result in sending
high-bit characters to the S3 upstream, which would rightly balk.

However, a longstanding bug in nginx's `set` directive[^2] means that
the following line:

```nginx
set $download_url https://$1/$2;
```

...results in nginx accidentally URI-encoding $1 and $2 when they are
inserted, resulting in a `$download_url` which is suitable to pass to
`proxy_pass`.  This bug is only present with numeric capture
variables, not named captures; this is particularly relevant because
numeric captures are easily overridden by additional regexes
elsewhere, as subsequent commits will add.

Fixing this is complicated; nginx does not supply any way to escape
values[^3], besides a third-party module[^4] which is an undue
complication to begin using.  The only variable which nginx exposes
which is _not_ un-escaped already is `$request_uri`, which contains
the very original URL sent by the browser -- and thus can't respect
any work done in Django to generate the `X-Accel-Redirect` (e.g., for
`/user_uploads/temporary/` URLs).  We also cannot pass these URLs to
nginx via query-parameters, since `$arg_foo` values are not
URI-decoded by nginx, there is no function to do so[^3], and the
values must be URI-encoded because they themselves are URLs with query
parameters.

Extra-URI-encode the path that we pass to the `X-Accel-Redirect`
location, for S3 redirects.  We rely on the `location` block
un-escaping that layer, leaving `$s3_hostname` and `$s3_path` as they
were intended in Django.

This works around the nginx bug, with no behaviour change.

[^1]: http://nginx.org/en/docs/http/ngx_http_core_module.html#location
[^2]: https://trac.nginx.org/nginx/ticket/348
[^3]: https://trac.nginx.org/nginx/ticket/52
[^4]: https://github.com/openresty/set-misc-nginx-module#set_escape_uri
This commit is contained in:
Alex Vandiver 2023-01-31 20:05:12 +00:00
parent a955f52904
commit d41a00b83b
2 changed files with 16 additions and 4 deletions

View File

@ -1,10 +1,15 @@
# Handle redirects to S3
location ~ ^/internal/s3/([^/]+)/(.*) {
location ~ ^/internal/s3/(?<s3_hostname>[^/]+)/(?<s3_path>.*) {
internal;
include /etc/nginx/zulip-include/headers;
set $download_url https://$1/$2;
proxy_set_header Host $1;
# The components of this path are originally double-URI-escaped
# (see zerver/view/upload.py). "location" matches are on
# unescaped values, which fills $s3_path with a properly
# single-escaped path to pass to the upstream server.
# (see associated commit message for more details)
set $download_url https://$s3_hostname/$s3_path;
proxy_set_header Host $s3_hostname;
# Ensure that we only get _one_ of these headers: the one that
# Django added, not the one from S3.

View File

@ -99,7 +99,14 @@ def serve_s3(request: HttpRequest, path_id: str, download: bool = False) -> Http
# this is acceptable in development.
return redirect(url)
response = internal_nginx_redirect("/internal/s3/" + url[len("https://") :])
# We over-escape the path, to work around it being impossible to
# get the _unescaped_ new internal request URI in nginx.
parsed_url = urlparse(url)
assert parsed_url.hostname is not None
assert parsed_url.path is not None
assert parsed_url.query is not None
escaped_path_parts = parsed_url.hostname + quote(parsed_url.path) + "?" + parsed_url.query
response = internal_nginx_redirect("/internal/s3/" + escaped_path_parts)
patch_disposition_header(response, path_id, download)
patch_cache_control(response, private=True, immutable=True)
return response