og:image is supposed to be an absolute URL, but some sites incorrectly
provide a relative URL. In this case, it makes more sense to
interpret it relative to the full page URL after redirects, rather
than relative to just the domain part of the page URL before
redirects.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
This adds the X-Smokescreen-Role header to proxy connections, to track
usage from various codepaths, and enforces a timeout. Timeouts were
kept consistent with their previous values, or set to 5s if they had
none previously.
django.utils.encoding.smart_text is a deprecated alias of
django.utils.encoding.smart_str as of Django 3.0, and will be removed
in Django 4.0.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
This commit updates the Zulip User-Agent to
'Mozilla/5.0 (compatible; ZulipURLPreview/{version}; +{external_host})'
as the older User-Agent was rendering Markdown YouTube titles as
'YouTube - YouTube'.
Fixes#16970.
An HTML document sent without a charset in the Content-Type header
needs to be scanned for a charset in <meta> tags. We need to pass
bytes instead of str to Beautiful Soup to allow it to do this.
Fixes#16843.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
Some `<img>` tags do not have an SRC, if they are rewritten using JS
to have one later. Attempting to access `first_image['src']` on these
will raise an exception, as they have no such attribute.
Only look for images which have a defined `src` attribute on them. We
could instead check if `first_image.has_attr('src')`, but this seems
only likely to produce fewer valid images.
There seems to have been a confusion between two different uses of the
word “optional”:
• An optional parameter may be omitted and replaced with a default
value.
• An Optional type has None as a possible value.
Sometimes an optional parameter has a default value of None, or None
is otherwise a meaningful value to provide, in which case it makes
sense for the optional parameter to have an Optional type. But in
other cases, optional parameters should not have Optional type. Fix
them.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
Fixes#2665.
Regenerated by tabbott with `lint --fix` after a rebase and change in
parameters.
Note from tabbott: In a few cases, this converts technical debt in the
form of unsorted imports into different technical debt in the form of
our largest files having very long, ugly import sequences at the
start. I expect this change will increase pressure for us to split
those files, which isn't a bad thing.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
This commit adds three `.pysa` model files: `false_positives.pysa`
for ruling out false positive flows with `Sanitize` annotations,
`req_lib.pysa` for educating pysa about Zulip's `REQ()` pattern for
extracting user input, and `redirects.pysa` for capturing the risk
of open redirects within Zulip code. Additionally, this commit
introduces `mark_sanitized`, an identity function which can be used
to selectively clear taint in cases where `Sanitize` models will not
work. This commit also puts `mark_sanitized` to work removing known
false postive flows.
Some sites don't render correctly unless you are one of the latest browsers.
YouTube Music, for instance, changes the page title to "Your browser is
deprecated, please upgrade.", which makes our URL previews look bad.
Our open graph parser logic sloppily mixed data obtained by parsing
open graph properties with trusted data set by our oembed parser.
We fix this by consistenly using our explicit whitelist of generic
properties (image, title, and description) in both places where we
interact with open graph properties. The fixes are redundant with
each other, but doing both helps in making the intent of the code
clearer.
This issue fixed here was originally reported as an XSS vulnerability
in the upcoming Inline URL Previews feature found by Graham Bleaney
and Ibrahim Mohamed using Pysa. The recent Oembed changes close that
vulnerability, but this change is still worth doing to make the
implementation do what it looks like it does.
Ensure that the html is safe, before using it. The html is considered if it is
in an iframe with a http/https src, based on the recommendations here:
https://oembed.com/#section3
We directly embed the `iframe` html into the lightbox overlay.
We may be successfully able to get the page once, to get the content type, but
the server or network may go down and cause problems when fetching the page for
parsing its meta tags.
Currently, we only show previews for URLs which are HTML pages, which could
contain other media. We don't show previews for links to non-HTML pages, like
pdf documents or audio/video files. To verify that the URL posted is an HTML
page, we verify the content-type of the page, either using server headers or by
sniffing the content.
Closes#8358
We don't want really long urls to lead to truncated
keys, or we could theoretically have two different
urls get mixed up previews.
Also, this suppresses warnings about exceeding the
250 char limit.
Finally, this gives the key a proper prefix.
This change adds support for displaying inline open graph previews for
links posted into Zulip.
It is designed to interact correctly with message editing.
This adds the new settings.INLINE_URL_EMBED_PREVIEW setting to control
whether this feature is enabled.
By default, this setting is currently disabled, so that we can burn it
in for a bit before it impacts users more broadly.
Eventually, we may want to make this manageable via a (set of?)
per-realm settings. E.g. I can imagine a realm wanting to be able to
enable/disable it for certain URLs.