Commit Graph

55 Commits

Author SHA1 Message Date
Anders Kaseorg a2825e5984 python: Use Python 3.8 typing.{Protocol,TypedDict}.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-04-27 12:57:49 -07:00
Alex Vandiver 56058f3316 caches: Remove unnecessary "in-memory" cache.
This cache was added in da33b72848 to serve as a replacement for the
durable database cache, in development; the previous commit has
switched that to be the non-durable memcached backend.

The special-case for "in-memory" in development is mostly-unnecessary
in contrast to memcached -- `./tools/run-dev.py` flushes memcached on
every startup.  This differs in behaviour slightly, in that if the
codepath is changed and `run-dev` restarts Django, the cache is not
cleared.  This seems an unlikely occurrence, however, and the code
cleanup from its removal is worth it.
2022-04-15 14:48:12 -07:00
Alex Vandiver 04ca2e92f7 caches: Cache link preview data in memcached, not in PostgreSQL.
The choice to cache these in the database dates back to c93f1d4eda,
with the comment added in da33b72848 while working around the
durability of the "database" cache in local development.

The values were stored in a durable cache, as they needed to be
ensured to persist between when they were inserted in
`get_link_embed_data` and when they were used in
`render_incoming_message` via `link_embed_data_from_cache`.

However, database accesses are not fast compared to memcached, and we
wish to avoid the overhead of the database connection from the
`embed_links` worker.  Specifically, making the connection may not be
thread-safe -- and in low-memory (and Docker) configurations, all
workers run as separate threads in a single process.  This can lead to
stalled database connections in `embed_links` workers, and failed
previews.

Since the previous commit made the durability of the cache no longer
necessary, this will have minimal effect; at worst, posting the same
URL twice, on either side of an upgrade, will result in two preview
fetches of it.
2022-04-15 14:48:12 -07:00
Alex Vandiver 351bdfaf78 preview: Use cache only as a non-durable cache, not an IPC.
The `get_link_embed_data` / `link_embed_data_from_cache` pair as
introduced in c93f1d4eda uses the cache
as a temporary store inside of the `embed_links` worker; this means
that it must be durable storage, or the worker will stall and re-fetch
the same links to preview them.

Switch to plumbing through the fetched URL embed data as an parameter
to the Markdown evaluation which uses them, rather than using the
cache as an intermediary.  This frees up the cache to be merely a
non-durable cache.

As a side-effect, this removes get_cache_with_key, and
link_embed_data_from_cache which was its only callsite.
2022-04-15 14:48:12 -07:00
Alex Vandiver 327ff9ea0f preview: Use a dataclass for the embed data.
This is significantly cleaner than passing around `Dict[str, Any]` all
of the time.
2022-04-15 14:48:12 -07:00
Alex Vandiver e53f9fad29 url_preview: Only return image URLs that validate as URLs. 2022-02-18 15:32:27 -08:00
Anders Kaseorg b0ce4f1bce docs: Fix many spelling mistakes.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-02-07 18:51:06 -08:00
Anders Kaseorg 4922632601 mypy: Add types-beautifulsoup4.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-01-23 23:39:40 -08:00
Anders Kaseorg 4839b7ed27 url_preview: Interpret og:image relative to full page URL.
og:image is supposed to be an absolute URL, but some sites incorrectly
provide a relative URL.  In this case, it makes more sense to
interpret it relative to the full page URL after redirects, rather
than relative to just the domain part of the page URL before
redirects.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-10-21 12:20:37 -07:00
Alex Vandiver 4d428490fd outgoing_http: Use OutgoingSession subclasses in more places.
This adds the X-Smokescreen-Role header to proxy connections, to track
usage from various codepaths, and enforces a timeout.  Timeouts were
kept consistent with their previous values, or set to 5s if they had
none previously.
2021-09-01 05:34:13 -07:00
Anders Kaseorg 2939d29b6d python: Convert deprecated Django smart_text alias to smart_str.
django.utils.encoding.smart_text is a deprecated alias of
django.utils.encoding.smart_str as of Django 3.0, and will be removed
in Django 4.0.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-04-15 18:01:34 -07:00
Anders Kaseorg 9864907985 mypy: Correct typing.re imports to typing.
Although typing.re exists in the standard library, mypy has never
recognized it.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-03-17 18:41:46 -07:00
Anders Kaseorg 6e4c3e41dc python: Normalize quotes with Black.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-02-12 13:11:19 -08:00
Anders Kaseorg 11741543da python: Reformat with Black, except quotes.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-02-12 13:11:19 -08:00
akshatdalton 5f8a10124e url preview: Update Zulip User-Agent.
This commit updates the Zulip User-Agent to
'Mozilla/5.0 (compatible; ZulipURLPreview/{version}; +{external_host})'
as the older User-Agent was rendering Markdown YouTube titles as
'YouTube - YouTube'.

Fixes #16970.
2021-01-25 14:24:48 -08:00
Anders Kaseorg bf45f921a7 url_preview: Allow Beautiful Soup to get the charset from <meta>.
An HTML document sent without a charset in the Content-Type header
needs to be scanned for a charset in <meta> tags.  We need to pass
bytes instead of str to Beautiful Soup to allow it to do this.

Fixes #16843.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-12-15 11:30:57 -08:00
Alex Vandiver ad8943a64a url_preview: Only extract img tags with an `src`.
Some `<img>` tags do not have an SRC, if they are rewritten using JS
to have one later.  Attempting to access `first_image['src']` on these
will raise an exception, as they have no such attribute.

Only look for images which have a defined `src` attribute on them.  We
could instead check if `first_image.has_attr('src')`, but this seems
only likely to produce fewer valid images.
2020-08-18 14:26:21 -04:00
Anders Kaseorg 69c0959f34 python: Fix misuse of Optional types for optional parameters.
There seems to have been a confusion between two different uses of the
word “optional”:

• An optional parameter may be omitted and replaced with a default
  value.
• An Optional type has None as a possible value.

Sometimes an optional parameter has a default value of None, or None
is otherwise a meaningful value to provide, in which case it makes
sense for the optional parameter to have an Optional type.  But in
other cases, optional parameters should not have Optional type.  Fix
them.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-06-13 15:31:27 -07:00
Anders Kaseorg 365fe0b3d5 python: Sort imports with isort.
Fixes #2665.

Regenerated by tabbott with `lint --fix` after a rebase and change in
parameters.

Note from tabbott: In a few cases, this converts technical debt in the
form of unsorted imports into different technical debt in the form of
our largest files having very long, ugly import sequences at the
start.  I expect this change will increase pressure for us to split
those files, which isn't a bad thing.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-06-11 16:45:32 -07:00
Graham Bleaney 461d5b1a3e pysa: Introduce sanitizers, models, and inline marking safe.
This commit adds three `.pysa` model files: `false_positives.pysa`
for ruling out false positive flows with `Sanitize` annotations,
`req_lib.pysa` for educating pysa about Zulip's `REQ()` pattern for
extracting user input, and `redirects.pysa` for capturing the risk
of open redirects within Zulip code. Additionally, this commit
introduces `mark_sanitized`, an identity function which can be used
to selectively clear taint in cases where `Sanitize` models will not
work. This commit also puts `mark_sanitized` to work removing known
false postive flows.
2020-06-11 12:57:49 -07:00
Puneeth Chaganti 2a65be2bf5 url preview: Use Chrome's user agent instead of a Zulip one.
Some sites don't render correctly unless you are one of the latest browsers.
YouTube Music, for instance, changes the page title to "Your browser is
deprecated, please upgrade.", which makes our URL previews look bad.
2020-04-26 10:16:43 -07:00
Mateusz Mandera 770086f983 url_preview: Discard url in oembed if server returns invalid json.
This fixes the scenario where we'd get errors in the
FetchLinksEmbedData queue processor if oembed got invalid json from the
URL.
2020-04-11 11:54:54 -07:00
Tim Abbott 4901dc3795 url_preview: Fix parsing of open graph tags.
Our open graph parser logic sloppily mixed data obtained by parsing
open graph properties with trusted data set by our oembed parser.

We fix this by consistenly using our explicit whitelist of generic
properties (image, title, and description) in both places where we
interact with open graph properties.  The fixes are redundant with
each other, but doing both helps in making the intent of the code
clearer.

This issue fixed here was originally reported as an XSS vulnerability
in the upcoming Inline URL Previews feature found by Graham Bleaney
and Ibrahim Mohamed using Pysa.  The recent Oembed changes close that
vulnerability, but this change is still worth doing to make the
implementation do what it looks like it does.
2019-12-12 15:24:38 -08:00
Anders Kaseorg faa3ea0b8e oembed: Remove unsound HTML filtering.
The frontend now takes care of confining the HTML.

Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
2019-12-12 15:24:38 -08:00
Tim Abbott 9f223bb7c2 url_preview: Simplify path to oembed code. 2019-12-12 13:34:49 -08:00
Puneeth Chaganti 64c40287f1 url preview: Rename type_ variable to oembed_resource_type. 2019-06-02 14:31:39 -07:00
Puneeth Chaganti 9aa5a2b369 url preview: Use oEmbed html for videos.
Ensure that the html is safe, before using it. The html is considered if it is
in an iframe with a http/https src, based on the recommendations here:
https://oembed.com/#section3

We directly embed the `iframe` html into the lightbox overlay.
2019-05-31 15:59:03 -07:00
Puneeth Chaganti c8cb785950 url preview: Show inline images as previews for oEmbed photo pages. 2019-05-31 15:59:03 -07:00
Puneeth Chaganti 22d0cd9696 url preview: Don't cache embed data when fetch has network errors. 2019-05-30 16:45:22 -07:00
Puneeth Chaganti 4ac9778d69 url preview: Catch network errors during get for page content.
We may be successfully able to get the page once, to get the content type, but
the server or network may go down and cause problems when fetching the page for
parsing its meta tags.
2019-05-13 13:55:00 -07:00
Puneeth Chaganti 9fd1c40bb1 url preview: Timeout requests after 15 seconds. 2019-05-13 13:54:59 -07:00
Puneeth Chaganti 0b76b16101 url preview: Set a custom user agent for requests.
Some sites seem to block the default user agent of the requests
library. Using a custom user agent lets us show previews for some of
these sites.
2019-05-13 13:54:43 -07:00
Puneeth Chaganti 59555ee7e5 url preview: Confirm content-type before trying to show previews.
Currently, we only show previews for URLs which are HTML pages, which could
contain other media. We don't show previews for links to non-HTML pages, like
pdf documents or audio/video files. To verify that the URL posted is an HTML
page, we verify the content-type of the page, either using server headers or by
sniffing the content.

Closes #8358
2019-05-13 13:45:17 -07:00
Puneeth Chaganti da33b72848 url preview: Use in-memory caching in dev environment. 2019-05-06 12:37:32 -07:00
Puneeth Chaganti 1f6306a5a7 url preview: Cleanup import ordering. 2019-05-06 12:37:32 -07:00
Puneeth Chaganti d56b16b275 url preview: Ignore open graph tags without a content attribute. 2019-05-06 12:37:32 -07:00
Puneeth Chaganti d02eb99831 url preview: Return generic parser <p> text as str (not bs4 string). 2019-05-06 12:37:32 -07:00
Anders Kaseorg 649235cfec python: Remove unused imports.
Signed-off-by: Anders Kaseorg <andersk@mit.edu>
2019-02-22 16:54:36 -08:00
Tim Abbott a4b294da98 url preview: Remove useless logging.error in open graph code path.
As detailed in the comment, someone pasting a broken URL isn't a
situation that a server administrator needs to be notified about.
2019-02-05 13:25:47 -08:00
Steve Howell 76deb30312 preview: Hash cache keys for preview urls.
We don't want really long urls to lead to truncated
keys, or we could theoretically have two different
urls get mixed up previews.

Also, this suppresses warnings about exceeding the
250 char limit.

Finally, this gives the key a proper prefix.
2018-10-14 09:28:57 -07:00
Tim Abbott 4d03c15848 url_preview: Don't import beautifulsoup at import time.
This is a small performance optimization to Django startup, in line
with other recent commits.
2018-08-08 14:19:42 -07:00
neiljp (Neil Pilgrim) e4821875f7 mypy: Improve typing of oembed data, to Dict[str, Any]. 2018-06-19 10:48:38 -07:00
Tim Abbott 3006b3f52f url_preview: Fix crash when description has no content.
There's several things we'll want to cleanup with this feature, but
for now we're content to just make this not crash.
2018-05-17 12:40:43 -07:00
Aditya Bansal 1f9244e060 zerver/lib: Change use of typing.Text to str. 2018-05-10 14:19:49 -07:00
rht 3f4bf2d22f zerver/lib: Use python 3 syntax for typing.
Extracted from a larger commit by tabbott because these changes will
not create significant merge conflicts.
2017-11-21 20:56:40 -08:00
neiljp (Neil Pilgrim) 1dcc981af8 mypy: Add explicit Any type parameters for embedded data Dicts. 2017-11-07 11:26:46 -08:00
rht e311842a1b zerver/lib: Remove inheritance from object. 2017-11-06 08:53:48 -08:00
neiljp (Neil Pilgrim) be856bad46 mypy: Reduce use of Any in zerver/lib/url_preview/ return types. 2017-11-04 16:18:27 -07:00
rht f43e54d352 zerver/lib: Remove absolute_import. 2017-09-27 10:00:39 -07:00
Aditya Bansal f32c1892ff preview.py: Fix error raised on uploading file with unicode filename. 2017-06-19 14:58:44 -04:00