Commit Graph

64 Commits

Author SHA1 Message Date
Anders Kaseorg 08db41660a python: Avoid deprecated cgi module, removed in Python 3.13.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2024-10-22 10:05:01 -07:00
Anders Kaseorg 0fa5e7f629 ruff: Fix UP035 Import from `collections.abc`, `typing` instead.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2024-07-13 22:28:22 -07:00
Anders Kaseorg 531b34cb4c ruff: Fix UP007 Use `X | Y` for type annotations.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2024-07-13 22:28:22 -07:00
Anders Kaseorg 7b1bb984b3 ruff: Fix RUF022 `__all__` is not sorted.
This is a preview rule, not yet enabled by default.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2024-03-01 09:30:04 -08:00
Anders Kaseorg 223b626256 python: Use urlsplit instead of urlparse.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-12-05 13:03:07 -08:00
Anders Kaseorg a50eb2e809 mypy: Enable new error explicit-override.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-10-12 12:28:41 -07:00
Anders Kaseorg 50e6cba1af ruff: Fix UP032 Use f-string instead of `format` call.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-07-19 16:14:59 -07:00
Anders Kaseorg 9db3451333 Remove statsd support.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-04-25 19:58:16 -07:00
Anders Kaseorg da3cf5ea7a ruff: Fix RSE102 Unnecessary parentheses on raised exception.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-02-04 16:34:55 -08:00
Anders Kaseorg a2825e5984 python: Use Python 3.8 typing.{Protocol,TypedDict}.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-04-27 12:57:49 -07:00
Alex Vandiver 56058f3316 caches: Remove unnecessary "in-memory" cache.
This cache was added in da33b72848 to serve as a replacement for the
durable database cache, in development; the previous commit has
switched that to be the non-durable memcached backend.

The special-case for "in-memory" in development is mostly-unnecessary
in contrast to memcached -- `./tools/run-dev.py` flushes memcached on
every startup.  This differs in behaviour slightly, in that if the
codepath is changed and `run-dev` restarts Django, the cache is not
cleared.  This seems an unlikely occurrence, however, and the code
cleanup from its removal is worth it.
2022-04-15 14:48:12 -07:00
Alex Vandiver 04ca2e92f7 caches: Cache link preview data in memcached, not in PostgreSQL.
The choice to cache these in the database dates back to c93f1d4eda,
with the comment added in da33b72848 while working around the
durability of the "database" cache in local development.

The values were stored in a durable cache, as they needed to be
ensured to persist between when they were inserted in
`get_link_embed_data` and when they were used in
`render_incoming_message` via `link_embed_data_from_cache`.

However, database accesses are not fast compared to memcached, and we
wish to avoid the overhead of the database connection from the
`embed_links` worker.  Specifically, making the connection may not be
thread-safe -- and in low-memory (and Docker) configurations, all
workers run as separate threads in a single process.  This can lead to
stalled database connections in `embed_links` workers, and failed
previews.

Since the previous commit made the durability of the cache no longer
necessary, this will have minimal effect; at worst, posting the same
URL twice, on either side of an upgrade, will result in two preview
fetches of it.
2022-04-15 14:48:12 -07:00
Alex Vandiver 351bdfaf78 preview: Use cache only as a non-durable cache, not an IPC.
The `get_link_embed_data` / `link_embed_data_from_cache` pair as
introduced in c93f1d4eda uses the cache
as a temporary store inside of the `embed_links` worker; this means
that it must be durable storage, or the worker will stall and re-fetch
the same links to preview them.

Switch to plumbing through the fetched URL embed data as an parameter
to the Markdown evaluation which uses them, rather than using the
cache as an intermediary.  This frees up the cache to be merely a
non-durable cache.

As a side-effect, this removes get_cache_with_key, and
link_embed_data_from_cache which was its only callsite.
2022-04-15 14:48:12 -07:00
Alex Vandiver 327ff9ea0f preview: Use a dataclass for the embed data.
This is significantly cleaner than passing around `Dict[str, Any]` all
of the time.
2022-04-15 14:48:12 -07:00
Alex Vandiver e53f9fad29 url_preview: Only return image URLs that validate as URLs. 2022-02-18 15:32:27 -08:00
Anders Kaseorg b0ce4f1bce docs: Fix many spelling mistakes.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-02-07 18:51:06 -08:00
Anders Kaseorg 4922632601 mypy: Add types-beautifulsoup4.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-01-23 23:39:40 -08:00
Anders Kaseorg 4839b7ed27 url_preview: Interpret og:image relative to full page URL.
og:image is supposed to be an absolute URL, but some sites incorrectly
provide a relative URL.  In this case, it makes more sense to
interpret it relative to the full page URL after redirects, rather
than relative to just the domain part of the page URL before
redirects.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-10-21 12:20:37 -07:00
Alex Vandiver 4d428490fd outgoing_http: Use OutgoingSession subclasses in more places.
This adds the X-Smokescreen-Role header to proxy connections, to track
usage from various codepaths, and enforces a timeout.  Timeouts were
kept consistent with their previous values, or set to 5s if they had
none previously.
2021-09-01 05:34:13 -07:00
Anders Kaseorg 2939d29b6d python: Convert deprecated Django smart_text alias to smart_str.
django.utils.encoding.smart_text is a deprecated alias of
django.utils.encoding.smart_str as of Django 3.0, and will be removed
in Django 4.0.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-04-15 18:01:34 -07:00
Anders Kaseorg 9864907985 mypy: Correct typing.re imports to typing.
Although typing.re exists in the standard library, mypy has never
recognized it.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-03-17 18:41:46 -07:00
Anders Kaseorg 6e4c3e41dc python: Normalize quotes with Black.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-02-12 13:11:19 -08:00
Anders Kaseorg 11741543da python: Reformat with Black, except quotes.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-02-12 13:11:19 -08:00
akshatdalton 5f8a10124e url preview: Update Zulip User-Agent.
This commit updates the Zulip User-Agent to
'Mozilla/5.0 (compatible; ZulipURLPreview/{version}; +{external_host})'
as the older User-Agent was rendering Markdown YouTube titles as
'YouTube - YouTube'.

Fixes #16970.
2021-01-25 14:24:48 -08:00
Anders Kaseorg bf45f921a7 url_preview: Allow Beautiful Soup to get the charset from <meta>.
An HTML document sent without a charset in the Content-Type header
needs to be scanned for a charset in <meta> tags.  We need to pass
bytes instead of str to Beautiful Soup to allow it to do this.

Fixes #16843.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-12-15 11:30:57 -08:00
Alex Vandiver ad8943a64a url_preview: Only extract img tags with an `src`.
Some `<img>` tags do not have an SRC, if they are rewritten using JS
to have one later.  Attempting to access `first_image['src']` on these
will raise an exception, as they have no such attribute.

Only look for images which have a defined `src` attribute on them.  We
could instead check if `first_image.has_attr('src')`, but this seems
only likely to produce fewer valid images.
2020-08-18 14:26:21 -04:00
Anders Kaseorg 69c0959f34 python: Fix misuse of Optional types for optional parameters.
There seems to have been a confusion between two different uses of the
word “optional”:

• An optional parameter may be omitted and replaced with a default
  value.
• An Optional type has None as a possible value.

Sometimes an optional parameter has a default value of None, or None
is otherwise a meaningful value to provide, in which case it makes
sense for the optional parameter to have an Optional type.  But in
other cases, optional parameters should not have Optional type.  Fix
them.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-06-13 15:31:27 -07:00
Anders Kaseorg 365fe0b3d5 python: Sort imports with isort.
Fixes #2665.

Regenerated by tabbott with `lint --fix` after a rebase and change in
parameters.

Note from tabbott: In a few cases, this converts technical debt in the
form of unsorted imports into different technical debt in the form of
our largest files having very long, ugly import sequences at the
start.  I expect this change will increase pressure for us to split
those files, which isn't a bad thing.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-06-11 16:45:32 -07:00
Graham Bleaney 461d5b1a3e pysa: Introduce sanitizers, models, and inline marking safe.
This commit adds three `.pysa` model files: `false_positives.pysa`
for ruling out false positive flows with `Sanitize` annotations,
`req_lib.pysa` for educating pysa about Zulip's `REQ()` pattern for
extracting user input, and `redirects.pysa` for capturing the risk
of open redirects within Zulip code. Additionally, this commit
introduces `mark_sanitized`, an identity function which can be used
to selectively clear taint in cases where `Sanitize` models will not
work. This commit also puts `mark_sanitized` to work removing known
false postive flows.
2020-06-11 12:57:49 -07:00
Puneeth Chaganti 2a65be2bf5 url preview: Use Chrome's user agent instead of a Zulip one.
Some sites don't render correctly unless you are one of the latest browsers.
YouTube Music, for instance, changes the page title to "Your browser is
deprecated, please upgrade.", which makes our URL previews look bad.
2020-04-26 10:16:43 -07:00
Mateusz Mandera 770086f983 url_preview: Discard url in oembed if server returns invalid json.
This fixes the scenario where we'd get errors in the
FetchLinksEmbedData queue processor if oembed got invalid json from the
URL.
2020-04-11 11:54:54 -07:00
Tim Abbott 4901dc3795 url_preview: Fix parsing of open graph tags.
Our open graph parser logic sloppily mixed data obtained by parsing
open graph properties with trusted data set by our oembed parser.

We fix this by consistenly using our explicit whitelist of generic
properties (image, title, and description) in both places where we
interact with open graph properties.  The fixes are redundant with
each other, but doing both helps in making the intent of the code
clearer.

This issue fixed here was originally reported as an XSS vulnerability
in the upcoming Inline URL Previews feature found by Graham Bleaney
and Ibrahim Mohamed using Pysa.  The recent Oembed changes close that
vulnerability, but this change is still worth doing to make the
implementation do what it looks like it does.
2019-12-12 15:24:38 -08:00
Anders Kaseorg faa3ea0b8e oembed: Remove unsound HTML filtering.
The frontend now takes care of confining the HTML.

Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
2019-12-12 15:24:38 -08:00
Tim Abbott 9f223bb7c2 url_preview: Simplify path to oembed code. 2019-12-12 13:34:49 -08:00
Puneeth Chaganti 64c40287f1 url preview: Rename type_ variable to oembed_resource_type. 2019-06-02 14:31:39 -07:00
Puneeth Chaganti 9aa5a2b369 url preview: Use oEmbed html for videos.
Ensure that the html is safe, before using it. The html is considered if it is
in an iframe with a http/https src, based on the recommendations here:
https://oembed.com/#section3

We directly embed the `iframe` html into the lightbox overlay.
2019-05-31 15:59:03 -07:00
Puneeth Chaganti c8cb785950 url preview: Show inline images as previews for oEmbed photo pages. 2019-05-31 15:59:03 -07:00
Puneeth Chaganti 22d0cd9696 url preview: Don't cache embed data when fetch has network errors. 2019-05-30 16:45:22 -07:00
Puneeth Chaganti 4ac9778d69 url preview: Catch network errors during get for page content.
We may be successfully able to get the page once, to get the content type, but
the server or network may go down and cause problems when fetching the page for
parsing its meta tags.
2019-05-13 13:55:00 -07:00
Puneeth Chaganti 9fd1c40bb1 url preview: Timeout requests after 15 seconds. 2019-05-13 13:54:59 -07:00
Puneeth Chaganti 0b76b16101 url preview: Set a custom user agent for requests.
Some sites seem to block the default user agent of the requests
library. Using a custom user agent lets us show previews for some of
these sites.
2019-05-13 13:54:43 -07:00
Puneeth Chaganti 59555ee7e5 url preview: Confirm content-type before trying to show previews.
Currently, we only show previews for URLs which are HTML pages, which could
contain other media. We don't show previews for links to non-HTML pages, like
pdf documents or audio/video files. To verify that the URL posted is an HTML
page, we verify the content-type of the page, either using server headers or by
sniffing the content.

Closes #8358
2019-05-13 13:45:17 -07:00
Puneeth Chaganti da33b72848 url preview: Use in-memory caching in dev environment. 2019-05-06 12:37:32 -07:00
Puneeth Chaganti 1f6306a5a7 url preview: Cleanup import ordering. 2019-05-06 12:37:32 -07:00
Puneeth Chaganti d56b16b275 url preview: Ignore open graph tags without a content attribute. 2019-05-06 12:37:32 -07:00
Puneeth Chaganti d02eb99831 url preview: Return generic parser <p> text as str (not bs4 string). 2019-05-06 12:37:32 -07:00
Anders Kaseorg 649235cfec python: Remove unused imports.
Signed-off-by: Anders Kaseorg <andersk@mit.edu>
2019-02-22 16:54:36 -08:00
Tim Abbott a4b294da98 url preview: Remove useless logging.error in open graph code path.
As detailed in the comment, someone pasting a broken URL isn't a
situation that a server administrator needs to be notified about.
2019-02-05 13:25:47 -08:00
Steve Howell 76deb30312 preview: Hash cache keys for preview urls.
We don't want really long urls to lead to truncated
keys, or we could theoretically have two different
urls get mixed up previews.

Also, this suppresses warnings about exceeding the
250 char limit.

Finally, this gives the key a proper prefix.
2018-10-14 09:28:57 -07:00
Tim Abbott 4d03c15848 url_preview: Don't import beautifulsoup at import time.
This is a small performance optimization to Django startup, in line
with other recent commits.
2018-08-08 14:19:42 -07:00