zulip/zerver/lib/url_preview/oembed.py

import json
from typing import Optional

import requests
from pyoembed import PyOembedException, oEmbed

from zerver.lib.url_preview.types import UrlEmbedData, UrlOEmbedData


def get_oembed_data(url: str, maxwidth: int = 640, maxheight: int = 480) -> Optional[UrlEmbedData]:
    try:
        data = oEmbed(url, maxwidth=maxwidth, maxheight=maxheight)
    except (PyOembedException, json.decoder.JSONDecodeError, requests.exceptions.ConnectionError):
        return None

    oembed_resource_type = data.get("type", "")
    image = data.get("url", data.get("image"))
    thumbnail = data.get("thumbnail_url")
    html = data.get("html", "")
    if oembed_resource_type == "photo" and image:
        return UrlOEmbedData(
            image=image,
            type="photo",
            title=data.get("title"),
            description=data.get("description"),
        )

    if oembed_resource_type == "video" and html and thumbnail:
        return UrlOEmbedData(
            image=thumbnail,
            type="video",
            html=strip_cdata(html),
            title=data.get("title"),
            description=data.get("description"),
        )

    # Otherwise, use the title/description from pyembed as the basis
    # for our other parsers
    return UrlEmbedData(
        title=data.get("title"),
        description=data.get("description"),
    )


def strip_cdata(html: str) -> str:
    # Work around a bug in SoundCloud's XML generation:
    # <html>&lt;![CDATA[&lt;iframe ...&gt;&lt;/iframe&gt;]]&gt;</html>
    if html.startswith("<![CDATA[") and html.endswith("]]>"):
        html = html[9:-3]
    return html
url_preview: Discard url in oembed if server returns invalid json. This fixes the scenario where we'd get errors in the FetchLinksEmbedData queue processor if oembed got invalid json from the URL. 2020-04-11 13:24:06 +02:00			`import json`
preview: Use a dataclass for the embed data. This is significantly cleaner than passing around `Dict[str, Any]` all of the time. 2022-04-14 21:52:41 +02:00			`from typing import Optional`
python: Sort imports with isort. Fixes #2665. Regenerated by tabbott with `lint --fix` after a rebase and change in parameters. Note from tabbott: In a few cases, this converts technical debt in the form of unsorted imports into different technical debt in the form of our largest files having very long, ugly import sequences at the start. I expect this change will increase pressure for us to split those files, which isn't a bad thing. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2020-06-11 00:54:34 +02:00
outgoing_http: Use OutgoingSession subclasses in more places. This adds the X-Smokescreen-Role header to proxy connections, to track usage from various codepaths, and enforces a timeout. Timeouts were kept consistent with their previous values, or set to 5s if they had none previously. 2021-05-07 03:54:25 +02:00			`import requests`
python: Sort imports with isort. Fixes #2665. Regenerated by tabbott with `lint --fix` after a rebase and change in parameters. Note from tabbott: In a few cases, this converts technical debt in the form of unsorted imports into different technical debt in the form of our largest files having very long, ugly import sequences at the start. I expect this change will increase pressure for us to split those files, which isn't a bad thing. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2020-06-11 00:54:34 +02:00			`from pyoembed import PyOembedException, oEmbed`

preview: Use a dataclass for the embed data. This is significantly cleaner than passing around `Dict[str, Any]` all of the time. 2022-04-14 21:52:41 +02:00			`from zerver.lib.url_preview.types import UrlEmbedData, UrlOEmbedData`
Add oembed/Open Graph/Meta tags data retrieval from inline links. This change adds support for displaying inline open graph previews for links posted into Zulip. It is designed to interact correctly with message editing. This adds the new settings.INLINE_URL_EMBED_PREVIEW setting to control whether this feature is enabled. By default, this setting is currently disabled, so that we can burn it in for a bit before it impacts users more broadly. Eventually, we may want to make this manageable via a (set of?) per-realm settings. E.g. I can imagine a realm wanting to be able to enable/disable it for certain URLs. 2016-10-27 12:06:44 +02:00
preview: Use a dataclass for the embed data. This is significantly cleaner than passing around `Dict[str, Any]` all of the time. 2022-04-14 21:52:41 +02:00
			`def get_oembed_data(url: str, maxwidth: int = 640, maxheight: int = 480) -> Optional[UrlEmbedData]:`
Add oembed/Open Graph/Meta tags data retrieval from inline links. This change adds support for displaying inline open graph previews for links posted into Zulip. It is designed to interact correctly with message editing. This adds the new settings.INLINE_URL_EMBED_PREVIEW setting to control whether this feature is enabled. By default, this setting is currently disabled, so that we can burn it in for a bit before it impacts users more broadly. Eventually, we may want to make this manageable via a (set of?) per-realm settings. E.g. I can imagine a realm wanting to be able to enable/disable it for certain URLs. 2016-10-27 12:06:44 +02:00			`try:`
			`data = oEmbed(url, maxwidth=maxwidth, maxheight=maxheight)`
outgoing_http: Use OutgoingSession subclasses in more places. This adds the X-Smokescreen-Role header to proxy connections, to track usage from various codepaths, and enforces a timeout. Timeouts were kept consistent with their previous values, or set to 5s if they had none previously. 2021-05-07 03:54:25 +02:00			`except (PyOembedException, json.decoder.JSONDecodeError, requests.exceptions.ConnectionError):`
Add oembed/Open Graph/Meta tags data retrieval from inline links. This change adds support for displaying inline open graph previews for links posted into Zulip. It is designed to interact correctly with message editing. This adds the new settings.INLINE_URL_EMBED_PREVIEW setting to control whether this feature is enabled. By default, this setting is currently disabled, so that we can burn it in for a bit before it impacts users more broadly. Eventually, we may want to make this manageable via a (set of?) per-realm settings. E.g. I can imagine a realm wanting to be able to enable/disable it for certain URLs. 2016-10-27 12:06:44 +02:00			`return None`

python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:20:45 +01:00			`oembed_resource_type = data.get("type", "")`
			`image = data.get("url", data.get("image"))`
			`thumbnail = data.get("thumbnail_url")`
preview: Use a dataclass for the embed data. This is significantly cleaner than passing around `Dict[str, Any]` all of the time. 2022-04-14 21:52:41 +02:00			`html = data.get("html", "")`
python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:20:45 +01:00			`if oembed_resource_type == "photo" and image:`
preview: Use a dataclass for the embed data. This is significantly cleaner than passing around `Dict[str, Any]` all of the time. 2022-04-14 21:52:41 +02:00			`return UrlOEmbedData(`
url_preview: Fix parsing of open graph tags. Our open graph parser logic sloppily mixed data obtained by parsing open graph properties with trusted data set by our oembed parser. We fix this by consistenly using our explicit whitelist of generic properties (image, title, and description) in both places where we interact with open graph properties. The fixes are redundant with each other, but doing both helps in making the intent of the code clearer. This issue fixed here was originally reported as an XSS vulnerability in the upcoming Inline URL Previews feature found by Graham Bleaney and Ibrahim Mohamed using Pysa. The recent Oembed changes close that vulnerability, but this change is still worth doing to make the implementation do what it looks like it does. 2019-12-12 02:10:50 +01:00			`image=image,`
preview: Use a dataclass for the embed data. This is significantly cleaner than passing around `Dict[str, Any]` all of the time. 2022-04-14 21:52:41 +02:00			`type="photo",`
python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:20:45 +01:00			`title=data.get("title"),`
			`description=data.get("description"),`
url_preview: Fix parsing of open graph tags. Our open graph parser logic sloppily mixed data obtained by parsing open graph properties with trusted data set by our oembed parser. We fix this by consistenly using our explicit whitelist of generic properties (image, title, and description) in both places where we interact with open graph properties. The fixes are redundant with each other, but doing both helps in making the intent of the code clearer. This issue fixed here was originally reported as an XSS vulnerability in the upcoming Inline URL Previews feature found by Graham Bleaney and Ibrahim Mohamed using Pysa. The recent Oembed changes close that vulnerability, but this change is still worth doing to make the implementation do what it looks like it does. 2019-12-12 02:10:50 +01:00			`)`
url preview: Use oEmbed html for videos. Ensure that the html is safe, before using it. The html is considered if it is in an iframe with a http/https src, based on the recommendations here: https://oembed.com/#section3 We directly embed the `iframe` html into the lightbox overlay. 2019-05-02 18:58:39 +02:00
python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:20:45 +01:00			`if oembed_resource_type == "video" and html and thumbnail:`
preview: Use a dataclass for the embed data. This is significantly cleaner than passing around `Dict[str, Any]` all of the time. 2022-04-14 21:52:41 +02:00			`return UrlOEmbedData(`
url_preview: Fix parsing of open graph tags. Our open graph parser logic sloppily mixed data obtained by parsing open graph properties with trusted data set by our oembed parser. We fix this by consistenly using our explicit whitelist of generic properties (image, title, and description) in both places where we interact with open graph properties. The fixes are redundant with each other, but doing both helps in making the intent of the code clearer. This issue fixed here was originally reported as an XSS vulnerability in the upcoming Inline URL Previews feature found by Graham Bleaney and Ibrahim Mohamed using Pysa. The recent Oembed changes close that vulnerability, but this change is still worth doing to make the implementation do what it looks like it does. 2019-12-12 02:10:50 +01:00			`image=thumbnail,`
preview: Use a dataclass for the embed data. This is significantly cleaner than passing around `Dict[str, Any]` all of the time. 2022-04-14 21:52:41 +02:00			`type="video",`
url_preview: Fix parsing of open graph tags. Our open graph parser logic sloppily mixed data obtained by parsing open graph properties with trusted data set by our oembed parser. We fix this by consistenly using our explicit whitelist of generic properties (image, title, and description) in both places where we interact with open graph properties. The fixes are redundant with each other, but doing both helps in making the intent of the code clearer. This issue fixed here was originally reported as an XSS vulnerability in the upcoming Inline URL Previews feature found by Graham Bleaney and Ibrahim Mohamed using Pysa. The recent Oembed changes close that vulnerability, but this change is still worth doing to make the implementation do what it looks like it does. 2019-12-12 02:10:50 +01:00			`html=strip_cdata(html),`
python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:20:45 +01:00			`title=data.get("title"),`
			`description=data.get("description"),`
url_preview: Fix parsing of open graph tags. Our open graph parser logic sloppily mixed data obtained by parsing open graph properties with trusted data set by our oembed parser. We fix this by consistenly using our explicit whitelist of generic properties (image, title, and description) in both places where we interact with open graph properties. The fixes are redundant with each other, but doing both helps in making the intent of the code clearer. This issue fixed here was originally reported as an XSS vulnerability in the upcoming Inline URL Previews feature found by Graham Bleaney and Ibrahim Mohamed using Pysa. The recent Oembed changes close that vulnerability, but this change is still worth doing to make the implementation do what it looks like it does. 2019-12-12 02:10:50 +01:00			`)`
url preview: Show inline images as previews for oEmbed photo pages. 2019-05-26 06:27:01 +02:00
preview: Use a dataclass for the embed data. This is significantly cleaner than passing around `Dict[str, Any]` all of the time. 2022-04-14 21:52:41 +02:00			`# Otherwise, use the title/description from pyembed as the basis`
			`# for our other parsers`
			`return UrlEmbedData(`
python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:20:45 +01:00			`title=data.get("title"),`
			`description=data.get("description"),`
url_preview: Fix parsing of open graph tags. Our open graph parser logic sloppily mixed data obtained by parsing open graph properties with trusted data set by our oembed parser. We fix this by consistenly using our explicit whitelist of generic properties (image, title, and description) in both places where we interact with open graph properties. The fixes are redundant with each other, but doing both helps in making the intent of the code clearer. This issue fixed here was originally reported as an XSS vulnerability in the upcoming Inline URL Previews feature found by Graham Bleaney and Ibrahim Mohamed using Pysa. The recent Oembed changes close that vulnerability, but this change is still worth doing to make the implementation do what it looks like it does. 2019-12-12 02:10:50 +01:00			`)`
url preview: Use oEmbed html for videos. Ensure that the html is safe, before using it. The html is considered if it is in an iframe with a http/https src, based on the recommendations here: https://oembed.com/#section3 We directly embed the `iframe` html into the lightbox overlay. 2019-05-02 18:58:39 +02:00
python: Reformat with Black, except quotes. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:19:30 +01:00
oembed: Remove unsound HTML filtering. The frontend now takes care of confining the HTML. Signed-off-by: Anders Kaseorg <anders@zulipchat.com> 2019-12-12 09:39:41 +01:00			`def strip_cdata(html: str) -> str:`
			`# Work around a bug in SoundCloud's XML generation:`
			`# <html><![CDATA[<iframe ...></iframe>]]></html>`
python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:20:45 +01:00			`if html.startswith("<![CDATA[") and html.endswith("]]>"):`
url preview: Use oEmbed html for videos. Ensure that the html is safe, before using it. The html is considered if it is in an iframe with a http/https src, based on the recommendations here: https://oembed.com/#section3 We directly embed the `iframe` html into the lightbox overlay. 2019-05-02 18:58:39 +02:00			`html = html[9:-3]`
oembed: Remove unsound HTML filtering. The frontend now takes care of confining the HTML. Signed-off-by: Anders Kaseorg <anders@zulipchat.com> 2019-12-12 09:39:41 +01:00			`return html`