zulip/zerver/lib/url_preview/parsers/open_graph.py

from typing import Dict
from urllib.parse import urlparse

from .base import BaseParser


class OpenGraphParser(BaseParser):
    allowed_og_properties = {
        "og:title",
        "og:description",
        "og:image",
    }

    def extract_data(self) -> Dict[str, str]:
        meta = self._soup.findAll("meta")
        result = {}
        for tag in meta:
            if not tag.has_attr("property"):
                continue
            if tag["property"] not in self.allowed_og_properties:
                continue

            og_property_name = tag["property"][len("og:") :]
            if not tag.has_attr("content"):
                continue

            if og_property_name == "image":
                try:
                    # We use urlparse and not URLValidator because we
                    # need to support relative URLs.
                    urlparse(tag["content"])
                except ValueError:
                    continue

            result[og_property_name] = tag["content"]

        return result
zerver/lib: Change use of typing.Text to str. 2018-05-10 19:13:36 +02:00			`from typing import Dict`
url_preview: Only return image URLs that validate as URLs. 2022-02-18 22:48:53 +01:00			`from urllib.parse import urlparse`
python: Sort imports with isort. Fixes #2665. Regenerated by tabbott with `lint --fix` after a rebase and change in parameters. Note from tabbott: In a few cases, this converts technical debt in the form of unsorted imports into different technical debt in the form of our largest files having very long, ugly import sequences at the start. I expect this change will increase pressure for us to split those files, which isn't a bad thing. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2020-06-11 00:54:34 +02:00
Add oembed/Open Graph/Meta tags data retrieval from inline links. This change adds support for displaying inline open graph previews for links posted into Zulip. It is designed to interact correctly with message editing. This adds the new settings.INLINE_URL_EMBED_PREVIEW setting to control whether this feature is enabled. By default, this setting is currently disabled, so that we can burn it in for a bit before it impacts users more broadly. Eventually, we may want to make this manageable via a (set of?) per-realm settings. E.g. I can imagine a realm wanting to be able to enable/disable it for certain URLs. 2016-10-27 12:06:44 +02:00			`from .base import BaseParser`


			`class OpenGraphParser(BaseParser):`
url_preview: Fix parsing of open graph tags. Our open graph parser logic sloppily mixed data obtained by parsing open graph properties with trusted data set by our oembed parser. We fix this by consistenly using our explicit whitelist of generic properties (image, title, and description) in both places where we interact with open graph properties. The fixes are redundant with each other, but doing both helps in making the intent of the code clearer. This issue fixed here was originally reported as an XSS vulnerability in the upcoming Inline URL Previews feature found by Graham Bleaney and Ibrahim Mohamed using Pysa. The recent Oembed changes close that vulnerability, but this change is still worth doing to make the implementation do what it looks like it does. 2019-12-12 02:10:50 +01:00			`allowed_og_properties = {`
python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:20:45 +01:00			`"og:title",`
			`"og:description",`
			`"og:image",`
url_preview: Fix parsing of open graph tags. Our open graph parser logic sloppily mixed data obtained by parsing open graph properties with trusted data set by our oembed parser. We fix this by consistenly using our explicit whitelist of generic properties (image, title, and description) in both places where we interact with open graph properties. The fixes are redundant with each other, but doing both helps in making the intent of the code clearer. This issue fixed here was originally reported as an XSS vulnerability in the upcoming Inline URL Previews feature found by Graham Bleaney and Ibrahim Mohamed using Pysa. The recent Oembed changes close that vulnerability, but this change is still worth doing to make the implementation do what it looks like it does. 2019-12-12 02:10:50 +01:00			`}`

zerver/lib: Change use of typing.Text to str. 2018-05-10 19:13:36 +02:00			`def extract_data(self) -> Dict[str, str]:`
python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:20:45 +01:00			`meta = self._soup.findAll("meta")`
url_preview: Fix parsing of open graph tags. Our open graph parser logic sloppily mixed data obtained by parsing open graph properties with trusted data set by our oembed parser. We fix this by consistenly using our explicit whitelist of generic properties (image, title, and description) in both places where we interact with open graph properties. The fixes are redundant with each other, but doing both helps in making the intent of the code clearer. This issue fixed here was originally reported as an XSS vulnerability in the upcoming Inline URL Previews feature found by Graham Bleaney and Ibrahim Mohamed using Pysa. The recent Oembed changes close that vulnerability, but this change is still worth doing to make the implementation do what it looks like it does. 2019-12-12 02:10:50 +01:00			`result = {}`
Add oembed/Open Graph/Meta tags data retrieval from inline links. This change adds support for displaying inline open graph previews for links posted into Zulip. It is designed to interact correctly with message editing. This adds the new settings.INLINE_URL_EMBED_PREVIEW setting to control whether this feature is enabled. By default, this setting is currently disabled, so that we can burn it in for a bit before it impacts users more broadly. Eventually, we may want to make this manageable via a (set of?) per-realm settings. E.g. I can imagine a realm wanting to be able to enable/disable it for certain URLs. 2016-10-27 12:06:44 +02:00			`for tag in meta:`
python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:20:45 +01:00			`if not tag.has_attr("property"):`
url_preview: Fix parsing of open graph tags. Our open graph parser logic sloppily mixed data obtained by parsing open graph properties with trusted data set by our oembed parser. We fix this by consistenly using our explicit whitelist of generic properties (image, title, and description) in both places where we interact with open graph properties. The fixes are redundant with each other, but doing both helps in making the intent of the code clearer. This issue fixed here was originally reported as an XSS vulnerability in the upcoming Inline URL Previews feature found by Graham Bleaney and Ibrahim Mohamed using Pysa. The recent Oembed changes close that vulnerability, but this change is still worth doing to make the implementation do what it looks like it does. 2019-12-12 02:10:50 +01:00			`continue`
python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:20:45 +01:00			`if tag["property"] not in self.allowed_og_properties:`
url_preview: Fix parsing of open graph tags. Our open graph parser logic sloppily mixed data obtained by parsing open graph properties with trusted data set by our oembed parser. We fix this by consistenly using our explicit whitelist of generic properties (image, title, and description) in both places where we interact with open graph properties. The fixes are redundant with each other, but doing both helps in making the intent of the code clearer. This issue fixed here was originally reported as an XSS vulnerability in the upcoming Inline URL Previews feature found by Graham Bleaney and Ibrahim Mohamed using Pysa. The recent Oembed changes close that vulnerability, but this change is still worth doing to make the implementation do what it looks like it does. 2019-12-12 02:10:50 +01:00			`continue`

python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:20:45 +01:00			`og_property_name = tag["property"][len("og:") :]`
			`if not tag.has_attr("content"):`
url_preview: Fix parsing of open graph tags. Our open graph parser logic sloppily mixed data obtained by parsing open graph properties with trusted data set by our oembed parser. We fix this by consistenly using our explicit whitelist of generic properties (image, title, and description) in both places where we interact with open graph properties. The fixes are redundant with each other, but doing both helps in making the intent of the code clearer. This issue fixed here was originally reported as an XSS vulnerability in the upcoming Inline URL Previews feature found by Graham Bleaney and Ibrahim Mohamed using Pysa. The recent Oembed changes close that vulnerability, but this change is still worth doing to make the implementation do what it looks like it does. 2019-12-12 02:10:50 +01:00			`continue`

url_preview: Only return image URLs that validate as URLs. 2022-02-18 22:48:53 +01:00			`if og_property_name == "image":`
			`try:`
			`# We use urlparse and not URLValidator because we`
			`# need to support relative URLs.`
			`urlparse(tag["content"])`
			`except ValueError:`
			`continue`

python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:20:45 +01:00			`result[og_property_name] = tag["content"]`
url_preview: Fix parsing of open graph tags. Our open graph parser logic sloppily mixed data obtained by parsing open graph properties with trusted data set by our oembed parser. We fix this by consistenly using our explicit whitelist of generic properties (image, title, and description) in both places where we interact with open graph properties. The fixes are redundant with each other, but doing both helps in making the intent of the code clearer. This issue fixed here was originally reported as an XSS vulnerability in the upcoming Inline URL Previews feature found by Graham Bleaney and Ibrahim Mohamed using Pysa. The recent Oembed changes close that vulnerability, but this change is still worth doing to make the implementation do what it looks like it does. 2019-12-12 02:10:50 +01:00
			`return result`