zulip/zerver/lib/url_preview/parsers/generic.py

from urllib.parse import urlsplit

from bs4.element import Tag
from typing_extensions import override

from zerver.lib.url_preview.parsers.base import BaseParser
from zerver.lib.url_preview.types import UrlEmbedData


class GenericParser(BaseParser):
    @override
    def extract_data(self) -> UrlEmbedData:
        return UrlEmbedData(
            title=self._get_title(),
            description=self._get_description(),
            image=self._get_image(),
        )

    def _get_title(self) -> str | None:
        soup = self._soup
        if soup.title and soup.title.text != "":
            return soup.title.text
        if soup.h1 and soup.h1.text != "":
            return soup.h1.text
        return None

    def _get_description(self) -> str | None:
        soup = self._soup
        meta_description = soup.find("meta", attrs={"name": "description"})
        if isinstance(meta_description, Tag) and meta_description.get("content", "") != "":
            assert isinstance(meta_description["content"], str)
            return meta_description["content"]
        first_h1 = soup.find("h1")
        if first_h1:
            first_p = first_h1.find_next("p")
            if first_p and first_p.text != "":
                return first_p.text
        first_p = soup.find("p")
        if first_p and first_p.text != "":
            return first_p.text
        return None

    def _get_image(self) -> str | None:
        """
        Finding a first image after the h1 header.
        Presumably it will be the main image.
        """
        soup = self._soup
        first_h1 = soup.find("h1")
        if first_h1:
            first_image = first_h1.find_next_sibling("img", src=True)
            if isinstance(first_image, Tag) and first_image["src"] != "":
                assert isinstance(first_image["src"], str)
                try:
                    # We use urlsplit and not URLValidator because we
                    # need to support relative URLs.
                    urlsplit(first_image["src"])
                except ValueError:
                    return None
                return first_image["src"]
        return None
python: Use urlsplit instead of urlparse. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2023-12-05 21:25:00 +01:00			`from urllib.parse import urlsplit`
python: Sort imports with isort. Fixes #2665. Regenerated by tabbott with `lint --fix` after a rebase and change in parameters. Note from tabbott: In a few cases, this converts technical debt in the form of unsorted imports into different technical debt in the form of our largest files having very long, ugly import sequences at the start. I expect this change will increase pressure for us to split those files, which isn't a bad thing. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2020-06-11 00:54:34 +02:00
mypy: Add types-beautifulsoup4. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2022-01-22 07:31:33 +01:00			`from bs4.element import Tag`
mypy: Enable new error explicit-override. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2023-10-12 19:43:45 +02:00			`from typing_extensions import override`
mypy: Add types-beautifulsoup4. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2022-01-22 07:31:33 +01:00
Add oembed/Open Graph/Meta tags data retrieval from inline links. This change adds support for displaying inline open graph previews for links posted into Zulip. It is designed to interact correctly with message editing. This adds the new settings.INLINE_URL_EMBED_PREVIEW setting to control whether this feature is enabled. By default, this setting is currently disabled, so that we can burn it in for a bit before it impacts users more broadly. Eventually, we may want to make this manageable via a (set of?) per-realm settings. E.g. I can imagine a realm wanting to be able to enable/disable it for certain URLs. 2016-10-27 12:06:44 +02:00			`from zerver.lib.url_preview.parsers.base import BaseParser`
preview: Use a dataclass for the embed data. This is significantly cleaner than passing around `Dict[str, Any]` all of the time. 2022-04-14 21:52:41 +02:00			`from zerver.lib.url_preview.types import UrlEmbedData`
Add oembed/Open Graph/Meta tags data retrieval from inline links. This change adds support for displaying inline open graph previews for links posted into Zulip. It is designed to interact correctly with message editing. This adds the new settings.INLINE_URL_EMBED_PREVIEW setting to control whether this feature is enabled. By default, this setting is currently disabled, so that we can burn it in for a bit before it impacts users more broadly. Eventually, we may want to make this manageable via a (set of?) per-realm settings. E.g. I can imagine a realm wanting to be able to enable/disable it for certain URLs. 2016-10-27 12:06:44 +02:00

			`class GenericParser(BaseParser):`
mypy: Enable new error explicit-override. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2023-10-12 19:43:45 +02:00			`@override`
preview: Use a dataclass for the embed data. This is significantly cleaner than passing around `Dict[str, Any]` all of the time. 2022-04-14 21:52:41 +02:00			`def extract_data(self) -> UrlEmbedData:`
			`return UrlEmbedData(`
			`title=self._get_title(),`
			`description=self._get_description(),`
			`image=self._get_image(),`
			`)`
Add oembed/Open Graph/Meta tags data retrieval from inline links. This change adds support for displaying inline open graph previews for links posted into Zulip. It is designed to interact correctly with message editing. This adds the new settings.INLINE_URL_EMBED_PREVIEW setting to control whether this feature is enabled. By default, this setting is currently disabled, so that we can burn it in for a bit before it impacts users more broadly. Eventually, we may want to make this manageable via a (set of?) per-realm settings. E.g. I can imagine a realm wanting to be able to enable/disable it for certain URLs. 2016-10-27 12:06:44 +02:00
ruff: Fix UP007 Use `X \| Y` for type annotations. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2024-07-12 02:30:23 +02:00			`def _get_title(self) -> str \| None:`
Add oembed/Open Graph/Meta tags data retrieval from inline links. This change adds support for displaying inline open graph previews for links posted into Zulip. It is designed to interact correctly with message editing. This adds the new settings.INLINE_URL_EMBED_PREVIEW setting to control whether this feature is enabled. By default, this setting is currently disabled, so that we can burn it in for a bit before it impacts users more broadly. Eventually, we may want to make this manageable via a (set of?) per-realm settings. E.g. I can imagine a realm wanting to be able to enable/disable it for certain URLs. 2016-10-27 12:06:44 +02:00			`soup = self._soup`
python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:20:45 +01:00			`if soup.title and soup.title.text != "":`
Add oembed/Open Graph/Meta tags data retrieval from inline links. This change adds support for displaying inline open graph previews for links posted into Zulip. It is designed to interact correctly with message editing. This adds the new settings.INLINE_URL_EMBED_PREVIEW setting to control whether this feature is enabled. By default, this setting is currently disabled, so that we can burn it in for a bit before it impacts users more broadly. Eventually, we may want to make this manageable via a (set of?) per-realm settings. E.g. I can imagine a realm wanting to be able to enable/disable it for certain URLs. 2016-10-27 12:06:44 +02:00			`return soup.title.text`
python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:20:45 +01:00			`if soup.h1 and soup.h1.text != "":`
Add oembed/Open Graph/Meta tags data retrieval from inline links. This change adds support for displaying inline open graph previews for links posted into Zulip. It is designed to interact correctly with message editing. This adds the new settings.INLINE_URL_EMBED_PREVIEW setting to control whether this feature is enabled. By default, this setting is currently disabled, so that we can burn it in for a bit before it impacts users more broadly. Eventually, we may want to make this manageable via a (set of?) per-realm settings. E.g. I can imagine a realm wanting to be able to enable/disable it for certain URLs. 2016-10-27 12:06:44 +02:00			`return soup.h1.text`
			`return None`

ruff: Fix UP007 Use `X \| Y` for type annotations. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2024-07-12 02:30:23 +02:00			`def _get_description(self) -> str \| None:`
Add oembed/Open Graph/Meta tags data retrieval from inline links. This change adds support for displaying inline open graph previews for links posted into Zulip. It is designed to interact correctly with message editing. This adds the new settings.INLINE_URL_EMBED_PREVIEW setting to control whether this feature is enabled. By default, this setting is currently disabled, so that we can burn it in for a bit before it impacts users more broadly. Eventually, we may want to make this manageable via a (set of?) per-realm settings. E.g. I can imagine a realm wanting to be able to enable/disable it for certain URLs. 2016-10-27 12:06:44 +02:00			`soup = self._soup`
python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:20:45 +01:00			`meta_description = soup.find("meta", attrs={"name": "description"})`
mypy: Add types-beautifulsoup4. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2022-01-22 07:31:33 +01:00			`if isinstance(meta_description, Tag) and meta_description.get("content", "") != "":`
			`assert isinstance(meta_description["content"], str)`
python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:20:45 +01:00			`return meta_description["content"]`
			`first_h1 = soup.find("h1")`
Add oembed/Open Graph/Meta tags data retrieval from inline links. This change adds support for displaying inline open graph previews for links posted into Zulip. It is designed to interact correctly with message editing. This adds the new settings.INLINE_URL_EMBED_PREVIEW setting to control whether this feature is enabled. By default, this setting is currently disabled, so that we can burn it in for a bit before it impacts users more broadly. Eventually, we may want to make this manageable via a (set of?) per-realm settings. E.g. I can imagine a realm wanting to be able to enable/disable it for certain URLs. 2016-10-27 12:06:44 +02:00			`if first_h1:`
python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:20:45 +01:00			`first_p = first_h1.find_next("p")`
			`if first_p and first_p.text != "":`
Add oembed/Open Graph/Meta tags data retrieval from inline links. This change adds support for displaying inline open graph previews for links posted into Zulip. It is designed to interact correctly with message editing. This adds the new settings.INLINE_URL_EMBED_PREVIEW setting to control whether this feature is enabled. By default, this setting is currently disabled, so that we can burn it in for a bit before it impacts users more broadly. Eventually, we may want to make this manageable via a (set of?) per-realm settings. E.g. I can imagine a realm wanting to be able to enable/disable it for certain URLs. 2016-10-27 12:06:44 +02:00			`return first_p.text`
python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:20:45 +01:00			`first_p = soup.find("p")`
			`if first_p and first_p.text != "":`
url preview: Return generic parser <p> text as str (not bs4 string). 2019-05-05 20:15:00 +02:00			`return first_p.text`
Add oembed/Open Graph/Meta tags data retrieval from inline links. This change adds support for displaying inline open graph previews for links posted into Zulip. It is designed to interact correctly with message editing. This adds the new settings.INLINE_URL_EMBED_PREVIEW setting to control whether this feature is enabled. By default, this setting is currently disabled, so that we can burn it in for a bit before it impacts users more broadly. Eventually, we may want to make this manageable via a (set of?) per-realm settings. E.g. I can imagine a realm wanting to be able to enable/disable it for certain URLs. 2016-10-27 12:06:44 +02:00			`return None`

ruff: Fix UP007 Use `X \| Y` for type annotations. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2024-07-12 02:30:23 +02:00			`def _get_image(self) -> str \| None:`
Add oembed/Open Graph/Meta tags data retrieval from inline links. This change adds support for displaying inline open graph previews for links posted into Zulip. It is designed to interact correctly with message editing. This adds the new settings.INLINE_URL_EMBED_PREVIEW setting to control whether this feature is enabled. By default, this setting is currently disabled, so that we can burn it in for a bit before it impacts users more broadly. Eventually, we may want to make this manageable via a (set of?) per-realm settings. E.g. I can imagine a realm wanting to be able to enable/disable it for certain URLs. 2016-10-27 12:06:44 +02:00			`"""`
			`Finding a first image after the h1 header.`
			`Presumably it will be the main image.`
			`"""`
			`soup = self._soup`
python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:20:45 +01:00			`first_h1 = soup.find("h1")`
Add oembed/Open Graph/Meta tags data retrieval from inline links. This change adds support for displaying inline open graph previews for links posted into Zulip. It is designed to interact correctly with message editing. This adds the new settings.INLINE_URL_EMBED_PREVIEW setting to control whether this feature is enabled. By default, this setting is currently disabled, so that we can burn it in for a bit before it impacts users more broadly. Eventually, we may want to make this manageable via a (set of?) per-realm settings. E.g. I can imagine a realm wanting to be able to enable/disable it for certain URLs. 2016-10-27 12:06:44 +02:00			`if first_h1:`
python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:20:45 +01:00			`first_image = first_h1.find_next_sibling("img", src=True)`
mypy: Add types-beautifulsoup4. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2022-01-22 07:31:33 +01:00			`if isinstance(first_image, Tag) and first_image["src"] != "":`
			`assert isinstance(first_image["src"], str)`
url_preview: Only return image URLs that validate as URLs. 2022-02-18 22:48:53 +01:00			`try:`
python: Use urlsplit instead of urlparse. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2023-12-05 21:25:00 +01:00			`# We use urlsplit and not URLValidator because we`
url_preview: Only return image URLs that validate as URLs. 2022-02-18 22:48:53 +01:00			`# need to support relative URLs.`
python: Use urlsplit instead of urlparse. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2023-12-05 21:25:00 +01:00			`urlsplit(first_image["src"])`
url_preview: Only return image URLs that validate as URLs. 2022-02-18 22:48:53 +01:00			`except ValueError:`
			`return None`
python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:20:45 +01:00			`return first_image["src"]`
Add oembed/Open Graph/Meta tags data retrieval from inline links. This change adds support for displaying inline open graph previews for links posted into Zulip. It is designed to interact correctly with message editing. This adds the new settings.INLINE_URL_EMBED_PREVIEW setting to control whether this feature is enabled. By default, this setting is currently disabled, so that we can burn it in for a bit before it impacts users more broadly. Eventually, we may want to make this manageable via a (set of?) per-realm settings. E.g. I can imagine a realm wanting to be able to enable/disable it for certain URLs. 2016-10-27 12:06:44 +02:00			`return None`