zulip/zerver/lib/url_preview/parsers/base.py

import cgi

from zerver.lib.url_preview.types import UrlEmbedData


class BaseParser:
    def __init__(self, html_source: bytes, content_type: str | None) -> None:
        # We import BeautifulSoup here, because it's not used by most
        # processes in production, and bs4 is big enough that
        # importing it adds 10s of milliseconds to manage.py startup.
        from bs4 import BeautifulSoup

        charset = None
        if content_type is not None:
            charset = cgi.parse_header(content_type)[1].get("charset")
        self._soup = BeautifulSoup(html_source, "lxml", from_encoding=charset)

    def extract_data(self) -> UrlEmbedData:
        raise NotImplementedError
url_preview: Allow Beautiful Soup to get the charset from <meta>. An HTML document sent without a charset in the Content-Type header needs to be scanned for a charset in <meta> tags. We need to pass bytes instead of str to Beautiful Soup to allow it to do this. Fixes #16843. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2020-12-08 04:26:30 +01:00			`import cgi`
preview: Use a dataclass for the embed data. This is significantly cleaner than passing around `Dict[str, Any]` all of the time. 2022-04-14 21:52:41 +02:00
			`from zerver.lib.url_preview.types import UrlEmbedData`
Add oembed/Open Graph/Meta tags data retrieval from inline links. This change adds support for displaying inline open graph previews for links posted into Zulip. It is designed to interact correctly with message editing. This adds the new settings.INLINE_URL_EMBED_PREVIEW setting to control whether this feature is enabled. By default, this setting is currently disabled, so that we can burn it in for a bit before it impacts users more broadly. Eventually, we may want to make this manageable via a (set of?) per-realm settings. E.g. I can imagine a realm wanting to be able to enable/disable it for certain URLs. 2016-10-27 12:06:44 +02:00
python: Sort imports with isort. Fixes #2665. Regenerated by tabbott with `lint --fix` after a rebase and change in parameters. Note from tabbott: In a few cases, this converts technical debt in the form of unsorted imports into different technical debt in the form of our largest files having very long, ugly import sequences at the start. I expect this change will increase pressure for us to split those files, which isn't a bad thing. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2020-06-11 00:54:34 +02:00
zerver/lib: Remove inheritance from object. 2017-11-05 11:37:41 +01:00			`class BaseParser:`
ruff: Fix UP007 Use `X \| Y` for type annotations. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2024-07-12 02:30:23 +02:00			`def __init__(self, html_source: bytes, content_type: str \| None) -> None:`
url_preview: Don't import beautifulsoup at import time. This is a small performance optimization to Django startup, in line with other recent commits. 2018-08-08 22:24:20 +02:00			`# We import BeautifulSoup here, because it's not used by most`
			`# processes in production, and bs4 is big enough that`
			`# importing it adds 10s of milliseconds to manage.py startup.`
			`from bs4 import BeautifulSoup`
python: Reformat with Black, except quotes. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:19:30 +01:00
url_preview: Allow Beautiful Soup to get the charset from <meta>. An HTML document sent without a charset in the Content-Type header needs to be scanned for a charset in <meta> tags. We need to pass bytes instead of str to Beautiful Soup to allow it to do this. Fixes #16843. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2020-12-08 04:26:30 +01:00			`charset = None`
			`if content_type is not None:`
			`charset = cgi.parse_header(content_type)[1].get("charset")`
			`self._soup = BeautifulSoup(html_source, "lxml", from_encoding=charset)`
Add oembed/Open Graph/Meta tags data retrieval from inline links. This change adds support for displaying inline open graph previews for links posted into Zulip. It is designed to interact correctly with message editing. This adds the new settings.INLINE_URL_EMBED_PREVIEW setting to control whether this feature is enabled. By default, this setting is currently disabled, so that we can burn it in for a bit before it impacts users more broadly. Eventually, we may want to make this manageable via a (set of?) per-realm settings. E.g. I can imagine a realm wanting to be able to enable/disable it for certain URLs. 2016-10-27 12:06:44 +02:00
preview: Use a dataclass for the embed data. This is significantly cleaner than passing around `Dict[str, Any]` all of the time. 2022-04-14 21:52:41 +02:00			`def extract_data(self) -> UrlEmbedData:`
ruff: Fix RSE102 Unnecessary parentheses on raised exception. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2023-02-04 02:07:20 +01:00			`raise NotImplementedError`