zulip/zerver/lib/html_to_text.py

from collections.abc import Mapping

from bs4 import BeautifulSoup
from django.utils.html import escape

from zerver.lib.cache import cache_with_key, open_graph_description_cache_key


def html_to_text(content: str | bytes, tags: Mapping[str, str] = {"p": " | "}) -> str:
    bs = BeautifulSoup(content, features="lxml")
    # Skip any admonition (warning) blocks, since they're
    # usually something about users needing to be an
    # organization administrator, and not useful for
    # describing the page.
    for tag in bs.find_all("div", class_="admonition"):
        tag.clear()

    # Skip tabbed-sections, which just contain navigation instructions.
    for tag in bs.find_all("div", class_="tabbed-section"):
        tag.clear()

    text = ""
    for element in bs.find_all(tags.keys()):
        # Ignore empty elements
        if not element.text:
            continue
        # .text converts it from HTML to text
        if text:
            text += tags[element.name]
        text += element.text
        if len(text) > 500:
            break
    return escape(" ".join(text.split()))


@cache_with_key(open_graph_description_cache_key, timeout=3600 * 24)
def get_content_description(content: bytes, request_url: str) -> str:
    return html_to_text(content)
ruff: Fix UP035 Import from `collections.abc`, `typing` instead. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2024-07-12 02:30:25 +02:00			`from collections.abc import Mapping`
html_to_text: Add arg to specify html tags for generating text. Closes #11497 2019-04-24 04:10:56 +02:00
html_to_text: Extract code for html to plain text conversion. 2019-04-24 02:50:25 +02:00			`from bs4 import BeautifulSoup`
html_to_text: Escape text when using as description. 2019-04-24 03:37:34 +02:00			`from django.utils.html import escape`
html_to_text: Extract code for html to plain text conversion. 2019-04-24 02:50:25 +02:00
			`from zerver.lib.cache import cache_with_key, open_graph_description_cache_key`

python: Sort imports with isort. Fixes #2665. Regenerated by tabbott with `lint --fix` after a rebase and change in parameters. Note from tabbott: In a few cases, this converts technical debt in the form of unsorted imports into different technical debt in the form of our largest files having very long, ugly import sequences at the start. I expect this change will increase pressure for us to split those files, which isn't a bad thing. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2020-06-11 00:54:34 +02:00
ruff: Fix UP007 Use `X \| Y` for type annotations. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2024-07-12 02:30:23 +02:00			`def html_to_text(content: str \| bytes, tags: Mapping[str, str] = {"p": " \| "}) -> str:`
python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:20:45 +01:00			`bs = BeautifulSoup(content, features="lxml")`
html_to_text: Extract code for html to plain text conversion. 2019-04-24 02:50:25 +02:00			`# Skip any admonition (warning) blocks, since they're`
			`# usually something about users needing to be an`
			`# organization administrator, and not useful for`
			`# describing the page.`
python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:20:45 +01:00			`for tag in bs.find_all("div", class_="admonition"):`
html_to_text: Extract code for html to plain text conversion. 2019-04-24 02:50:25 +02:00			`tag.clear()`

widgets: Rename confusing attribute name in `tabbed_sections.py`. Renames misleading attribute in HTML template using `code-section` to refer to both language toggles in API docs and app toggles in help center docs. 2023-08-30 00:58:01 +02:00			`# Skip tabbed-sections, which just contain navigation instructions.`
			`for tag in bs.find_all("div", class_="tabbed-section"):`
html_to_text: Extract code for html to plain text conversion. 2019-04-24 02:50:25 +02:00			`tag.clear()`

python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:20:45 +01:00			`text = ""`
html_to_text: Add delimiters between text from different elements. This module is used to render the HTML of pages like our user documentation into text for use in open graph previews of those articles. It provided somewhat confusing output in the case that there were paragraph breaks in the original message, because text with multiple paragraphs and list items does't read very well. This commit adds `\|` as a delimiter between paragraphs, and prefixes list items with a `*`. Closes #12228 2019-05-02 02:35:20 +02:00			`for element in bs.find_all(tags.keys()):`
			`# Ignore empty elements`
			`if not element.text:`
			`continue`
html_to_text: Extract code for html to plain text conversion. 2019-04-24 02:50:25 +02:00			`# .text converts it from HTML to text`
html_to_text: Add delimiters between text from different elements. This module is used to render the HTML of pages like our user documentation into text for use in open graph previews of those articles. It provided somewhat confusing output in the case that there were paragraph breaks in the original message, because text with multiple paragraphs and list items does't read very well. This commit adds `\|` as a delimiter between paragraphs, and prefixes list items with a `*`. Closes #12228 2019-05-02 02:35:20 +02:00			`if text:`
			`text += tags[element.name]`
			`text += element.text`
html_to_text: Extract code for html to plain text conversion. 2019-04-24 02:50:25 +02:00			`if len(text) > 500:`
html_to_text: Escape text when using as description. 2019-04-24 03:37:34 +02:00			`break`
python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:20:45 +01:00			`return escape(" ".join(text.split()))`
html_to_text: Extract code for html to plain text conversion. 2019-04-24 02:50:25 +02:00
python: Reformat with Black, except quotes. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2021-02-12 08:19:30 +01:00
			`@cache_with_key(open_graph_description_cache_key, timeout=3600 * 24)`
documentation: Move OpenGraph description updating out of middleware. This middleware was highly-specific to a set of URLs, and pulled in a beautifulsoup dependency for Tornado. Move it closer to where it is used, minimizing action at a distance, as well as trimming out a dependency. 2024-04-16 18:27:55 +02:00			`def get_content_description(content: bytes, request_url: str) -> str:`
python: Skip unnecessary decode before BeautifulSoup parsing. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2020-10-30 01:21:40 +01:00			`return html_to_text(content)`