zulip/zerver/lib/html_to_text.py

from typing import Dict, Optional

from bs4 import BeautifulSoup
from django.http import HttpRequest
from django.utils.html import escape

from zerver.lib.cache import cache_with_key, open_graph_description_cache_key


def html_to_text(content: str, tags: Optional[Dict[str, str]]=None) -> str:
    bs = BeautifulSoup(content, features='lxml')
    # Skip any admonition (warning) blocks, since they're
    # usually something about users needing to be an
    # organization administrator, and not useful for
    # describing the page.
    for tag in bs.find_all('div', class_="admonition"):
        tag.clear()

    # Skip code-sections, which just contains navigation instructions.
    for tag in bs.find_all('div', class_="code-section"):
        tag.clear()

    text = ''
    if tags is None:
        tags = {'p': ' | '}
    for element in bs.find_all(tags.keys()):
        # Ignore empty elements
        if not element.text:
            continue
        # .text converts it from HTML to text
        if text:
            text += tags[element.name]
        text += element.text
        if len(text) > 500:
            break
    return escape(' '.join(text.split()))

@cache_with_key(open_graph_description_cache_key, timeout=3600*24)
def get_content_description(content: bytes, request: HttpRequest) -> str:
    str_content = content.decode("utf-8")
    return html_to_text(str_content)
html_to_text: Add delimiters between text from different elements. This module is used to render the HTML of pages like our user documentation into text for use in open graph previews of those articles. It provided somewhat confusing output in the case that there were paragraph breaks in the original message, because text with multiple paragraphs and list items does't read very well. This commit adds `\|` as a delimiter between paragraphs, and prefixes list items with a `*`. Closes #12228 2019-05-02 02:35:20 +02:00			`from typing import Dict, Optional`
html_to_text: Add arg to specify html tags for generating text. Closes #11497 2019-04-24 04:10:56 +02:00
html_to_text: Extract code for html to plain text conversion. 2019-04-24 02:50:25 +02:00			`from bs4 import BeautifulSoup`
			`from django.http import HttpRequest`
html_to_text: Escape text when using as description. 2019-04-24 03:37:34 +02:00			`from django.utils.html import escape`
html_to_text: Extract code for html to plain text conversion. 2019-04-24 02:50:25 +02:00
			`from zerver.lib.cache import cache_with_key, open_graph_description_cache_key`

python: Sort imports with isort. Fixes #2665. Regenerated by tabbott with `lint --fix` after a rebase and change in parameters. Note from tabbott: In a few cases, this converts technical debt in the form of unsorted imports into different technical debt in the form of our largest files having very long, ugly import sequences at the start. I expect this change will increase pressure for us to split those files, which isn't a bad thing. Signed-off-by: Anders Kaseorg <anders@zulip.com> 2020-06-11 00:54:34 +02:00
html_to_text: Add delimiters between text from different elements. This module is used to render the HTML of pages like our user documentation into text for use in open graph previews of those articles. It provided somewhat confusing output in the case that there were paragraph breaks in the original message, because text with multiple paragraphs and list items does't read very well. This commit adds `\|` as a delimiter between paragraphs, and prefixes list items with a `*`. Closes #12228 2019-05-02 02:35:20 +02:00			`def html_to_text(content: str, tags: Optional[Dict[str, str]]=None) -> str:`
html_to_text: Extract code for html to plain text conversion. 2019-04-24 02:50:25 +02:00			`bs = BeautifulSoup(content, features='lxml')`
			`# Skip any admonition (warning) blocks, since they're`
			`# usually something about users needing to be an`
			`# organization administrator, and not useful for`
			`# describing the page.`
			`for tag in bs.find_all('div', class_="admonition"):`
			`tag.clear()`

			`# Skip code-sections, which just contains navigation instructions.`
			`for tag in bs.find_all('div', class_="code-section"):`
			`tag.clear()`

			`text = ''`
html_to_text: Add arg to specify html tags for generating text. Closes #11497 2019-04-24 04:10:56 +02:00			`if tags is None:`
html_to_text: Add delimiters between text from different elements. This module is used to render the HTML of pages like our user documentation into text for use in open graph previews of those articles. It provided somewhat confusing output in the case that there were paragraph breaks in the original message, because text with multiple paragraphs and list items does't read very well. This commit adds `\|` as a delimiter between paragraphs, and prefixes list items with a `*`. Closes #12228 2019-05-02 02:35:20 +02:00			`tags = {'p': ' \| '}`
			`for element in bs.find_all(tags.keys()):`
			`# Ignore empty elements`
			`if not element.text:`
			`continue`
html_to_text: Extract code for html to plain text conversion. 2019-04-24 02:50:25 +02:00			`# .text converts it from HTML to text`
html_to_text: Add delimiters between text from different elements. This module is used to render the HTML of pages like our user documentation into text for use in open graph previews of those articles. It provided somewhat confusing output in the case that there were paragraph breaks in the original message, because text with multiple paragraphs and list items does't read very well. This commit adds `\|` as a delimiter between paragraphs, and prefixes list items with a `*`. Closes #12228 2019-05-02 02:35:20 +02:00			`if text:`
			`text += tags[element.name]`
			`text += element.text`
html_to_text: Extract code for html to plain text conversion. 2019-04-24 02:50:25 +02:00			`if len(text) > 500:`
html_to_text: Escape text when using as description. 2019-04-24 03:37:34 +02:00			`break`
			`return escape(' '.join(text.split()))`
html_to_text: Extract code for html to plain text conversion. 2019-04-24 02:50:25 +02:00
			`@cache_with_key(open_graph_description_cache_key, timeout=3600*24)`
			`def get_content_description(content: bytes, request: HttpRequest) -> str:`
			`str_content = content.decode("utf-8")`
			`return html_to_text(str_content)`