Skip to content

HTML To Markdown

pipeline pipeline

The HTML To Markdown pipeline transforms HTML to Markdown.

Markdown formatting is applied for headings, blockquotes, lists, code, tables and text. Visual formatting is also included (bold, italic etc).

This pipeline searches for the best node that has relevant text, often found with an article, main or body tag.

The HTML to Markdown pipeline requires the BeautifulSoup4 library to be installed.

Example

The following shows a simple example using this pipeline.

from txtai.pipeline import HTMLToMarkdown

# Create and run pipeline
md = HTMLToMarkdown()
md("<html><body>This is a test</body></html>")

Configuration-driven example

Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.

config.yml

# Create pipeline using lower case class name
htmltomarkdown:

# Run pipeline with workflow
workflow:
  markdown:
    tasks:
      - action: htmltomarkdown

Run with Workflows

from txtai import Application

# Create and run pipeline with workflow
app = Application("config.yml")
list(app.workflow("markdown", ["<html><body>This is a test</body></html>"]))

Run with API

CONFIG=config.yml uvicorn "txtai.api:app" &

curl \
  -X POST "http://localhost:8000/workflow" \
  -H "Content-Type: application/json" \
  -d '{"name":"markdown", "elements":["<html><body>This is a test</body></html>"]}'

Methods

Python documentation for the pipeline.

__init__(paragraphs=False, sections=False)

Create a new Extract instance.

Parameters:

Name Type Description Default
paragraphs

True if paragraph parsing enabled, False otherwise

False
sections

True if section parsing enabled, False otherwise

False
Source code in txtai/pipeline/data/htmltomd.py
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def __init__(self, paragraphs=False, sections=False):
    """
    Create a new Extract instance.

    Args:
        paragraphs: True if paragraph parsing enabled, False otherwise
        sections: True if section parsing enabled, False otherwise
    """

    if not SOUP:
        raise ImportError('HTMLToMarkdown pipeline is not available - install "pipeline" extra to enable')

    self.paragraphs = paragraphs
    self.sections = sections

__call__(html)

Transforms input HTML into Markdown formatted text.

Parameters:

Name Type Description Default
html

input html

required

Returns:

Type Description

markdown formatted text

Source code in txtai/pipeline/data/htmltomd.py
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
def __call__(self, html):
    """
    Transforms input HTML into Markdown formatted text.

    Args:
        html: input html

    Returns:
        markdown formatted text
    """

    # HTML Parser
    soup = BeautifulSoup(html, features="html.parser")

    # Ignore script and style tags
    for script in soup.find_all(["script", "style"]):
        script.decompose()

    # Check for article sections
    article = next((x for x in ["article", "main"] if soup.find(x)), None)

    # Extract text from each section element
    nodes = []
    for node in soup.find_all(article if article else "body"):
        # Skip article sections without at least 1 paragraph
        if not article or node.find("p"):
            nodes.append(self.process(node, article))

    # Return extracted text, fallback to default text extraction if no nodes found
    return "\n".join(self.metadata(soup) + nodes) if nodes else self.default(soup)