HTML To Markdown

pipeline

The HTML To Markdown pipeline transforms HTML to Markdown.

Markdown formatting is applied for headings, blockquotes, lists, code, tables and text. Visual formatting is also included (bold, italic etc).

This pipeline searches for the best node that has relevant text, often found with an article, main or body tag.

The HTML to Markdown pipeline requires the BeautifulSoup4 library to be installed.

Example

The following shows a simple example using this pipeline.

from txtai.pipeline import HTMLToMarkdown

# Create and run pipeline
md = HTMLToMarkdown()
md("<html><body>This is a test</body></html>")

Configuration-driven example

Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.

config.yml

# Create pipeline using lower case class name
htmltomarkdown:

# Run pipeline with workflow
workflow:
  markdown:
    tasks:
      - action: htmltomarkdown

Run with Workflows

from txtai import Application

# Create and run pipeline with workflow
app = Application("config.yml")
list(app.workflow("markdown", ["<html><body>This is a test</body></html>"]))

Run with API

CONFIG=config.yml uvicorn "txtai.api:app" &

curl \
  -X POST "http://localhost:8000/workflow" \
  -H "Content-Type: application/json" \
  -d '{"name":"markdown", "elements":["<html><body>This is a test</body></html>"]}'

Methods

Python documentation for the pipeline.

`init(paragraphs=False, sections=False)`

Create a new Extract instance.

Parameters:

Name	Type	Description	Default
`paragraphs`		True if paragraph parsing enabled, False otherwise	`False`
`sections`		True if section parsing enabled, False otherwise	`False`

Source code in txtai/pipeline/data/htmltomd.py

def __init__(self, paragraphs=False, sections=False):
    """
    Create a new Extract instance.

    Args:
        paragraphs: True if paragraph parsing enabled, False otherwise
        sections: True if section parsing enabled, False otherwise
    """

    if not SOUP:
        raise ImportError('HTMLToMarkdown pipeline is not available - install "pipeline" extra to enable')

    self.paragraphs = paragraphs
    self.sections = sections

`call(html)`

Transforms input HTML into Markdown formatted text.