HTML To Markdown
The HTML To Markdown pipeline transforms HTML to Markdown.
Markdown formatting is applied for headings, blockquotes, lists, code, tables and text. Visual formatting is also included (bold, italic etc).
This pipeline searches for the best node that has relevant text, often found with an article
, main
or body
tag.
The HTML to Markdown pipeline requires the BeautifulSoup4 library to be installed.
Example
The following shows a simple example using this pipeline.
from txtai.pipeline import HTMLToMarkdown
# Create and run pipeline
md = HTMLToMarkdown()
md("<html><body>This is a test</body></html>")
Configuration-driven example
Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.
config.yml
# Create pipeline using lower case class name
htmltomarkdown:
# Run pipeline with workflow
workflow:
markdown:
tasks:
- action: htmltomarkdown
Run with Workflows
from txtai import Application
# Create and run pipeline with workflow
app = Application("config.yml")
list(app.workflow("markdown", ["<html><body>This is a test</body></html>"]))
Run with API
CONFIG=config.yml uvicorn "txtai.api:app" &
curl \
-X POST "http://localhost:8000/workflow" \
-H "Content-Type: application/json" \
-d '{"name":"markdown", "elements":["<html><body>This is a test</body></html>"]}'
Methods
Python documentation for the pipeline.
__init__(paragraphs=False, sections=False)
Create a new Extract instance.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
paragraphs
|
True if paragraph parsing enabled, False otherwise |
False
|
|
sections
|
True if section parsing enabled, False otherwise |
False
|
Source code in txtai/pipeline/data/htmltomd.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
|
__call__(html)
Transforms input HTML into Markdown formatted text.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
html
|
input html |
required |
Returns:
Type | Description |
---|---|
markdown formatted text |
Source code in txtai/pipeline/data/htmltomd.py
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
|