File To HTML
The File To HTML pipeline transforms files to HTML. It supports the following text extraction backends.
Apache Tika
Apache Tika detects and extracts metadata and text from over a thousand different file types. See this link for a list of supported document formats.
Apache Tika requires Java to be installed. An alternative to that is starting a separate Apache Tika service via this Docker Image and setting these environment variables.
Docling
Docling parses documents and exports them to the desired format with ease and speed. This is a library that has rapidly gained popularity starting in late 2024. Docling excels in parsing formatting elements from PDFs (tables, sections etc).
See this link for a list of supported document formats.
Example
The following shows a simple example using this pipeline.
from txtai.pipeline import FileToHTML
# Create and run pipeline
html = FileToHTML()
html("/path/to/file")
Configuration-driven example
Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.
config.yml
# Create pipeline using lower case class name
filetohtml:
# Run pipeline with workflow
workflow:
html:
tasks:
- action: filetohtml
Run with Workflows
from txtai import Application
# Create and run pipeline with workflow
app = Application("config.yml")
list(app.workflow("html", ["/path/to/file"]))
Run with API
CONFIG=config.yml uvicorn "txtai.api:app" &
curl \
-X POST "http://localhost:8000/workflow" \
-H "Content-Type: application/json" \
-d '{"name":"html", "elements":["/path/to/file"]}'
Methods
Python documentation for the pipeline.
__init__(backend='available')
Creates a new File to HTML pipeline.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
backend
|
backend to use to extract content, supports "tika", "docling" or "available" (default) which finds the first available |
'available'
|
Source code in txtai/pipeline/data/filetohtml.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
|
__call__(path)
Converts file at path to HTML. Returns None if no backend is available.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
input file path |
required |
Returns:
Type | Description |
---|---|
html if a backend is available, otherwise returns None |
Source code in txtai/pipeline/data/filetohtml.py
52 53 54 55 56 57 58 59 60 61 62 63 |
|