File To HTML

pipeline

The File To HTML pipeline transforms files to HTML. It supports the following text extraction backends.

Apache Tika

Apache Tika detects and extracts metadata and text from over a thousand different file types. See this link for a list of supported document formats.

Apache Tika requires Java to be installed. An alternative to that is starting a separate Apache Tika service via this Docker Image and setting these environment variables.

Docling

Docling parses documents and exports them to the desired format with ease and speed. This is a library that has rapidly gained popularity starting in late 2024. Docling excels in parsing formatting elements from PDFs (tables, sections etc).

See this link for a list of supported document formats.

Example

The following shows a simple example using this pipeline.

from txtai.pipeline import FileToHTML

# Create and run pipeline
html = FileToHTML()
html("/path/to/file")

Configuration-driven example

Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.

config.yml

# Create pipeline using lower case class name
filetohtml:

# Run pipeline with workflow
workflow:
  html:
    tasks:
      - action: filetohtml

Run with Workflows

from txtai import Application

# Create and run pipeline with workflow
app = Application("config.yml")
list(app.workflow("html", ["/path/to/file"]))

Run with API

CONFIG=config.yml uvicorn "txtai.api:app" &

curl \
  -X POST "http://localhost:8000/workflow" \
  -H "Content-Type: application/json" \
  -d '{"name":"html", "elements":["/path/to/file"]}'

Methods

Python documentation for the pipeline.

`init(backend='available')`

Creates a new File to HTML pipeline.

Parameters:

Name	Type	Description	Default
`backend`		backend to use to extract content, supports "tika", "docling" or "available" (default) which finds the first available	`'available'`

Source code in txtai/pipeline/data/filetohtml.py

def __init__(self, backend="available"):
    """
    Creates a new File to HTML pipeline.

    Args:
        backend: backend to use to extract content, supports "tika", "docling" or "available" (default) which finds the first available
    """

    # Lowercase backend parameter
    backend = backend.lower() if backend else None

    # Check for available backend
    if backend == "available":
        backend = "tika" if Tika.available() else "docling" if Docling.available() else None

    # Create backend instance
    self.backend = Tika() if backend == "tika" else Docling() if backend == "docling" else None

`call(path)`

Converts file at path to HTML. Returns None if no backend is available.