Skip to content

File To HTML

pipeline pipeline

The File To HTML pipeline transforms files to HTML. It supports the following text extraction backends.

Apache Tika

Apache Tika detects and extracts metadata and text from over a thousand different file types. See this link for a list of supported document formats.

Apache Tika requires Java to be installed. An alternative to that is starting a separate Apache Tika service via this Docker Image and setting these environment variables.

Docling

Docling parses documents and exports them to the desired format with ease and speed. This is a library that has rapidly gained popularity starting in late 2024. Docling excels in parsing formatting elements from PDFs (tables, sections etc).

See this link for a list of supported document formats.

Example

The following shows a simple example using this pipeline.

from txtai.pipeline import FileToHTML

# Create and run pipeline
html = FileToHTML()
html("/path/to/file")

Configuration-driven example

Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.

config.yml

# Create pipeline using lower case class name
filetohtml:

# Run pipeline with workflow
workflow:
  html:
    tasks:
      - action: filetohtml

Run with Workflows

from txtai import Application

# Create and run pipeline with workflow
app = Application("config.yml")
list(app.workflow("html", ["/path/to/file"]))

Run with API

CONFIG=config.yml uvicorn "txtai.api:app" &

curl \
  -X POST "http://localhost:8000/workflow" \
  -H "Content-Type: application/json" \
  -d '{"name":"html", "elements":["/path/to/file"]}'

Methods

Python documentation for the pipeline.

__init__(backend='available')

Creates a new File to HTML pipeline.

Parameters:

Name Type Description Default
backend

backend to use to extract content, supports "tika", "docling" or "available" (default) which finds the first available

'available'
Source code in txtai/pipeline/data/filetohtml.py
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
def __init__(self, backend="available"):
    """
    Creates a new File to HTML pipeline.

    Args:
        backend: backend to use to extract content, supports "tika", "docling" or "available" (default) which finds the first available
    """

    # Lowercase backend parameter
    backend = backend.lower() if backend else None

    # Check for available backend
    if backend == "available":
        backend = "tika" if Tika.available() else "docling" if Docling.available() else None

    # Create backend instance
    self.backend = Tika() if backend == "tika" else Docling() if backend == "docling" else None

__call__(path)

Converts file at path to HTML. Returns None if no backend is available.

Parameters:

Name Type Description Default
path

input file path

required

Returns:

Type Description

html if a backend is available, otherwise returns None

Source code in txtai/pipeline/data/filetohtml.py
52
53
54
55
56
57
58
59
60
61
62
63
def __call__(self, path):
    """
    Converts file at path to HTML. Returns None if no backend is available.

    Args:
        path: input file path

    Returns:
        html if a backend is available, otherwise returns None
    """

    return self.backend(path) if self.backend else None