Textractor

pipeline

The Textractor pipeline extracts and splits text from documents. This pipeline extends the Segmentation pipeline.

Each document goes through the following process.

Content is retrieved if it's not local
If the document mime-type isn't plain text or HTML, it's converted to HTML via the FiletoHTML pipeline
HTML is converted to Markdown via the HTMLToMarkdown pipeline
Content is split/chunked based on the segmentation parameters and returned

The backend parameter sets the FileToHTML pipeline backend. If a backend isn't available, this pipeline assumes input is HTML content and only converts it to Markdown.

See the FiletoHTML and HTMLToMarkdown pipelines to learn more on the dependencies necessary for each of those pipelines.

Example

The following shows a simple example using this pipeline.

from txtai.pipeline import Textractor

# Create and run pipeline
textract = Textractor()
textract("https://github.com/neuml/txtai")

See the link below for a more detailed example.

Notebook	Description
Extract text from documents	Extract text from PDF, Office, HTML and more
Chunking your data for RAG	Extract, chunk and index content for effective retrieval

Configuration-driven example

Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.

config.yml

# Create pipeline using lower case class name
textractor:

# Run pipeline with workflow
workflow:
  textract:
    tasks:
      - action: textractor

Run with Workflows

from txtai import Application

# Create and run pipeline with workflow
app = Application("config.yml")
list(app.workflow("textract", ["https://github.com/neuml/txtai"]))

Run with API

CONFIG=config.yml uvicorn "txtai.api:app" &

curl \
  -X POST "http://localhost:8000/workflow" \
  -H "Content-Type: application/json" \
  -d '{"name":"textract", "elements":["https://github.com/neuml/txtai"]}'

Methods

Python documentation for the pipeline.

`init(sentences=False, lines=False, paragraphs=False, minlength=None, join=False, sections=False, cleantext=True, chunker=None, headers=None, backend='available', **kwargs)`

Source code in txtai/pipeline/data/textractor.py

def __init__(
    self,
    sentences=False,
    lines=False,
    paragraphs=False,
    minlength=None,
    join=False,
    sections=False,
    cleantext=True,
    chunker=None,
    headers=None,
    backend="available",
    **kwargs
):
    super().__init__(sentences, lines, paragraphs, minlength, join, sections, cleantext, chunker, **kwargs)

    # Get backend parameter - handle legacy tika flag
    backend = "tika" if "tika" in kwargs and kwargs["tika"] else None if "tika" in kwargs else backend

    # File to HTML pipeline
    self.html = FileToHTML(backend) if backend else None

    # HTML to Markdown pipeline
    self.markdown = HTMLToMarkdown(self.paragraphs, self.sections)

    # HTTP headers
    self.headers = headers if headers else {}

`call(text)`

Segments text into semantic units.

This method supports text as a string or a list. If the input is a string, the return type is text|list. If text is a list, a list of returned, this could be a list of text or a list of lists depending on the tokenization strategy.

Parameters:

Name	Type	Description	Default
`text`		text\|list	required

Returns:

Type	Description
	segmented text

Source code in txtai/pipeline/data/segmentation.py

def __call__(self, text):
    """
    Segments text into semantic units.

    This method supports text as a string or a list. If the input is a string, the return
    type is text|list. If text is a list, a list of returned, this could be a
    list of text or a list of lists depending on the tokenization strategy.

    Args:
        text: text|list

    Returns:
        segmented text
    """

    # Get inputs
    texts = [text] if not isinstance(text, list) else text

    # Extract text for each input file
    results = []
    for value in texts:
        # Get text
        value = self.text(value)

        # Parse and add extracted results
        results.append(self.parse(value))

    return results[0] if isinstance(text, str) else results