Skip to content

Textractor

pipeline pipeline

The Textractor pipeline extracts and splits text from documents. This pipeline uses either an Apache Tika backend (if Java is available) or BeautifulSoup4.

Note: BeautifulSoup4 only supports HTML documents, anything else requires Tika and Java to be installed.

Example

The following shows a simple example using this pipeline.

from txtai.pipeline import Textractor

# Create and run pipeline
textract = Textractor()
textract("https://github.com/neuml/txtai")

See the link below for a more detailed example.

Notebook Description
Extract text from documents Extract text from PDF, Office, HTML and more Open In Colab

Configuration-driven example

Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.

config.yml

# Create pipeline using lower case class name
textractor:

# Run pipeline with workflow
workflow:
  textract:
    tasks:
      - action: textractor

Run with Workflows

from txtai.app import Application

# Create and run pipeline with workflow
app = Application("config.yml")
list(app.workflow("textract", ["https://github.com/neuml/txtai"]))

Run with API

CONFIG=config.yml uvicorn "txtai.api:app" &

curl \
  -X POST "http://localhost:8000/workflow" \
  -H "Content-Type: application/json" \
  -d '{"name":"textract", "elements":["https://github.com/neuml/txtai"]}'

Methods

Python documentation for the pipeline.

Source code in txtai/pipeline/data/textractor.py
28
29
30
31
32
33
34
35
36
37
38
39
def __init__(self, sentences=False, lines=False, paragraphs=False, minlength=None, join=False, tika=True, sections=False):
    if not TIKA:
        raise ImportError('Textractor pipeline is not available - install "pipeline" extra to enable')

    super().__init__(sentences, lines, paragraphs, minlength, join, sections)

    # Determine if Tika (default if Java is available) or Beautiful Soup should be used
    # Beautiful Soup only supports HTML, Tika supports a wide variety of file formats, including HTML.
    self.tika = self.checkjava() if tika else False

    # HTML to Text extractor
    self.extract = Extract(self.sections)

Segments text into semantic units.

This method supports text as a string or a list. If the input is a string, the return type is text|list. If text is a list, a list of returned, this could be a list of text or a list of lists depending on the tokenization strategy.

Parameters:

Name Type Description Default
text

text|list

required

Returns:

Type Description

segmented text

Source code in txtai/pipeline/data/segmentation.py
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
def __call__(self, text):
    """
    Segments text into semantic units.

    This method supports text as a string or a list. If the input is a string, the return
    type is text|list. If text is a list, a list of returned, this could be a
    list of text or a list of lists depending on the tokenization strategy.

    Args:
        text: text|list

    Returns:
        segmented text
    """

    # Get inputs
    texts = [text] if not isinstance(text, list) else text

    # Extract text for each input file
    results = []
    for value in texts:
        # Get text
        value = self.text(value)

        # Parse and add extracted results
        results.append(self.parse(value))

    return results[0] if isinstance(text, str) else results