Skip to content

Segmentation

pipeline pipeline

The Segmentation pipeline segments text into semantic units.

Example

The following shows a simple example using this pipeline.

from txtai.pipeline import Segmentation

# Create and run pipeline
segment = Segmentation(sentences=True)
segment("This is a test. And another test.")

# Segment text with Chonkie chunker (word, sentence, semantic, late etc)
segment = Segmentation(chunker="semantic")
segment("This is a test. And another test.")

The Segmentation pipeline supports segmenting sentences, lines, paragraphs and sections using a rules-based approach. Each of these modes can be set when creating the pipeline.

More advanced functionality is supported via a Chonkie chunker. The chunker keyword dynamically creates a Chonkie chunker. For example, chunker='token' creates a TokenChunker, chunker='semantic' creates a SemanticChunker and so forth. Additional keyword arguments are passed to the chunker.

Configuration-driven example

Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.

config.yml

# Create pipeline using lower case class name
segmentation:
  sentences: true

# Run pipeline with workflow
workflow:
  segment:
    tasks:
      - action: segmentation

Run with Workflows

from txtai import Application

# Create and run pipeline with workflow
app = Application("config.yml")
list(app.workflow("segment", ["This is a test. And another test."]))

Run with API

CONFIG=config.yml uvicorn "txtai.api:app" &

curl \
  -X POST "http://localhost:8000/workflow" \
  -H "Content-Type: application/json" \
  -d '{"name":"segment", "elements":["This is a test. And another test."]}'

Methods

Python documentation for the pipeline.

__init__(sentences=False, lines=False, paragraphs=False, minlength=None, join=False, sections=False, cleantext=True, chunker=None, **kwargs)

Creates a new Segmentation pipeline.

Parameters:

Name Type Description Default
sentences

tokenize text into sentences if True, defaults to False

False
lines

tokenizes text into lines if True, defaults to False

False
paragraphs

tokenizes text into paragraphs if True, defaults to False

False
minlength

require at least minlength characters per text element, defaults to None

None
join

joins tokenized sections back together if True, defaults to False

False
sections

tokenizes text into sections if True, defaults to False. Splits using section or page breaks, depending on what's available

False
cleantext

apply text cleaning rules, defaults to True

True
chunker

creates a chonkie chunker to tokenize text if set, defaults to None

None
kwargs

additional keyword arguments

{}
Source code in txtai/pipeline/data/segmentation.py
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
def __init__(
    self, sentences=False, lines=False, paragraphs=False, minlength=None, join=False, sections=False, cleantext=True, chunker=None, **kwargs
):
    """
    Creates a new Segmentation pipeline.

    Args:
        sentences: tokenize text into sentences if True, defaults to False
        lines: tokenizes text into lines if True, defaults to False
        paragraphs: tokenizes text into paragraphs if True, defaults to False
        minlength: require at least minlength characters per text element, defaults to None
        join: joins tokenized sections back together if True, defaults to False
        sections: tokenizes text into sections if True, defaults to False. Splits using section or page breaks, depending on what's available
        cleantext: apply text cleaning rules, defaults to True
        chunker: creates a chonkie chunker to tokenize text if set, defaults to None
        kwargs: additional keyword arguments
    """

    if not NLTK and sentences:
        raise ImportError('NLTK is not available - install "pipeline" extra to enable')

    if not CHONKIE and chunker:
        raise ImportError('Chonkie is not available - install "pipeline" extra to enable')

    self.sentences = sentences
    self.lines = lines
    self.paragraphs = paragraphs
    self.sections = sections
    self.minlength = minlength
    self.join = join
    self.cleantext = cleantext

    # Create a chonkie chunker, if applicable
    self.chunker = self.createchunker(chunker, **kwargs) if chunker else None

__call__(text)

Segments text into semantic units.

This method supports text as a string or a list. If the input is a string, the return type is text|list. If text is a list, a list of returned, this could be a list of text or a list of lists depending on the tokenization strategy.

Parameters:

Name Type Description Default
text

text|list

required

Returns:

Type Description

segmented text

Source code in txtai/pipeline/data/segmentation.py
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
def __call__(self, text):
    """
    Segments text into semantic units.

    This method supports text as a string or a list. If the input is a string, the return
    type is text|list. If text is a list, a list of returned, this could be a
    list of text or a list of lists depending on the tokenization strategy.

    Args:
        text: text|list

    Returns:
        segmented text
    """

    # Get inputs
    texts = [text] if not isinstance(text, list) else text

    # Extract text for each input file
    results = []
    for value in texts:
        # Get text
        value = self.text(value)

        # Parse and add extracted results
        results.append(self.parse(value))

    return results[0] if isinstance(text, str) else results