Skip to content

Tokenizer

pipeline pipeline

The Tokenizer pipeline splits text into tokens. This is primarily used for keyword / term indexing.

Note: Transformers-based models have their own tokenizers and this pipeline isn't designed for working with Transformers models.

Example

The following shows a simple example using this pipeline.

from txtai.pipeline import Tokenizer

# Create and run pipeline
tokenizer = Tokenizer()
tokenizer("text to tokenize")

# Whitespace tokenization
tokenizer = Tokenizer(whitespace=True)
tokenizer("text to tokenize")

# Tokenize using a regular expression
tokenizer = Tokenizer(regexp=r"\w{5,}")
tokenizer("text to tokenize")

Configuration-driven example

Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.

config.yml

# Create pipeline using lower case class name
tokenizer:

# Run pipeline with workflow
workflow:
  tokenizer:
    tasks:
      - action: tokenizer

Run with Workflows

from txtai import Application

# Create and run pipeline with workflow
app = Application("config.yml")
list(app.workflow("tokenizer", ["text to tokenize"]))

Run with API

CONFIG=config.yml uvicorn "txtai.api:app" &

curl \
  -X POST "http://localhost:8000/workflow" \
  -H "Content-Type: application/json" \
  -d '{"name":"tokenizer", "elements":["text"]}'

Methods

Python documentation for the pipeline.

__init__(lowercase=True, emoji=True, alphanum=False, stopwords=False, whitespace=False, regexp=None)

Creates a new tokenizer. The default parameters segment text per Unicode Standard Annex #29.

Parameters:

Name Type Description Default
lowercase

lower cases all tokens if True, defaults to True

True
emoji

tokenize emoji in text if True, defaults to True

True
alphanum

requires 2+ character alphanumeric tokens if True, defaults to False

False
stopwords

removes provided stop words if a list, removes default English stop words if True, defaults to False

False
whitespace

tokenize on whitespace if True, defaults to False

False
regexp

tokenize using the provided regular expression, defaults to None

None
Source code in txtai/pipeline/data/tokenizer.py
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
def __init__(self, lowercase=True, emoji=True, alphanum=False, stopwords=False, whitespace=False, regexp=None):
    """
    Creates a new tokenizer. The default parameters segment text per Unicode Standard Annex #29.

    Args:
        lowercase: lower cases all tokens if True, defaults to True
        emoji: tokenize emoji in text if True, defaults to True
        alphanum: requires 2+ character alphanumeric tokens if True, defaults to False
        stopwords: removes provided stop words if a list, removes default English stop words if True, defaults to False
        whitespace: tokenize on whitespace if True, defaults to False
        regexp: tokenize using the provided regular expression, defaults to None
    """

    # Lowercase
    self.lowercase = lowercase

    # Text segmentation
    self.alphanum, self.whitespace, self.regexp, self.segment = None, whitespace, None, None
    if alphanum:
        # Alphanumeric regex that accepts tokens that meet following rules:
        #  - Strings to be at least 2 characters long AND
        #  - At least 1 non-trailing alpha character in string
        # Note: The standard Python re module is much faster than regex for this expression
        self.alphanum = re.compile(r"^\d*[a-z][\-.0-9:_a-z]{1,}$")
    elif regexp:
        # Regular expression for tokenization
        self.regexp = regex.compile(regexp)
    else:
        # Text segmentation per Unicode Standard Annex #29
        pattern = r"\w\p{Extended_Pictographic}\p{WB:RegionalIndicator}" if emoji else r"\w"
        self.segment = regex.compile(rf"[{pattern}](?:\B\S)*", flags=regex.WORD)

    # Stop words
    self.stopwords = stopwords if isinstance(stopwords, list) else Tokenizer.STOP_WORDS if stopwords else False

__call__(text)

Tokenizes text into a list of tokens.

Parameters:

Name Type Description Default
text

input text

required

Returns:

Type Description

list of tokens

Source code in txtai/pipeline/data/tokenizer.py
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
def __call__(self, text):
    """
    Tokenizes text into a list of tokens.

    Args:
        text: input text

    Returns:
        list of tokens
    """

    # Check for None and skip processing
    if text is None:
        return None

    # Lowercase
    text = text.lower() if self.lowercase else text

    if self.alphanum:
        # Text segmentation using standard split
        tokens = [token.strip(string.punctuation) for token in text.split()]

        # Filter on alphanumeric strings.
        tokens = [token for token in tokens if re.match(self.alphanum, token)]
    elif self.whitespace:
        # Text segmentation using whitespace
        tokens = text.split()
    elif self.regexp:
        # Text segmentation using a custom regular expression
        tokens = regex.findall(self.regexp, text)
    else:
        # Text segmentation per Unicode Standard Annex #29
        tokens = regex.findall(self.segment, text)

    # Stop words
    if self.stopwords:
        tokens = [token for token in tokens if token not in self.stopwords]

    return tokens