Tokenizer

pipeline

The Tokenizer pipeline splits text into tokens. This is primarily used for keyword / term indexing.

Note: Transformers-based models have their own tokenizers and this pipeline isn't designed for working with Transformers models.

Example

The following shows a simple example using this pipeline.

from txtai.pipeline import Tokenizer

# Create and run pipeline
tokenizer = Tokenizer()
tokenizer("text to tokenize")

# Whitespace tokenization
tokenizer = Tokenizer(whitespace=True)
tokenizer("text to tokenize")

# Tokenize using a regular expression
tokenizer = Tokenizer(regexp=r"\w{5,}")
tokenizer("text to tokenize")

# Tokenize into trigrams like pg_trgm
tokenizer = Tokenizer(ngrams={
  "ngrams": 3, "lpad": "  ", "rpad": " ", "unique": True
})
tokenize("text to tokenize")

# Tokenize into edge ngrams
tokenizer = Tokenizer(ngrams={"nmin": 2, "nmax": 5, "edge": True})
tokenizer("text to tokenize")

Configuration-driven example

Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.

config.yml

# Create pipeline using lower case class name
tokenizer:

# Run pipeline with workflow
workflow:
  tokenizer:
    tasks:
      - action: tokenizer

Run with Workflows

from txtai import Application

# Create and run pipeline with workflow
app = Application("config.yml")
list(app.workflow("tokenizer", ["text to tokenize"]))

Run with API

CONFIG=config.yml uvicorn "txtai.api:app" &

curl \
  -X POST "http://localhost:8000/workflow" \
  -H "Content-Type: application/json" \
  -d '{"name":"tokenizer", "elements":["text"]}'

Methods

Python documentation for the pipeline.

`init(lowercase=True, emoji=True, alphanum=False, stopwords=False, whitespace=False, regexp=None, ngrams=None)`

Creates a new tokenizer. The default parameters segment text per Unicode Standard Annex #29.

Parameters:

Name	Description	Default
`lowercase`	lower cases all tokens if True, defaults to True	`True`
`emoji`	tokenize emoji in text if True, defaults to True	`True`
`alphanum`	requires 2+ character alphanumeric tokens if True, defaults to False	`False`
`stopwords`	removes provided stop words if a list, removes default English stop words if True, defaults to False	`False`
`whitespace`	tokenize on whitespace if True, defaults to False	`False`
`regexp`	tokenize using the provided regular expression, defaults to None	`None`
`ngrams`	tokenize into ngrams, defaults to None, supports int or dict	`None`

Source code in txtai/pipeline/data/tokenizer.py

def __init__(self, lowercase=True, emoji=True, alphanum=False, stopwords=False, whitespace=False, regexp=None, ngrams=None):
    """
    Creates a new tokenizer. The default parameters segment text per Unicode Standard Annex #29.

    Args:
        lowercase: lower cases all tokens if True, defaults to True
        emoji: tokenize emoji in text if True, defaults to True
        alphanum: requires 2+ character alphanumeric tokens if True, defaults to False
        stopwords: removes provided stop words if a list, removes default English stop words if True, defaults to False
        whitespace: tokenize on whitespace if True, defaults to False
        regexp: tokenize using the provided regular expression, defaults to None
        ngrams: tokenize into ngrams, defaults to None, supports int or dict
    """

    # Lowercase
    self.lowercase = lowercase

    # Text segmentation
    self.alphanum, self.whitespace, self.regexp, self.ngrams, self.segment = None, whitespace, None, None, None
    if alphanum:
        # Alphanumeric regex that accepts tokens that meet following rules:
        #  - Strings to be at least 2 characters long AND
        #  - At least 1 non-trailing alpha character in string
        # Note: The standard Python re module is much faster than regex for this expression
        self.alphanum = re.compile(r"^\d*[a-z][\-.0-9:_a-z]{1,}$")
    elif regexp:
        # Regular expression for tokenization
        self.regexp = regex.compile(regexp)
    elif ngrams:
        # Ngram tokenization configuration
        self.ngrams = ngrams if isinstance(ngrams, dict) else {"ngrams": ngrams}
    else:
        # Text segmentation per Unicode Standard Annex #29
        pattern = r"\w\p{Extended_Pictographic}\p{WB:RegionalIndicator}" if emoji else r"\w"
        self.segment = regex.compile(rf"[{pattern}](?:\B\S)*", flags=regex.WORD)

    # Stop words
    self.stopwords = stopwords if isinstance(stopwords, list) else Tokenizer.STOP_WORDS if stopwords else False

`call(text)`

Tokenizes text into a list of tokens.