Tokenizer

The Tokenizer pipeline splits text into tokens. This is primarily used for keyword / term indexing.
Note: Transformers-based models have their own tokenizers and this pipeline isn't designed for working with Transformers models.
Example
The following shows a simple example using this pipeline.
from txtai.pipeline import Tokenizer
# Create and run pipeline
tokenizer = Tokenizer()
tokenizer("text to tokenize")
# Whitespace tokenization
tokenizer = Tokenizer(whitespace=True)
tokenizer("text to tokenize")
# Tokenize using a regular expression
tokenizer = Tokenizer(regexp=r"\w{5,}")
tokenizer("text to tokenize")
Configuration-driven example
Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.
config.yml
# Create pipeline using lower case class name
tokenizer:
# Run pipeline with workflow
workflow:
tokenizer:
tasks:
- action: tokenizer
Run with Workflows
from txtai import Application
# Create and run pipeline with workflow
app = Application("config.yml")
list(app.workflow("tokenizer", ["text to tokenize"]))
Run with API
CONFIG=config.yml uvicorn "txtai.api:app" &
curl \
-X POST "http://localhost:8000/workflow" \
-H "Content-Type: application/json" \
-d '{"name":"tokenizer", "elements":["text"]}'
Methods
Python documentation for the pipeline.
__init__(lowercase=True, emoji=True, alphanum=False, stopwords=False, whitespace=False, regexp=None)
Creates a new tokenizer. The default parameters segment text per Unicode Standard Annex #29.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
lowercase
|
lower cases all tokens if True, defaults to True |
True
|
|
emoji
|
tokenize emoji in text if True, defaults to True |
True
|
|
alphanum
|
requires 2+ character alphanumeric tokens if True, defaults to False |
False
|
|
stopwords
|
removes provided stop words if a list, removes default English stop words if True, defaults to False |
False
|
|
whitespace
|
tokenize on whitespace if True, defaults to False |
False
|
|
regexp
|
tokenize using the provided regular expression, defaults to None |
None
|
Source code in txtai/pipeline/data/tokenizer.py
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 | |
__call__(text)
Tokenizes text into a list of tokens.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
input text |
required |
Returns:
| Type | Description |
|---|---|
|
list of tokens |
Source code in txtai/pipeline/data/tokenizer.py
92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | |