Skip to content

Translation

pipeline pipeline

The Translation pipeline translates text between languages. It supports over 100+ languages. Automatic source language detection is built-in. This pipeline detects the language of each input text row, loads a model for the source-target combination and translates text to the target language.

Example

The following shows a simple example using this pipeline.

from txtai.pipeline import Translation

# Create and run pipeline
translate = Translation()
translate("This is a test translation into Spanish", "es")

See the link below for a more detailed example.

Notebook Description
Translate text between languages Streamline machine translation and language detection Open In Colab

Configuration-driven example

Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.

config.yml

# Create pipeline using lower case class name
translation:

# Run pipeline with workflow
workflow:
  translate:
    tasks:
      - action: translation
        args: ["es"]

Run with Workflows

from txtai import Application

# Create and run pipeline with workflow
app = Application("config.yml")
list(app.workflow("translate", ["This is a test translation into Spanish"]))

Run with API

CONFIG=config.yml uvicorn "txtai.api:app" &

curl \
  -X POST "http://localhost:8000/workflow" \
  -H "Content-Type: application/json" \
  -d '{"name":"translate", "elements":["This is a test translation into Spanish"]}'

Methods

Python documentation for the pipeline.

__init__(path=None, quantize=False, gpu=True, batch=64, langdetect=None, findmodels=True)

Constructs a new language translation pipeline.

Parameters:

Name Type Description Default
path

optional path to model, accepts Hugging Face model hub id or local path, uses default model for task if not provided

None
quantize

if model should be quantized, defaults to False

False
gpu

True/False if GPU should be enabled, also supports a GPU device id

True
batch

batch size used to incrementally process content

64
langdetect

set a custom language detection function, method must take a list of strings and return language codes for each, uses default language detector if not provided

None
findmodels

True/False if the Hugging Face Hub will be searched for source-target translation models

True
Source code in txtai/pipeline/text/translation.py
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
def __init__(self, path=None, quantize=False, gpu=True, batch=64, langdetect=None, findmodels=True):
    """
    Constructs a new language translation pipeline.

    Args:
        path: optional path to model, accepts Hugging Face model hub id or local path,
              uses default model for task if not provided
        quantize: if model should be quantized, defaults to False
        gpu: True/False if GPU should be enabled, also supports a GPU device id
        batch: batch size used to incrementally process content
        langdetect: set a custom language detection function, method must take a list of strings and return
                    language codes for each, uses default language detector if not provided
        findmodels: True/False if the Hugging Face Hub will be searched for source-target translation models
    """

    # Call parent constructor
    super().__init__(path if path else "facebook/m2m100_418M", quantize, gpu, batch)

    # Language detection
    self.detector = None
    self.langdetect = langdetect
    self.findmodels = findmodels

    # Language models
    self.models = {}
    self.ids = self.modelids()

__call__(texts, target='en', source=None, showmodels=False)

Translates text from source language into target language.

This method supports texts as a string or a list. If the input is a string, the return type is string. If text is a list, the return type is a list.

Parameters:

Name Type Description Default
texts

text|list

required
target

target language code, defaults to "en"

'en'
source

source language code, detects language if not provided

None

Returns:

Type Description

list of translated text

Source code in txtai/pipeline/text/translation.py
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
def __call__(self, texts, target="en", source=None, showmodels=False):
    """
    Translates text from source language into target language.

    This method supports texts as a string or a list. If the input is a string,
    the return type is string. If text is a list, the return type is a list.

    Args:
        texts: text|list
        target: target language code, defaults to "en"
        source: source language code, detects language if not provided

    Returns:
        list of translated text
    """

    values = [texts] if not isinstance(texts, list) else texts

    # Detect source languages
    languages = self.detect(values) if not source else [source] * len(values)
    unique = set(languages)

    # Build a dict from language to list of (index, text)
    langdict = {}
    for x, lang in enumerate(languages):
        if lang not in langdict:
            langdict[lang] = []
        langdict[lang].append((x, values[x]))

    results = {}
    for language in unique:
        # Get all indices and text values for a language
        inputs = langdict[language]

        # Translate text in batches
        outputs = []
        for chunk in self.batch([text for _, text in inputs], self.batchsize):
            outputs.extend(self.translate(chunk, language, target, showmodels))

        # Store output value
        for y, (x, _) in enumerate(inputs):
            if showmodels:
                model, op = outputs[y]
                results[x] = (op.strip(), language, model)
            else:
                results[x] = outputs[y].strip()

    # Return results in same order as input
    results = [results[x] for x in sorted(results)]
    return results[0] if isinstance(texts, str) else results