Skip to content

Extractor

pipeline pipeline

The Extractor pipeline is a combination of a similarity instance (embeddings or similarity pipeline) to build a question context and a model that answers questions. The model can be a prompt-driven large language model (LLM), an extractive question-answering model or a custom pipeline.

Example

The following shows a simple example using this pipeline.

from txtai.embeddings import Embeddings
from txtai.pipeline import Extractor

# Embeddings model ranks candidates before passing to QA pipeline
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2"})

# Create and run pipeline
extractor = Extractor(embeddings, "distilbert-base-cased-distilled-squad")
extractor([["What was won"] * 3 + [False]],
          ["Maine man wins $1M from $25 lottery ticket"])

See the links below for more detailed examples.

Notebook Description
Extractive QA with txtai Introduction to extractive question-answering with txtai Open In Colab
Extractive QA with Elasticsearch Run extractive question-answering queries with Elasticsearch Open In Colab
Extractive QA to build structured data Build structured datasets using extractive question-answering Open In Colab
Prompt-driven search with LLMs Embeddings-guided and Prompt-driven search with Large Language Models (LLMs) Open In Colab

Configuration-driven example

Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.

config.yml

# Create pipeline using lower case class name
extractor:

Run with Workflows

from txtai.app import Application

# Create and run pipeline with workflow
app = Application("config.yml")
list(app.extract([{"name": "What was won", "query": "What was won",
                   "question", "What was won", "snippet": False}], 
                 ["Maine man wins $1M from $25 lottery ticket"]))

Run with API

CONFIG=config.yml uvicorn "txtai.api:app" &

curl \
  -X POST "http://localhost:8000/extract" \
  -H "Content-Type: application/json" \
  -d '{"queue": [{"name":"What was won", "query": "What was won", "question": "What was won", "snippet": false}], "texts": ["Maine man wins $1M from $25 lottery ticket"]}'

Methods

Python documentation for the pipeline.

__init__(self, similarity, path, quantize=False, gpu=True, model=None, tokenizer=None, minscore=None, mintokens=None, context=None, task=None) special

Builds a new extractor.

Parameters:

Name Type Description Default
similarity

similarity instance (embeddings or similarity pipeline)

required
path

path to model, supports Questions, Generator, Sequences or custom pipeline

required
quantize

True if model should be quantized before inference, False otherwise.

False
gpu

if gpu inference should be used (only works if GPUs are available)

True
model

optional existing pipeline model to wrap

None
tokenizer

Tokenizer class

None
minscore

minimum score to include context match, defaults to None

None
mintokens

minimum number of tokens to include context match, defaults to None

None
context

topn context matches to include, defaults to 3

None
task

model task (language-generation, sequence-sequence or question-answering), defaults to auto-detect

None
Source code in txtai/pipeline/text/extractor.py
def __init__(
    self, similarity, path, quantize=False, gpu=True, model=None, tokenizer=None, minscore=None, mintokens=None, context=None, task=None
):
    """
    Builds a new extractor.

    Args:
        similarity: similarity instance (embeddings or similarity pipeline)
        path: path to model, supports Questions, Generator, Sequences or custom pipeline
        quantize: True if model should be quantized before inference, False otherwise.
        gpu: if gpu inference should be used (only works if GPUs are available)
        model: optional existing pipeline model to wrap
        tokenizer: Tokenizer class
        minscore: minimum score to include context match, defaults to None
        mintokens: minimum number of tokens to include context match, defaults to None
        context: topn context matches to include, defaults to 3
        task: model task (language-generation, sequence-sequence or question-answering), defaults to auto-detect
    """

    # Similarity instance
    self.similarity = similarity

    # Question-Answer model. Can be prompt-driven LLM or extractive qa
    self.model = self.load(path, quantize, gpu, model, task)

    # Tokenizer class use default method if not set
    self.tokenizer = tokenizer if tokenizer else Tokenizer() if hasattr(self.similarity, "scoring") and self.similarity.scoring else None

    # Minimum score to include context match
    self.minscore = minscore if minscore is not None else 0.0

    # Minimum number of tokens to include context match
    self.mintokens = mintokens if mintokens is not None else 0.0

    # Top n context matches to include for context
    self.context = context if context else 3

__call__(self, queue, texts=None) special

Finds answers to input questions. This method runs queries to finds the top n best matches and uses that as the context. A model is then run against the context for each input question, with the answer returned.

Parameters:

Name Type Description Default
queue

input question queue (name, query, question, snippet)

required
texts

optional list of text for context, otherwise runs embeddings search

None

Returns:

Type Description

list of (name, answer)

Source code in txtai/pipeline/text/extractor.py
def __call__(self, queue, texts=None):
    """
    Finds answers to input questions. This method runs queries to finds the top n best matches and uses that as the context.
    A model is then run against the context for each input question, with the answer returned.

    Args:
        queue: input question queue (name, query, question, snippet)
        texts: optional list of text for context, otherwise runs embeddings search

    Returns:
        list of (name, answer)
    """

    # Rank texts by similarity for each query
    results = self.query([query for _, query, _, _ in queue], texts)

    # Build question-context pairs
    names, questions, contexts, topns, snippets = [], [], [], [], []
    for x, (name, _, question, snippet) in enumerate(queue):
        # Build context using top n best matching segments
        topn = sorted(results[x], key=lambda y: y[2], reverse=True)[: self.context]
        context = " ".join([text for _, text, _ in sorted(topn, key=lambda y: y[0])])

        names.append(name)
        questions.append(question)
        contexts.append(context)
        topns.append([text for _, text, _ in topn])
        snippets.append(snippet)

    # Run pipeline and return answers
    return self.answers(names, questions, contexts, topns, snippets)