RAG

pipeline

The Retrieval Augmented Generation (RAG) pipeline joins a prompt, context data store and generative model together to extract knowledge.

The data store can be an embeddings database or a similarity instance with associated input text. The generative model can be a prompt-driven large language model (LLM), an extractive question-answering model or a custom pipeline.

Example

The following shows a simple example using this pipeline.

from txtai import Embeddings, RAG

# Input data
data = [
  "US tops 5 million confirmed virus cases",
  "Canada's last fully intact ice shelf has suddenly collapsed, " +
  "forming a Manhattan-sized iceberg",
  "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
  "The National Park Service warns against sacrificing slower friends " +
  "in a bear attack",
  "Maine man wins $1M from $25 lottery ticket",
  "Make huge profits without work, earn up to $100,000 a day"
]

# Build embeddings index
embeddings = Embeddings(content=True)
embeddings.index(data)

# Create and run pipeline
rag = RAG(embeddings, "google/flan-t5-base", template="""
  Answer the following question using the provided context.

  Question:
  {question}

  Context:
  {context}
""")

rag("What was won?")

# Instruction tuned models typically require string prompts to
# follow a specific chat template set by the model
rag = RAG(embeddings, "meta-llama/Meta-Llama-3.1-8B-Instruct", template="""
  <|im_start|>system
  You are a friendly assistant.<|im_end|>
  <|im_start|>user
  Answer the following question using the provided context.

  Question:
  {question}

  Context:
  {context}
  <|im_start|>assistant
  """
)
rag("What was won?")

# Inputs are automatically converted to chat messages when a
# system prompt is provided
rag = RAG(
  embeddings,
  "meta-llama/Meta-Llama-3.1-8B-Instruct",
  system="You are a friendly assistant",
  template="""
  Answer the following question using the provided context.

  Question:
  {question}

  Context:
  {context}
""")
rag("What was won?")

# LLM options can be passed as additional arguments
rag = RAG(embeddings, "meta-llama/Meta-Llama-3.1-8B-Instruct", template="""
  Answer the following question using the provided context.

  Question:
  {question}

  Context:
  {context}
""")

# Set the default role to user and string inputs are converted to chat messages
rag("What was won?", defaultrole="user")

See the Embeddings and LLM pages for additional configuration options.

See the links below for more detailed examples.

Notebook	Description
Prompt-driven search with LLMs	Embeddings-guided and Prompt-driven search with Large Language Models (LLMs)
Prompt templates and task chains	Build model prompts and connect tasks together with workflows
Build RAG pipelines with txtai	Guide on retrieval augmented generation including how to create citations
Integrate LLM frameworks	Integrate llama.cpp, LiteLLM and custom generation frameworks
Generate knowledge with Semantic Graphs and RAG	Knowledge exploration and discovery with Semantic Graphs and RAG
Build knowledge graphs with LLMs	Build knowledge graphs with LLM-driven entity extraction
Advanced RAG with graph path traversal	Graph path traversal to collect complex sets of data for advanced RAG
Advanced RAG with guided generation	Retrieval Augmented and Guided Generation
RAG with llama.cpp and external API services	RAG with additional vector and LLM frameworks
How RAG with txtai works	Create RAG processes, API services and Docker instances
Speech to Speech RAG ▶️	Full cycle speech to speech workflow with RAG
Generative Audio	Storytelling with generative audio workflows
Analyzing Hugging Face Posts with Graphs and Agents	Explore a rich dataset with Graph Analysis and Agents
Granting autonomy to agents	Agents that iteratively solve problems as they see fit
Getting started with LLM APIs	Generate embeddings and run LLMs with OpenAI, Claude, Gemini, Bedrock and more
Analyzing LinkedIn Company Posts with Graphs and Agents	Exploring how to improve social media engagement with AI
Extractive QA with txtai	Introduction to extractive question-answering with txtai
Extractive QA with Elasticsearch	Run extractive question-answering queries with Elasticsearch
Extractive QA to build structured data	Build structured datasets using extractive question-answering
Parsing the stars with txtai	Explore an astronomical knowledge graph of known stars, planets, galaxies
Chunking your data for RAG	Extract, chunk and index content for effective retrieval
Medical RAG Research with txtai	Analyze PubMed article metadata with RAG

Configuration-driven example

Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.

config.yml

# Allow documents to be indexed
writable: True

# Content is required for extractor pipeline
embeddings:
  content: True

rag:
  path: google/flan-t5-base
  template: |
    Answer the following question using the provided context.

    Question:
    {question}

    Context:
    {context}

workflow:
  search:
    tasks:
      - action: rag

Run with Workflows

Built in tasks make using the extractor pipeline easier.

from txtai import Application

# Create and run pipeline with workflow
app = Application("config.yml")
app.add([
  "US tops 5 million confirmed virus cases",
  "Canada's last fully intact ice shelf has suddenly collapsed, " +
  "forming a Manhattan-sized iceberg",
  "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
  "The National Park Service warns against sacrificing slower friends " +
  "in a bear attack",
  "Maine man wins $1M from $25 lottery ticket",
  "Make huge profits without work, earn up to $100,000 a day"
])
app.index()

list(app.workflow("search", ["What was won?"]))

Run with API

CONFIG=config.yml uvicorn "txtai.api:app" &

curl \
  -X POST "http://localhost:8000/workflow" \
  -H "Content-Type: application/json" \
  -d '{"name": "search", "elements": ["What was won"]}'

Methods

Python documentation for the pipeline.

`init(similarity, path, quantize=False, gpu=True, model=None, tokenizer=None, minscore=None, mintokens=None, context=None, task=None, output='default', template=None, separator=' ', system=None, **kwargs)`

Builds a new RAG pipeline.

Parameters:

Name	Description	Default
`similarity`	similarity instance (embeddings or similarity pipeline)	required
`path`	path to model, supports a LLM, Questions or custom pipeline	required
`quantize`	True if model should be quantized before inference, False otherwise.	`False`
`gpu`	if gpu inference should be used (only works if GPUs are available)	`True`
`model`	optional existing pipeline model to wrap	`None`
`tokenizer`	Tokenizer class	`None`
`minscore`	minimum score to include context match, defaults to None	`None`
`mintokens`	minimum number of tokens to include context match, defaults to None	`None`
`context`	topn context matches to include, defaults to 3	`None`
`task`	model task (language-generation, sequence-sequence or question-answering), defaults to auto-detect	`None`
`output`	output format, 'default' returns (name, answer), 'flatten' returns answers and 'reference' returns (name, answer, reference)	`'default'`
`template`	prompt template, it must have a parameter for {question} and {context}, defaults to "{question} {context}"	`None`
`separator`	context separator	`' '`
`system`	system prompt, defaults to None	`None`
`kwargs`	additional keyword arguments to pass to pipeline model	`{}`

Source code in txtai/pipeline/llm/rag.py

def __init__(
    self,
    similarity,
    path,
    quantize=False,
    gpu=True,
    model=None,
    tokenizer=None,
    minscore=None,
    mintokens=None,
    context=None,
    task=None,
    output="default",
    template=None,
    separator=" ",
    system=None,
    **kwargs,
):
    """
    Builds a new RAG pipeline.

    Args:
        similarity: similarity instance (embeddings or similarity pipeline)
        path: path to model, supports a LLM, Questions or custom pipeline
        quantize: True if model should be quantized before inference, False otherwise.
        gpu: if gpu inference should be used (only works if GPUs are available)
        model: optional existing pipeline model to wrap
        tokenizer: Tokenizer class
        minscore: minimum score to include context match, defaults to None
        mintokens: minimum number of tokens to include context match, defaults to None
        context: topn context matches to include, defaults to 3
        task: model task (language-generation, sequence-sequence or question-answering), defaults to auto-detect
        output: output format, 'default' returns (name, answer), 'flatten' returns answers and 'reference' returns (name, answer, reference)
        template: prompt template, it must have a parameter for {question} and {context}, defaults to "{question} {context}"
        separator: context separator
        system: system prompt, defaults to None
        kwargs: additional keyword arguments to pass to pipeline model
    """

    # Similarity instance
    self.similarity = similarity

    # Model can be a LLM, Questions or custom pipeline
    self.model = self.load(path, quantize, gpu, model, task, **kwargs)

    # Tokenizer class use default method if not set
    self.tokenizer = tokenizer if tokenizer else Tokenizer() if hasattr(self.similarity, "scoring") and self.similarity.isweighted() else None

    # Minimum score to include context match
    self.minscore = minscore if minscore is not None else 0.0

    # Minimum number of tokens to include context match
    self.mintokens = mintokens if mintokens is not None else 0.0

    # Top n context matches to include for context
    self.context = context if context else 3

    # Output format
    self.output = output

    # Prompt template
    self.template = template if template else "{question} {context}"

    # Context separator
    self.separator = separator

    # System prompt template
    self.system = system

`call(queue, texts=None, **kwargs)`

Finds answers to input questions. This method runs queries to find the top n best matches and uses that as the context. A model is then run against the context for each input question, with the answer returned.