Similarity

pipeline

The Similarity pipeline computes similarity between queries and list of text using a text classifier.

This pipeline supports both standard text classification models and zero-shot classification models. The pipeline uses the queries as labels for the input text. The results are transposed to get scores per query/label vs scores per input text.

Cross-encoder models are supported via the crossencode=True constructor parameter. These models are loaded with a CrossEncoder pipeline that can also be instantiated directly. The CrossEncoder pipeline has the same methods and functionality as described below.

Example

The following shows a simple example using this pipeline.

from txtai.pipeline import Similarity

# Create and run pipeline
similarity = Similarity()
similarity("feel good story", [
    "Maine man wins $1M from $25 lottery ticket", 
    "Don't sacrifice slower friends in a bear attack"
])

See the link below for a more detailed example.

Notebook	Description
Add semantic search to Elasticsearch	Add semantic search to existing search systems

Configuration-driven example

Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.

config.yml

# Create pipeline using lower case class name
similarity:

Run with Workflows

from txtai.app import Application

# Create and run pipeline with workflow
app = Application("config.yml")
app.similarity("feel good story", [
    "Maine man wins $1M from $25 lottery ticket", 
    "Don't sacrifice slower friends in a bear attack"
])

Run with API

CONFIG=config.yml uvicorn "txtai.api:app" &

curl \
  -X POST "http://localhost:8000/similarity" \
  -H "Content-Type: application/json" \
  -d '{"query": "feel good story", "texts": ["Maine man wins $1M from $25 lottery ticket", "Dont sacrifice slower friends in a bear attack"]}'

Methods

Python documentation for the pipeline.

Source code in txtai/pipeline/text/similarity.py

def __init__(self, path=None, quantize=False, gpu=True, model=None, dynamic=True, crossencode=False, **kwargs):
    # Use zero-shot classification if dynamic is True and crossencode is False, otherwise use standard text classification
    super().__init__(path, quantize, gpu, model, False if crossencode else dynamic, **kwargs)

    # Load as a cross-encoder if crossencode set to True
    self.crossencoder = CrossEncoder(model=self.pipeline) if crossencode else None

Computes the similarity between query and list of text. Returns a list of (id, score) sorted by highest score, where id is the index in texts.

This method supports query as a string or a list. If the input is a string, the return type is a 1D list of (id, score). If text is a list, a 2D list of (id, score) is returned with a row per string.

Parameters:

Name	Description	Default
`query`	query text\|list	required
`texts`	list of text	required
`multilabel`	labels are independent if True, scores are normalized to sum to 1 per text item if False, raw scores returned if None	`True`

Returns:

Type	Description
	list of (id, score)

Source code in txtai/pipeline/text/similarity.py

def __call__(self, query, texts, multilabel=True):
    """
    Computes the similarity between query and list of text. Returns a list of
    (id, score) sorted by highest score, where id is the index in texts.

    This method supports query as a string or a list. If the input is a string,
    the return type is a 1D list of (id, score). If text is a list, a 2D list
    of (id, score) is returned with a row per string.

    Args:
        query: query text|list
        texts: list of text
        multilabel: labels are independent if True, scores are normalized to sum to 1 per text item if False, raw scores returned if None

    Returns:
        list of (id, score)
    """

    if self.crossencoder:
        # pylint: disable=E1102
        return self.crossencoder(query, texts, multilabel)

    # Call Labels pipeline for texts using input query as the candidate label
    scores = super().__call__(texts, [query] if isinstance(query, str) else query, multilabel)

    # Sort on query index id
    scores = [[score for _, score in sorted(row)] for row in scores]

    # Transpose axes to get a list of text scores for each query
    scores = np.array(scores).T.tolist()

    # Build list of (id, score) per query sorted by highest score
    scores = [sorted(enumerate(row), key=lambda x: x[1], reverse=True) for row in scores]

    return scores[0] if isinstance(query, str) else scores