Skip to content

URL Retrieve

pipeline pipeline

The URL Retrieve pipeline retrieves content from a HTTP(s) URL.

Example

The following shows a simple example using this pipeline.

from txtai.pipeline import URLRetrieve

# Create and run pipeline
urlretrieve = URLRetrieve()
urlretrieve("https://github.com/neuml/txtai")

Configuration-driven example

Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.

config.yml

# Create pipeline using lower case class name
urlretrieve:

# Run pipeline with workflow
workflow:
  retrieve:
    tasks:
      - action: urlretrieve

Run with Workflows

from txtai import Application

# Create and run pipeline with workflow
app = Application("config.yml")
list(app.workflow("urlretrieve", ["https://github.com/neuml/txtai"]))

Run with API

CONFIG=config.yml uvicorn "txtai.api:app" &

curl \
  -X POST "http://localhost:8000/workflow" \
  -H "Content-Type: application/json" \
  -d '{"name":"retrieve", "elements":["http://github.com/neuml/txtai"]}'

Methods

Python documentation for the pipeline.

__init__(headers=None, safeopen=False, timeout=30, readlimit=100 * 1024 * 1024)

Creates a new URLRetrieve pipeline.

Parameters:

Name Type Description Default
headers

http headers

None
safeopen

if safe validation checks should be enabled

False
timeout

default socket timeout

30
readlimit

default read limit

100 * 1024 * 1024
Source code in txtai/pipeline/data/urlretrieve.py
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
def __init__(self, headers=None, safeopen=False, timeout=30, readlimit=100 * 1024 * 1024):
    """
    Creates a new URLRetrieve pipeline.

    Args:
        headers: http headers
        safeopen: if safe validation checks should be enabled
        timeout: default socket timeout
        readlimit: default read limit
    """

    # HTTP headers
    self.headers = headers if headers else {}

    # Safeopen mode
    self.safeopen = safeopen

    # Socket timeout
    self.timeout = timeout

    # Read limit
    self.readlimit = readlimit

    # Create a blank opener
    self.opener = OpenerDirector()

    # Register handlers
    for handler in [
        UnknownHandler(),
        HTTPDefaultErrorHandler(),
        HTTPErrorProcessor(),
        SafeHTTPHandler(self),
        SafeHTTPSHandler(self),
        SafeRedirectHandler(self),
    ]:
        self.opener.add_handler(handler)

__call__(url)

Retrieves content from url.

Parameters:

Name Type Description Default
url

input url

required

Returns:

Type Description

data

Source code in txtai/pipeline/data/urlretrieve.py
67
68
69
70
71
72
73
74
75
76
77
78
79
80
def __call__(self, url):
    """
    Retrieves content from url.

    Args:
        url: input url

    Returns:
        data
    """

    with contextlib.closing(self.opener.open(Request(url, headers=self.headers), timeout=self.timeout)) as connection:
        # Read up to readlimit bytes
        return connection.read(self.readlimit)