When you install pgai on Timescale Cloud, the vectorizer worker is automatically activated and run on a schedule so you don't need to do anything, everything just works out of the box.
How it works: When you deploy a pgai vectorizer on Timescale Cloud, a scheduled job detects whether work is to be done for the vectorizers. If there is, the job triggers a cloud function to embed the data.
Disable the cloud function: There are some instances in which you might want
to run the vectorizer worker manually and disable the cloud function from
running. You can do this by setting scheduling =>ai.scheduling_none()
in the configuration for your vectorizer. Then you can run the vectorizer
worker manually using the pgai vectorizer worker command or any other method
discussed below.
Manually trigger an execution: You can also always manually trigger an execution of the vectorizer cloud function by running select ai.execute_vectorizer(<vectorizer_id>).
Note
Timescale Cloud currently does not support Ollama. To use Ollama on the data in your Timescale Cloud service you need to disable the cloud function and run the vectorizer worker yourself.
When you use pgai vectorizers on a self-hosted Postgres installation or another cloud provider other than Timescale Cloud, you have to run the vectorizer worker yourself. The vectorizer worker will connect to your database and process the vectorizers you have defined. You can run the vectorizer:
- Through the pgai CLI tool as
pgai vectorizer worker(see instructions below) - Integrating the vectorizer worker as a background process into your own python application (see instructions below)
- Using the vectorizer worker Docker image (see instructions below)
- In a Docker Compose configuration (see instructions below)
Prerequisites: Python (>= 3.10) and pip
-
Install pgai from PyPI
pip install pgai[vectorizer-worker]
The
pgaicommand line tool should now be in your$PATH. -
Create a .env file
Configure the API keys for your embedding providers by adding them to a
.envfile. For example, if you are using OpenAI, you can add the following:OPENAI_API_KEY=<your-openai-api-key>Alternatively, you can set the API key through an environment variable directly.
-
Run the vectorizer worker
After you define a vectorizer in your database, you run a vectorizer worker to generate and update your embeddings:
pgai vectorizer worker -d "postgres://user:password@host:5432/database"
For more configuration options, see Advanced configuration options below.
Prerequisites: Python (>= 3.10) and pip
-
Add the pgai package dependency to your project
pip install pgai[vectorizer-worker]
or add
pgaito the dependencies in yourrequirements.txtfile,pyproject.toml, or similar configuration file. -
Add the vectorizer worker to run in the background of your application
from pgai import Worker worker = Worker(db_url="postgres://user:password@host:5432/database") task = asyncio.create_task(worker.run())
We recommend to shutdown the worker gracefully when your application shuts down. It won't cause problems if you skip this step and force kill the worker, but this ensures the last batch of work is finished, if possible, before stopping the process. You can do this by calling
request_graceful_shutdownon the worker. Example:await worker.request_graceful_shutdown() try: result = await asyncio.wait_for(task, timeout=20) if result is not None: print("Worker shutdown with exception:", result) else: print("Worker shutdown successfully") except asyncio.TimeoutError: print("Worker did not shutdown in time, it was killed")
-
Make sure you add the API keys for your embedding providers to the environment variables when you run your application.
We recommend using a
.envfile to set the API key and then having your application load the.envfile using theload_dotenvfunction from thepython-dotenvpackage.Alternatively, you can set the API key through an environment variable directly.
-
Run your application
For more configuration options, see Advanced configuration options below.
Prerequisites: Docker
-
Create a .env file
Configure the API keys for your embedding providers by adding them to a
.envfile. For example, if you are using OpenAI, you can add the following:OPENAI_API_KEY=<your-openai-api-key>Alternatively, you can set the API key by passing it as an environment variable in the
docker runcommand below. -
Run the vectorizer worker
After you define a vectorizer in your database, you run a vectorizer worker to generate and update your embeddings.
docker run --env-file=.env timescale/pgai-vectorizer-worker:{tag version} --db-url "postgres://user:password@host:5432/database"
For more configuration options, see Advanced configuration options below.
Below is an end-to-end batteries-included Docker Compose configuration which you can use to test pgai vectorizers and and the vectorizer worker locally. It includes a:
- local Postgres instance,
- Ollama embedding API service
- pgai vectorizer worker
On your local machine:
-
Copy the following configuration into a file named
compose.yamlname: pgai services: db: image: timescale/timescaledb-ha:pg17 environment: POSTGRES_PASSWORD: postgres ports: - "5432:5432" volumes: - data:/var/lib/postgresql/data vectorizer-worker: image: timescale/pgai-vectorizer-worker:latest environment: PGAI_VECTORIZER_WORKER_DB_URL: postgres://postgres:postgres@db:5432/postgres OLLAMA_HOST: http://ollama:11434 command: [ "--poll-interval", "5s" ] ollama: image: ollama/ollama volumes: data:
-
Start the services locally
docker compose up -d
-
Connect to your self-hosted database
- Docker:
docker compose exec -it db psql - psql:
psql postgres://postgres:postgres@localhost:5432/postgres
- Docker:
For more configuration options, see Advanced configuration options below.
Most users of the vectorizer worker will need to set the following configuration options:
- Setting the database connection string
- Setting API keys for embedding providers through environment variables (or .env file)
Other, advanced configuration options are available, see Advanced configuration options below.
The vectorizer worker needs to know how to connect to your database. You can do this by setting the -d command line argument or the PGAI_VECTORIZER_WORKER_DB_URL environment variable.
For example, if you are using a local Postgres database, you can set the database connection string as follows:
pgai vectorizer worker -d "postgres://user:password@host:5432/database"
If you are using an embedding provider that requires an API key (which most do), you can set the API key through an environment variable or a .env file.
For example, if you are using OpenAI, you can set the API key in a .env file as follows:
OPENAI_API_KEY=<your-openai-api-key>
Or you can set the API key through an environment variable:
export OPENAI_API_KEY=<your-openai-api-key>
Most users of the vectorizer worker will be happy with the default configuration for all other options. But, if you need to, you can control the following:
- The vectorizer ids that are processed by the vectorizer worker (section below)
- The time between vectorizer worker runs (section below)
- The number of asynchronous tasks running in a vectorizer worker (section below)
- Whether to run the vectorizer worker once and then exit (section below)
All of these options can be set through the command line arguments, environment variables, or through an argument to the Worker
class constructor in the pgai Python package.
| Option | Command line argument | Environment variable | Worker class constructor argument |
|---|---|---|---|
| Control which vectorizers are processed | -i / --vectorizer-id |
PGAI_VECTORIZER_WORKER_VECTORIZER_IDS |
vectorizer_ids |
| Set the time between vectorizer worker runs | --poll-interval |
PGAI_VECTORIZER_WORKER_POLL_INTERVAL |
poll_interval |
| Set the number of asynchronous tasks running in a vectorizer worker | -c / --concurrency |
PGAI_VECTORIZER_WORKER_CONCURRENCY |
concurrency |
| Run the vectorizer worker once and then exit | --once |
PGAI_VECTORIZER_WORKER_ONCE |
once |
If you want to run a vectorizer worker that only processes a subset of the vectorizers in your database,
you can do so by specifying the vectorizer ids you want to process. You can do this by using the
-i / --vectorizer-id command line argument.
A vectorizer worker can:
-
Run all vectorizers:
To run all current and future vectorizers:
- cli:
pgai vectorizer worker - python:
worker = Worker(db_url="postgres://user:password@host:5432/database") - Docker:
docker run timescale/pgai-vectorizer-worker:{tag version} - Docker Compose:
command: []
- cli:
-
Run a single vectorizer:
To run the vectorizer with id 42:
- cli:
pgai vectorizer worker -i 42 - python:
worker = Worker(db_url="postgres://user:password@host:5432/database", vectorizer_ids=[42]) - Docker:
docker run timescale/pgai-vectorizer-worker:{tag version} -i 42 - Docker Compose:
command: ["-i", "42"]
- cli:
-
Run multiple specific vectorizers:
To run the vectorizers with ids
42,64, and8:- cli:
pgai vectorizer worker -i 42 -i 64 -i 8 - python:
worker = Worker(db_url="postgres://user:password@host:5432/database", vectorizer_ids=[42, 64, 8]) - Docker:
docker run timescale/pgai-vectorizer-worker:{tag version} -i 42 -i 64 -i 8 - Docker Compose:
command: ["-i", "42", "-i", "64", "-i", "8"]
- cli:
-
Run multiple vectorizers in concurrent vectorizer workers:
To run the vectorizers with id
42and64in different vectorizer workers:-
In a first shell, run:
- cli:
pgai vectorizer worker -i 42 - python:
worker = Worker(db_url="postgres://user:password@host:5432/database", vectorizer_ids=[42]) - Docker:
docker run timescale/pgai-vectorizer-worker:{tag version} -i 42 - Docker Compose:
command: ["-i", "42"]
- cli:
-
In another shell, run:
- cli:
pgai vectorizer worker -i 64 - python:
worker = Worker(db_url="postgres://user:password@host:5432/database", vectorizer_ids=[64]) - Docker:
docker run timescale/pgai-vectorizer-worker:{tag version} -i 64 - Docker Compose:
command: ["-i", "64"]
- cli:
-
-
Run concurrent vectorizer workers on a single vectorizer
More than one vectorizer worker can efficiently process the same vectorizer id at the same time. To run the vectorizer with id
41in different vectorizer workers:-
In a first shell, run:
- cli:
pgai vectorizer worker -i 42 - python:
worker = Worker(db_url="postgres://user:password@host:5432/database", vectorizer_ids=[42]) - Docker:
docker run timescale/pgai-vectorizer-worker:{tag version} -i 42 - Docker Compose:
command: ["-i", "42"]
- cli:
-
In another shell, run:
- cli:
pgai vectorizer worker -i 42 - python:
worker = Worker(db_url="postgres://user:password@host:5432/database", vectorizer_ids=[42]) - Docker:
docker run timescale/pgai-vectorizer-worker:{tag version} -i 42 - Docker Compose:
command: ["-i", "42"]
- cli:
-
You find the vectorizer ids in the ai.vectorizer table.
When you run a vectorizer worker, it loops over the vectorizers defined in your database. Each vectorizer worker processes vectorizer queue until it is empty. By default, the vectorizer worker sleeps for five minutes, then start over.
To control the time between vectorizer worker iterations, set the integer seconds or a duration string
in the --poll-interval parameter:
-
Run every hour:
- cli:
pgai vectorizer worker --poll-interval=1h - python:
worker = Worker(db_url="postgres://user:password@host:5432/database", poll_interval=timedelta(hours=1)) - Docker:
docker run timescale/pgai-vectorizer-worker:{tag version} --poll-interval=1h - Docker Compose:
command: ["--poll-interval", "1h"]
- cli:
-
Run every 45 minutes:
- cli:
pgai vectorizer worker --poll-interval=45m - python:
worker = Worker(db_url="postgres://user:password@host:5432/database", poll_interval=timedelta(minutes=45)) - Docker:
docker run timescale/pgai-vectorizer-worker:{tag version} --poll-interval=45m - Docker Compose:
command: ["--poll-interval", "45m"]
- cli:
-
Run every 900 seconds:
- cli:
pgai vectorizer worker --poll-interval=900 - python:
worker = Worker(db_url="postgres://user:password@host:5432/database", poll_interval=timedelta(seconds=900)) - Docker:
docker run timescale/pgai-vectorizer-worker:{tag version} --poll-interval=900 - Docker Compose:
command: ["--poll-interval", "900"]
- cli:
You can also make the vectorizer worker run only once by setting the --once flag. See Run the vectorizer worker once and then exit for more details.
Use the -c / --concurrency option to cause the vectorizer worker to use
multiple asynchronous tasks to process a queue:
- cli:
pgai vectorizer worker -c 3 - python:
worker = Worker(db_url="postgres://user:password@host:5432/database", concurrency=3) - Docker:
docker run timescale/pgai-vectorizer-worker:{tag version} -c 3 - Docker Compose:
command: ["-c", "3"]
You can run the vectorizer worker once and then exit by using the --once flag. This is useful for debugging or if you want to run the vectorizer worker in a cron job.
- cli:
pgai vectorizer worker --once - python:
worker = Worker(db_url="postgres://user:password@host:5432/database", once=True) - Docker:
docker run timescale/pgai-vectorizer-worker:{tag version} --once - Docker Compose:
command: ["--once"]