Setup

Running locally

Install the projects requirements by running:

pip install -r requirements.txt

As with most python projects we recommend setting up a virtual environment.

To run the benchmark locally, first start the pgvector docker container by running the helper script:

sh ./utils/start_pgvector.sh

Making embeddings for the dataset

The data used for the trace in this benchmark comes from the ms-marco page ranking dataset. As this dataset does not yet contain embeddings, they have to be generated by changing directory to /data and running:

python data_handling.py

The embedding generation is done using the all-mpnet-base-v2 model from huggingface sentence-transformers. As the dataset is relatively large you should have a GPU at your disposal.

Preparing the trace workload

To make the trace, which is later used to execute the benchmark change directory to /benchmark and run:

python make_trace.py

If you intend to run the benchmark in gcloud and use the deployment scripts from this project, you need to upload a gzipped archive of the /benchmark/trace directory to a google cloud bucket (name: bench-data-bucket) in the same zone you intend to run the project in.

Setup for running on GCE

Install the gcloud cli and authenticate for the first time. When asked for a passphrase for the ssh file, leave this blank, as this will get annoying pretty quickly otherwise.

The project is currently set up, so the trace data is gathered from a gcs bucket before the benchmark is executed. This is done to ensure that the benchmark can be executed in a timely manner, as the trace data is ~3GB in size and uploading this everytime might be prohibitively slow depending on your upload speed. To do this, first add a new storage bucket to your project and upload a gzipped version of the trace directory. To use the deployment script as-is, the bucket should be named 'bench-data-bucket'. Then grant access to this bucket to the default project account by running:

gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:<SERVICE_ACCOUNT_EMAIL>" \
  --role="roles/storage.objectViewer"

Running the benchmark

Running locally

To execute the benchmark execute main.py. There the following params for adjusting the benchmark can be passed:

--requests_per_second: the arrival rate at which the requests are sent to the SUT (5 by default)
--db_host: the IP address of the database host (localhost by default)
--indexing_method: The indexing_method pgvector should use [choices: "hnsw", "ivfflat", "none"] ("none" by default)
--run_number: ID to associate the benchmark run later on

Example run → Run Nr. 1, hnsw indexing method, 10 requests per second:

python main.py  --run_number 1 --indexing_method hnsw --requests_per_second 10

Running on Google Cloud

For running the project on google cloud, a deployment script is supplied under /deployment. This will set up the needed infrastructure, execute the benchmark, gather the results and tear down the infrastructure after everything has finished.

For the script to work properly, set the values in env.py with your config.

After this, initialize terraform by running:

terraform init

Example run → Run Nr. 1, hnsw indexing method, 10 requests per second, run 1:

python deploy_and_run.py --indexing_method hnsw --requests_per_second 10 --run_number 1

Evaluation

There are two evaluation scripts, for this project, which can be found under /eval.

The visualizations (throughput and latency) can be attained by running:

python visualizations.py

The query accuracy metrics can be attained by running:

python query_accuracy.py

The resulting plots are all placed under ./eval/plots.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setup

Running locally

Making embeddings for the dataset

Preparing the trace workload

Setup for running on GCE

Running the benchmark

Running locally

Running on Google Cloud

Evaluation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
benchmark		benchmark
data		data
db		db
deployment		deployment
eval		eval
utils		utils
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

vDawgg/pgvector_benchmark

Folders and files

Latest commit

History

Repository files navigation

Setup

Running locally

Making embeddings for the dataset

Preparing the trace workload

Setup for running on GCE

Running the benchmark

Running locally

Running on Google Cloud

Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages