Skip to content

vDawgg/pgvector_benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

89 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Setup

Running locally

Install the projects requirements by running:

pip install -r requirements.txt

As with most python projects we recommend setting up a virtual environment.

To run the benchmark locally, first start the pgvector docker container by running the helper script:

sh ./utils/start_pgvector.sh

Making embeddings for the dataset

The data used for the trace in this benchmark comes from the ms-marco page ranking dataset. As this dataset does not yet contain embeddings, they have to be generated by changing directory to /data and running:

python data_handling.py

The embedding generation is done using the all-mpnet-base-v2 model from huggingface sentence-transformers. As the dataset is relatively large you should have a GPU at your disposal.

Preparing the trace workload

To make the trace, which is later used to execute the benchmark change directory to /benchmark and run:

python make_trace.py

If you intend to run the benchmark in gcloud and use the deployment scripts from this project, you need to upload a gzipped archive of the /benchmark/trace directory to a google cloud bucket (name: bench-data-bucket) in the same zone you intend to run the project in.

Setup for running on GCE

Install the gcloud cli and authenticate for the first time. When asked for a passphrase for the ssh file, leave this blank, as this will get annoying pretty quickly otherwise.

The project is currently set up, so the trace data is gathered from a gcs bucket before the benchmark is executed. This is done to ensure that the benchmark can be executed in a timely manner, as the trace data is ~3GB in size and uploading this everytime might be prohibitively slow depending on your upload speed. To do this, first add a new storage bucket to your project and upload a gzipped version of the trace directory. To use the deployment script as-is, the bucket should be named 'bench-data-bucket'. Then grant access to this bucket to the default project account by running:

gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:<SERVICE_ACCOUNT_EMAIL>" \
  --role="roles/storage.objectViewer"

Running the benchmark

Running locally

To execute the benchmark execute main.py. There the following params for adjusting the benchmark can be passed:

  • --requests_per_second: the arrival rate at which the requests are sent to the SUT (5 by default)
  • --db_host: the IP address of the database host (localhost by default)
  • --indexing_method: The indexing_method pgvector should use [choices: "hnsw", "ivfflat", "none"] ("none" by default)
  • --run_number: ID to associate the benchmark run later on

Example run → Run Nr. 1, hnsw indexing method, 10 requests per second:

python main.py  --run_number 1 --indexing_method hnsw --requests_per_second 10

Running on Google Cloud

For running the project on google cloud, a deployment script is supplied under /deployment. This will set up the needed infrastructure, execute the benchmark, gather the results and tear down the infrastructure after everything has finished.

For the script to work properly, set the values in env.py with your config.

After this, initialize terraform by running:

terraform init

Example run → Run Nr. 1, hnsw indexing method, 10 requests per second, run 1:

python deploy_and_run.py --indexing_method hnsw --requests_per_second 10 --run_number 1

Evaluation

There are two evaluation scripts, for this project, which can be found under /eval.

The visualizations (throughput and latency) can be attained by running:

python visualizations.py

The query accuracy metrics can be attained by running:

python query_accuracy.py

The resulting plots are all placed under ./eval/plots.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published