Install the projects requirements by running:
pip install -r requirements.txtAs with most python projects we recommend setting up a virtual environment.
To run the benchmark locally, first start the pgvector docker container by running the helper script:
sh ./utils/start_pgvector.shThe data used for the trace in this benchmark comes from the ms-marco page ranking dataset. As this dataset does not yet contain embeddings, they have to be generated by changing directory to /data and running:
python data_handling.pyThe embedding generation is done using the all-mpnet-base-v2 model from huggingface sentence-transformers. As the dataset is relatively large you should have a GPU at your disposal.
To make the trace, which is later used to execute the benchmark change directory to /benchmark and run:
python make_trace.pyIf you intend to run the benchmark in gcloud and use the deployment scripts from this project, you need to upload a gzipped archive of the /benchmark/trace directory to a google cloud bucket (name: bench-data-bucket) in the same zone you intend to run the project in.
Install the gcloud cli and authenticate for the first time. When asked for a passphrase for the ssh file, leave this blank, as this will get annoying pretty quickly otherwise.
The project is currently set up, so the trace data is gathered from a gcs bucket before the benchmark is executed. This is done to ensure that the benchmark can be executed in a timely manner, as the trace data is ~3GB in size and uploading this everytime might be prohibitively slow depending on your upload speed. To do this, first add a new storage bucket to your project and upload a gzipped version of the trace directory. To use the deployment script as-is, the bucket should be named 'bench-data-bucket'. Then grant access to this bucket to the default project account by running:
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:<SERVICE_ACCOUNT_EMAIL>" \
--role="roles/storage.objectViewer"To execute the benchmark execute main.py. There the following params for adjusting the benchmark can be passed:
- --requests_per_second: the arrival rate at which the requests are sent to the SUT (5 by default)
- --db_host: the IP address of the database host (localhost by default)
- --indexing_method: The indexing_method pgvector should use [choices: "hnsw", "ivfflat", "none"] ("none" by default)
- --run_number: ID to associate the benchmark run later on
Example run → Run Nr. 1, hnsw indexing method, 10 requests per second:
python main.py --run_number 1 --indexing_method hnsw --requests_per_second 10For running the project on google cloud, a deployment script is supplied under /deployment. This will set up the needed infrastructure, execute the benchmark, gather the results and tear down the infrastructure after everything has finished.
For the script to work properly, set the values in env.py with your config.
After this, initialize terraform by running:
terraform initExample run → Run Nr. 1, hnsw indexing method, 10 requests per second, run 1:
python deploy_and_run.py --indexing_method hnsw --requests_per_second 10 --run_number 1There are two evaluation scripts, for this project, which can be found under /eval.
The visualizations (throughput and latency) can be attained by running:
python visualizations.pyThe query accuracy metrics can be attained by running:
python query_accuracy.pyThe resulting plots are all placed under ./eval/plots.