Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 72 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,12 +102,15 @@ python run.py --dataset app_reviews-384-euclidean-filter --count 100 --runs 3 --
- redis-hnsw
- [Elasitcsearch](https://www.elastic.co/)
- elasticsearch-hnsw

- [SPTAG](https://github.com/microsoft/SPTAG)
- sptag-bkt
- [pgvector](https://github.com/pgvector/pgvector)
- pgvector-hnsw
- pgvector-ivfflat

**TODO**

- [Vespa](https://vespa.ai/)
- [SPTAG](https://github.com/microsoft/SPTAG)
- [pgvector](https://github.com/pgvector/pgvector)

## Use-cases for Compound Queries

Expand Down Expand Up @@ -237,6 +240,72 @@ The dataset at [Hugging Face - dbpedia-entities-openai3-text-embedding-3-large-3
| dbpedia-entities-openai3-text-embedding-3-large-1536-1000k-euclidean | 990,000 / 10,000 | [OpenAI text-embedding-3-large](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M) | 1536 | Euclidean | [link1](https://huggingface.co/datasets/Patrickcode/BigVectorBench/resolve/main/dbpedia-entities-openai3-text-embedding-3-large-1536-1000k-euclidean.hdf5), [link2](https://hf-mirror.com/datasets/Patrickcode/BigVectorBench/resolve/main/dbpedia-entities-openai3-text-embedding-3-large-1536-1000k-euclidean.hdf5) | [dbpedia-entities](https://huggingface.co/datasets/BeIR/dbpedia-entity) |
| dbpedia-entities-openai3-text-embedding-3-large-3072-1000k-euclidean | 990,000 / 10,000 | [OpenAI text-embedding-3-large](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-3072-1M) | 3072 | Euclidean | [link1](https://huggingface.co/datasets/Patrickcode/BigVectorBench/resolve/main/dbpedia-entities-openai3-text-embedding-3-large-3072-1000k-euclidean.hdf5), [link2](https://hf-mirror.com/datasets/Patrickcode/BigVectorBench/resolve/main/dbpedia-entities-openai3-text-embedding-3-large-3072-1000k-euclidean.hdf5) | [dbpedia-entities](https://huggingface.co/datasets/BeIR/dbpedia-entity) |

## ARTIFICIAL WORKLOADS

### BUILD

The command below will create a man-made datasets for test.

```bash
python create_artificial_datasets.py
```

Arguments:

- `--n`: the number of train data to be generated (default: 10000)
- `--m`: the number of test data to be generated (default: 1000)
- `--d`: the dimension of data to be generated (default: 128)
- `--l`: the number of labels for data to be generated (default: 1)
- `--metric`: the metric type for distance to be calculated (default: angular)
- `--maxlabel`: the max label value to be generated (default: 100000)
- `--center`: the center numbers of the vectors to be generated (default: 100)

### FORMAT

- HDF5 format:
- Attributes:
- `type`: the type of the dataset (default: `ann`)
- `ann` or `dense`: ann datasets and large-scale datasets
- `filter-ann`: filter-ann datasets
- `mm-ann`: multi-modal datasets
- `mv-ann`: multi-vector datasets
- `sparse`: sparse datasets
- `distance`: the distance computation method (must be specified)
- `euclidean`: Euclidean distance
- `angular`: Angular distance
- `hamming`: Hamming distance
- `jaccard`: Jaccard distance
- `filter_expr_func`: the filter expression function (only available for the filter-ann datasets)
- `label_names`: the names of the labels (only available for the filter-ann datasets)
- `label_types`: the types of the labels (only available for the filter-ann datasets, e.g., `int32`)
- Datasets:
<!-- - `train`: the training vectors (available except for the filter-ann datasets)
- `test`: the query vectors (available except for the filter-ann datasets) -->
- `train_vec`: the training vectors (only available for the filter-ann datasets)
- `train_label`: the training labels (only available for the filter-ann datasets)
- `test_vec`: the query vectors (only available for the filter-ann datasets)
- `test_label`: the query labels (only available for the filter-ann datasets)
- `distances`: the ground truth distances between the query vectors and the training vectors
- `neighbors`: the ground truth neighbors containing the indices of the nearest neighbors

### STORE and USE

- Store the datasets in the `./data` directory named as `artificial-*-*d-*l-*a.hdf5`.
- You should add your new datasets name in `./bigvectorbench/datasets.py` line 947 to update the `ART_DATASETS` 's key.

### Completed artificial workloads

| Dataset | Data / Query Points | Type | Dimension | Distance | Label Numbers | Filter Ratio |Download | Raw Data |
| :--------------------------------------------------------------------: | :------------------------------------: | :-------------------------------------------------------------------------------------------------------------------------------: | :---------: | :---------: | :--------: | :---------: |:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------: |
| msong-1filter-80a | 990,000 / 10,000 | real vectors with artificial labels | 420 | Euclidean | 1 | 80% | [link1](https://huggingface.co/datasets/AnnaZh/Bigvectorbench-artificial-datasets/resolve/main/msong-1filter-80a.hdf5), [link2](https://hf-mirror.com/datasets/AnnaZh/Bigvectorbench-artificial-datasets/resolve/main/msong-1filter-80a.hdf5) | [msong](https://www.cse.cuhk.edu.hk/systems/hash/gqr/datasets.html) |
| deep1M-2filter-50a | 1,000,000 / 10,000 | real vectors with artificial labels | 256 | Euclidean | 2 | 50% | [link1](https://huggingface.co/datasets/AnnaZh/Bigvectorbench-artificial-datasets/resolve/main/deep1M-2filter-50a.hdf5), [link2](https://hf-mirror.com/datasets/AnnaZh/Bigvectorbench-artificial-datasets/resolve/main/deep1M-2filter-50a.hdf5) | [deep1M](https://www.cse.cuhk.edu.hk/systems/hash/gqr/datasets.html) |
| tiny5m-6filter-12a | 5,000,000 / 10,000 | real vectors with artificial labels | 384 | Euclidean | 6 | 12% | [link1](https://huggingface.co/datasets/AnnaZh/Bigvectorbench-artificial-datasets/resolve/main/tiny5m-6filter-12a.hdf5), [link2](https://hf-mirror.com/datasets/AnnaZh/Bigvectorbench-artificial-datasets/resolve/main/tiny5m-6filter-12a.hdf5) | [tiny5m](https://www.cse.cuhk.edu.hk/systems/hash/gqr/datasets.html) |
| sift10m-6filter-6a | 10,000,000 / 10,000 | real vectors with artificial labels | 128 | Euclidean | 6 | 6% | [link1](https://huggingface.co/datasets/AnnaZh/Bigvectorbench-artificial-datasets/resolve/main/sift10m-6filter-6a.hdf5), [link2](https://hf-mirror.com/datasets/AnnaZh/Bigvectorbench-artificial-datasets/resolve/main/sift10m-6filter-6a.hdf5) | [sift10m](https://www.cse.cuhk.edu.hk/systems/hash/gqr/datasets.html) |
| artificial-average-128d-1l-80a-euclidean-107 | 10,000,000 / 10,000 |artificial vectors with labels | 128 | Euclidean | 1 | 80% | [link1](https://huggingface.co/datasets/AnnaZh/Bigvectorbench-artificial-datasets/resolve/main/artificial-average-128d-1l-80a-euclidean-107.hdf5), [link2](https://hf-mirror.com/datasets/AnnaZh/Bigvectorbench-artificial-datasets/resolve/main/artificial-average-128d-1l-80a-euclidean-107.hdf5) | - |
| artificial-average-128d-2l-50a-euclidean-107 | 10,000,000 / 10,000 |artificial vectors with labels | 128 | Euclidean | 2 | 50% | [link1](https://huggingface.co/datasets/AnnaZh/Bigvectorbench-artificial-datasets/resolve/main/artificial-average-128d-2l-50a-euclidean-107.hdf5), [link2](https://hf-mirror.com/datasets/AnnaZh/Bigvectorbench-artificial-datasets/resolve/main/artificial-average-128d-2l-50a-euclidean-107.hdf5) | - |
| artificial-average-384d-6l-12a-euclidean-107 | 10,000,000 / 10,000 |artificial vectors with labels | 384 | Euclidean | 6 | 12% | [link1](https://huggingface.co/datasets/AnnaZh/Bigvectorbench-artificial-datasets/resolve/main/rtificial-average-384d-6l-12a-euclidean-107.hdf5), [link2](https://hf-mirror.com/datasets/AnnaZh/Bigvectorbench-artificial-datasets/resolve/main/rtificial-average-384d-6l-12a-euclidean-107.hdf5) | - |
| artificial-average-768d-6l-6a-euclidean-107 | 10,000,000 / 10,000 |artificial vectors with labels | 768 | Euclidean | 6 | 6% | [link1](https://huggingface.co/datasets/AnnaZh/Bigvectorbench-artificial-datasets/resolve/main/artificial-average-768d-6l-6a-euclidean-107.hdf5), [link2](https://hf-mirror.com/datasets/AnnaZh/Bigvectorbench-artificial-datasets/resolve/main/artificial-average-768d-6l-6a-euclidean-107.hdf5) | - |

## Contributing

For the development of BigVectorBench, we welcome contributions from the community. If you are interested in contributing to this project, please follow the .[Guidelines for Contributing](./CONTRIBUTING.md).
Expand Down
33 changes: 31 additions & 2 deletions bigvectorbench/datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ def get_dataset_fn(dataset_name: str) -> str:
"""
if not os.path.exists("data"):
os.mkdir("data")

return os.path.join("data", f"{dataset_name}.hdf5")


Expand All @@ -77,10 +78,14 @@ def get_dataset(dataset_name: str) -> Tuple[h5py.File, int]:
if dataset_name in ANN_DATASETS or dataset_name in RANDOM_DATASETS:
dataset_url = f"https://ann-benchmarks.com/{dataset_name}.hdf5"
elif dataset_name in BVB_DATASETS:
dataset_url = f"https://huggingface.co/datasets/Patrickcode/BigVectorBench/resolve/main/{dataset_name}.hdf5"
dataset_url = f"https://huggingface.co/datasets/Patrickcode/BigVectorBench/blob/main/{dataset_name}.hdf5"
# dataset_url = f"https://hf-mirror.com/datasets/Patrickcode/BigVectorBench/resolve/main/{dataset_name}.hdf5"
elif dataset_name in ART_DATASETS:
dataset_url = f"https://huggingface.co/datasets/Patrickcode/BigVectorBench/blob/main/{dataset_name}.hdf5"
# dataset_url = f"https://hf-mirror.com/datasets/Patrickcode/BigVectorBench/blob/main/{dataset_name}.hdf5"
else:
raise ValueError(f"Unknown dataset: {dataset_name}")
raise ValueError(f"Unknown dataset: {dataset_name},datasets should be in {DATASETS.keys()} or be created by create_datasets.py then added in ART_DATASETS.key()")

try:
download(dataset_url, hdf5_filename)
except Exception:
Expand Down Expand Up @@ -931,7 +936,31 @@ def dbpedia_entities_openai3_text_embedding_3_large_1536_1M(out_fn, i, distance)
}
)

def artificial_dataset(out_fn: str, dataset_name: str) -> None:
"""
bvb_dataset: Downloads a dataset from the BigVectorBench repository on Hugging Face Datasets Hub
"""
dataset_url = f"https://huggingface.co/datasets/AnnaZh/Bigvectorbench-artificial-datasets/resolve/main/{dataset_name}.hdf5"
# dataset_url = f"https://hf-mirror.com/datasets/AnnaZh/Bigvectorbench-artificial-datasets/resolve/main/{dataset_name}.hdf5"
download(dataset_url, out_fn)

ART_DATASETS: Dict[str, Callable[[str], None]] = {
"deep1M-2filter-50a": lambda out_fn: artificial_dataset(
out_fn, "deep1M-2filter-50a"
),
"msong-1filter-80a": lambda out_fn: artificial_dataset(
out_fn, "msong-1filter-80a"
),
"sift10m-6filter-6a": lambda out_fn: artificial_dataset(
out_fn, "sift10m-6filter-6a"
),
"tiny5m-6filter-12a": lambda out_fn: artificial_dataset(
out_fn, "tiny5m-6filter-12a"
)
}

DATASETS: Dict[str, Callable[[str], None]] = {}
DATASETS.update(RANDOM_DATASETS)
DATASETS.update(ANN_DATASETS)
DATASETS.update(BVB_DATASETS)
DATASETS.update(ART_DATASETS)
8 changes: 4 additions & 4 deletions bigvectorbench/runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ def batch_query(
"""
exprs = None
if filter_expr_func is not None:
exprs = [filter_expr(*labels) for labels in X_labels]
exprs = [filter_expr(*(labels.flatten())) for labels in X_labels]
# TODO: consider using a dataclass to represent return value.
if prepared_queries:
algo.prepare_batch_query(X, count, exprs)
Expand Down Expand Up @@ -207,7 +207,7 @@ def single_multi_vector_query(vs: np.ndarray):
results = batch_query(X_test, X_test_label)
else:
results = [
single_query(x, labels)
single_query(x, labels.flatten())
for x, labels in zip(X_test, X_test_label)
]
else:
Expand Down Expand Up @@ -267,7 +267,7 @@ def run_individual_insert(
else:
for i, (x, labels) in enumerate(zip(X_test, X_test_label)):
start = time.time()
algo.insert(x, labels)
algo.insert(x, labels.flatten())
latencies.append(time.time() - start)
if i % 1000 == 0:
print(f"Processed {i}/{len(X_test)} inserts...")
Expand Down Expand Up @@ -305,7 +305,7 @@ def run_individual_update(
for i, (x, labels) in enumerate(zip(X_test, X_test_label)):
idx = np.random.randint(num_entities)
start = time.time()
algo.update(idx, x, labels)
algo.update(idx, x, labels.flatten())
latencies.append(time.time() - start)
if i % 1000 == 0:
print(f"Processed {i}/{len(X_test)} updates...")
Expand Down
Loading
Loading