Skip to content

Add vector search benchmarks to benchmarking suite#7399

Closed
connortsui20 wants to merge 10 commits intodevelopfrom
ct/vector-bench
Closed

Add vector search benchmarks to benchmarking suite#7399
connortsui20 wants to merge 10 commits intodevelopfrom
ct/vector-bench

Conversation

@connortsui20
Copy link
Copy Markdown
Contributor

Summary

Tracking issue: #7297

Adds a vector-search-bench crate similar to VectorDBBench.

The benchmark brute-forces cosine similarity search over public VectorDBBench datasets (Cohere, OpenAI, etc).

Since Vortex is not a database, we do not measure things like vector inserts and deletes. instead we just measure storage size and throughput for four targets:

  • Hand-rolled Rust baseline over &[f32]
  • Uncompressed (canonical) Vortex
  • Default compressed (which ends up being ALPrd)
  • TurboQuant (plus Recall@10 for the lossy TurboQuant path).

Every variant goes through a correctness check against the uncompressed scan before timing.

Not sure if this makes sense to have on a PR benchmark yet since this can only happen on a very specific array tree that will have a very specific optimized implementation.

Testing

N/A

@connortsui20 connortsui20 added the changelog/feature A new feature label Apr 11, 2026
@connortsui20 connortsui20 force-pushed the ct/vector-bench branch 5 times, most recently from 4362b28 to c37689c Compare April 13, 2026 15:06
@connortsui20 connortsui20 force-pushed the ct/vector-bench branch 3 times, most recently from 851ec23 to c06ba43 Compare April 14, 2026 17:28
@connortsui20 connortsui20 marked this pull request as ready for review April 14, 2026 17:30
@connortsui20 connortsui20 enabled auto-merge (squash) April 14, 2026 17:33
@connortsui20 connortsui20 disabled auto-merge April 14, 2026 17:33
connortsui20 added a commit that referenced this pull request Apr 14, 2026
)

## Summary

Tracking issue: #7297

Optimizes inner product with manual partial sum decomposition (because
the compiler can't optimize this because float addition is not
associative).

Also removes the old benchmarks as they no longer really time the
correct thing anymore. The real benchmarks will be finished here:
#7399. This change also adds
the `vortex-tensor/src/vector_search.rs` file to support that soon.

## Testing

N/A

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
@connortsui20 connortsui20 marked this pull request as draft April 14, 2026 21:14
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 15, 2026

Merging this PR will degrade performance by 14.69%

⚡ 1 improved benchmark
❌ 1 regressed benchmark
✅ 1151 untouched benchmarks
🆕 10 new benchmarks
⏩ 1455 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation old_alp_prim_test_between[f32, 32768] 269.3 µs 223.2 µs +20.66%
Simulation new_alp_prim_test_between[f64, 16384] 127.7 µs 149.7 µs -14.69%
🆕 Simulation turboquant_decompress_dim128_4bit N/A 5.5 ms N/A
🆕 Simulation turboquant_decompress_dim1024_8bit N/A 43.9 ms N/A
🆕 Simulation turboquant_decompress_dim768_4bit N/A 42.1 ms N/A
🆕 Simulation turboquant_compress_dim128_4bit N/A 6.6 ms N/A
🆕 Simulation turboquant_decompress_dim1024_4bit N/A 43.9 ms N/A
🆕 Simulation turboquant_compress_dim1024_2bit N/A 47.5 ms N/A
🆕 Simulation turboquant_decompress_dim1024_2bit N/A 43.9 ms N/A
🆕 Simulation turboquant_compress_dim768_4bit N/A 52.1 ms N/A
🆕 Simulation turboquant_compress_dim1024_4bit N/A 52.6 ms N/A
🆕 Simulation turboquant_compress_dim1024_8bit N/A 62.8 ms N/A

Comparing ct/vector-bench (b6ea352) with develop (4a5b7d7)

Open in CodSpeed

Footnotes

  1. 1455 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

connortsui20 and others added 9 commits April 15, 2026 09:24
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Useful for callers that want explicit, scheme-by-scheme control over the
compressor — for example, the vector-search benchmark wants `empty()`
for a vortex-uncompressed flavor and `empty().with_turboquant()` for a
TurboQuant-only flavor.

Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Replaces vortex-bench/src/vector_dataset.rs with a four-module package:

  catalog.rs   the static VectorDataset enum with per-dataset
               metadata (dim, num_rows, element_ptype, metric,
               layouts, has_neighbors, has_scalar_labels)
  layout.rs    TrainLayout (Single, SingleShuffled, Partitioned,
               PartitionedShuffled), LayoutSpec, VectorMetric
  download.rs  URL builders + idempotent download driver returning
               DatasetPaths { train_files, test, neighbors }
  paths.rs     local cache layout under vortex-bench/data/vector-search/

The catalog now covers all 16 published VectorDBBench corpora — including
the partitioned cohere-large-10m, openai-large-5m, bioasq-large-10m,
sift-large-50m, and laion-large-100m datasets that the previous single-file
catalog couldn't model — and is parameterized over layout so callers can
pick the hosted shape per dataset.

The example and the (now-stub) vector-search-bench crate are updated to
use the new API; the bench is rebuilt from scratch in subsequent commits.

Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Two flavors:

  vortex-uncompressed  BtrBlocksCompressorBuilder::empty()
  vortex-turboquant    BtrBlocksCompressorBuilder::empty().with_turboquant()

The TurboQuant flavor extends the default file ALLOWED_ENCODINGS with the
two scalar-fn array IDs the scheme emits (L2Denorm, SorfTransform) so the
write strategy will accept the L2Denorm(SorfTransform(...)) tree.

Wires the unstable_encodings feature through the bench crate's Cargo.toml
so vortex-btrblocks::with_turboquant is available.

Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Pieces added:

  vortex-bench/src/conversions.rs::write_parquet_as_vortex_with_options
                       streams a parquet file into a Vortex file using
                       caller-provided VortexWriteOptions
  src/session.rs       process-wide VortexSession with the tensor scalar-fn
                       array plugins registered (env-gated)
  src/paths.rs         per-flavor vortex path translator
  src/ingest.rs        per-chunk transform: project emb, wrap as
                       Extension<Vector<f32>>, lossy cast f64→f32, optional
                       scalar_labels passthrough
  src/prepare.rs       per-flavor driver: streams every train shard through
                       the ChunkTransform into one .vortex file per shard,
                       idempotent, sequential, sums wall-time / byte counters

The transform always produces f32 vectors so all downstream code (scan,
recall, handrolled baseline) drops the f32/f64 dual-pathing the previous
benchmark carried.

Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
  query_scalar          wrap a query &[f32] as a Scalar::extension::<Vector>
                        suitable for use as a lit() RHS
  similarity_filter     gt(cosine_similarity(col("emb"), lit(query)),
                           lit(threshold))
  emb_projection        col("emb"), used by the throughput-only scan path

Also adds an end-to-end smoke test under tests/end_to_end_smoke.rs that
writes a synthetic Struct { emb: Vector<f32, dim> } to a real .vortex file
under both flavors and runs the filter expression through file.scan(). The
self-matching row (cosine = 1.0) must survive any reasonable threshold —
this is the first proof the write strategy and the filter pipeline agree.

Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Pieces added:

  src/scan_util.rs        median() helper shared between scan and handrolled
  src/scan.rs             per-iteration vortex file scan driver — re-opens
                          every shard fresh per iteration, drains the stream,
                          tracks best-of / median across runs
  src/query.rs            sampler that pulls one query vector from
                          test.parquet (seeded random row, f64 → f32 cast
                          when needed)
  src/handrolled.rs       sequential parquet scan baseline + 4-way unrolled
                          cosine loop; takes query as parameter, f32-only
  src/handrolled_decode.rs
                          parquet → flat Vec<f32> decoder (List, LargeList,
                          FixedSizeList — Float32 + Float64 narrowing)
  src/display.rs          local column-per-flavor renderer (compress wall,
                          input/output bytes, ratio, scan best/median,
                          matches, throughput)
  src/main.rs             clap CLI: --dataset (single), --layout (validated),
                          --flavors (comma list incl. handrolled), iterations,
                          threshold, query-seed; orchestrates download →
                          prepare → query → scan → render

Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
  src/recall.rs        per-flavor recall driver: samples N test rows,
                       runs brute-force top-K cosine over every shard via
                       a bounded BinaryHeap, compares against the
                       neighbors.parquet ground truth, reports mean +
                       p05 recall
  src/main.rs          --recall / --recall-k / --recall-queries /
                       --recall-seed flags; bails when the dataset has
                       no neighbors hosted; skips lossless flavors
                       (trivially 1.0)
  src/display.rs       extra recall@K (mean) and (p05) rows, only emitted
                       when --recall produced results

  tests/recall_smoke.rs
                       8-row standard-basis dataset where train row i is
                       basis e_i and neighbors_id[i] = i. Lossless flavor
                       must hit recall@1 = 1.0.

README is fully rewritten to reflect the new on-disk file-scan
benchmark, the layout / partitioned model, the f32-only pipeline, and
the future-work backlog.

Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
connortsui20 added a commit that referenced this pull request Apr 15, 2026
## Summary

Tracking issue: #7297

We will want to add vector benchmarking soon (see
#7399 for a draft).

This adds a simple catalog for the vector datasets hosted by
`https://assets.zilliz.com/benchmark` for
[VectorDBBench](https://github.com/zilliztech/vectordbbench), which both
describes the shape of the datasets (are things partitioned, randomly
shuffled, are there neighbors lists for top k, etc).

Also handles downloading everything.

I had to verify that all of this stuff was correct by looking at the S3
buckets themselves:

```sh
aws s3 ls s3://assets.zilliz.com/benchmark/ --region us-west-2 --no-sign-request
```

<details>

```sh
for d in bioasq_large_10m bioasq_medium_1m cohere_large_10m cohere_medium_1m \
         cohere_small_100k gist_medium_1m gist_small_100k glove_medium_1m \
         glove_small_100k laion_large_100m  \
         openai_large_5m openai_medium_500k openai_small_50k \
         sift_large_50m sift_medium_5m sift_small_500k; do
  echo "=== $d ==="
  aws s3 ls s3://assets.zilliz.com/benchmark/$d/ --region us-west-2 --no-sign-request
done
```

</details>

And this script from the main repo helped too:
https://github.com/zilliztech/VectorDBBench/blob/main/vectordb_bench/backend/dataset.py

---

Things that are not implemented that I would like to add:

- Is the dataset pre-normalized for cosine similarity? This is not so
obvious to me without actually working with the datasets, so I will do
this later.
- Some datasets have scalar labels for all vectors that help mimic
similarity + filter by some other column. Some of them also have
neighbor lists for these specific filtered queries. So that is something
we'll probably want to add in the future.

## Testing

N/A

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants