vector-store: return similarity scores#354
Conversation
0c4570d to
0fc4799
Compare
0fc4799 to
ef9dfbe
Compare
|
Changelog:
|
ef9dfbe to
d0a64ac
Compare
|
Changelog:
|
d0a64ac to
d08bc63
Compare
|
Changelog:
|
There was a problem hiding this comment.
Pull request overview
Extends the Vector Store ANN API to return similarity scores in addition to distances, deriving scores from distance values and the index space type.
Changes:
- Add
SimilarityScoretype and compute similarity scores from ANN distances based onSpaceType. - Propagate
space_typethrough the engine and expose similarity scores in the HTTP ANN response and OpenAPI spec. - Update index backends and clients/tests to handle validated distances and the expanded ANN response.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/vector-store/tests/integration/usearch.rs | Updates integration tests to handle the additional similarity_scores return value. |
| crates/vector-store/tests/integration/opensearch.rs | Updates integration test to validate presence/shape of similarity scores in ANN results. |
| crates/vector-store/src/lib.rs | Introduces SimilarityScore and adds Distance::new validation + related unit test. |
| crates/vector-store/src/index/usearch.rs | Validates distances from USearch and propagates possible distance validation errors via iterators. |
| crates/vector-store/src/index/opensearch.rs | Validates ANN distances parsed from OpenSearch results and returns errors on invalid values. |
| crates/vector-store/src/httproutes.rs | Extends ANN HTTP response schema and computes similarity_scores using per-index space_type. |
| crates/vector-store/src/engine.rs | Stores SpaceType alongside index actors and returns it via get_index_ids(). |
| crates/httpclient/src/lib.rs | Extends HttpClient::ann to return (primary_keys, distances, similarity_scores). |
| api/openapi.json | Updates OpenAPI schema/description to include similarity_scores and SimilarityScore schema. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
d08bc63 to
40edafc
Compare
|
Changelog:
|
knowack1
left a comment
There was a problem hiding this comment.
Looks super good. One nit comment - could be as follow-up if find it useful.
Add support for binary quantization space types. As for this patch, the only supported type is Hamming.
40edafc to
bb56708
Compare
|
Changelog:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
bb56708 to
24cb867
Compare
|
Changelog:
|
24cb867 to
bb61f61
Compare
|
Changelog:
|
Add a check when creating a `Distance` value to ensure that distances corresponding to the space types satisfy the mathematical boundaries. Move to a separate module to disallow for creation by value.
Extend the API to return similarity scores calculated from the distance values and index's space type. The similarity scores are needed to implement the optimization to return similarity scores pre-calculated in Vector Store when calling `similarity_*()` functions within ANN vector queries with the exact same parameters as the ANN request. The conversion from distance to similarity score is based on comparing USearch's distance calculations versus jVector's similarity scores calculations (which both Scylla and Cassandra use). The conversion from Hamming distance to Hamming similarity was introduced by us as the quotient of matched bits and all stored bits, which is a number of Vector dimensions rounded up to the nearest multiple of 8 (as packed in bytes). References: VECTOR-469
bb61f61 to
6fe0458
Compare
|
Changelog:
|
|
@m-szymon please take a look if you're OK with the validation now |
| Distance::Cosine(_) | Distance::DotProduct(_) => (2.0 - d) / 2.0, | ||
| Distance::Euclidean(_) => 1.0 / (1.0 + d), | ||
| Distance::Hamming((_, dim)) => { | ||
| // Round up to nearest multiple of 8 to reflect that we store hamming distances packed in bytes |
There was a problem hiding this comment.
I don't understand why do we care about packing here. Isn't it just padded. If we define 57 dimensions is it possible to get distance 64?
There was a problem hiding this comment.
When you have 57 dim vector, after quantization you will get 64 bits representing this vector. We always round up to the full byte.
There was a problem hiding this comment.
yes, but the additional 7 bits are just padding. For all vectors it will be the same.
[zero] => [0, 0, ... 0, 0, 0, 0, 0, 0, 0, 0]
[one] => [1, 1, ... 1, 0, 0, 0, 0, 0, 0, 0]
max distance is still 57.
There was a problem hiding this comment.
You are right! You will never get 64 distance from usearch, as padding bits are always equal.
| Distance::Cosine(_) | Distance::DotProduct(_) => (2.0 - d) / 2.0, | ||
| Distance::Euclidean(_) => 1.0 / (1.0 + d), | ||
| Distance::Hamming((_, dim)) => { | ||
| // Round up to nearest multiple of 8 to reflect that we store hamming distances packed in bytes |
There was a problem hiding this comment.
When you have 57 dim vector, after quantization you will get 64 bits representing this vector. We always round up to the full byte.
Extend the API to return similarity scores calculated
from the distance values and index's space type.
The similarity scores are needed to implement the optimization
to return similarity scores pre-calculated in Vector Store
when calling
similarity_*()functions within ANN vector querieswith the exact same parameters as the ANN request.
The conversion from distance to similarity score is based on
comparing USearch's distance calculations versus jVector's
similarity scores calculations (which both Scylla and Cassandra use).
References: VECTOR-469