Skip to content

vector-store: return similarity scores#354

Open
QuerthDP wants to merge 3 commits intoscylladb:masterfrom
QuerthDP:return-similarity-scores
Open

vector-store: return similarity scores#354
QuerthDP wants to merge 3 commits intoscylladb:masterfrom
QuerthDP:return-similarity-scores

Conversation

@QuerthDP
Copy link
Member

@QuerthDP QuerthDP commented Jan 27, 2026

Extend the API to return similarity scores calculated
from the distance values and index's space type.

The similarity scores are needed to implement the optimization
to return similarity scores pre-calculated in Vector Store
when calling similarity_*() functions within ANN vector queries
with the exact same parameters as the ANN request.

The conversion from distance to similarity score is based on
comparing USearch's distance calculations versus jVector's
similarity scores calculations (which both Scylla and Cassandra use).

References: VECTOR-469

@QuerthDP QuerthDP force-pushed the return-similarity-scores branch from 0c4570d to 0fc4799 Compare January 27, 2026 15:25
@QuerthDP QuerthDP changed the title Return similarity scores vector-store: return similarity scores Jan 27, 2026
@QuerthDP QuerthDP marked this pull request as ready for review January 27, 2026 15:26
@QuerthDP QuerthDP requested review from ewienik, knowack1 and smoczy123 and removed request for ewienik January 27, 2026 15:26
@QuerthDP QuerthDP force-pushed the return-similarity-scores branch from 0fc4799 to ef9dfbe Compare January 28, 2026 09:53
@QuerthDP
Copy link
Member Author

Changelog:

  • remove the GetSpaceType message and use GetIndexIdsR instead

@QuerthDP QuerthDP force-pushed the return-similarity-scores branch from ef9dfbe to d0a64ac Compare January 28, 2026 11:15
@QuerthDP QuerthDP requested a review from ewienik January 28, 2026 11:15
@QuerthDP
Copy link
Member Author

Changelog:

  • refactor Distance struct to now check if distance value is finite and >=0 on creation

@QuerthDP QuerthDP force-pushed the return-similarity-scores branch from d0a64ac to d08bc63 Compare January 28, 2026 11:30
@QuerthDP
Copy link
Member Author

Changelog:

  • add unit test for Distance validation

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends the Vector Store ANN API to return similarity scores in addition to distances, deriving scores from distance values and the index space type.

Changes:

  • Add SimilarityScore type and compute similarity scores from ANN distances based on SpaceType.
  • Propagate space_type through the engine and expose similarity scores in the HTTP ANN response and OpenAPI spec.
  • Update index backends and clients/tests to handle validated distances and the expanded ANN response.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
crates/vector-store/tests/integration/usearch.rs Updates integration tests to handle the additional similarity_scores return value.
crates/vector-store/tests/integration/opensearch.rs Updates integration test to validate presence/shape of similarity scores in ANN results.
crates/vector-store/src/lib.rs Introduces SimilarityScore and adds Distance::new validation + related unit test.
crates/vector-store/src/index/usearch.rs Validates distances from USearch and propagates possible distance validation errors via iterators.
crates/vector-store/src/index/opensearch.rs Validates ANN distances parsed from OpenSearch results and returns errors on invalid values.
crates/vector-store/src/httproutes.rs Extends ANN HTTP response schema and computes similarity_scores using per-index space_type.
crates/vector-store/src/engine.rs Stores SpaceType alongside index actors and returns it via get_index_ids().
crates/httpclient/src/lib.rs Extends HttpClient::ann to return (primary_keys, distances, similarity_scores).
api/openapi.json Updates OpenAPI schema/description to include similarity_scores and SimilarityScore schema.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@QuerthDP QuerthDP force-pushed the return-similarity-scores branch from d08bc63 to 40edafc Compare January 28, 2026 13:01
@QuerthDP
Copy link
Member Author

Changelog:

  • fix validator test using unnormalized vectors with dot product similarity
  • move Distance and Similarity to separate modules to disallow creating them by value
  • add tests for Distance to Similarity converstion
  • resolve Copilot's comments

ewienik
ewienik previously approved these changes Jan 28, 2026
@ewienik
Copy link
Collaborator

ewienik commented Jan 28, 2026

@knowack1 @m-szymon Do you have any comments?

knowack1
knowack1 previously approved these changes Jan 29, 2026
Copy link
Collaborator

@knowack1 knowack1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks super good. One nit comment - could be as follow-up if find it useful.

Add support for binary quantization space types.
As for this patch, the only supported type is Hamming.
@QuerthDP QuerthDP dismissed stale reviews from knowack1 and ewienik via bb56708 February 10, 2026 12:59
@QuerthDP QuerthDP force-pushed the return-similarity-scores branch from 40edafc to bb56708 Compare February 10, 2026 12:59
@QuerthDP
Copy link
Member Author

Changelog:

  • refactored Distance to validate it's values according to the space type of the index
  • added support for Hamming space type and it's similarity score
  • rebased on master

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@QuerthDP QuerthDP force-pushed the return-similarity-scores branch from bb56708 to 24cb867 Compare February 10, 2026 13:59
@QuerthDP
Copy link
Member Author

Changelog:

  • check upper boundary for Hamming distance
  • clamp the similarity score of DotProduct to [0.0, 1.0]
  • return Dimension as an option as the code doesn't prevent from creation of 0 dim vectors

@QuerthDP QuerthDP force-pushed the return-similarity-scores branch from 24cb867 to bb61f61 Compare February 10, 2026 14:19
@QuerthDP
Copy link
Member Author

Changelog:

  • remove the similarity clamp to stay compatible with Cassandra

Add a check when creating a `Distance` value to ensure that
distances corresponding to the space types satisfy the mathematical
boundaries.

Move to a separate module to disallow for creation by value.
Extend the API to return similarity scores calculated
from the distance values and index's space type.

The similarity scores are needed to implement the optimization
to return similarity scores pre-calculated in Vector Store
when calling `similarity_*()` functions within ANN vector queries
with the exact same parameters as the ANN request.

The conversion from distance to similarity score is based on
comparing USearch's distance calculations versus jVector's
similarity scores calculations (which both Scylla and Cassandra use).

The conversion from Hamming distance to Hamming similarity was
introduced by us as the quotient of matched bits and all stored bits,
which is a number of Vector dimensions rounded up to the nearest
multiple of 8 (as packed in bytes).

References: VECTOR-469
@QuerthDP QuerthDP force-pushed the return-similarity-scores branch from bb61f61 to 6fe0458 Compare February 11, 2026 14:07
@QuerthDP
Copy link
Member Author

Changelog:

  • add httproutes::Distance to split the API related Distance struct from internal Distance

@QuerthDP QuerthDP requested a review from ewienik February 11, 2026 14:08
@QuerthDP
Copy link
Member Author

@m-szymon please take a look if you're OK with the validation now

Distance::Cosine(_) | Distance::DotProduct(_) => (2.0 - d) / 2.0,
Distance::Euclidean(_) => 1.0 / (1.0 + d),
Distance::Hamming((_, dim)) => {
// Round up to nearest multiple of 8 to reflect that we store hamming distances packed in bytes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why do we care about packing here. Isn't it just padded. If we define 57 dimensions is it possible to get distance 64?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you have 57 dim vector, after quantization you will get 64 bits representing this vector. We always round up to the full byte.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but the additional 7 bits are just padding. For all vectors it will be the same.

[zero] => [0, 0, ... 0, 0, 0, 0, 0, 0, 0, 0]
[one]  => [1, 1, ... 1, 0, 0, 0, 0, 0, 0, 0]

max distance is still 57.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right! You will never get 64 distance from usearch, as padding bits are always equal.

Distance::Cosine(_) | Distance::DotProduct(_) => (2.0 - d) / 2.0,
Distance::Euclidean(_) => 1.0 / (1.0 + d),
Distance::Hamming((_, dim)) => {
// Round up to nearest multiple of 8 to reflect that we store hamming distances packed in bytes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you have 57 dim vector, after quantization you will get 64 bits representing this vector. We always round up to the full byte.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants