Skip to content

Conversation

@michaeljmarshall
Copy link
Member

What is the issue

Fixes: https://github.com/riptano/cndb/issues/16055

What does this PR fix and why was it fixed

#2030 introduced a bug for CNDB due to the way we inject the file reader. The solution is actually quite simple, I just needed to follow a convention that I wasn't familiar with. By calling tmpFileFor, I get the proper file extension (which cleans the file in case of restart) and I get the local only file.

@michaeljmarshall michaeljmarshall self-assigned this Nov 19, 2025
@github-actions
Copy link

github-actions bot commented Nov 19, 2025

Checklist before you submit for review

  • This PR adheres to the Definition of Done
  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

@jkni jkni self-requested a review November 19, 2025 23:52
Copy link

@jkni jkni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sonarqubecloud
Copy link

@cassci-bot
Copy link

❌ Build ds-cassandra-pr-gate/PR-2133 rejected by Butler


2 regressions found
See build details here


Found 2 new test failures

Test Explanation Runs Upstream
o.a.c.index.sai.cql.VectorCompaction100dTest.testOneToManyCompactionTooManyHoles[eb false] NEW 🔴 0 / 17
o.a.c.index.sai.cql.VectorSiftSmallTest.testSiftSmall[db false] NEW 🔴 0 / 17

No known test failures found

@michaeljmarshall michaeljmarshall merged commit b1010a2 into main Nov 20, 2025
492 of 499 checks passed
@michaeljmarshall michaeljmarshall deleted the cndb-16055 branch November 20, 2025 04:37
michaelsembwever pushed a commit that referenced this pull request Dec 9, 2025
…action (#2030)

+ CNDB-16055: Use tmpFileFor in CompactionGraph to prevent CNDB from remote loading (#2133)

Fixes: riptano/cndb#15469
CNDB PR: riptano/cndb#15813

This PR integrates the jvector NVQ feature into SAI vector indexes built
via compaction. This feature is disabled by default (by
`cassandra.sai.vector.enable_nvq`) to continue providing the best recall
when storage savings is not an explicit concern. The jvector library
describes NVQ with:

> Support for Non-uniform Vector Quantization (NVQ, pronounced as "new
vec"). This new technique quantizes the values in each vector with high
accuracy by first applying a nonlinear transformation that is
individually fit to each vector. These nonlinearities are designed to be
lightweight and have a negligible impact on distance computation
performance.

This feature is only available in SAI on disk version `EC` and later. It
can be enabled by setting `cassandra.sai.vector.enable_nvq` to `true`
and selecting `cassandra.sai.latest.version=ec` or greater.

When enabled, we can expect NVQ to reduce the storage footprint of the
graph (stored in the `TERMS` file) because quantized vectors are stored
inline instead of the full precision vectors. A possible result of
storing these smaller vectors is fewer iops due to improved efficiency
of a graph node fitting within a single 4 kb page.

We do not have any new metrics exposed to track this feature beyond disk
utilization.

When troubleshooting, this log line will help determine what features an
on disk graph is using:

```java
        logger.debug("Opened graph for {} for sstable row id offset {} with {} features", source, segmentMetadata.segmentRowIdOffset, features);
```

NVQ will be in the list if it is in use.

tl;dr: NVQ works for earlier versions of CC because the on disk format
hasn't changed and jvector knows how to read it. If you enable NVQ on CC
without this PR and with `ann_use_synthetic_score = true`, you might see
out of order results.

One side effect of NVQ is that the NVQ vector similarity score is
slightly different than the full precision score. This is primarily a
problem when the synthetic score is in use
(`cassandra.sai.ann_use_synthetic_score`) because the synthetic score
was based on the score from the index. Now that this score does not
necessarily equal the FP sim score, we must compute the FP sim score
before sending the synthetic score to the coordinator. Otherwise, we
will end up with out of order vectors.

Because older versions of CC do not correct for this, it is possible to
send the wrong score to the coordinator. However, because this feature
is disabled by default, there is not really a risk of sending the wrong
score.

CNDB-16055: Use tmpFileFor in CompactionGraph to prevent CNDB from remote loading (#2133)

Fixes: riptano/cndb#16055

#2030 introduced a bug for
CNDB due to the way we inject the file reader. The solution is actually
quite simple, I just needed to follow a convention that I wasn't
familiar with. By calling `tmpFileFor`, I get the proper file extension
(which cleans the file in case of restart) and I get the local only
file.
michaelsembwever pushed a commit that referenced this pull request Dec 10, 2025
…action (#2030)

+ CNDB-16055: Use tmpFileFor in CompactionGraph to prevent CNDB from remote loading (#2133)

Fixes: riptano/cndb#15469
CNDB PR: riptano/cndb#15813

This PR integrates the jvector NVQ feature into SAI vector indexes built
via compaction. This feature is disabled by default (by
`cassandra.sai.vector.enable_nvq`) to continue providing the best recall
when storage savings is not an explicit concern. The jvector library
describes NVQ with:

> Support for Non-uniform Vector Quantization (NVQ, pronounced as "new
vec"). This new technique quantizes the values in each vector with high
accuracy by first applying a nonlinear transformation that is
individually fit to each vector. These nonlinearities are designed to be
lightweight and have a negligible impact on distance computation
performance.

This feature is only available in SAI on disk version `EC` and later. It
can be enabled by setting `cassandra.sai.vector.enable_nvq` to `true`
and selecting `cassandra.sai.latest.version=ec` or greater.

When enabled, we can expect NVQ to reduce the storage footprint of the
graph (stored in the `TERMS` file) because quantized vectors are stored
inline instead of the full precision vectors. A possible result of
storing these smaller vectors is fewer iops due to improved efficiency
of a graph node fitting within a single 4 kb page.

We do not have any new metrics exposed to track this feature beyond disk
utilization.

When troubleshooting, this log line will help determine what features an
on disk graph is using:

```java
        logger.debug("Opened graph for {} for sstable row id offset {} with {} features", source, segmentMetadata.segmentRowIdOffset, features);
```

NVQ will be in the list if it is in use.

tl;dr: NVQ works for earlier versions of CC because the on disk format
hasn't changed and jvector knows how to read it. If you enable NVQ on CC
without this PR and with `ann_use_synthetic_score = true`, you might see
out of order results.

One side effect of NVQ is that the NVQ vector similarity score is
slightly different than the full precision score. This is primarily a
problem when the synthetic score is in use
(`cassandra.sai.ann_use_synthetic_score`) because the synthetic score
was based on the score from the index. Now that this score does not
necessarily equal the FP sim score, we must compute the FP sim score
before sending the synthetic score to the coordinator. Otherwise, we
will end up with out of order vectors.

Because older versions of CC do not correct for this, it is possible to
send the wrong score to the coordinator. However, because this feature
is disabled by default, there is not really a risk of sending the wrong
score.

CNDB-16055: Use tmpFileFor in CompactionGraph to prevent CNDB from remote loading (#2133)

Fixes: riptano/cndb#16055

#2030 introduced a bug for
CNDB due to the way we inject the file reader. The solution is actually
quite simple, I just needed to follow a convention that I wasn't
familiar with. By calling `tmpFileFor`, I get the proper file extension
(which cleans the file in case of restart) and I get the local only
file.
michaelsembwever pushed a commit that referenced this pull request Dec 12, 2025
…action (#2030)

+ CNDB-16055: Use tmpFileFor in CompactionGraph to prevent CNDB from remote loading (#2133)

Fixes: riptano/cndb#15469
CNDB PR: riptano/cndb#15813

This PR integrates the jvector NVQ feature into SAI vector indexes built
via compaction. This feature is disabled by default (by
`cassandra.sai.vector.enable_nvq`) to continue providing the best recall
when storage savings is not an explicit concern. The jvector library
describes NVQ with:

> Support for Non-uniform Vector Quantization (NVQ, pronounced as "new
vec"). This new technique quantizes the values in each vector with high
accuracy by first applying a nonlinear transformation that is
individually fit to each vector. These nonlinearities are designed to be
lightweight and have a negligible impact on distance computation
performance.

This feature is only available in SAI on disk version `EC` and later. It
can be enabled by setting `cassandra.sai.vector.enable_nvq` to `true`
and selecting `cassandra.sai.latest.version=ec` or greater.

When enabled, we can expect NVQ to reduce the storage footprint of the
graph (stored in the `TERMS` file) because quantized vectors are stored
inline instead of the full precision vectors. A possible result of
storing these smaller vectors is fewer iops due to improved efficiency
of a graph node fitting within a single 4 kb page.

We do not have any new metrics exposed to track this feature beyond disk
utilization.

When troubleshooting, this log line will help determine what features an
on disk graph is using:

```java
        logger.debug("Opened graph for {} for sstable row id offset {} with {} features", source, segmentMetadata.segmentRowIdOffset, features);
```

NVQ will be in the list if it is in use.

tl;dr: NVQ works for earlier versions of CC because the on disk format
hasn't changed and jvector knows how to read it. If you enable NVQ on CC
without this PR and with `ann_use_synthetic_score = true`, you might see
out of order results.

One side effect of NVQ is that the NVQ vector similarity score is
slightly different than the full precision score. This is primarily a
problem when the synthetic score is in use
(`cassandra.sai.ann_use_synthetic_score`) because the synthetic score
was based on the score from the index. Now that this score does not
necessarily equal the FP sim score, we must compute the FP sim score
before sending the synthetic score to the coordinator. Otherwise, we
will end up with out of order vectors.

Because older versions of CC do not correct for this, it is possible to
send the wrong score to the coordinator. However, because this feature
is disabled by default, there is not really a risk of sending the wrong
score.

CNDB-16055: Use tmpFileFor in CompactionGraph to prevent CNDB from remote loading (#2133)

Fixes: riptano/cndb#16055

#2030 introduced a bug for
CNDB due to the way we inject the file reader. The solution is actually
quite simple, I just needed to follow a convention that I wasn't
familiar with. By calling `tmpFileFor`, I get the proper file extension
(which cleans the file in case of restart) and I get the local only
file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants