forked from apache/cassandra
-
Notifications
You must be signed in to change notification settings - Fork 20
CNDB-16055: Use tmpFileFor in CompactionGraph to prevent CNDB from remote loading #2133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Checklist before you submit for review
|
JeremiahDJordan
approved these changes
Nov 19, 2025
jkni
approved these changes
Nov 19, 2025
jkni
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
❌ Build ds-cassandra-pr-gate/PR-2133 rejected by Butler2 regressions found Found 2 new test failures
No known test failures found |
michaelsembwever
pushed a commit
that referenced
this pull request
Dec 9, 2025
…action (#2030) + CNDB-16055: Use tmpFileFor in CompactionGraph to prevent CNDB from remote loading (#2133) Fixes: riptano/cndb#15469 CNDB PR: riptano/cndb#15813 This PR integrates the jvector NVQ feature into SAI vector indexes built via compaction. This feature is disabled by default (by `cassandra.sai.vector.enable_nvq`) to continue providing the best recall when storage savings is not an explicit concern. The jvector library describes NVQ with: > Support for Non-uniform Vector Quantization (NVQ, pronounced as "new vec"). This new technique quantizes the values in each vector with high accuracy by first applying a nonlinear transformation that is individually fit to each vector. These nonlinearities are designed to be lightweight and have a negligible impact on distance computation performance. This feature is only available in SAI on disk version `EC` and later. It can be enabled by setting `cassandra.sai.vector.enable_nvq` to `true` and selecting `cassandra.sai.latest.version=ec` or greater. When enabled, we can expect NVQ to reduce the storage footprint of the graph (stored in the `TERMS` file) because quantized vectors are stored inline instead of the full precision vectors. A possible result of storing these smaller vectors is fewer iops due to improved efficiency of a graph node fitting within a single 4 kb page. We do not have any new metrics exposed to track this feature beyond disk utilization. When troubleshooting, this log line will help determine what features an on disk graph is using: ```java logger.debug("Opened graph for {} for sstable row id offset {} with {} features", source, segmentMetadata.segmentRowIdOffset, features); ``` NVQ will be in the list if it is in use. tl;dr: NVQ works for earlier versions of CC because the on disk format hasn't changed and jvector knows how to read it. If you enable NVQ on CC without this PR and with `ann_use_synthetic_score = true`, you might see out of order results. One side effect of NVQ is that the NVQ vector similarity score is slightly different than the full precision score. This is primarily a problem when the synthetic score is in use (`cassandra.sai.ann_use_synthetic_score`) because the synthetic score was based on the score from the index. Now that this score does not necessarily equal the FP sim score, we must compute the FP sim score before sending the synthetic score to the coordinator. Otherwise, we will end up with out of order vectors. Because older versions of CC do not correct for this, it is possible to send the wrong score to the coordinator. However, because this feature is disabled by default, there is not really a risk of sending the wrong score. CNDB-16055: Use tmpFileFor in CompactionGraph to prevent CNDB from remote loading (#2133) Fixes: riptano/cndb#16055 #2030 introduced a bug for CNDB due to the way we inject the file reader. The solution is actually quite simple, I just needed to follow a convention that I wasn't familiar with. By calling `tmpFileFor`, I get the proper file extension (which cleans the file in case of restart) and I get the local only file.
michaelsembwever
pushed a commit
that referenced
this pull request
Dec 10, 2025
…action (#2030) + CNDB-16055: Use tmpFileFor in CompactionGraph to prevent CNDB from remote loading (#2133) Fixes: riptano/cndb#15469 CNDB PR: riptano/cndb#15813 This PR integrates the jvector NVQ feature into SAI vector indexes built via compaction. This feature is disabled by default (by `cassandra.sai.vector.enable_nvq`) to continue providing the best recall when storage savings is not an explicit concern. The jvector library describes NVQ with: > Support for Non-uniform Vector Quantization (NVQ, pronounced as "new vec"). This new technique quantizes the values in each vector with high accuracy by first applying a nonlinear transformation that is individually fit to each vector. These nonlinearities are designed to be lightweight and have a negligible impact on distance computation performance. This feature is only available in SAI on disk version `EC` and later. It can be enabled by setting `cassandra.sai.vector.enable_nvq` to `true` and selecting `cassandra.sai.latest.version=ec` or greater. When enabled, we can expect NVQ to reduce the storage footprint of the graph (stored in the `TERMS` file) because quantized vectors are stored inline instead of the full precision vectors. A possible result of storing these smaller vectors is fewer iops due to improved efficiency of a graph node fitting within a single 4 kb page. We do not have any new metrics exposed to track this feature beyond disk utilization. When troubleshooting, this log line will help determine what features an on disk graph is using: ```java logger.debug("Opened graph for {} for sstable row id offset {} with {} features", source, segmentMetadata.segmentRowIdOffset, features); ``` NVQ will be in the list if it is in use. tl;dr: NVQ works for earlier versions of CC because the on disk format hasn't changed and jvector knows how to read it. If you enable NVQ on CC without this PR and with `ann_use_synthetic_score = true`, you might see out of order results. One side effect of NVQ is that the NVQ vector similarity score is slightly different than the full precision score. This is primarily a problem when the synthetic score is in use (`cassandra.sai.ann_use_synthetic_score`) because the synthetic score was based on the score from the index. Now that this score does not necessarily equal the FP sim score, we must compute the FP sim score before sending the synthetic score to the coordinator. Otherwise, we will end up with out of order vectors. Because older versions of CC do not correct for this, it is possible to send the wrong score to the coordinator. However, because this feature is disabled by default, there is not really a risk of sending the wrong score. CNDB-16055: Use tmpFileFor in CompactionGraph to prevent CNDB from remote loading (#2133) Fixes: riptano/cndb#16055 #2030 introduced a bug for CNDB due to the way we inject the file reader. The solution is actually quite simple, I just needed to follow a convention that I wasn't familiar with. By calling `tmpFileFor`, I get the proper file extension (which cleans the file in case of restart) and I get the local only file.
michaelsembwever
pushed a commit
that referenced
this pull request
Dec 12, 2025
…action (#2030) + CNDB-16055: Use tmpFileFor in CompactionGraph to prevent CNDB from remote loading (#2133) Fixes: riptano/cndb#15469 CNDB PR: riptano/cndb#15813 This PR integrates the jvector NVQ feature into SAI vector indexes built via compaction. This feature is disabled by default (by `cassandra.sai.vector.enable_nvq`) to continue providing the best recall when storage savings is not an explicit concern. The jvector library describes NVQ with: > Support for Non-uniform Vector Quantization (NVQ, pronounced as "new vec"). This new technique quantizes the values in each vector with high accuracy by first applying a nonlinear transformation that is individually fit to each vector. These nonlinearities are designed to be lightweight and have a negligible impact on distance computation performance. This feature is only available in SAI on disk version `EC` and later. It can be enabled by setting `cassandra.sai.vector.enable_nvq` to `true` and selecting `cassandra.sai.latest.version=ec` or greater. When enabled, we can expect NVQ to reduce the storage footprint of the graph (stored in the `TERMS` file) because quantized vectors are stored inline instead of the full precision vectors. A possible result of storing these smaller vectors is fewer iops due to improved efficiency of a graph node fitting within a single 4 kb page. We do not have any new metrics exposed to track this feature beyond disk utilization. When troubleshooting, this log line will help determine what features an on disk graph is using: ```java logger.debug("Opened graph for {} for sstable row id offset {} with {} features", source, segmentMetadata.segmentRowIdOffset, features); ``` NVQ will be in the list if it is in use. tl;dr: NVQ works for earlier versions of CC because the on disk format hasn't changed and jvector knows how to read it. If you enable NVQ on CC without this PR and with `ann_use_synthetic_score = true`, you might see out of order results. One side effect of NVQ is that the NVQ vector similarity score is slightly different than the full precision score. This is primarily a problem when the synthetic score is in use (`cassandra.sai.ann_use_synthetic_score`) because the synthetic score was based on the score from the index. Now that this score does not necessarily equal the FP sim score, we must compute the FP sim score before sending the synthetic score to the coordinator. Otherwise, we will end up with out of order vectors. Because older versions of CC do not correct for this, it is possible to send the wrong score to the coordinator. However, because this feature is disabled by default, there is not really a risk of sending the wrong score. CNDB-16055: Use tmpFileFor in CompactionGraph to prevent CNDB from remote loading (#2133) Fixes: riptano/cndb#16055 #2030 introduced a bug for CNDB due to the way we inject the file reader. The solution is actually quite simple, I just needed to follow a convention that I wasn't familiar with. By calling `tmpFileFor`, I get the proper file extension (which cleans the file in case of restart) and I get the local only file.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.



What is the issue
Fixes: https://github.com/riptano/cndb/issues/16055
What does this PR fix and why was it fixed
#2030 introduced a bug for CNDB due to the way we inject the file reader. The solution is actually quite simple, I just needed to follow a convention that I wasn't familiar with. By calling
tmpFileFor, I get the proper file extension (which cleans the file in case of restart) and I get the local only file.