Add trie-based User Defined Index (UDI) plugin#14310
Add trie-based User Defined Index (UDI) plugin#14310zaidoon1 wants to merge 11 commits intofacebook:mainfrom
Conversation
046a09d to
507dd64
Compare
| user_defined_index_builder_(std::move(user_defined_index_builder)) {} | ||
|
|
||
| ~UserDefinedIndexBuilderWrapper() override = default; | ||
| ~UserDefinedIndexBuilderWrapper() override { status_.PermitUncheckedError(); } |
There was a problem hiding this comment.
this matches the same pattern used: https://github.com/facebook/rocksdb/blob/main/table/block_based/block.h#L392
without it, we see: https://github.com/facebook/rocksdb/actions/runs/21760182332/job/62781432926?pr=14310
There was a problem hiding this comment.
although it's not gated behind the debug builds, let me know if you want me to do the same
|
cc @xingbowang this is my first attempt at part 1 of many |
d7d1969 to
02a5136
Compare
|
Overall, the change looks very good. Thank you for the contribution. This is Claude generated code review feedback. I will take a closer look later. Correctness1. Unbounded
|
57723e7 to
5ef21cd
Compare
|
Codex review result: Trie UDI: Bounds Checking FindingsCommit: Both findings are in Finding 1: Upper-bound pruning drops valid blocks (High)Problem
// trie_index_factory.cc, CheckBounds()
if (comparator_->Compare(Slice(current_key_scratch_), limit) >= 0) {
return IterBoundCheck::kOutOfBound;
}The trie index stores separator keys (produced by The UDI API contract in
ExampleThree blocks with separators computed by Upper bound (limit) =
Block 1 contains keys like The correct check: compare the previous separator Reproducer testTEST_F(TrieIndexFactoryTest, UpperBoundPruningDropsValidBlock) {
UserDefinedIndexOption option;
option.comparator = BytewiseComparator();
std::unique_ptr<UserDefinedIndexBuilder> udi_builder;
ASSERT_OK(factory_->NewBuilder(option, udi_builder));
// Block 0: last="az", next_first="c" → sep "b"
{
UserDefinedIndexBuilder::BlockHandle handle{0, 1000};
std::string scratch;
Slice next("c");
udi_builder->AddIndexEntry(Slice("az"), &next, handle, &scratch);
}
// Block 1: last="cz", next_first="e" → sep "d"
{
UserDefinedIndexBuilder::BlockHandle handle{1000, 1000};
std::string scratch;
Slice next("e");
udi_builder->AddIndexEntry(Slice("cz"), &next, handle, &scratch);
}
// Block 2: last="ez", no next → sep "f"
{
UserDefinedIndexBuilder::BlockHandle handle{2000, 1000};
std::string scratch;
udi_builder->AddIndexEntry(Slice("ez"), nullptr, handle, &scratch);
}
Slice index_contents;
ASSERT_OK(udi_builder->Finish(&index_contents));
std::unique_ptr<UserDefinedIndexReader> reader;
ASSERT_OK(factory_->NewReader(option, index_contents, reader));
ReadOptions ro;
auto iter = reader->NewIterator(ro);
ScanOptions scan_opts(Slice("a"), Slice("d"));
iter->Prepare(&scan_opts, 1);
IterateResult result;
ASSERT_OK(iter->SeekAndGetResult(Slice("a"), &result));
ASSERT_EQ(result.bound_check_result, IterBoundCheck::kInbound);
ASSERT_EQ(iter->value().offset, 0); // block 0
ASSERT_OK(iter->NextAndGetResult(&result));
// BUG: returns kOutOfBound because separator "d" >= limit "d".
// Should be kInbound — block 1 contains keys < "d".
ASSERT_EQ(result.bound_check_result, IterBoundCheck::kOutOfBound); // buggy
}FixThe root cause is that
Changes to // Add members to TrieIndexIterator:
std::string prev_key_scratch_;
bool has_prev_key_;
// Change CheckBounds signature:
IterBoundCheck CheckBounds(const Slice& reference_key) const;Changes to Status TrieIndexIterator::SeekAndGetResult(const Slice& target,
IterateResult* result) {
// ... (multi-scan advancement, see Finding 2) ...
has_prev_key_ = false;
if (!iter_.Seek(target)) {
result->bound_check_result = IterBoundCheck::kOutOfBound;
result->key = Slice();
return Status::OK();
}
result->key = iter_.Key();
current_key_scratch_ = result->key.ToString();
result->key = Slice(current_key_scratch_);
// Use target as reference: if target < limit, block may have keys in bounds.
result->bound_check_result = CheckBounds(target);
return Status::OK();
}
Status TrieIndexIterator::NextAndGetResult(IterateResult* result) {
// Save current separator as "previous" before advancing.
prev_key_scratch_ = current_key_scratch_;
has_prev_key_ = true;
if (!iter_.Next()) {
result->bound_check_result = IterBoundCheck::kOutOfBound;
result->key = Slice();
return Status::OK();
}
result->key = iter_.Key();
current_key_scratch_ = result->key.ToString();
result->key = Slice(current_key_scratch_);
// Use previous separator: if prev >= limit, current block is out of bounds.
result->bound_check_result = CheckBounds(Slice(prev_key_scratch_));
return Status::OK();
}
IterBoundCheck TrieIndexIterator::CheckBounds(
const Slice& reference_key) const {
if (!prepared_ || scan_opts_.empty()) {
return IterBoundCheck::kInbound;
}
if (current_scan_idx_ >= scan_opts_.size()) {
return IterBoundCheck::kOutOfBound;
}
const auto& opts = scan_opts_[current_scan_idx_];
if (opts.range.limit.has_value()) {
const Slice& limit = opts.range.limit.value();
if (comparator_->Compare(reference_key, limit) >= 0) {
return IterBoundCheck::kOutOfBound;
}
}
return IterBoundCheck::kInbound;
}This is conservative: when Finding 2: Multi-scan bounds applied to the wrong scan (Medium)Problem
// trie_index_factory.cc, Prepare()
current_scan_idx_ = 0;
prepared_ = true;
// CheckBounds() always reads scan_opts_[0]:
const auto& opts = scan_opts_[current_scan_idx_]; // always index 0
ExampleFive blocks with separators
At step 3, the caller seeks into scan 1's range. Block 2's separator This bug only affects multi-scan workloads. Single-scan iteration (the common case) is unaffected. Reproducer testTEST_F(TrieIndexFactoryTest, MultiScanBoundsAppliedToWrongScan) {
UserDefinedIndexOption option;
option.comparator = BytewiseComparator();
std::unique_ptr<UserDefinedIndexBuilder> udi_builder;
ASSERT_OK(factory_->NewBuilder(option, udi_builder));
struct BlockDef {
const char* last_key;
const char* next_first;
uint64_t offset;
};
BlockDef blocks[] = {
{"az", "c", 0},
{"cz", "e", 1000},
{"ez", "g", 2000},
{"gz", "i", 3000},
{"iz", nullptr, 4000},
};
for (const auto& b : blocks) {
UserDefinedIndexBuilder::BlockHandle handle{b.offset, 500};
std::string scratch;
if (b.next_first) {
Slice next(b.next_first);
udi_builder->AddIndexEntry(Slice(b.last_key), &next, handle, &scratch);
} else {
udi_builder->AddIndexEntry(Slice(b.last_key), nullptr, handle, &scratch);
}
}
Slice index_contents;
ASSERT_OK(udi_builder->Finish(&index_contents));
std::unique_ptr<UserDefinedIndexReader> reader;
ASSERT_OK(factory_->NewReader(option, index_contents, reader));
ReadOptions ro;
auto iter = reader->NewIterator(ro);
ScanOptions scans[] = {
ScanOptions(Slice("a"), Slice("c")), // scan 0
ScanOptions(Slice("e"), Slice("g")), // scan 1
};
iter->Prepare(scans, 2);
// Scan 0
IterateResult result;
ASSERT_OK(iter->SeekAndGetResult(Slice("a"), &result));
ASSERT_EQ(result.bound_check_result, IterBoundCheck::kInbound);
ASSERT_EQ(iter->value().offset, 0);
ASSERT_OK(iter->NextAndGetResult(&result));
ASSERT_EQ(result.bound_check_result, IterBoundCheck::kOutOfBound);
// Scan 1 — seek into second range
ASSERT_OK(iter->SeekAndGetResult(Slice("e"), &result));
// BUG: returns kOutOfBound because current_scan_idx_ is still 0,
// so CheckBounds() compares "f" against scan 0's limit "c".
// Should be kInbound — "f" < scan 1's limit "g".
ASSERT_EQ(result.bound_check_result, IterBoundCheck::kOutOfBound); // buggy
}FixIn Status TrieIndexIterator::SeekAndGetResult(const Slice& target,
IterateResult* result) {
// Advance current_scan_idx_ past any scans whose limit <= target.
// This handles the multi-scan case where the caller seeks into a later
// scan range after the previous scan returned kOutOfBound.
if (prepared_) {
while (current_scan_idx_ < scan_opts_.size()) {
const auto& opts = scan_opts_[current_scan_idx_];
if (opts.range.limit.has_value() &&
comparator_->Compare(target, opts.range.limit.value()) >= 0) {
current_scan_idx_++;
} else {
break;
}
}
}
has_prev_key_ = false;
// ... rest of SeekAndGetResult (see Finding 1 fix) ...
}Also reset void TrieIndexIterator::Prepare(const ScanOptions scan_opts[],
size_t num_opts) {
// ...
current_scan_idx_ = 0;
has_prev_key_ = false;
prepared_ = true;
}With both fixes applied, the multi-scan test verifies end-to-end:
|
|
There is a great amount of unit tests added. Thank you for doing this. |
f70d1bb to
c08e895
Compare
will do! Just working on fixing some failing builds right now. I'm running profiling code locally, identifying hotspots and fixing them too. Will address the AI review comments/start working on the stress test after |
4240604 to
ad02d56
Compare
|
Just discussed this change with AI. I assume the index is densely packed, so that it is relative small and can be loaded into memory completely. If I am wrong, please correct me. If this is the case, the performance is likely bottlenecked on cache misses. AI pointed out this issues. A. Multiple Separate Arrays (Problematic) Meantime, I am wondering whether we could do batch execution to hide the cache misses latency. It would interleave the execution of multiple queries. This could boost the throughput significantly while losing a bit of latency. If you are interested, maybe explore this direction as well. Maybe expand the UDI to support batch api, then extend MultiGet api to leverage it. |
You're correct, the trie index is densely packed into a single contiguous meta-block and loaded entirely into memory via the block cache. On cache misses from multiple separate arrays:
Only the child position lookup tables (s_child_start_pos_, s_child_end_pos_) and chain metadata vectors are copied into separate std::vectors during deserialization. These are small — 8 bytes per internal node. Concrete trie sizes from benchmarks:
For typical RocksDB SST files (4 KB data blocks, 64–256 MB file → 16K–64K separator keys), the entire trie fits in L2. Per-level memory accesses for one sparse iteration (the hot path):
5 memory accesses per trie level, where 1–3 are into the same contiguous buffer (close addresses → likely same or adjacent cache lines) and 4–5 are adjacent vector elements. With path compression, entire chains of fanout-1 nodes are skipped with a single memcmp on contiguous suffix bytes instead of 5 accesses per level. On batch execution / MultiGet integration: A batch trie Seek could use software pipelining to interleave multiple queries, similar to what RocksDB already does for bloom filters. The UDI interface already has Prepare() which could be extended for point-lookup batches. The benefit is also uncertain without profiling a real MultiGet workload with perf stat to measure actual cache miss rates. If MultiGet keys for the same SST are sorted (which they typically are), consecutive Seeks follow similar trie paths and get good cache reuse after the first cold access. The cold first-Seek per SST is the main scenario where batch prefetching would help. I think this is worth exploring as a follow-up once the basic trie index is stable, but would want to measure actual L2/L3 miss rates on a production-like workload first to quantify the opportunity before committing to the implementation complexity. |
|
also optimization wise, i'm done. I think this is as good as it gets for now as a first pass. I'll look at the stress testing next. |
Implement a Fast Succinct Trie (FST) index based on LOUDS encoding as a User Defined Index plugin for RocksDB's block-based tables. This is the first step toward supporting trie-based indexing per issue facebook#12396. The trie uses a hybrid LOUDS-Dense (upper levels, 256-bit bitmaps) + LOUDS-Sparse (lower levels, label arrays) encoding inspired by the SuRF paper (Zhang et al., SIGMOD 2018) https://www.pdl.cmu.edu/PDL-FTP/Storage/surf_sigmod18.pdf . The boundary between dense and sparse levels is automatically chosen to minimize total space. Key components: - Bitvector with O(1) rank and O(log n) select using a rank LUT sampled every 256 bits with popcount intrinsics. Uses uint32_t rank LUT entries (halving LUT memory overhead vs uint64_t, safe because trie bitvectors are bounded well below 4 billion bits). Includes word-level AppendWord() for efficient dense bitmap construction and AppendMultiple() optimized at word granularity for bulk bit fills. - Streaming trie builder using flat per-level arrays with deferred internal marking, handle migration for prefix keys, and lazy node creation. Infers trie structure directly from sorted keys via LCP analysis in a single pass (no intermediate tree). Merged node-per-level and label-per-level cutoff computation into a single pass over the key set. - LoudsTrie for immutable querying with BFS-ordered handle reordering built into the single-pass level-by-level serialization. Move-only semantics with correct pointer re-seating after std::string move. - LoudsTrieIterator with rank-based trie traversal, key reconstruction from the trie path, and stack-based backtracking for Next(). Uses packed 8-byte LevelPos (is_dense flag encoded in bit 63) and autovector<LevelPos, 24> to avoid heap allocation. Key reconstruction uses a raw char buffer (key_buf_/key_len_) allocated once to MaxDepth()+1 bytes, so appending each byte is a single inlined store + increment with no function call overhead. - TrieIndexFactory/Builder/Reader/Iterator implementing the UserDefinedIndexFactory interface. - Zero-copy block handle loading using two fixed-width uint64_t arrays (offsets + sizes) with 8-byte alignment, enabling O(1) initialization via direct pointer assignment into the serialized data. Seek hot path optimizations: - Fanout-1 sparse fast path: most sparse nodes in tries built from zero-padded numeric keys have exactly one child. The Seek loop detects this case (start_pos + 1 == end_pos) and inlines a single byte comparison, avoiding the full SparseSeekLabel function call and reducing branch logic to a single comparison + conditional. - Linear scan for small sparse nodes: SparseSeekLabel uses sequential scan for nodes with <=16 labels instead of binary search. This is faster for the common 10-child digit nodes where the branch misprediction cost of binary search outweighs the linear scan cost. - Rank reuse: DenseLeafIndexFromRankAndHasChildRank and SparseLeafIndexFromHasChildRank overloads accept pre-computed has_child_rank from the Seek descent, avoiding redundant Rank1 calls when computing the final leaf index. General performance optimizations: - Select-free sparse traversal: precomputed child position lookup tables (s_child_start_pos_/s_child_end_pos_, uint32_t per internal node) eliminate Select1 calls during Seek. Sparse traversal tracks (start_pos, end_pos) directly instead of node_num, using only Rank1 (O(1)) + array lookup (O(1)) for child descent. - Binary search for sparse label lookup (std::lower_bound) as fallback for nodes with >16 labels. - Popcount-based Select64 via 6-step binary search within 64-bit words. - Cached label_rank pattern to eliminate redundant Rank1 calls in hot paths (Seek, Next, Advance all cache and reuse the label_rank computed for has_child checking). - Leaf index fast path: when there are no prefix keys (the common case), SparseLeafIndex and DenseLeafIndexFromRank skip the prefix-key Rank1 calls entirely, reducing SparseLeafIndex from 3 Rank1 calls to 1. - MSVC portability using RocksDB's BitsSetToOne/CountTrailingZeroBits.
…ration - Security: fix integer overflow in InitFromData (remaining/16 check) - Correctness: fix alignment padding to use buffer offset, not pointer - Validation: reject max_depth > 65536 to prevent key_cap overflow - Validation: reject non-BytewiseComparator in NewBuilder/NewReader - Safety: use abort() for deprecated virtual methods - Memory: include aux vectors in ApproximateMemoryUsage - Tests: add move semantics, corruption, comparator, memory tests - db_bench: add --use_trie_index flag for benchmarking - Style: named constants, constexpr methods, noexcept move ctor
Add unsigned suffix (u) to integer literals compared against unsigned return types (uint64_t, size_t) in ASSERT_EQ/ASSERT_GT macros. GCC with -Werror -Wsign-compare rejects mixed signed/unsigned comparisons in gtest's CmpHelper template instantiations.
…no class Key changes: - Add select1/select0 hint arrays to Bitvector for O(1) Select (was O(log n) binary search). Hints store rank LUT indices at every 256th one/zero bit, narrowing the search to a linear scan of 1-2 rank samples. - Replace raw uint64_t block handle arrays (16 bytes/key) with packed uint32_t arrays for offsets and sizes (8 bytes/key). BFS leaf order does not match key-sorted order for keys of different lengths, so Elias-Fano cannot be used. - Add EliasFano class (bitvector.h/cc) for compressed monotone sequences. Not used for handles due to the BFS ordering issue, but available for future use with other monotone sequences. - Remove all debug fprintf/fflush trace statements from SerializeAll(). - Add standalone profiling tools (local-only, not tracked in repo). Space reduction: 19.2 -> 11.3 bytes/key (41% reduction) at 32K 16-byte keys. Seek performance: ~125 ns/op (no regression).
On Linux, uint64_t is 'unsigned long' not 'unsigned long long', so %llu triggers -Werror,-Wformat with clang-18. Use the portable PRIu64 macro.
Two correctness bugs in TrieIndexIterator::CheckBounds(): 1. Upper-bound pruning dropped valid blocks (High). CheckBounds() compared the current separator key against the scan limit. Since the trie stores separator keys (upper bounds on block contents), not first-in-block keys, this prematurely rejected blocks that still contained keys within the limit. Fix: use the seek target (for Seek) or previous separator (for Next) as the reference key. This matches the UDI API contract in user_defined_index.h:114-121. 2. Multi-scan bounds applied to wrong scan (Medium). current_scan_idx_ was never advanced, so all bounds checks evaluated against scan 0's limit. Fix: in SeekAndGetResult(), advance current_scan_idx_ past scans whose limit <= the seek target before checking bounds. Added regression tests: - UpperBoundDoesNotDropValidBlocks: verifies blocks with keys < limit are not skipped when their separator >= limit. - MultiScanBoundsAdvanceCorrectly: verifies multi-scan iteration uses the correct scan's limit for each seek.
- Guard right-shift by low_bits_ when low_bits_==64 in EliasFano::BuildFrom to avoid undefined behavior (shift >= type width). - Guard word-boundary crossing shift: check bit_idx > 0 before shifting by (64 - bit_idx) to avoid shift by 64. - Initialize consumed to 0 in EliasFano::InitFromData to satisfy core.uninitialized.Assign checker. - Remove dead stores to p and remaining at end of LoudsTrie::InitFromData.
Add Bitvector::Rank1AndBit() that computes both GetBit(pos) and Rank1(pos+1) in a single pass, avoiding a redundant word access when both operations target the same bitvector position. This is the common pattern in sparse trie traversal where we check has_child then compute rank for child/leaf index lookup. Wire Rank1AndBit into three hot paths in LoudsTrieIterator::Seek(): - Fanout-1 exact match: replaces GetBit + two separate Rank1 calls (internal vs leaf branches) with a single Rank1AndBit call - Fanout-1 mismatch (label > target): replaces GetBit + Rank1 - General sparse path (fanout > 1): replaces GetBit + Rank1 Benchmark (32K random 16-byte hex keys, 10M lookups): Before: ~160 ns/op After: ~153 ns/op (~4.4% faster)
Implement path compression for the LOUDS-sparse trie by detecting and collapsing fanout-1 chains (sequences of single-child nodes) into single edges that can be compared with memcmp instead of per-level Rank1 traversal. Builder changes (louds_trie.cc): - Detect fanout-1 chains >= 8 nodes during serialization - Store chain metadata: bitmap (1 bit per internal label), suffix bytes, chain lengths (uint16), and end-child indices (uint32) - Filter chains with num_chains <= num_keys heuristic to disable chains for key patterns where they hurt (random hex, URLs) and enable them where they help (numeric keys with long shared prefixes) - Apply 10% space budget cap to prevent metadata explosion Seek changes (louds_trie.h, louds_trie.cc): - Template SeekImpl<bool kHasChains> with if constexpr guards around all chain-handling code (~200 lines in two blocks) - When kHasChains=false, compiler eliminates all chain code entirely, producing zero i-cache overhead for tries without chains - Dispatch once at iterator construction via has_chains_ flag - Follows the same pattern as RocksDB's BlockIter::ParseNextKey template Chain matching handles all cases: - Full chain match: skip entire chain, continue at end node - Partial match with divergence: find mismatch point, descend or advance - Target exhausted mid-chain: descend to leftmost leaf - Chain ending at leaf vs internal node with fanout > 1 Benchmark changes (trie_index_test.cc): - Replace raw trie vs binary search benchmark with trie vs real RocksDB IndexBlockIter comparison using InternalKeys and production code paths Performance (16-byte zero-padded numeric keys, 500K lookups): - 500 keys: 97ns vs 142ns baseline (+32% faster) - 1000 keys: 98ns vs 146ns baseline (+33% faster) - 32000 keys: 136ns vs 170ns baseline (+20% faster) Zero regression for non-chain patterns (hex, URL, short keys): all within noise of baseline measurements.
4ffdde3 to
c214056
Compare
The blackbox crash test reuses the same DB across iterations. When use_trie_index was a lambda, it could change between iterations: iteration 1 might run without trie index (writing deletes), then iteration 2 enables trie index and fails on reopen because existing SSTs contain non-Put types. Fix by making use_trie_index a non-lambda constant, matching the pattern used by use_put_entity_one_in for the same reason.
Implement a Fast Succinct Trie (FST) index based on LOUDS encoding as a User Defined Index plugin for RocksDB's block-based tables. First step toward trie-based indexing per #12396.
The trie uses hybrid LOUDS-Dense (upper levels, 256-bit bitmaps) + LOUDS-Sparse (lower levels, label arrays) encoding inspired by the SuRF paper (Zhang et al., SIGMOD 2018). The boundary between dense and sparse levels is automatically chosen to minimize total space.
Key Components
AppendWord()for efficient dense bitmap construction andAppendMultiple()optimized at word granularity for bulk bit fills.std::stringmove.Next(). Uses packed 8-byteLevelPos(is_dense flag in bit 63) andautovector<LevelPos, 24>to avoid heap allocation. Key reconstruction uses a raw char buffer allocated once toMaxDepth()+1bytes.UserDefinedIndexFactoryinterface.Seek Hot Path Optimizations
start_pos + 1 == end_posand inlined as a single byte comparison, avoiding the fullSparseSeekLabelcall.SparseSeekLabeluses sequential scan for nodes with ≤16 labels instead of binary search. Faster for common 10-child digit nodes where branch misprediction cost outweighs linear scan cost.DenseLeafIndexFromRankAndHasChildRankandSparseLeafIndexFromHasChildRankoverloads accept pre-computedhas_child_rankfrom the Seek descent, avoiding redundantRank1calls.General Performance Optimizations
s_child_start_pos_/s_child_end_pos_) eliminateSelect1calls during Seek. Sparse traversal tracks(start_pos, end_pos)directly, using onlyRank1(O(1)) + array lookup (O(1)) for child descent.Rank1calls in hot paths (Seek, Next, Advance all cache and reuse the label_rank computed for has_child checking).SparseLeafIndexandDenseLeafIndexFromRankskip the prefix-keyRank1calls entirely, reducing from 3 to 1Rank1call.BitsSetToOne/CountTrailingZeroBits.Benchmark Results
Trie Seek at 32K keys (16-byte keys, 5M lookups, median of 5 runs):