Add trie-based User Defined Index (UDI) plugin by zaidoon1 · Pull Request #14310 · facebook/rocksdb

zaidoon1 · 2026-02-06T17:24:56Z

Implement a Fast Succinct Trie (FST) index based on LOUDS encoding as a User Defined Index plugin for RocksDB's block-based tables. First step toward trie-based indexing per #12396.
The trie uses hybrid LOUDS-Dense (upper levels, 256-bit bitmaps) + LOUDS-Sparse (lower levels, label arrays) encoding inspired by the SuRF paper (Zhang et al., SIGMOD 2018). The boundary between dense and sparse levels is automatically chosen to minimize total space.

Key Components

Bitvector with O(1) rank and O(log n) select using a rank LUT sampled every 256 bits with popcount intrinsics. Uses uint32_t rank LUT entries (halving memory vs uint64_t). Includes word-level AppendWord() for efficient dense bitmap construction and AppendMultiple() optimized at word granularity for bulk bit fills.
Streaming trie builder using flat per-level arrays with deferred internal marking, handle migration for prefix keys, and lazy node creation. Infers trie structure directly from sorted keys via LCP analysis in a single pass (no intermediate tree).
LoudsTrie for immutable querying with BFS-ordered handle reordering built into single-pass level-by-level serialization. Move-only semantics with correct pointer re-seating after std::string move.
LoudsTrieIterator with rank-based traversal, key reconstruction from trie path, and stack-based backtracking for Next(). Uses packed 8-byte LevelPos (is_dense flag in bit 63) and autovector<LevelPos, 24> to avoid heap allocation. Key reconstruction uses a raw char buffer allocated once to MaxDepth()+1 bytes.
TrieIndexFactory/Builder/Reader/Iterator implementing the UserDefinedIndexFactory interface.
Zero-copy block handle loading using two fixed-width uint64_t arrays (offsets + sizes) with 8-byte alignment, enabling O(1) initialization via direct pointer assignment.

Seek Hot Path Optimizations

Fanout-1 sparse fast path: Most sparse nodes in tries built from zero-padded numeric keys have exactly one child. Detected via start_pos + 1 == end_pos and inlined as a single byte comparison, avoiding the full SparseSeekLabel call.
Linear scan for small sparse nodes: SparseSeekLabel uses sequential scan for nodes with ≤16 labels instead of binary search. Faster for common 10-child digit nodes where branch misprediction cost outweighs linear scan cost.
Rank reuse: DenseLeafIndexFromRankAndHasChildRank and SparseLeafIndexFromHasChildRank overloads accept pre-computed has_child_rank from the Seek descent, avoiding redundant Rank1 calls.
General Performance Optimizations
Select-free sparse traversal: Precomputed child position lookup tables (s_child_start_pos_/s_child_end_pos_) eliminate Select1 calls during Seek. Sparse traversal tracks (start_pos, end_pos) directly, using only Rank1 (O(1)) + array lookup (O(1)) for child descent.
Cached label_rank pattern: Eliminates redundant Rank1 calls in hot paths (Seek, Next, Advance all cache and reuse the label_rank computed for has_child checking).
Leaf index fast path: When no prefix keys exist (common case), SparseLeafIndex and DenseLeafIndexFromRank skip the prefix-key Rank1 calls entirely, reducing from 3 to 1 Rank1 call.
Popcount-based Select64 via 6-step binary search within 64-bit words.
MSVC portability using RocksDB's BitsSetToOne/CountTrailingZeroBits.

Benchmark Results

Trie Seek at 32K keys (16-byte keys, 5M lookups, median of 5 runs):

Configuration	ns/op
Trie (optimized)	118
Binary search (native)	134
Trie Seek is ~12% faster than native binary search index at 32K keys per block.

zaidoon1 · 2026-02-06T18:47:49Z

table/block_based/user_defined_index_wrapper.h

        user_defined_index_builder_(std::move(user_defined_index_builder)) {}

-  ~UserDefinedIndexBuilderWrapper() override = default;
+  ~UserDefinedIndexBuilderWrapper() override { status_.PermitUncheckedError(); }


this matches the same pattern used: https://github.com/facebook/rocksdb/blob/main/table/block_based/block.h#L392

without it, we see: https://github.com/facebook/rocksdb/actions/runs/21760182332/job/62781432926?pr=14310

although it's not gated behind the debug builds, let me know if you want me to do the same

zaidoon1 · 2026-02-06T19:57:05Z

cc @xingbowang this is my first attempt at part 1 of many

xingbowang · 2026-02-14T15:19:32Z

Overall, the change looks very good. Thank you for the contribution.

This is Claude generated code review feedback. I will take a closer look later.

Correctness

1. Unbounded `max_depth_` from untrusted data (P1)

LoudsTrieIterator allocates key_buf_ of size MaxDepth() + 1, where MaxDepth() comes from the serialized trie header. If a corrupted SST provides max_depth_ = UINT32_MAX, then key_cap_ = trie_->MaxDepth() + 1 overflows uint32_t to 0, and make_unique<char[]>(0) creates a zero-length buffer. Any subsequent write to key_buf_ is a buffer overflow.

// louds_trie.cc, LoudsTrieIterator constructor
key_cap_ = trie_->MaxDepth() + 1;  // overflow if MaxDepth() == UINT32_MAX
key_buf_ = std::make_unique<char[]>(key_cap_);

Suggestion: Add validation in LoudsTrie::InitFromData():

// A key longer than 64 KB is unrealistic for a block index separator.
if (max_depth_ > 65536) {
  return Status::Corruption("Trie index: max_depth exceeds reasonable limit");
}

2. Comparator compatibility not validated (P1)

The trie's internal traversal is inherently bytewise-ordered (it's a trie over byte sequences). If the user supplies a non-bytewise comparator, separator keys may not be in bytewise-sorted order, causing incorrect Seek results. The TrieIndexFactory should either validate that the comparator is bytewise-compatible or document this limitation.

Suggestion: Add a check in TrieIndexFactory::NewBuilder():

if (option.comparator != nullptr &&
    option.comparator != BytewiseComparator() &&
    option.comparator != ReverseBytewiseComparator()) {
  // At minimum, document. Ideally, return NotSupported.
}

Or add a comment in trie_index_factory.h documenting the restriction.

Error Handling

3. Deprecated API methods crash in release builds (P2)

The TrieIndexFactory deprecated overloads (NewBuilder() and NewReader() without UserDefinedIndexOption) call assert(false) and return nullptr. In release builds (where asserts are stripped), this silently returns null, likely causing a null-pointer dereference.

// trie_index_factory.h
UserDefinedIndexBuilder* NewBuilder() const override {
    assert(false);
    return nullptr;
}

Suggestion: Return a proper error or use ROCKSDB_UNREACHABLE / abort() if the intent is to never call these.

Testing

4. Missing test coverage (P2)

Move semantics for Bitvector and LoudsTrie. The move constructor has subtle pointer re-seating logic (RecomputePointers()) that deserves explicit testing.
ApproximateMemoryUsage() accuracy — currently returns only data_size_, but s_child_start_pos_ and s_child_end_pos_ vectors are additional heap allocations (8 bytes per sparse internal node). Consider adding them.

Performance

5. No benchmark results provided (P2)

Per CLAUDE.md, performance-sensitive changes should include db_bench numbers. For a space-reduction feature, a comparison of trie index size and Seek latency vs. the default binary search index would strengthen the case. The commit message references the SuRF paper but doesn't include empirical results for RocksDB's workloads.

Editor note: Typically, for performance related feature, we would add db_bench result.

6. `ApproximateMemoryUsage()` under-reports (P2)

Returns only data_size_ but the s_child_start_pos_ and s_child_end_pos_ vectors (8 bytes per sparse internal node) are heap-allocated during InitFromData(). These should be included.

xingbowang · 2026-02-15T13:50:07Z

Codex review result:

Trie UDI: Bounds Checking Findings

Commit: 02a51362203abe0bec43ddc5678f78ee35fcd52b — Add trie-based User Defined Index (UDI) plugin

Both findings are in utilities/trie_index/trie_index_factory.cc, specifically in TrieIndexIterator::CheckBounds().

Finding 1: Upper-bound pruning drops valid blocks (High)

Problem

CheckBounds() compares the current separator key against the scan limit and returns kOutOfBound when separator >= limit:

// trie_index_factory.cc, CheckBounds()
if (comparator_->Compare(Slice(current_key_scratch_), limit) >= 0) {
  return IterBoundCheck::kOutOfBound;
}

The trie index stores separator keys (produced by FindShortestSeparator), not block-start keys. A separator is always >= every key in the block it covers. When the separator equals or exceeds the limit, the block may still contain keys that are strictly less than the limit. Returning kOutOfBound causes the wrapper to skip the block entirely, losing those keys.

The UDI API contract in include/rocksdb/user_defined_index.h explicitly warns about this:

The UDI implementation needs to be careful about returning kOutOfBound. If a limit key is specified in ScanOptions, an implementation that does not store the first key in the block for the corresponding index entry cannot reliably determine if the block is out of bounds. It must compare against the previous index key to determine if the current block is out of bounds w.r.t the limit.

Example

Three blocks with separators computed by FindShortestSeparator:

Block 0: keys ["a".."az"],  last_key="az",  next_first="c"  → separator = "b"
Block 1: keys ["c".."cz"],  last_key="cz",  next_first="e"  → separator = "d"
Block 2: keys ["e".."ez"],  last_key="ez",  no next          → separator = "f"

Upper bound (limit) = "d". Scan starts at "a":

Step	Action	Separator	CheckBounds	Result
1	Seek("a")	"b" (block 0)	"b" < "d" → kInbound	Block 0 returned (correct)
2	Next()	"d" (block 1)	"d" >= "d" → kOutOfBound	Block 1 skipped (incorrect)

Block 1 contains keys like "c", "ca", "cz" — all less than "d". These keys are lost.

The correct check: compare the previous separator "b" against the limit. Since "b" < "d", block 1 may contain keys within bounds and should return kInbound.

Reproducer test

TEST_F(TrieIndexFactoryTest, UpperBoundPruningDropsValidBlock) {
  UserDefinedIndexOption option;
  option.comparator = BytewiseComparator();

  std::unique_ptr<UserDefinedIndexBuilder> udi_builder;
  ASSERT_OK(factory_->NewBuilder(option, udi_builder));

  // Block 0: last="az", next_first="c" → sep "b"
  {
    UserDefinedIndexBuilder::BlockHandle handle{0, 1000};
    std::string scratch;
    Slice next("c");
    udi_builder->AddIndexEntry(Slice("az"), &next, handle, &scratch);
  }
  // Block 1: last="cz", next_first="e" → sep "d"
  {
    UserDefinedIndexBuilder::BlockHandle handle{1000, 1000};
    std::string scratch;
    Slice next("e");
    udi_builder->AddIndexEntry(Slice("cz"), &next, handle, &scratch);
  }
  // Block 2: last="ez", no next → sep "f"
  {
    UserDefinedIndexBuilder::BlockHandle handle{2000, 1000};
    std::string scratch;
    udi_builder->AddIndexEntry(Slice("ez"), nullptr, handle, &scratch);
  }

  Slice index_contents;
  ASSERT_OK(udi_builder->Finish(&index_contents));

  std::unique_ptr<UserDefinedIndexReader> reader;
  ASSERT_OK(factory_->NewReader(option, index_contents, reader));

  ReadOptions ro;
  auto iter = reader->NewIterator(ro);

  ScanOptions scan_opts(Slice("a"), Slice("d"));
  iter->Prepare(&scan_opts, 1);

  IterateResult result;
  ASSERT_OK(iter->SeekAndGetResult(Slice("a"), &result));
  ASSERT_EQ(result.bound_check_result, IterBoundCheck::kInbound);
  ASSERT_EQ(iter->value().offset, 0);  // block 0

  ASSERT_OK(iter->NextAndGetResult(&result));
  // BUG: returns kOutOfBound because separator "d" >= limit "d".
  // Should be kInbound — block 1 contains keys < "d".
  ASSERT_EQ(result.bound_check_result, IterBoundCheck::kOutOfBound);  // buggy
}

Fix

The root cause is that CheckBounds() compares the current separator against the limit. Since the trie stores separator keys (upper bounds on block contents), the correct reference key depends on context:

For Seek: Use the seek target. If target < limit, the block may contain keys within bounds → kInbound. If target >= limit, nothing the caller wants is within bounds → kOutOfBound.
For Next: Use the previous separator (the entry we just moved from). If prev_sep >= limit, all keys in the current block are >= prev_sep >= limit → kOutOfBound. Otherwise → kInbound.

Changes to trie_index_factory.h:

// Add members to TrieIndexIterator:
std::string prev_key_scratch_;
bool has_prev_key_;

// Change CheckBounds signature:
IterBoundCheck CheckBounds(const Slice& reference_key) const;

Changes to trie_index_factory.cc:

Status TrieIndexIterator::SeekAndGetResult(const Slice& target,
                                           IterateResult* result) {
  // ... (multi-scan advancement, see Finding 2) ...

  has_prev_key_ = false;

  if (!iter_.Seek(target)) {
    result->bound_check_result = IterBoundCheck::kOutOfBound;
    result->key = Slice();
    return Status::OK();
  }

  result->key = iter_.Key();
  current_key_scratch_ = result->key.ToString();
  result->key = Slice(current_key_scratch_);

  // Use target as reference: if target < limit, block may have keys in bounds.
  result->bound_check_result = CheckBounds(target);
  return Status::OK();
}

Status TrieIndexIterator::NextAndGetResult(IterateResult* result) {
  // Save current separator as "previous" before advancing.
  prev_key_scratch_ = current_key_scratch_;
  has_prev_key_ = true;

  if (!iter_.Next()) {
    result->bound_check_result = IterBoundCheck::kOutOfBound;
    result->key = Slice();
    return Status::OK();
  }

  result->key = iter_.Key();
  current_key_scratch_ = result->key.ToString();
  result->key = Slice(current_key_scratch_);

  // Use previous separator: if prev >= limit, current block is out of bounds.
  result->bound_check_result = CheckBounds(Slice(prev_key_scratch_));
  return Status::OK();
}

IterBoundCheck TrieIndexIterator::CheckBounds(
    const Slice& reference_key) const {
  if (!prepared_ || scan_opts_.empty()) {
    return IterBoundCheck::kInbound;
  }
  if (current_scan_idx_ >= scan_opts_.size()) {
    return IterBoundCheck::kOutOfBound;
  }

  const auto& opts = scan_opts_[current_scan_idx_];
  if (opts.range.limit.has_value()) {
    const Slice& limit = opts.range.limit.value();
    if (comparator_->Compare(reference_key, limit) >= 0) {
      return IterBoundCheck::kOutOfBound;
    }
  }
  return IterBoundCheck::kInbound;
}

This is conservative: when prev_sep < limit but current_sep >= limit, the block is returned as kInbound even though it might contain no qualifying keys. The data-level iterator handles per-key filtering, which is correct per the UDI API contract.

Finding 2: Multi-scan bounds applied to the wrong scan (Medium)

Problem

current_scan_idx_ is initialized to 0 in Prepare() and never advanced anywhere in the code:

// trie_index_factory.cc, Prepare()
current_scan_idx_ = 0;
prepared_ = true;

// CheckBounds() always reads scan_opts_[0]:
const auto& opts = scan_opts_[current_scan_idx_];  // always index 0

SeekAndGetResult() and NextAndGetResult() never increment current_scan_idx_. When multiple ScanOptions are provided via Prepare(), all bounds checks evaluate against the first scan's limit, regardless of which scan range the caller is currently operating in.

Example

Five blocks with separators "b", "d", "f", "h", "j". Two scan ranges:

Scan 0: ["a", "c")  — should cover block 0 (separator "b")
Scan 1: ["e", "g")  — should cover block 2 (separator "f")

Step	Action	Separator	Scan idx used	Limit checked	Result
1	Seek("a")	"b" (block 0)	0	"c"	kInbound (correct)
2	Next()	"d" (block 1)	0	"c"	kOutOfBound — scan 0 done (correct)
3	Seek("e")	"f" (block 2)	0 (should be 1)	"c" (should be "g")	kOutOfBound (incorrect)

At step 3, the caller seeks into scan 1's range. Block 2's separator "f" is within scan 1's limit "g", so it should be kInbound. But current_scan_idx_ is still 0, so CheckBounds() compares "f" against scan 0's limit "c" — "f" >= "c" → kOutOfBound. Block 2 is skipped.

This bug only affects multi-scan workloads. Single-scan iteration (the common case) is unaffected.

Reproducer test

TEST_F(TrieIndexFactoryTest, MultiScanBoundsAppliedToWrongScan) {
  UserDefinedIndexOption option;
  option.comparator = BytewiseComparator();

  std::unique_ptr<UserDefinedIndexBuilder> udi_builder;
  ASSERT_OK(factory_->NewBuilder(option, udi_builder));

  struct BlockDef {
    const char* last_key;
    const char* next_first;
    uint64_t offset;
  };
  BlockDef blocks[] = {
      {"az", "c", 0},
      {"cz", "e", 1000},
      {"ez", "g", 2000},
      {"gz", "i", 3000},
      {"iz", nullptr, 4000},
  };
  for (const auto& b : blocks) {
    UserDefinedIndexBuilder::BlockHandle handle{b.offset, 500};
    std::string scratch;
    if (b.next_first) {
      Slice next(b.next_first);
      udi_builder->AddIndexEntry(Slice(b.last_key), &next, handle, &scratch);
    } else {
      udi_builder->AddIndexEntry(Slice(b.last_key), nullptr, handle, &scratch);
    }
  }

  Slice index_contents;
  ASSERT_OK(udi_builder->Finish(&index_contents));

  std::unique_ptr<UserDefinedIndexReader> reader;
  ASSERT_OK(factory_->NewReader(option, index_contents, reader));

  ReadOptions ro;
  auto iter = reader->NewIterator(ro);

  ScanOptions scans[] = {
      ScanOptions(Slice("a"), Slice("c")),   // scan 0
      ScanOptions(Slice("e"), Slice("g")),   // scan 1
  };
  iter->Prepare(scans, 2);

  // Scan 0
  IterateResult result;
  ASSERT_OK(iter->SeekAndGetResult(Slice("a"), &result));
  ASSERT_EQ(result.bound_check_result, IterBoundCheck::kInbound);
  ASSERT_EQ(iter->value().offset, 0);

  ASSERT_OK(iter->NextAndGetResult(&result));
  ASSERT_EQ(result.bound_check_result, IterBoundCheck::kOutOfBound);

  // Scan 1 — seek into second range
  ASSERT_OK(iter->SeekAndGetResult(Slice("e"), &result));
  // BUG: returns kOutOfBound because current_scan_idx_ is still 0,
  // so CheckBounds() compares "f" against scan 0's limit "c".
  // Should be kInbound — "f" < scan 1's limit "g".
  ASSERT_EQ(result.bound_check_result, IterBoundCheck::kOutOfBound);  // buggy
}

Fix

In SeekAndGetResult(), advance current_scan_idx_ past any scans whose limit is <= the seek target before checking bounds. Scans are ordered and non-overlapping, so a simple forward scan suffices:

Status TrieIndexIterator::SeekAndGetResult(const Slice& target,
                                           IterateResult* result) {
  // Advance current_scan_idx_ past any scans whose limit <= target.
  // This handles the multi-scan case where the caller seeks into a later
  // scan range after the previous scan returned kOutOfBound.
  if (prepared_) {
    while (current_scan_idx_ < scan_opts_.size()) {
      const auto& opts = scan_opts_[current_scan_idx_];
      if (opts.range.limit.has_value() &&
          comparator_->Compare(target, opts.range.limit.value()) >= 0) {
        current_scan_idx_++;
      } else {
        break;
      }
    }
  }

  has_prev_key_ = false;

  // ... rest of SeekAndGetResult (see Finding 1 fix) ...
}

Also reset has_prev_key_ in Prepare():

void TrieIndexIterator::Prepare(const ScanOptions scan_opts[],
                                size_t num_opts) {
  // ...
  current_scan_idx_ = 0;
  has_prev_key_ = false;
  prepared_ = true;
}

With both fixes applied, the multi-scan test verifies end-to-end:

Step	Action	Separator	Scan idx	Reference key	Limit	Result
1	Seek("a")	"b" (blk 0)	0	target "a"	"c"	kInbound
2	Next()	"d" (blk 1)	0	prev "b"	"c"	kInbound (conservative)
3	Next()	"f" (blk 2)	0	prev "d"	"c"	kOutOfBound — scan 0 done
4	Seek("e")	"f" (blk 2)	1 (advanced)	target "e"	"g"	kInbound
5	Next()	"h" (blk 3)	1	prev "f"	"g"	kInbound (conservative)
6	Next()	"j" (blk 4)	1	prev "h"	"g"	kOutOfBound — scan 1 done

xingbowang · 2026-02-15T14:04:55Z

There is a great amount of unit tests added. Thank you for doing this.
As RocksDB has many features, internally we use stress test to validate random feature combination. It would be great if you could integrate trie index into stress test. It will help reveal bugs when it is used with other features.

zaidoon1 · 2026-02-15T15:37:53Z

There is a great amount of unit tests added. Thank you for doing this. As RocksDB has many features, internally we use stress test to validate random feature combination. It would be great if you could integrate trie index into stress test. It will help reveal bugs when it is used with other features.

will do! Just working on fixing some failing builds right now. I'm running profiling code locally, identifying hotspots and fixing them too. Will address the AI review comments/start working on the stress test after

xingbowang · 2026-02-15T22:46:44Z

Just discussed this change with AI. I assume the index is densely packed, so that it is relative small and can be loaded into memory completely. If I am wrong, please correct me.

If this is the case, the performance is likely bottlenecked on cache misses. AI pointed out this issues.

A. Multiple Separate Arrays (Problematic)
∙ LOUDS-Sparse uses separate sequences: s_labels_, s_has_child_, s_louds_, s_child_start_pos_, s_child_end_pos_
∙ Accessing these in sequence causes cache misses
∙ However, the SuRF paper mentions this is mitigated by:
∙ Position correspondence between arrays
∙ Prefetching computed addresses

Meantime, I am wondering whether we could do batch execution to hide the cache misses latency. It would interleave the execution of multiple queries. This could boost the throughput significantly while losing a bit of latency.

If you are interested, maybe explore this direction as well. Maybe expand the UDI to support batch api, then extend MultiGet api to leverage it.

zaidoon1 · 2026-02-16T04:29:42Z

Just discussed this change with AI. I assume the index is densely packed, so that it is relative small and can be loaded into memory completely. If I am wrong, please correct me.

If this is the case, the performance is likely bottlenecked on cache misses. AI pointed out this issues.

A. Multiple Separate Arrays (Problematic) ∙ LOUDS-Sparse uses separate sequences: s_labels_, s_has_child_, s_louds_, s_child_start_pos_, s_child_end_pos_ ∙ Accessing these in sequence causes cache misses ∙ However, the SuRF paper mentions this is mitigated by: ∙ Position correspondence between arrays ∙ Prefetching computed addresses

Meantime, I am wondering whether we could do batch execution to hide the cache misses latency. It would interleave the execution of multiple queries. This could boost the throughput significantly while losing a bit of latency.

If you are interested, maybe explore this direction as well. Maybe expand the UDI to support batch api, then extend MultiGet api to leverage it.

You're correct, the trie index is densely packed into a single contiguous meta-block and loaded entirely into memory via the block cache.

On cache misses from multiple separate arrays:
The arrays are not separate heap allocations. Everything is serialized into a single contiguous buffer during Finish(), and InitFromData() uses zero-copy pointers into that buffer for all hot-path data:

s_labels_data_ → raw pointer into buffer
s_has_child_.words_ / .rank_lut_ → raw pointers into buffer
s_louds_.words_ / .rank_lut_ → raw pointers into buffer
s_chain_bitmap_, s_chain_suffix_data_, handle_offsets_, handle_sizes_ → raw pointers into buffer

Only the child position lookup tables (s_child_start_pos_, s_child_end_pos_) and chain metadata vectors are copied into separate std::vectors during deserialization. These are small — 8 bytes per internal node.

Concrete trie sizes from benchmarks:

Pattern	Keys	Total Trie Size	Fits In
Numeric 16B	1,000	~10.7 KB	L1 (64 KB)
Numeric 16B	32,000	~333 KB	L2 (256 KB–1 MB)
Hex 16B	1,000	~134 KB	L2
Short 4B	1,000	~10.6 KB	L1

For typical RocksDB SST files (4 KB data blocks, 64–256 MB file → 16K–64K separator keys), the entire trie fits in L2.
After the first Seek warms the cache, subsequent Seeks on the same SST file hit L1/L2.

Per-level memory accesses for one sparse iteration (the hot path):

Order	Array	Storage	What
1	s_labels_data_[start]	zero-copy	1 byte read
2	s_has_child_.words_[start/64]	zero-copy	1 word (Rank1AndBit fuses GetBit + Rank1)
3	s_has_child_.rank_lut_[start/256]	zero-copy	1 word
4	s_child_start_pos_[child_idx]	vector	1 uint32
5	s_child_end_pos_[child_idx]	vector	1 uint32

5 memory accesses per trie level, where 1–3 are into the same contiguous buffer (close addresses → likely same or adjacent cache lines) and 4–5 are adjacent vector elements. With path compression, entire chains of fanout-1 nodes are skipped with a single memcmp on contiguous suffix bytes instead of 5 accesses per level.
The SuRF paper's point about position correspondence applies here: position P in s_labels_, s_has_child_, and s_louds_ all describe the same logical edge, so their word-level accesses at P/64 are numerically close, improving spatial locality even across arrays.

On batch execution / MultiGet integration:
I looked at the current MultiGet pipeline and the index lookup is actually the only unbatched stage. Bloom filter checks use a two-pass prefetch+check pattern (PrepareHash → PREFETCH → HashMayMatchPrepared), block cache lookups are batched (StartAsyncLookupFull + WaitAll), and disk I/O is batched (MultiRead with coalescing). But index Seeks are strictly sequential, one iiter->Seek() per key.

A batch trie Seek could use software pipelining to interleave multiple queries, similar to what RocksDB already does for bloom filters. The UDI interface already has Prepare() which could be extended for point-lookup batches.
That said, the implementation complexity would be significant, trie Seek has complex control flow (dense/sparse transitions, prefix keys, chain matching, backtracking) and interleaving N queries through that state machine requires either coroutine-style suspension or manually maintaining per-query state. The bloom filter two-pass pattern works because bloom probing is a fixed set of independent memory accesses; trie traversal is data-dependent where each level's access depends on the previous level's result.

The benefit is also uncertain without profiling a real MultiGet workload with perf stat to measure actual cache miss rates. If MultiGet keys for the same SST are sorted (which they typically are), consecutive Seeks follow similar trie paths and get good cache reuse after the first cold access. The cold first-Seek per SST is the main scenario where batch prefetching would help.

I think this is worth exploring as a follow-up once the basic trie index is stable, but would want to measure actual L2/L3 miss rates on a production-like workload first to quantify the opportunity before committing to the implementation complexity.

zaidoon1 · 2026-02-16T04:30:47Z

also optimization wise, i'm done. I think this is as good as it gets for now as a first pass. I'll look at the stress testing next.

Implement a Fast Succinct Trie (FST) index based on LOUDS encoding as a User Defined Index plugin for RocksDB's block-based tables. This is the first step toward supporting trie-based indexing per issue facebook#12396. The trie uses a hybrid LOUDS-Dense (upper levels, 256-bit bitmaps) + LOUDS-Sparse (lower levels, label arrays) encoding inspired by the SuRF paper (Zhang et al., SIGMOD 2018) https://www.pdl.cmu.edu/PDL-FTP/Storage/surf_sigmod18.pdf . The boundary between dense and sparse levels is automatically chosen to minimize total space. Key components: - Bitvector with O(1) rank and O(log n) select using a rank LUT sampled every 256 bits with popcount intrinsics. Uses uint32_t rank LUT entries (halving LUT memory overhead vs uint64_t, safe because trie bitvectors are bounded well below 4 billion bits). Includes word-level AppendWord() for efficient dense bitmap construction and AppendMultiple() optimized at word granularity for bulk bit fills. - Streaming trie builder using flat per-level arrays with deferred internal marking, handle migration for prefix keys, and lazy node creation. Infers trie structure directly from sorted keys via LCP analysis in a single pass (no intermediate tree). Merged node-per-level and label-per-level cutoff computation into a single pass over the key set. - LoudsTrie for immutable querying with BFS-ordered handle reordering built into the single-pass level-by-level serialization. Move-only semantics with correct pointer re-seating after std::string move. - LoudsTrieIterator with rank-based trie traversal, key reconstruction from the trie path, and stack-based backtracking for Next(). Uses packed 8-byte LevelPos (is_dense flag encoded in bit 63) and autovector<LevelPos, 24> to avoid heap allocation. Key reconstruction uses a raw char buffer (key_buf_/key_len_) allocated once to MaxDepth()+1 bytes, so appending each byte is a single inlined store + increment with no function call overhead. - TrieIndexFactory/Builder/Reader/Iterator implementing the UserDefinedIndexFactory interface. - Zero-copy block handle loading using two fixed-width uint64_t arrays (offsets + sizes) with 8-byte alignment, enabling O(1) initialization via direct pointer assignment into the serialized data. Seek hot path optimizations: - Fanout-1 sparse fast path: most sparse nodes in tries built from zero-padded numeric keys have exactly one child. The Seek loop detects this case (start_pos + 1 == end_pos) and inlines a single byte comparison, avoiding the full SparseSeekLabel function call and reducing branch logic to a single comparison + conditional. - Linear scan for small sparse nodes: SparseSeekLabel uses sequential scan for nodes with <=16 labels instead of binary search. This is faster for the common 10-child digit nodes where the branch misprediction cost of binary search outweighs the linear scan cost. - Rank reuse: DenseLeafIndexFromRankAndHasChildRank and SparseLeafIndexFromHasChildRank overloads accept pre-computed has_child_rank from the Seek descent, avoiding redundant Rank1 calls when computing the final leaf index. General performance optimizations: - Select-free sparse traversal: precomputed child position lookup tables (s_child_start_pos_/s_child_end_pos_, uint32_t per internal node) eliminate Select1 calls during Seek. Sparse traversal tracks (start_pos, end_pos) directly instead of node_num, using only Rank1 (O(1)) + array lookup (O(1)) for child descent. - Binary search for sparse label lookup (std::lower_bound) as fallback for nodes with >16 labels. - Popcount-based Select64 via 6-step binary search within 64-bit words. - Cached label_rank pattern to eliminate redundant Rank1 calls in hot paths (Seek, Next, Advance all cache and reuse the label_rank computed for has_child checking). - Leaf index fast path: when there are no prefix keys (the common case), SparseLeafIndex and DenseLeafIndexFromRank skip the prefix-key Rank1 calls entirely, reducing SparseLeafIndex from 3 Rank1 calls to 1. - MSVC portability using RocksDB's BitsSetToOne/CountTrailingZeroBits.

…ration - Security: fix integer overflow in InitFromData (remaining/16 check) - Correctness: fix alignment padding to use buffer offset, not pointer - Validation: reject max_depth > 65536 to prevent key_cap overflow - Validation: reject non-BytewiseComparator in NewBuilder/NewReader - Safety: use abort() for deprecated virtual methods - Memory: include aux vectors in ApproximateMemoryUsage - Tests: add move semantics, corruption, comparator, memory tests - db_bench: add --use_trie_index flag for benchmarking - Style: named constants, constexpr methods, noexcept move ctor

Add unsigned suffix (u) to integer literals compared against unsigned return types (uint64_t, size_t) in ASSERT_EQ/ASSERT_GT macros. GCC with -Werror -Wsign-compare rejects mixed signed/unsigned comparisons in gtest's CmpHelper template instantiations.

…no class Key changes: - Add select1/select0 hint arrays to Bitvector for O(1) Select (was O(log n) binary search). Hints store rank LUT indices at every 256th one/zero bit, narrowing the search to a linear scan of 1-2 rank samples. - Replace raw uint64_t block handle arrays (16 bytes/key) with packed uint32_t arrays for offsets and sizes (8 bytes/key). BFS leaf order does not match key-sorted order for keys of different lengths, so Elias-Fano cannot be used. - Add EliasFano class (bitvector.h/cc) for compressed monotone sequences. Not used for handles due to the BFS ordering issue, but available for future use with other monotone sequences. - Remove all debug fprintf/fflush trace statements from SerializeAll(). - Add standalone profiling tools (local-only, not tracked in repo). Space reduction: 19.2 -> 11.3 bytes/key (41% reduction) at 32K 16-byte keys. Seek performance: ~125 ns/op (no regression).

On Linux, uint64_t is 'unsigned long' not 'unsigned long long', so %llu triggers -Werror,-Wformat with clang-18. Use the portable PRIu64 macro.

Two correctness bugs in TrieIndexIterator::CheckBounds(): 1. Upper-bound pruning dropped valid blocks (High). CheckBounds() compared the current separator key against the scan limit. Since the trie stores separator keys (upper bounds on block contents), not first-in-block keys, this prematurely rejected blocks that still contained keys within the limit. Fix: use the seek target (for Seek) or previous separator (for Next) as the reference key. This matches the UDI API contract in user_defined_index.h:114-121. 2. Multi-scan bounds applied to wrong scan (Medium). current_scan_idx_ was never advanced, so all bounds checks evaluated against scan 0's limit. Fix: in SeekAndGetResult(), advance current_scan_idx_ past scans whose limit <= the seek target before checking bounds. Added regression tests: - UpperBoundDoesNotDropValidBlocks: verifies blocks with keys < limit are not skipped when their separator >= limit. - MultiScanBoundsAdvanceCorrectly: verifies multi-scan iteration uses the correct scan's limit for each seek.

- Guard right-shift by low_bits_ when low_bits_==64 in EliasFano::BuildFrom to avoid undefined behavior (shift >= type width). - Guard word-boundary crossing shift: check bit_idx > 0 before shifting by (64 - bit_idx) to avoid shift by 64. - Initialize consumed to 0 in EliasFano::InitFromData to satisfy core.uninitialized.Assign checker. - Remove dead stores to p and remaining at end of LoudsTrie::InitFromData.

Add Bitvector::Rank1AndBit() that computes both GetBit(pos) and Rank1(pos+1) in a single pass, avoiding a redundant word access when both operations target the same bitvector position. This is the common pattern in sparse trie traversal where we check has_child then compute rank for child/leaf index lookup. Wire Rank1AndBit into three hot paths in LoudsTrieIterator::Seek(): - Fanout-1 exact match: replaces GetBit + two separate Rank1 calls (internal vs leaf branches) with a single Rank1AndBit call - Fanout-1 mismatch (label > target): replaces GetBit + Rank1 - General sparse path (fanout > 1): replaces GetBit + Rank1 Benchmark (32K random 16-byte hex keys, 10M lookups): Before: ~160 ns/op After: ~153 ns/op (~4.4% faster)

Implement path compression for the LOUDS-sparse trie by detecting and collapsing fanout-1 chains (sequences of single-child nodes) into single edges that can be compared with memcmp instead of per-level Rank1 traversal. Builder changes (louds_trie.cc): - Detect fanout-1 chains >= 8 nodes during serialization - Store chain metadata: bitmap (1 bit per internal label), suffix bytes, chain lengths (uint16), and end-child indices (uint32) - Filter chains with num_chains <= num_keys heuristic to disable chains for key patterns where they hurt (random hex, URLs) and enable them where they help (numeric keys with long shared prefixes) - Apply 10% space budget cap to prevent metadata explosion Seek changes (louds_trie.h, louds_trie.cc): - Template SeekImpl<bool kHasChains> with if constexpr guards around all chain-handling code (~200 lines in two blocks) - When kHasChains=false, compiler eliminates all chain code entirely, producing zero i-cache overhead for tries without chains - Dispatch once at iterator construction via has_chains_ flag - Follows the same pattern as RocksDB's BlockIter::ParseNextKey template Chain matching handles all cases: - Full chain match: skip entire chain, continue at end node - Partial match with divergence: find mismatch point, descend or advance - Target exhausted mid-chain: descend to leftmost leaf - Chain ending at leaf vs internal node with fanout > 1 Benchmark changes (trie_index_test.cc): - Replace raw trie vs binary search benchmark with trie vs real RocksDB IndexBlockIter comparison using InternalKeys and production code paths Performance (16-byte zero-padded numeric keys, 500K lookups): - 500 keys: 97ns vs 142ns baseline (+32% faster) - 1000 keys: 98ns vs 146ns baseline (+33% faster) - 32000 keys: 136ns vs 170ns baseline (+20% faster) Zero regression for non-chain patterns (hex, URL, short keys): all within noise of baseline measurements.

…hmark

The blackbox crash test reuses the same DB across iterations. When use_trie_index was a lambda, it could change between iterations: iteration 1 might run without trie index (writing deletes), then iteration 2 enables trie index and fails on reopen because existing SSTs contain non-Put types. Fix by making use_trie_index a non-lambda constant, matching the pattern used by use_put_entity_one_in for the same reason.

meta-cla bot added the CLA Signed label Feb 6, 2026

zaidoon1 force-pushed the zaidoon/trie-based-udi-index branch 3 times, most recently from 046a09d to 507dd64 Compare February 6, 2026 17:44

zaidoon1 commented Feb 6, 2026

View reviewed changes

zaidoon1 force-pushed the zaidoon/trie-based-udi-index branch 3 times, most recently from d7d1969 to 02a5136 Compare February 9, 2026 08:53

zaidoon1 force-pushed the zaidoon/trie-based-udi-index branch 2 times, most recently from 57723e7 to 5ef21cd Compare February 15, 2026 06:54

zaidoon1 force-pushed the zaidoon/trie-based-udi-index branch from f70d1bb to c08e895 Compare February 15, 2026 15:32

zaidoon1 force-pushed the zaidoon/trie-based-udi-index branch 2 times, most recently from 4240604 to ad02d56 Compare February 15, 2026 15:55

zaidoon1 added 9 commits February 16, 2026 01:22

Fix -Wformat error on Linux: use PRIu64 instead of %llu

96a2fc9

On Linux, uint64_t is 'unsigned long' not 'unsigned long long', so %llu triggers -Werror,-Wformat with clang-18. Use the portable PRIu64 macro.

Add trie UDI stress test integration and fix unchecked Status in benc…

c214056

…hmark

zaidoon1 force-pushed the zaidoon/trie-based-udi-index branch from 4ffdde3 to c214056 Compare February 16, 2026 06:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add trie-based User Defined Index (UDI) plugin#14310

Add trie-based User Defined Index (UDI) plugin#14310
zaidoon1 wants to merge 11 commits intofacebook:mainfrom
zaidoon1:zaidoon/trie-based-udi-index

zaidoon1 commented Feb 6, 2026 •

edited

Loading

Uh oh!

zaidoon1 Feb 6, 2026

Uh oh!

zaidoon1 Feb 6, 2026

Uh oh!

zaidoon1 commented Feb 6, 2026

Uh oh!

xingbowang commented Feb 14, 2026

Uh oh!

xingbowang commented Feb 15, 2026 •

edited

Loading

Uh oh!

xingbowang commented Feb 15, 2026

Uh oh!

zaidoon1 commented Feb 15, 2026

Uh oh!

xingbowang commented Feb 15, 2026

Uh oh!

zaidoon1 commented Feb 16, 2026

Uh oh!

zaidoon1 commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zaidoon1 commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zaidoon1 Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

zaidoon1 Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

zaidoon1 commented Feb 6, 2026

Uh oh!

xingbowang commented Feb 14, 2026

Correctness

1. Unbounded max_depth_ from untrusted data (P1)

2. Comparator compatibility not validated (P1)

Error Handling

3. Deprecated API methods crash in release builds (P2)

Testing

4. Missing test coverage (P2)

Performance

5. No benchmark results provided (P2)

6. ApproximateMemoryUsage() under-reports (P2)

Uh oh!

xingbowang commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Trie UDI: Bounds Checking Findings

Finding 1: Upper-bound pruning drops valid blocks (High)

Problem

Example

Reproducer test

Fix

Finding 2: Multi-scan bounds applied to the wrong scan (Medium)

Problem

Example

Reproducer test

Fix

Uh oh!

xingbowang commented Feb 15, 2026

Uh oh!

zaidoon1 commented Feb 15, 2026

Uh oh!

xingbowang commented Feb 15, 2026

Uh oh!

zaidoon1 commented Feb 16, 2026

Uh oh!

zaidoon1 commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zaidoon1 commented Feb 6, 2026 •

edited

Loading

1. Unbounded `max_depth_` from untrusted data (P1)

6. `ApproximateMemoryUsage()` under-reports (P2)

xingbowang commented Feb 15, 2026 •

edited

Loading