Skip to content

Add trie-based User Defined Index (UDI) plugin#14310

Open
zaidoon1 wants to merge 11 commits intofacebook:mainfrom
zaidoon1:zaidoon/trie-based-udi-index
Open

Add trie-based User Defined Index (UDI) plugin#14310
zaidoon1 wants to merge 11 commits intofacebook:mainfrom
zaidoon1:zaidoon/trie-based-udi-index

Conversation

@zaidoon1
Copy link
Contributor

@zaidoon1 zaidoon1 commented Feb 6, 2026

Implement a Fast Succinct Trie (FST) index based on LOUDS encoding as a User Defined Index plugin for RocksDB's block-based tables. First step toward trie-based indexing per #12396.
The trie uses hybrid LOUDS-Dense (upper levels, 256-bit bitmaps) + LOUDS-Sparse (lower levels, label arrays) encoding inspired by the SuRF paper (Zhang et al., SIGMOD 2018). The boundary between dense and sparse levels is automatically chosen to minimize total space.

Key Components

  • Bitvector with O(1) rank and O(log n) select using a rank LUT sampled every 256 bits with popcount intrinsics. Uses uint32_t rank LUT entries (halving memory vs uint64_t). Includes word-level AppendWord() for efficient dense bitmap construction and AppendMultiple() optimized at word granularity for bulk bit fills.
  • Streaming trie builder using flat per-level arrays with deferred internal marking, handle migration for prefix keys, and lazy node creation. Infers trie structure directly from sorted keys via LCP analysis in a single pass (no intermediate tree).
  • LoudsTrie for immutable querying with BFS-ordered handle reordering built into single-pass level-by-level serialization. Move-only semantics with correct pointer re-seating after std::string move.
  • LoudsTrieIterator with rank-based traversal, key reconstruction from trie path, and stack-based backtracking for Next(). Uses packed 8-byte LevelPos (is_dense flag in bit 63) and autovector<LevelPos, 24> to avoid heap allocation. Key reconstruction uses a raw char buffer allocated once to MaxDepth()+1 bytes.
  • TrieIndexFactory/Builder/Reader/Iterator implementing the UserDefinedIndexFactory interface.
  • Zero-copy block handle loading using two fixed-width uint64_t arrays (offsets + sizes) with 8-byte alignment, enabling O(1) initialization via direct pointer assignment.

Seek Hot Path Optimizations

  • Fanout-1 sparse fast path: Most sparse nodes in tries built from zero-padded numeric keys have exactly one child. Detected via start_pos + 1 == end_pos and inlined as a single byte comparison, avoiding the full SparseSeekLabel call.
  • Linear scan for small sparse nodes: SparseSeekLabel uses sequential scan for nodes with ≤16 labels instead of binary search. Faster for common 10-child digit nodes where branch misprediction cost outweighs linear scan cost.
  • Rank reuse: DenseLeafIndexFromRankAndHasChildRank and SparseLeafIndexFromHasChildRank overloads accept pre-computed has_child_rank from the Seek descent, avoiding redundant Rank1 calls.
    General Performance Optimizations
  • Select-free sparse traversal: Precomputed child position lookup tables (s_child_start_pos_/s_child_end_pos_) eliminate Select1 calls during Seek. Sparse traversal tracks (start_pos, end_pos) directly, using only Rank1 (O(1)) + array lookup (O(1)) for child descent.
  • Cached label_rank pattern: Eliminates redundant Rank1 calls in hot paths (Seek, Next, Advance all cache and reuse the label_rank computed for has_child checking).
  • Leaf index fast path: When no prefix keys exist (common case), SparseLeafIndex and DenseLeafIndexFromRank skip the prefix-key Rank1 calls entirely, reducing from 3 to 1 Rank1 call.
  • Popcount-based Select64 via 6-step binary search within 64-bit words.
  • MSVC portability using RocksDB's BitsSetToOne/CountTrailingZeroBits.

Benchmark Results

Trie Seek at 32K keys (16-byte keys, 5M lookups, median of 5 runs):

Configuration ns/op
Trie (optimized) 118
Binary search (native) 134
Trie Seek is ~12% faster than native binary search index at 32K keys per block.

@meta-cla meta-cla bot added the CLA Signed label Feb 6, 2026
@zaidoon1 zaidoon1 force-pushed the zaidoon/trie-based-udi-index branch 3 times, most recently from 046a09d to 507dd64 Compare February 6, 2026 17:44
user_defined_index_builder_(std::move(user_defined_index_builder)) {}

~UserDefinedIndexBuilderWrapper() override = default;
~UserDefinedIndexBuilderWrapper() override { status_.PermitUncheckedError(); }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although it's not gated behind the debug builds, let me know if you want me to do the same

@zaidoon1
Copy link
Contributor Author

zaidoon1 commented Feb 6, 2026

cc @xingbowang this is my first attempt at part 1 of many

@zaidoon1 zaidoon1 force-pushed the zaidoon/trie-based-udi-index branch 3 times, most recently from d7d1969 to 02a5136 Compare February 9, 2026 08:53
@xingbowang
Copy link
Contributor

Overall, the change looks very good. Thank you for the contribution.

This is Claude generated code review feedback. I will take a closer look later.

Correctness

1. Unbounded max_depth_ from untrusted data (P1)

LoudsTrieIterator allocates key_buf_ of size MaxDepth() + 1, where MaxDepth() comes from the serialized trie header. If a corrupted SST provides max_depth_ = UINT32_MAX, then key_cap_ = trie_->MaxDepth() + 1 overflows uint32_t to 0, and make_unique<char[]>(0) creates a zero-length buffer. Any subsequent write to key_buf_ is a buffer overflow.

// louds_trie.cc, LoudsTrieIterator constructor
key_cap_ = trie_->MaxDepth() + 1;  // overflow if MaxDepth() == UINT32_MAX
key_buf_ = std::make_unique<char[]>(key_cap_);

Suggestion: Add validation in LoudsTrie::InitFromData():

// A key longer than 64 KB is unrealistic for a block index separator.
if (max_depth_ > 65536) {
  return Status::Corruption("Trie index: max_depth exceeds reasonable limit");
}

2. Comparator compatibility not validated (P1)

The trie's internal traversal is inherently bytewise-ordered (it's a trie over byte sequences). If the user supplies a non-bytewise comparator, separator keys may not be in bytewise-sorted order, causing incorrect Seek results. The TrieIndexFactory should either validate that the comparator is bytewise-compatible or document this limitation.

Suggestion: Add a check in TrieIndexFactory::NewBuilder():

if (option.comparator != nullptr &&
    option.comparator != BytewiseComparator() &&
    option.comparator != ReverseBytewiseComparator()) {
  // At minimum, document. Ideally, return NotSupported.
}

Or add a comment in trie_index_factory.h documenting the restriction.


Error Handling

3. Deprecated API methods crash in release builds (P2)

The TrieIndexFactory deprecated overloads (NewBuilder() and NewReader() without UserDefinedIndexOption) call assert(false) and return nullptr. In release builds (where asserts are stripped), this silently returns null, likely causing a null-pointer dereference.

// trie_index_factory.h
UserDefinedIndexBuilder* NewBuilder() const override {
    assert(false);
    return nullptr;
}

Suggestion: Return a proper error or use ROCKSDB_UNREACHABLE / abort() if the intent is to never call these.


Testing

4. Missing test coverage (P2)

  • Move semantics for Bitvector and LoudsTrie. The move constructor has subtle pointer re-seating logic (RecomputePointers()) that deserves explicit testing.
  • ApproximateMemoryUsage() accuracy — currently returns only data_size_, but s_child_start_pos_ and s_child_end_pos_ vectors are additional heap allocations (8 bytes per sparse internal node). Consider adding them.

Performance

5. No benchmark results provided (P2)

Per CLAUDE.md, performance-sensitive changes should include db_bench numbers. For a space-reduction feature, a comparison of trie index size and Seek latency vs. the default binary search index would strengthen the case. The commit message references the SuRF paper but doesn't include empirical results for RocksDB's workloads.

Editor note: Typically, for performance related feature, we would add db_bench result.

6. ApproximateMemoryUsage() under-reports (P2)

Returns only data_size_ but the s_child_start_pos_ and s_child_end_pos_ vectors (8 bytes per sparse internal node) are heap-allocated during InitFromData(). These should be included.


@zaidoon1 zaidoon1 force-pushed the zaidoon/trie-based-udi-index branch 2 times, most recently from 57723e7 to 5ef21cd Compare February 15, 2026 06:54
@xingbowang
Copy link
Contributor

xingbowang commented Feb 15, 2026

Codex review result:

Trie UDI: Bounds Checking Findings

Commit: 02a51362203abe0bec43ddc5678f78ee35fcd52b — Add trie-based User Defined Index (UDI) plugin

Both findings are in utilities/trie_index/trie_index_factory.cc, specifically in TrieIndexIterator::CheckBounds().


Finding 1: Upper-bound pruning drops valid blocks (High)

Problem

CheckBounds() compares the current separator key against the scan limit and returns kOutOfBound when separator >= limit:

// trie_index_factory.cc, CheckBounds()
if (comparator_->Compare(Slice(current_key_scratch_), limit) >= 0) {
  return IterBoundCheck::kOutOfBound;
}

The trie index stores separator keys (produced by FindShortestSeparator), not block-start keys. A separator is always >= every key in the block it covers. When the separator equals or exceeds the limit, the block may still contain keys that are strictly less than the limit. Returning kOutOfBound causes the wrapper to skip the block entirely, losing those keys.

The UDI API contract in include/rocksdb/user_defined_index.h explicitly warns about this:

The UDI implementation needs to be careful about returning kOutOfBound. If a limit key is specified in ScanOptions, an implementation that does not store the first key in the block for the corresponding index entry cannot reliably determine if the block is out of bounds. It must compare against the previous index key to determine if the current block is out of bounds w.r.t the limit.

Example

Three blocks with separators computed by FindShortestSeparator:

Block 0: keys ["a".."az"],  last_key="az",  next_first="c"  → separator = "b"
Block 1: keys ["c".."cz"],  last_key="cz",  next_first="e"  → separator = "d"
Block 2: keys ["e".."ez"],  last_key="ez",  no next          → separator = "f"

Upper bound (limit) = "d". Scan starts at "a":

Step Action Separator CheckBounds Result
1 Seek("a") "b" (block 0) "b" < "d" → kInbound Block 0 returned (correct)
2 Next() "d" (block 1) "d" >= "d" → kOutOfBound Block 1 skipped (incorrect)

Block 1 contains keys like "c", "ca", "cz" — all less than "d". These keys are lost.

The correct check: compare the previous separator "b" against the limit. Since "b" < "d", block 1 may contain keys within bounds and should return kInbound.

Reproducer test

TEST_F(TrieIndexFactoryTest, UpperBoundPruningDropsValidBlock) {
  UserDefinedIndexOption option;
  option.comparator = BytewiseComparator();

  std::unique_ptr<UserDefinedIndexBuilder> udi_builder;
  ASSERT_OK(factory_->NewBuilder(option, udi_builder));

  // Block 0: last="az", next_first="c" → sep "b"
  {
    UserDefinedIndexBuilder::BlockHandle handle{0, 1000};
    std::string scratch;
    Slice next("c");
    udi_builder->AddIndexEntry(Slice("az"), &next, handle, &scratch);
  }
  // Block 1: last="cz", next_first="e" → sep "d"
  {
    UserDefinedIndexBuilder::BlockHandle handle{1000, 1000};
    std::string scratch;
    Slice next("e");
    udi_builder->AddIndexEntry(Slice("cz"), &next, handle, &scratch);
  }
  // Block 2: last="ez", no next → sep "f"
  {
    UserDefinedIndexBuilder::BlockHandle handle{2000, 1000};
    std::string scratch;
    udi_builder->AddIndexEntry(Slice("ez"), nullptr, handle, &scratch);
  }

  Slice index_contents;
  ASSERT_OK(udi_builder->Finish(&index_contents));

  std::unique_ptr<UserDefinedIndexReader> reader;
  ASSERT_OK(factory_->NewReader(option, index_contents, reader));

  ReadOptions ro;
  auto iter = reader->NewIterator(ro);

  ScanOptions scan_opts(Slice("a"), Slice("d"));
  iter->Prepare(&scan_opts, 1);

  IterateResult result;
  ASSERT_OK(iter->SeekAndGetResult(Slice("a"), &result));
  ASSERT_EQ(result.bound_check_result, IterBoundCheck::kInbound);
  ASSERT_EQ(iter->value().offset, 0);  // block 0

  ASSERT_OK(iter->NextAndGetResult(&result));
  // BUG: returns kOutOfBound because separator "d" >= limit "d".
  // Should be kInbound — block 1 contains keys < "d".
  ASSERT_EQ(result.bound_check_result, IterBoundCheck::kOutOfBound);  // buggy
}

Fix

The root cause is that CheckBounds() compares the current separator against the limit. Since the trie stores separator keys (upper bounds on block contents), the correct reference key depends on context:

  • For Seek: Use the seek target. If target < limit, the block may contain keys within bounds → kInbound. If target >= limit, nothing the caller wants is within bounds → kOutOfBound.
  • For Next: Use the previous separator (the entry we just moved from). If prev_sep >= limit, all keys in the current block are >= prev_sep >= limitkOutOfBound. Otherwise → kInbound.

Changes to trie_index_factory.h:

// Add members to TrieIndexIterator:
std::string prev_key_scratch_;
bool has_prev_key_;

// Change CheckBounds signature:
IterBoundCheck CheckBounds(const Slice& reference_key) const;

Changes to trie_index_factory.cc:

Status TrieIndexIterator::SeekAndGetResult(const Slice& target,
                                           IterateResult* result) {
  // ... (multi-scan advancement, see Finding 2) ...

  has_prev_key_ = false;

  if (!iter_.Seek(target)) {
    result->bound_check_result = IterBoundCheck::kOutOfBound;
    result->key = Slice();
    return Status::OK();
  }

  result->key = iter_.Key();
  current_key_scratch_ = result->key.ToString();
  result->key = Slice(current_key_scratch_);

  // Use target as reference: if target < limit, block may have keys in bounds.
  result->bound_check_result = CheckBounds(target);
  return Status::OK();
}

Status TrieIndexIterator::NextAndGetResult(IterateResult* result) {
  // Save current separator as "previous" before advancing.
  prev_key_scratch_ = current_key_scratch_;
  has_prev_key_ = true;

  if (!iter_.Next()) {
    result->bound_check_result = IterBoundCheck::kOutOfBound;
    result->key = Slice();
    return Status::OK();
  }

  result->key = iter_.Key();
  current_key_scratch_ = result->key.ToString();
  result->key = Slice(current_key_scratch_);

  // Use previous separator: if prev >= limit, current block is out of bounds.
  result->bound_check_result = CheckBounds(Slice(prev_key_scratch_));
  return Status::OK();
}

IterBoundCheck TrieIndexIterator::CheckBounds(
    const Slice& reference_key) const {
  if (!prepared_ || scan_opts_.empty()) {
    return IterBoundCheck::kInbound;
  }
  if (current_scan_idx_ >= scan_opts_.size()) {
    return IterBoundCheck::kOutOfBound;
  }

  const auto& opts = scan_opts_[current_scan_idx_];
  if (opts.range.limit.has_value()) {
    const Slice& limit = opts.range.limit.value();
    if (comparator_->Compare(reference_key, limit) >= 0) {
      return IterBoundCheck::kOutOfBound;
    }
  }
  return IterBoundCheck::kInbound;
}

This is conservative: when prev_sep < limit but current_sep >= limit, the block is returned as kInbound even though it might contain no qualifying keys. The data-level iterator handles per-key filtering, which is correct per the UDI API contract.


Finding 2: Multi-scan bounds applied to the wrong scan (Medium)

Problem

current_scan_idx_ is initialized to 0 in Prepare() and never advanced anywhere in the code:

// trie_index_factory.cc, Prepare()
current_scan_idx_ = 0;
prepared_ = true;

// CheckBounds() always reads scan_opts_[0]:
const auto& opts = scan_opts_[current_scan_idx_];  // always index 0

SeekAndGetResult() and NextAndGetResult() never increment current_scan_idx_. When multiple ScanOptions are provided via Prepare(), all bounds checks evaluate against the first scan's limit, regardless of which scan range the caller is currently operating in.

Example

Five blocks with separators "b", "d", "f", "h", "j". Two scan ranges:

Scan 0: ["a", "c")  — should cover block 0 (separator "b")
Scan 1: ["e", "g")  — should cover block 2 (separator "f")
Step Action Separator Scan idx used Limit checked Result
1 Seek("a") "b" (block 0) 0 "c" kInbound (correct)
2 Next() "d" (block 1) 0 "c" kOutOfBound — scan 0 done (correct)
3 Seek("e") "f" (block 2) 0 (should be 1) "c" (should be "g") kOutOfBound (incorrect)

At step 3, the caller seeks into scan 1's range. Block 2's separator "f" is within scan 1's limit "g", so it should be kInbound. But current_scan_idx_ is still 0, so CheckBounds() compares "f" against scan 0's limit "c""f" >= "c"kOutOfBound. Block 2 is skipped.

This bug only affects multi-scan workloads. Single-scan iteration (the common case) is unaffected.

Reproducer test

TEST_F(TrieIndexFactoryTest, MultiScanBoundsAppliedToWrongScan) {
  UserDefinedIndexOption option;
  option.comparator = BytewiseComparator();

  std::unique_ptr<UserDefinedIndexBuilder> udi_builder;
  ASSERT_OK(factory_->NewBuilder(option, udi_builder));

  struct BlockDef {
    const char* last_key;
    const char* next_first;
    uint64_t offset;
  };
  BlockDef blocks[] = {
      {"az", "c", 0},
      {"cz", "e", 1000},
      {"ez", "g", 2000},
      {"gz", "i", 3000},
      {"iz", nullptr, 4000},
  };
  for (const auto& b : blocks) {
    UserDefinedIndexBuilder::BlockHandle handle{b.offset, 500};
    std::string scratch;
    if (b.next_first) {
      Slice next(b.next_first);
      udi_builder->AddIndexEntry(Slice(b.last_key), &next, handle, &scratch);
    } else {
      udi_builder->AddIndexEntry(Slice(b.last_key), nullptr, handle, &scratch);
    }
  }

  Slice index_contents;
  ASSERT_OK(udi_builder->Finish(&index_contents));

  std::unique_ptr<UserDefinedIndexReader> reader;
  ASSERT_OK(factory_->NewReader(option, index_contents, reader));

  ReadOptions ro;
  auto iter = reader->NewIterator(ro);

  ScanOptions scans[] = {
      ScanOptions(Slice("a"), Slice("c")),   // scan 0
      ScanOptions(Slice("e"), Slice("g")),   // scan 1
  };
  iter->Prepare(scans, 2);

  // Scan 0
  IterateResult result;
  ASSERT_OK(iter->SeekAndGetResult(Slice("a"), &result));
  ASSERT_EQ(result.bound_check_result, IterBoundCheck::kInbound);
  ASSERT_EQ(iter->value().offset, 0);

  ASSERT_OK(iter->NextAndGetResult(&result));
  ASSERT_EQ(result.bound_check_result, IterBoundCheck::kOutOfBound);

  // Scan 1 — seek into second range
  ASSERT_OK(iter->SeekAndGetResult(Slice("e"), &result));
  // BUG: returns kOutOfBound because current_scan_idx_ is still 0,
  // so CheckBounds() compares "f" against scan 0's limit "c".
  // Should be kInbound — "f" < scan 1's limit "g".
  ASSERT_EQ(result.bound_check_result, IterBoundCheck::kOutOfBound);  // buggy
}

Fix

In SeekAndGetResult(), advance current_scan_idx_ past any scans whose limit is <= the seek target before checking bounds. Scans are ordered and non-overlapping, so a simple forward scan suffices:

Status TrieIndexIterator::SeekAndGetResult(const Slice& target,
                                           IterateResult* result) {
  // Advance current_scan_idx_ past any scans whose limit <= target.
  // This handles the multi-scan case where the caller seeks into a later
  // scan range after the previous scan returned kOutOfBound.
  if (prepared_) {
    while (current_scan_idx_ < scan_opts_.size()) {
      const auto& opts = scan_opts_[current_scan_idx_];
      if (opts.range.limit.has_value() &&
          comparator_->Compare(target, opts.range.limit.value()) >= 0) {
        current_scan_idx_++;
      } else {
        break;
      }
    }
  }

  has_prev_key_ = false;

  // ... rest of SeekAndGetResult (see Finding 1 fix) ...
}

Also reset has_prev_key_ in Prepare():

void TrieIndexIterator::Prepare(const ScanOptions scan_opts[],
                                size_t num_opts) {
  // ...
  current_scan_idx_ = 0;
  has_prev_key_ = false;
  prepared_ = true;
}

With both fixes applied, the multi-scan test verifies end-to-end:

Step Action Separator Scan idx Reference key Limit Result
1 Seek("a") "b" (blk 0) 0 target "a" "c" kInbound
2 Next() "d" (blk 1) 0 prev "b" "c" kInbound (conservative)
3 Next() "f" (blk 2) 0 prev "d" "c" kOutOfBound — scan 0 done
4 Seek("e") "f" (blk 2) 1 (advanced) target "e" "g" kInbound
5 Next() "h" (blk 3) 1 prev "f" "g" kInbound (conservative)
6 Next() "j" (blk 4) 1 prev "h" "g" kOutOfBound — scan 1 done

@xingbowang
Copy link
Contributor

There is a great amount of unit tests added. Thank you for doing this.
As RocksDB has many features, internally we use stress test to validate random feature combination. It would be great if you could integrate trie index into stress test. It will help reveal bugs when it is used with other features.

@zaidoon1 zaidoon1 force-pushed the zaidoon/trie-based-udi-index branch from f70d1bb to c08e895 Compare February 15, 2026 15:32
@zaidoon1
Copy link
Contributor Author

There is a great amount of unit tests added. Thank you for doing this. As RocksDB has many features, internally we use stress test to validate random feature combination. It would be great if you could integrate trie index into stress test. It will help reveal bugs when it is used with other features.

will do! Just working on fixing some failing builds right now. I'm running profiling code locally, identifying hotspots and fixing them too. Will address the AI review comments/start working on the stress test after

@zaidoon1 zaidoon1 force-pushed the zaidoon/trie-based-udi-index branch 2 times, most recently from 4240604 to ad02d56 Compare February 15, 2026 15:55
@xingbowang
Copy link
Contributor

Just discussed this change with AI. I assume the index is densely packed, so that it is relative small and can be loaded into memory completely. If I am wrong, please correct me.

If this is the case, the performance is likely bottlenecked on cache misses. AI pointed out this issues.

A. Multiple Separate Arrays (Problematic)
∙ LOUDS-Sparse uses separate sequences: s_labels_, s_has_child_, s_louds_, s_child_start_pos_, s_child_end_pos_
∙ Accessing these in sequence causes cache misses
∙ However, the SuRF paper mentions this is mitigated by:
∙ Position correspondence between arrays
∙ Prefetching computed addresses

Meantime, I am wondering whether we could do batch execution to hide the cache misses latency. It would interleave the execution of multiple queries. This could boost the throughput significantly while losing a bit of latency.

If you are interested, maybe explore this direction as well. Maybe expand the UDI to support batch api, then extend MultiGet api to leverage it.

@zaidoon1
Copy link
Contributor Author

Just discussed this change with AI. I assume the index is densely packed, so that it is relative small and can be loaded into memory completely. If I am wrong, please correct me.

If this is the case, the performance is likely bottlenecked on cache misses. AI pointed out this issues.

A. Multiple Separate Arrays (Problematic) ∙ LOUDS-Sparse uses separate sequences: s_labels_, s_has_child_, s_louds_, s_child_start_pos_, s_child_end_pos_ ∙ Accessing these in sequence causes cache misses ∙ However, the SuRF paper mentions this is mitigated by: ∙ Position correspondence between arrays ∙ Prefetching computed addresses

Meantime, I am wondering whether we could do batch execution to hide the cache misses latency. It would interleave the execution of multiple queries. This could boost the throughput significantly while losing a bit of latency.

If you are interested, maybe explore this direction as well. Maybe expand the UDI to support batch api, then extend MultiGet api to leverage it.

You're correct, the trie index is densely packed into a single contiguous meta-block and loaded entirely into memory via the block cache.

On cache misses from multiple separate arrays:
The arrays are not separate heap allocations. Everything is serialized into a single contiguous buffer during Finish(), and InitFromData() uses zero-copy pointers into that buffer for all hot-path data:

  • s_labels_data_ → raw pointer into buffer
  • s_has_child_.words_ / .rank_lut_ → raw pointers into buffer
  • s_louds_.words_ / .rank_lut_ → raw pointers into buffer
  • s_chain_bitmap_, s_chain_suffix_data_, handle_offsets_, handle_sizes_ → raw pointers into buffer

Only the child position lookup tables (s_child_start_pos_, s_child_end_pos_) and chain metadata vectors are copied into separate std::vectors during deserialization. These are small — 8 bytes per internal node.

Concrete trie sizes from benchmarks:

Pattern Keys Total Trie Size Fits In
Numeric 16B 1,000 ~10.7 KB L1 (64 KB)
Numeric 16B 32,000 ~333 KB L2 (256 KB–1 MB)
Hex 16B 1,000 ~134 KB L2
Short 4B 1,000 ~10.6 KB L1

For typical RocksDB SST files (4 KB data blocks, 64–256 MB file → 16K–64K separator keys), the entire trie fits in L2.
After the first Seek warms the cache, subsequent Seeks on the same SST file hit L1/L2.

Per-level memory accesses for one sparse iteration (the hot path):

Order Array Storage What
1 s_labels_data_[start] zero-copy 1 byte read
2 s_has_child_.words_[start/64] zero-copy 1 word (Rank1AndBit fuses GetBit + Rank1)
3 s_has_child_.rank_lut_[start/256] zero-copy 1 word
4 s_child_start_pos_[child_idx] vector 1 uint32
5 s_child_end_pos_[child_idx] vector 1 uint32

5 memory accesses per trie level, where 1–3 are into the same contiguous buffer (close addresses → likely same or adjacent cache lines) and 4–5 are adjacent vector elements. With path compression, entire chains of fanout-1 nodes are skipped with a single memcmp on contiguous suffix bytes instead of 5 accesses per level.
The SuRF paper's point about position correspondence applies here: position P in s_labels_, s_has_child_, and s_louds_ all describe the same logical edge, so their word-level accesses at P/64 are numerically close, improving spatial locality even across arrays.

On batch execution / MultiGet integration:
I looked at the current MultiGet pipeline and the index lookup is actually the only unbatched stage. Bloom filter checks use a two-pass prefetch+check pattern (PrepareHash → PREFETCH → HashMayMatchPrepared), block cache lookups are batched (StartAsyncLookupFull + WaitAll), and disk I/O is batched (MultiRead with coalescing). But index Seeks are strictly sequential, one iiter->Seek() per key.

A batch trie Seek could use software pipelining to interleave multiple queries, similar to what RocksDB already does for bloom filters. The UDI interface already has Prepare() which could be extended for point-lookup batches.
That said, the implementation complexity would be significant, trie Seek has complex control flow (dense/sparse transitions, prefix keys, chain matching, backtracking) and interleaving N queries through that state machine requires either coroutine-style suspension or manually maintaining per-query state. The bloom filter two-pass pattern works because bloom probing is a fixed set of independent memory accesses; trie traversal is data-dependent where each level's access depends on the previous level's result.

The benefit is also uncertain without profiling a real MultiGet workload with perf stat to measure actual cache miss rates. If MultiGet keys for the same SST are sorted (which they typically are), consecutive Seeks follow similar trie paths and get good cache reuse after the first cold access. The cold first-Seek per SST is the main scenario where batch prefetching would help.

I think this is worth exploring as a follow-up once the basic trie index is stable, but would want to measure actual L2/L3 miss rates on a production-like workload first to quantify the opportunity before committing to the implementation complexity.

@zaidoon1
Copy link
Contributor Author

also optimization wise, i'm done. I think this is as good as it gets for now as a first pass. I'll look at the stress testing next.

Implement a Fast Succinct Trie (FST) index based on LOUDS encoding as a
User Defined Index plugin for RocksDB's block-based tables. This is the
first step toward supporting trie-based indexing per issue facebook#12396.

The trie uses a hybrid LOUDS-Dense (upper levels, 256-bit bitmaps) +
LOUDS-Sparse (lower levels, label arrays) encoding inspired by the SuRF
paper (Zhang et al., SIGMOD 2018)
https://www.pdl.cmu.edu/PDL-FTP/Storage/surf_sigmod18.pdf . The boundary
between dense and sparse levels is automatically chosen to minimize total
space.

Key components:

- Bitvector with O(1) rank and O(log n) select using a rank LUT sampled
  every 256 bits with popcount intrinsics. Uses uint32_t rank LUT entries
  (halving LUT memory overhead vs uint64_t, safe because trie bitvectors
  are bounded well below 4 billion bits). Includes word-level AppendWord()
  for efficient dense bitmap construction and AppendMultiple() optimized
  at word granularity for bulk bit fills.
- Streaming trie builder using flat per-level arrays with deferred internal
  marking, handle migration for prefix keys, and lazy node creation. Infers
  trie structure directly from sorted keys via LCP analysis in a single
  pass (no intermediate tree). Merged node-per-level and label-per-level
  cutoff computation into a single pass over the key set.
- LoudsTrie for immutable querying with BFS-ordered handle reordering built
  into the single-pass level-by-level serialization. Move-only semantics
  with correct pointer re-seating after std::string move.
- LoudsTrieIterator with rank-based trie traversal, key reconstruction from
  the trie path, and stack-based backtracking for Next(). Uses packed 8-byte
  LevelPos (is_dense flag encoded in bit 63) and autovector<LevelPos, 24>
  to avoid heap allocation. Key reconstruction uses a raw char buffer
  (key_buf_/key_len_) allocated once to MaxDepth()+1 bytes, so appending
  each byte is a single inlined store + increment with no function call
  overhead.
- TrieIndexFactory/Builder/Reader/Iterator implementing the
  UserDefinedIndexFactory interface.
- Zero-copy block handle loading using two fixed-width uint64_t arrays
  (offsets + sizes) with 8-byte alignment, enabling O(1) initialization via
  direct pointer assignment into the serialized data.

Seek hot path optimizations:

- Fanout-1 sparse fast path: most sparse nodes in tries built from
  zero-padded numeric keys have exactly one child. The Seek loop detects
  this case (start_pos + 1 == end_pos) and inlines a single byte comparison,
  avoiding the full SparseSeekLabel function call and reducing branch logic
  to a single comparison + conditional.
- Linear scan for small sparse nodes: SparseSeekLabel uses sequential scan
  for nodes with <=16 labels instead of binary search. This is faster for
  the common 10-child digit nodes where the branch misprediction cost of
  binary search outweighs the linear scan cost.
- Rank reuse: DenseLeafIndexFromRankAndHasChildRank and
  SparseLeafIndexFromHasChildRank overloads accept pre-computed
  has_child_rank from the Seek descent, avoiding redundant Rank1 calls
  when computing the final leaf index.

General performance optimizations:

- Select-free sparse traversal: precomputed child position lookup tables
  (s_child_start_pos_/s_child_end_pos_, uint32_t per internal node)
  eliminate Select1 calls during Seek. Sparse traversal tracks
  (start_pos, end_pos) directly instead of node_num, using only Rank1
  (O(1)) + array lookup (O(1)) for child descent.
- Binary search for sparse label lookup (std::lower_bound) as fallback
  for nodes with >16 labels.
- Popcount-based Select64 via 6-step binary search within 64-bit words.
- Cached label_rank pattern to eliminate redundant Rank1 calls in hot paths
  (Seek, Next, Advance all cache and reuse the label_rank computed for
  has_child checking).
- Leaf index fast path: when there are no prefix keys (the common case),
  SparseLeafIndex and DenseLeafIndexFromRank skip the prefix-key Rank1
  calls entirely, reducing SparseLeafIndex from 3 Rank1 calls to 1.
- MSVC portability using RocksDB's BitsSetToOne/CountTrailingZeroBits.
…ration

- Security: fix integer overflow in InitFromData (remaining/16 check)
- Correctness: fix alignment padding to use buffer offset, not pointer
- Validation: reject max_depth > 65536 to prevent key_cap overflow
- Validation: reject non-BytewiseComparator in NewBuilder/NewReader
- Safety: use abort() for deprecated virtual methods
- Memory: include aux vectors in ApproximateMemoryUsage
- Tests: add move semantics, corruption, comparator, memory tests
- db_bench: add --use_trie_index flag for benchmarking
- Style: named constants, constexpr methods, noexcept move ctor
Add unsigned suffix (u) to integer literals compared against unsigned
return types (uint64_t, size_t) in ASSERT_EQ/ASSERT_GT macros. GCC with
-Werror -Wsign-compare rejects mixed signed/unsigned comparisons in
gtest's CmpHelper template instantiations.
…no class

Key changes:
- Add select1/select0 hint arrays to Bitvector for O(1) Select (was O(log n)
  binary search). Hints store rank LUT indices at every 256th one/zero bit,
  narrowing the search to a linear scan of 1-2 rank samples.
- Replace raw uint64_t block handle arrays (16 bytes/key) with packed uint32_t
  arrays for offsets and sizes (8 bytes/key). BFS leaf order does not match
  key-sorted order for keys of different lengths, so Elias-Fano cannot be used.
- Add EliasFano class (bitvector.h/cc) for compressed monotone sequences.
  Not used for handles due to the BFS ordering issue, but available for future
  use with other monotone sequences.
- Remove all debug fprintf/fflush trace statements from SerializeAll().
- Add standalone profiling tools (local-only, not tracked in repo).

Space reduction: 19.2 -> 11.3 bytes/key (41% reduction) at 32K 16-byte keys.
Seek performance: ~125 ns/op (no regression).
On Linux, uint64_t is 'unsigned long' not 'unsigned long long', so %llu
triggers -Werror,-Wformat with clang-18. Use the portable PRIu64 macro.
Two correctness bugs in TrieIndexIterator::CheckBounds():

1. Upper-bound pruning dropped valid blocks (High). CheckBounds()
   compared the current separator key against the scan limit. Since the
   trie stores separator keys (upper bounds on block contents), not
   first-in-block keys, this prematurely rejected blocks that still
   contained keys within the limit. Fix: use the seek target (for Seek)
   or previous separator (for Next) as the reference key. This matches
   the UDI API contract in user_defined_index.h:114-121.

2. Multi-scan bounds applied to wrong scan (Medium). current_scan_idx_
   was never advanced, so all bounds checks evaluated against scan 0's
   limit. Fix: in SeekAndGetResult(), advance current_scan_idx_ past
   scans whose limit <= the seek target before checking bounds.

Added regression tests:
- UpperBoundDoesNotDropValidBlocks: verifies blocks with keys < limit
  are not skipped when their separator >= limit.
- MultiScanBoundsAdvanceCorrectly: verifies multi-scan iteration uses
  the correct scan's limit for each seek.
- Guard right-shift by low_bits_ when low_bits_==64 in EliasFano::BuildFrom
  to avoid undefined behavior (shift >= type width).
- Guard word-boundary crossing shift: check bit_idx > 0 before shifting by
  (64 - bit_idx) to avoid shift by 64.
- Initialize consumed to 0 in EliasFano::InitFromData to satisfy
  core.uninitialized.Assign checker.
- Remove dead stores to p and remaining at end of LoudsTrie::InitFromData.
Add Bitvector::Rank1AndBit() that computes both GetBit(pos) and
Rank1(pos+1) in a single pass, avoiding a redundant word access when
both operations target the same bitvector position. This is the common
pattern in sparse trie traversal where we check has_child then compute
rank for child/leaf index lookup.

Wire Rank1AndBit into three hot paths in LoudsTrieIterator::Seek():
 - Fanout-1 exact match: replaces GetBit + two separate Rank1 calls
   (internal vs leaf branches) with a single Rank1AndBit call
 - Fanout-1 mismatch (label > target): replaces GetBit + Rank1
 - General sparse path (fanout > 1): replaces GetBit + Rank1

Benchmark (32K random 16-byte hex keys, 10M lookups):
 Before: ~160 ns/op
 After:  ~153 ns/op  (~4.4% faster)
Implement path compression for the LOUDS-sparse trie by detecting and
collapsing fanout-1 chains (sequences of single-child nodes) into
single edges that can be compared with memcmp instead of per-level
Rank1 traversal.

Builder changes (louds_trie.cc):
- Detect fanout-1 chains >= 8 nodes during serialization
- Store chain metadata: bitmap (1 bit per internal label), suffix bytes,
  chain lengths (uint16), and end-child indices (uint32)
- Filter chains with num_chains <= num_keys heuristic to disable chains
  for key patterns where they hurt (random hex, URLs) and enable them
  where they help (numeric keys with long shared prefixes)
- Apply 10% space budget cap to prevent metadata explosion

Seek changes (louds_trie.h, louds_trie.cc):
- Template SeekImpl<bool kHasChains> with if constexpr guards around
  all chain-handling code (~200 lines in two blocks)
- When kHasChains=false, compiler eliminates all chain code entirely,
  producing zero i-cache overhead for tries without chains
- Dispatch once at iterator construction via has_chains_ flag
- Follows the same pattern as RocksDB's BlockIter::ParseNextKey template

Chain matching handles all cases:
- Full chain match: skip entire chain, continue at end node
- Partial match with divergence: find mismatch point, descend or advance
- Target exhausted mid-chain: descend to leftmost leaf
- Chain ending at leaf vs internal node with fanout > 1

Benchmark changes (trie_index_test.cc):
- Replace raw trie vs binary search benchmark with trie vs real RocksDB
  IndexBlockIter comparison using InternalKeys and production code paths

Performance (16-byte zero-padded numeric keys, 500K lookups):
- 500 keys:   97ns vs 142ns baseline  (+32% faster)
- 1000 keys:  98ns vs 146ns baseline  (+33% faster)
- 32000 keys: 136ns vs 170ns baseline (+20% faster)

Zero regression for non-chain patterns (hex, URL, short keys):
all within noise of baseline measurements.
@zaidoon1 zaidoon1 force-pushed the zaidoon/trie-based-udi-index branch from 4ffdde3 to c214056 Compare February 16, 2026 06:23
The blackbox crash test reuses the same DB across iterations. When
use_trie_index was a lambda, it could change between iterations: iteration
1 might run without trie index (writing deletes), then iteration 2 enables
trie index and fails on reopen because existing SSTs contain non-Put types.

Fix by making use_trie_index a non-lambda constant, matching the pattern
used by use_put_entity_one_in for the same reason.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants