This document provides a deep dive into each component of the three storage engines implemented in this project.
Purpose: In-memory sorted structure for recent writes before they're flushed to disk.
Data Structure:
type MemTableEntry struct {
Key string
Value []byte
Sequence uint64 // Monotonically increasing, determines version
Deleted bool // Tombstone marker
}Implementation Details:
- Uses a sorted slice with binary search (simple and effective)
- Thread-safe with RWMutex (multiple readers, single writer)
- Entries maintained in sorted order by key
- Sequence numbers provide total ordering of all writes
Why Sorted Slice Instead of Tree?
Alternatives considered:
1. Red-Black Tree: O(log n) insert, O(log n) search
- Pros: Balanced, predictable
- Cons: Complex, cache-unfriendly
2. Skip List: O(log n) average insert/search
- Pros: Simpler than RB-tree, concurrent-friendly
- Cons: Probabilistic, variable height
3. Sorted Slice: O(n) insert, O(log n) search
- Pros: Cache-friendly, simple, fast iteration
- Cons: Slow inserts for large sizes
Choice: Sorted slice because memtable is small (4MB),
flushed frequently, and iteration is fast (needed for flush).
Operations:
// Put: O(n) insertion (binary search + shift)
Put(key, value, seq) -> Updates existing or inserts new
// Get: O(log n) binary search
Get(key) -> (value, sequence, deleted, found)
// GetAllEntries: O(1) (already sorted!)
GetAllEntries() -> []MemTableEntry // For flushingMemory Management:
- Size tracked approximately:
len(key) + len(value) + 16 bytes overhead - Becomes immutable when
size >= maxSize(default 4MB) - Old entries replaced (not appended) to save space
Purpose: Provide durability for writes before they're persisted to SSTables.
Record Format:
[CRC32: 4 bytes] - Checksum for corruption detection
[Sequence: 8 bytes] - Global sequence number
[KeySize: 4 bytes] - Length of key
[ValueSize: 4 bytes] - Length of value
[Deleted: 1 byte] - Tombstone flag (0 or 1)
[Key: variable] - Actual key bytes
[Value: variable] - Actual value bytes
Why This Format?
- CRC first: Detect corruption before parsing
- Sizes before data: Know how much to read
- Self-contained: Each record independent
Recovery Process:
1. On startup, read WAL from beginning to end
2. For each record:
a. Validate CRC
b. If valid, apply to memtable
c. Track max sequence number
3. Resume from max sequence + 1
4. Keep WAL until memtable flushed
5. Delete old WAL, create new one
Crash Scenarios:
Scenario 1: Crash before memtable flush
- WAL has all writes
- Recovery replays WAL → memtable reconstructed
- Result: No data loss ✅
Scenario 2: Crash during memtable flush
- Partial SSTable written
- Recovery replays WAL → memtable reconstructed
- Re-flush memtable → new SSTable
- Result: Duplicate SSTable (cleaned up by compaction)
Scenario 3: Crash after memtable flush, before WAL delete
- SSTable persisted
- WAL still exists
- Recovery loads SSTable + replays WAL
- Result: Duplicate entries (sequence numbers deduplicate)
Purpose: Probabilistic data structure to skip SSTables that definitely don't contain a key.
Math Behind It:
Given:
- n = expected number of keys
- p = desired false positive rate (e.g., 0.01 for 1%)
Optimal parameters:
- m (number of bits) = -(n * ln(p)) / (ln(2)²)
- k (number of hashes) = (m/n) * ln(2)
Example with n=1000, p=0.01:
- m = -(1000 * ln(0.01)) / 0.48 ≈ 9585 bits (1.2 KB)
- k = (9585/1000) * 0.69 ≈ 7 hash functions
Implementation:
// Uses double hashing for efficiency
h_i(x) = (h1(x) + i * h2(x)) mod m
// Instead of k different hash functions, compute just 2:
h1 = FNV64a(key)
h2 = FNV64(key)
// Then derive k hashes via linear combinationWhy Double Hashing?
- Efficient: Only 2 hash computations instead of k
- Good distribution: Linear combination spreads bits well
- Fast: FNV is a fast non-cryptographic hash
Performance Impact:
Without Bloom Filter:
Get("nonexistent_key"):
- Check L0: 4 files × 100µs = 400µs
- Check L1: 10 files × 100µs = 1000µs
- Check L2: 100 files × 100µs = 10000µs
Total: ~11.4ms per negative lookup!
With Bloom Filter (1% FP):
- Bloom checks: 5 levels × 1µs = 5µs
- 99% stop here (negative result)
- 1% false positive → 11.4ms
Average: 5µs + 0.01 * 11400µs = 119µs
Speedup: 96x faster! 🚀
Purpose: Immutable sorted file on disk with efficient search.
File Layout:
┌─────────────────────────────────────┐
│ Data Block 1 (4KB) │ ← Sorted entries
├─────────────────────────────────────┤
│ Data Block 2 (4KB) │
├─────────────────────────────────────┤
│ ... │
├─────────────────────────────────────┤
│ Data Block N (4KB) │
├─────────────────────────────────────┤
│ Index Block │ ← Maps first key → block offset
├─────────────────────────────────────┤
│ Metadata Block │ ← MinKey, MaxKey
├─────────────────────────────────────┤
│ Bloom Filter │ ← Membership test
├─────────────────────────────────────┤
│ Footer (28 bytes) │ ← Offsets + magic number
└─────────────────────────────────────┘
Data Block Format:
[NumEntries: 4 bytes]
[Entry1]
[Entry2]
...
Entry format:
[KeySize: 4 bytes]
[ValueSize: 4 bytes]
[Deleted: 1 byte]
[Key: variable]
[Value: variable]
Index Block Format:
[NumEntries: 4 bytes]
[IndexEntry1]
[IndexEntry2]
...
IndexEntry format:
[KeySize: 4 bytes]
[BlockOffset: 8 bytes] ← Position of data block in file
[Key: variable] ← First key in that block
Search Algorithm:
func Get(key string) (value []byte, found bool, err error) {
// Step 1: Check bloom filter (1µs)
if !bloomFilter.MayContain(key) {
return nil, false, nil // Definitely not present
}
// Step 2: Binary search index to find block (log n)
blockIdx := binarySearch(index, key)
// Step 3: Read block from disk (100µs)
block := readBlock(blockIdx)
// Step 4: Linear search within block (10µs)
return searchBlock(block, key)
}Why 4KB Blocks?
Alternatives:
- 512 bytes: Too small, index overhead too high
- 64KB: Too large, waste I/O for small values
4KB is optimal because:
1. Matches page size (efficient I/O)
2. ~100 entries per block (good granularity)
3. Small enough for fast linear search within block
4. Large enough to amortize index overhead
Purpose: Construct immutable SSTables from sorted entries.
Build Process:
1. Initialize:
- Create file
- Init bloom filter (based on expected keys)
- Init current block buffer
2. For each entry (must be in sorted order!):
- Add key to bloom filter
- Track minKey (first key) and maxKey (last key)
- Append entry to current block
- If block full (>4KB):
a. Flush block to disk
b. Add index entry (first key → block offset)
c. Reset block buffer
3. Finish:
- Flush remaining block
- Write index block
- Write metadata block (minKey, maxKey)
- Write bloom filter
- Write footer (offsets + magic)
- Sync to disk
- Close file
Critical Invariant: Entries MUST be added in sorted order!
Why? Because we build index as we go:
- First key of each block goes in index
- If unsorted, index would be wrong
- Result: Binary search would fail!
Enforcement: Caller's responsibility (flush, compaction)
Block Flushing:
// When block exceeds 4KB:
1. Write block header (numEntries)
2. Write block to file
3. Pad to 4KB boundary (for alignment)
4. Create index entry:
- Key: first key in block
- Offset: file position of block start
5. Reset block bufferPurpose: Merge sorted files to reclaim space and reduce read amplification.
K-Way Merge Algorithm:
Problem: Merge N sorted files into 1 sorted file
Algorithm:
1. Create iterator for each input file
2. Build min-heap with first entry from each iterator
3. While heap not empty:
a. Pop smallest entry (by key)
b. If duplicate key, keep entry with highest sequence
c. Write entry to output file
d. Advance iterator that produced this entry
e. Push next entry from that iterator to heap
Time: O(k * n * log k) where k=files, n=total entries
Space: O(k) for heap
Handling Duplicates:
Example: Same key in multiple files
File 1: key="user1", seq=5, value="Alice"
File 2: key="user1", seq=8, value="Bob"
File 3: key="user1", seq=3, value="Charlie"
Merge process:
1. All 3 appear in heap simultaneously
2. Heap orders by (key, -sequence):
("user1", -8) ← Pop this (highest sequence)
("user1", -5)
("user1", -3)
3. Write ("user1", seq=8, "Bob")
4. Skip ("user1", seq=5) and ("user1", seq=3)
Result: Only latest version kept!
Tombstone Handling:
Rule: Drop tombstones ONLY in final level (L4)
Why?
- Tombstone in L2 might delete key in L3/L4
- If dropped too early, old version resurfaces!
Example:
L1: ("user1", DELETED, seq=10)
L2: ("user1", "Alice", seq=5)
If we drop tombstone in L1→L2 compaction:
Result: ("user1", "Alice") ← WRONG! Should be deleted
Correct: Keep tombstone until L4
Compaction Strategies:
L0 → L1 (Special Case):
Problem: L0 files may overlap (from memtable flushes)
Solution: Merge ALL L0 files at once
Example:
L0-001: [a-m, z] ← Overlaps!
L0-002: [b-n, x] ← Overlaps!
L0→L1 compaction:
- Pick ALL L0 files
- Find overlapping L1 files
- Merge all together
- Output: Non-overlapping L1 files
L1 → L2, L2 → L3, L3 → L4 (Leveled):
Strategy: Pick 1 file from source, merge with overlapping target files
Example L1→L2:
L1: [a-m], [n-z]
L2: [a-f], [g-m], [n-s], [t-z]
Step 1: Pick L1 file [a-m]
Step 2: Find overlapping L2 files: [a-f], [g-m]
Step 3: Merge these 3 files
Step 4: Output: New L2 files [a-m] (merged and compacted)
Result:
- Removed 3 old files ([a-m] from L1, [a-f] and [g-m] from L2)
- Created 1 new file ([a-m] in L2)
- Reclaimed space from deleted/overwritten keys
Purpose: Organize SSTables across multiple levels and track when compaction needed.
Level Configuration:
L0: 40 MB (4-10 files) ← May overlap
L1: 400 MB (10-100 files) ← Non-overlapping
L2: 4 GB (100-1000 files) ← Non-overlapping
L3: 40 GB (1000-10000 files) ← Non-overlapping
L4: 400 GB (10000+ files) ← Non-overlappingWhy 10x Size Ratio?
Write amplification = 1 + (size_ratio × num_levels)
With 10x ratio and 4 levels:
WA = 1 + (10 × 4) = 41x
Seems high, but alternatives worse:
- 5x ratio, 6 levels: WA = 1 + (5 × 6) = 31x (more levels to check)
- 20x ratio, 2 levels: WA = 1 + (20 × 2) = 41x (huge compactions)
10x is industry standard (LevelDB, RocksDB)
Compaction Triggers:
L0: Trigger when >= 4 files (count-based)
Why? L0 files overlap, more files = slower reads
L1+: Trigger when size >= max_size (size-based)
Why? Non-overlapping, size matters more than count
File Organization:
L0: Unsorted (by time created)
- Don't sort, order doesn't matter
- All checked anyway (overlap)
L1-L4: Sorted by minKey
- Binary search possible
- Skip non-overlapping files
Purpose: Coordinate all components, handle reads/writes, manage background workers.
Write Path:
Put(key, value):
1. Generate sequence number (atomic increment)
2. Append to WAL (for durability)
3. Insert into active memtable
4. If memtable full:
a. Lock
b. Swap active ← new empty memtable
c. Set immutable ← old active
d. Signal flush worker (non-blocking)
e. Unlock
5. Return (fast!)
Read Path:
Get(key):
1. Check active memtable (fastest)
- If found, return immediately
2. Check immutable memtable (if exists)
- If found, return immediately
3. Check L0 SSTables (ALL of them, may overlap)
- For each: bloom filter → binary search
- If found, return immediately
4. Check L1 SSTables (binary search by key range)
- Find files where minKey <= key <= maxKey
- For each: bloom filter → binary search
- If found, return immediately
5. Repeat for L2, L3, L4
6. Not found → return nil
Read amplification: Worst case checks all levels
Background Workers:
Flush Worker:
for {
select {
case <-flushChan:
if immutableMemtable != nil {
1. Get all entries from immutable memtable
2. Build SSTable from entries
3. Add SSTable to L0
4. Delete WAL (data now durable in SSTable)
5. Set immutableMemtable = nil
6. Check if L0 needs compaction
}
case <-closeChan:
return
}
}Compaction Worker:
for {
select {
case <-compactionChan:
if L0.needsCompaction() {
compactL0ToL1()
triggerNextIfNeeded(L1)
} else if L1.needsCompaction() {
compactL1ToL2()
triggerNextIfNeeded(L2)
} else if L2.needsCompaction() {
compactL2ToL3()
triggerNextIfNeeded(L3)
} else if L3.needsCompaction() {
compactL3ToL4()
}
case <-closeChan:
return
}
}Purpose: Divide keyspace into independent shards to reduce lock contention.
Architecture:
256 shards (power of 2 for fast modulo)
Each shard independently manages:
- In-memory index (map[uint64]Location)
- Active segment
- List of immutable segments
- Own lock (RWMutex)
Shard selection: hash(key) % 256
Why 256 Shards?
Trade-offs:
- 16 shards: Simple, but contention with >16 threads
- 256 shards: Sweet spot for modern CPUs (8-16 cores)
- 1024 shards: Overhead, diminishing returns
256 chosen because:
1. Enough parallelism for 16-32 threads
2. Low overhead (256 × small struct = <1MB)
3. Power of 2 (fast modulo with bitwise AND)
Entry Format:
[CRC32: 4 bytes] - Checksum
[Timestamp: 8 bytes] - Unix timestamp
[KeySize: 4 bytes] - Length of key
[ValueSize: 4 bytes] - Length of value
[Key: variable] - Key bytes
[Value: variable] - Value bytes
Append-Only Writes:
Benefits:
1. Sequential I/O (fast!)
2. No random seeks
3. Simple crash recovery
4. No corruption from partial writes
Cost:
- Space amplification (old values not deleted)
- Need compaction
Strategy: Merge-rewrite all segments into one.
Process:
1. Scan all segments
2. Build map: key → latest value
3. Write map to new segment (sorted order)
4. Atomically replace old segments
5. Update in-memory index
Problem: Blocks writes during merge!
Solution: LSM-Tree (incremental compaction)
Purpose: Fixed 4KB page-based storage unit for B-tree nodes.
Page Layout:
┌───────────────────────────────────────────────────┐
│ Header (13 bytes) │
├───────────────────────────────────────────────────┤
│ Cell Directory (grows forward →) │
│ [offset1][offset2][offset3]... │
├───────────────────────────────────────────────────┤
│ Free Space │
├───────────────────────────────────────────────────┤
│ Cells (grow backward ←) │
│ [cell N]...[cell 3][cell 2][cell 1] │
└───────────────────────────────────────────────────┘
Size: 4096 bytes (4KB)
Header Format:
[Version: 1 byte] - Page format version (0=V1, 1=V2)
[PageType: 1 byte] - LEAF=1, INTERNAL=2
[NumCells: 2 bytes] - Number of cells in page
[FreeOffset: 2 bytes] - Start of free space
[RightPtr: 4 bytes] - Right sibling (leaf) or rightmost child (internal)
[Reserved: 3 bytes] - Future use
Cell Directory:
Array of 2-byte offsets pointing to cells
- Grows forward from byte 13
- Each entry: offset to cell in page
- Binary searchable (cells kept sorted by key)
Leaf Cell Format (V1):
[KeySize: 2 bytes] - Length of key
[ValueSize: 2 bytes] - Length of value
[Key: variable] - Actual key bytes
[Value: variable] - Actual value bytes
Example: key="user", value="Alice"
[0x00, 0x04][0x00, 0x05]['u','s','e','r']['A','l','i','c','e']
Total: 2 + 2 + 4 + 5 = 13 bytes
Leaf Cell Format (V2 - Varint Optimized):
[KeySize: 1-3 bytes] - Varint-encoded key length
[ValueSize: 1-3 bytes] - Varint-encoded value length
[Key: variable] - Actual key bytes
[Value: variable] - Actual value bytes
Example: key="user", value="Alice"
[0x04][0x05]['u','s','e','r']['A','l','i','c','e']
Total: 1 + 1 + 4 + 5 = 11 bytes (2 bytes saved!)
Internal Cell Format (V1):
[KeySize: 2 bytes] - Length of separator key
[ChildPageID: 4 bytes] - Left child page ID
[Key: variable] - Separator key
Interpretation: All keys in ChildPageID < Key
Internal Cell Format (V2):
[KeySize: 1-3 bytes] - Varint-encoded key length
[ChildPageID: 4 bytes] - Left child page ID
[Key: variable] - Separator key
Why 4KB Pages?
Alternatives considered:
1. 512 bytes: Too small, poor fanout (20-30 keys)
2. 8KB: Larger than OS page, wasted I/O
3. 64KB: Huge, cache-unfriendly
4KB is optimal because:
1. Matches OS page size (single syscall)
2. Good fanout: 128-357 keys per page (depending on key size)
3. Fits in L1/L2 cache
4. Standard in PostgreSQL, MySQL, SQLite
Why Two-Way Growth (Directory Forward, Cells Backward)?
Problem with linear layout:
- Insert in middle → shift all cells → expensive!
Two-way growth:
- Directory grows forward (small, fixed-size offsets)
- Cells grow backward (variable size)
- Free space in middle
- No cell shifting needed! Just update offsets
- Fragmentation isolated to free space calculation
Page Format Versioning:
V1 (Legacy):
- Fixed 2-byte size encoding
- Version byte = 0 (or unset)
- Backward compatible
V2 (Varint):
- Variable-length size encoding (1-3 bytes)
- Version byte = 1
- 18% space savings for small keys
- 17% more keys per page (306 → 357)
Key Operations:
// Insert cell (O(n) for directory shift, but no cell shift!)
func (p *Page) InsertCell(key, value []byte) bool {
// Check space
cellSize := varintSize16(len(key)) + varintSize16(len(value)) + len(key) + len(value)
if p.freeBytes() < cellSize + 2 {
return false // Page full
}
// Find insertion index (binary search)
idx := p.searchCell(key)
// Write cell at free space end (grows backward)
p.freeOffset -= cellSize
p.writeLeafCell(p.freeOffset, key, value)
// Insert offset in directory (grows forward)
p.insertOffsetAt(idx, p.freeOffset)
p.numCells++
return true
}
// Search cell (O(log n) binary search in directory)
func (p *Page) SearchCell(key []byte) int {
left, right := 0, int(p.numCells)
for left < right {
mid := (left + right) / 2
cellKey := p.cellAt(mid).Key
if bytes.Compare(cellKey, key) < 0 {
left = mid + 1
} else {
right = mid
}
}
return left
}Purpose: Manage page cache, disk I/O, and page lifecycle.
Architecture:
┌─────────────────────────────────────────────┐
│ Pager │
│ ┌─────────────────────────────────────────┐ │
│ │ LRU Page Cache (default 100 pages) │ │
│ │ pageID → *Page │ │
│ │ [1]->[5]->[12]->[3]->[8] │ │
│ │ ↑ MRU LRU ↓ │ │
│ └─────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────┐ │
│ │ Dirty Pages Set │ │
│ │ {1, 5, 8} → Need flush │ │
│ └─────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────┐ │
│ │ Free List │ │
│ │ Deleted page IDs for reuse │ │
│ │ [42, 137, 89] → Available │ │
│ └─────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────┐ │
│ │ Metadata (Page 0) │ │
│ │ - Root page ID │ │
│ │ - Total pages │ │
│ │ - Free list head │ │
│ └─────────────────────────────────────────┘ │
└─────────────────────────────────────────────┘
↕ (I/O)
┌─────────────────────────────────────────────┐
│ Disk File (btree.db) │
│ [Page 0][Page 1][Page 2]...[Page N] │
│ Meta Root Internal Leaf │
└─────────────────────────────────────────────┘
LRU Cache Design:
Why LRU?
- Hot pages stay in memory (temporal locality)
- Automatic eviction of cold pages
- Simple to implement (doubly-linked list + hash map)
Cache hit: Page in memory (fast, ~50ns)
Cache miss: Read from disk (slow, ~100µs = 2000x slower!)
Hit rate target: >95% for typical workloads
Page Lifecycle:
1. Allocation:
- Check free list first (reuse deleted pages)
- If empty, append new page to file
- Return page ID
2. Load:
- Check cache (O(1) hash lookup)
- If miss, read from disk:
file.Seek(pageID * 4096)
file.Read(4096 bytes)
- Add to cache, update LRU
3. Modification:
- Modify page in cache
- Mark as dirty (add to dirty set)
- Don't flush immediately (lazy)
4. Eviction:
- When cache full, evict LRU page
- If dirty, flush to disk first
- Remove from cache
5. Flush:
- Write all dirty pages to disk
- Clear dirty set
- Keep pages in cache (still usable)
6. Free:
- Add page ID to free list
- Remove from cache
- Update metadata
Metadata Page (Page 0):
[Magic: 8 bytes] - "BTREEDB\0" (file type marker)
[Version: 4 bytes] - File format version
[PageSize: 4 bytes] - Always 4096
[RootPageID: 4 bytes] - Current root page
[NumPages: 4 bytes] - Total pages in file
[FreeListHead: 4 bytes] - First free page ID (linked list)
[Reserved: 4064 bytes] - Future use
Free List Structure (Linked List):
Deleted page IDs stored in linked list:
Page 42: [NextFree=137][...]
Page 137: [NextFree=89][...]
Page 89: [NextFree=0][...] (0 = end)
Allocation:
1. Pop head from free list (O(1))
2. Update metadata's FreeListHead
3. Return page ID
Better than bitmap (no fragmentation tracking needed)
Critical Operations:
// Get page (cache-aware)
func (p *Pager) GetPage(pageID uint32) (*Page, error) {
// Check cache
if page, ok := p.cache[pageID]; ok {
p.updateLRU(pageID) // Move to front
return page, nil
}
// Cache miss - read from disk
page := &Page{id: pageID, data: make([]byte, 4096)}
offset := int64(pageID) * 4096
_, err := p.file.ReadAt(page.data, offset)
if err != nil {
return nil, err
}
// Add to cache (may evict LRU)
p.addToCache(pageID, page)
return page, nil
}
// Flush dirty pages
func (p *Pager) Flush() error {
for pageID := range p.dirtyPages {
page := p.cache[pageID]
offset := int64(pageID) * 4096
_, err := p.file.WriteAt(page.data, offset)
if err != nil {
return err
}
}
p.dirtyPages = make(map[uint32]bool) // Clear dirty set
return p.file.Sync() // fsync
}Performance Tuning:
Cache Size Trade-off:
- Small (10 pages = 40KB):
+ Low memory
- High cache miss rate (many disk reads)
- Large (1000 pages = 4MB):
+ High cache hit rate
- High memory usage
- Default (100 pages = 400KB):
* Balanced for most workloads
* ~95% hit rate for working set < 400KB
Purpose: Helper functions for internal and leaf node operations.
Internal Node Operations:
Internal node structure:
Key1 < Key2 < Key3
↓ ↓ ↓ ↓
[P0] [P1] [P2] [P3]
P0: All keys < Key1
P1: Keys where Key1 ≤ k < Key2
P2: Keys where Key2 ≤ k < Key3
P3: All keys ≥ Key3 (RightPtr)
Search Child Algorithm:
func searchChild(page *Page, key []byte) uint32 {
// Binary search for separator key
idx := page.searchCell(key)
if idx == 0 {
// Key smaller than all separators → leftmost child
return page.cellAt(0).ChildPageID
} else if idx >= page.numCells() {
// Key larger than all separators → rightmost child
return page.rightPtr
} else {
// Key falls between separators
return page.cellAt(idx).ChildPageID
}
}
Example:
Internal page: [10, 20, 30], RightPtr=P3
Search(5): idx=0 → return P0 (leftmost)
Search(15): idx=1 → return P1
Search(25): idx=2 → return P2
Search(35): idx=3 (>= numCells) → return P3 (RightPtr)Insert Pointer Algorithm:
func insertPointer(page *Page, key []byte, childPageID uint32) error {
// Find insertion point
idx := page.searchCell(key)
// Insert cell (key + child pointer)
cell := InternalCell{
Key: key,
ChildPageID: childPageID,
}
// Write cell and update directory
return page.insertInternalCell(idx, cell)
}Leaf Node Operations:
Leaf node structure:
Sorted keys with values
[k1:v1][k2:v2][k3:v3]...[kN:vN] → RightPtr (next leaf)
All leaves form doubly-linked list for range scans:
[Leaf1] ⇄ [Leaf2] ⇄ [Leaf3] ⇄ [Leaf4]
Search Key Algorithm:
func searchKey(page *Page, key []byte) ([]byte, bool, error) {
// Binary search in leaf cells
idx := page.searchCell(key)
if idx < page.numCells() {
cell := page.cellAt(idx)
if bytes.Equal(cell.Key, key) {
return cell.Value, true, nil
}
}
return nil, false, nil
}Purpose: Maintain balanced tree by splitting full pages.
Why Splitting is Needed:
Problem: Page full, can't insert
4KB page can hold ~357 keys (with varint)
Once full, must split to maintain B-tree properties
Leaf Page Split:
Before:
┌────────────────────────────────────────────┐
│ Leaf: [1,2,3,4,5,6,7,8,9,10] ← FULL │
└────────────────────────────────────────────┘
After inserting key=4.5:
┌────────────────────────┐ ┌────────────────────────┐
│ Leaf1: [1,2,3,4,4.5] │→│ Leaf2: [5,6,7,8,9,10] │
└────────────────────────┘ └────────────────────────┘
↑
Separator: "5"
Parent Internal node gets: [5, PageID(Leaf2)]
Leaf Split Algorithm:
func (b *BTree) splitLeaf(pageID uint32, key, value []byte) ([]byte, uint32, error) {
oldPage := b.pager.GetPage(pageID)
// Create new page
newPage := b.pager.NewPage(LEAF)
// Find midpoint
midIdx := oldPage.numCells() / 2
// Move second half to new page
for i := midIdx; i < oldPage.numCells(); i++ {
cell := oldPage.cellAt(i)
newPage.insertCell(cell.Key, cell.Value)
}
// Truncate old page
oldPage.truncateCells(midIdx)
// Link pages (doubly-linked list for scans)
newPage.rightPtr = oldPage.rightPtr
oldPage.rightPtr = newPage.id
// Insert new key into appropriate page
separatorKey := newPage.cellAt(0).Key
if bytes.Compare(key, separatorKey) < 0 {
oldPage.insertCell(key, value)
} else {
newPage.insertCell(key, value)
}
b.pager.markDirty(oldPage.id)
b.pager.markDirty(newPage.id)
return separatorKey, newPage.id, nil
}Internal Page Split:
Before:
┌───────────────────────────────────────────────────┐
│ Internal: [10,20,30,40,50] ← FULL │
│ ↓ ↓ ↓ ↓ ↓ ↓ │
│ [P0][P1][P2][P3][P4][P5] │
└───────────────────────────────────────────────────┘
After inserting separator 25:
┌────────────────────────┐ ┌────────────────────────┐
│ Internal1: [10,20] │ │ Internal2: [40,50] │
│ ↓ ↓ ↓ │ │ ↓ ↓ ↓ │
│ [P0][P1][P2] │ │ [P4][P5][P6] │
└────────────────────────┘ └────────────────────────┘
↑
Separator: "30" (middle key)
Key difference: Middle key (30) MOVES UP to parent
(Doesn't stay in either child)
Internal Split Algorithm:
func (b *BTree) splitInternal(pageID uint32, key []byte, childPageID uint32) ([]byte, uint32, error) {
oldPage := b.pager.GetPage(pageID)
newPage := b.pager.NewPage(INTERNAL)
midIdx := oldPage.numCells() / 2
midCell := oldPage.cellAt(midIdx)
separatorKey := midCell.Key // Middle key goes UP
// Move second half to new page (EXCLUDING middle)
for i := midIdx + 1; i < oldPage.numCells(); i++ {
cell := oldPage.cellAt(i)
newPage.insertPointer(cell.Key, cell.ChildPageID)
}
// Set new page's rightmost child
newPage.rightPtr = oldPage.rightPtr
// Truncate old page (remove second half + middle)
oldPage.truncateCells(midIdx)
oldPage.rightPtr = midCell.ChildPageID // Middle's child becomes old page's rightmost
// Insert new pointer into appropriate page
if bytes.Compare(key, separatorKey) < 0 {
oldPage.insertPointer(key, childPageID)
} else {
newPage.insertPointer(key, childPageID)
}
return separatorKey, newPage.id, nil
}Root Split (Tree Growth):
Problem: Root page is full, needs split
But root has no parent!
Solution: Create NEW root
Before:
[Root: Full]
After:
[New Root: separator]
↓
[Old Root] [New Page]
Tree height increases by 1!
Root Split Algorithm:
func (b *BTree) splitRoot() error {
oldRootID := b.rootPageID
oldRoot := b.pager.GetPage(oldRootID)
// Create new root
newRoot := b.pager.NewPage(INTERNAL)
// Split old root
separatorKey, newPageID, err := b.split(oldRootID, ...)
if err != nil {
return err
}
// New root points to both children
newRoot.insertPointer(separatorKey, oldRootID) // Left child
newRoot.rightPtr = newPageID // Right child
// Update root page ID
b.rootPageID = newRoot.id
b.height++ // Tree grew taller
// Persist root change
b.pager.saveMetadata()
return nil
}Split Propagation:
Insert can trigger cascade of splits:
1. Insert into leaf → Leaf full → Split leaf
2. Insert separator into parent → Parent full → Split parent
3. Insert separator into grandparent → Grandparent full → Split grandparent
4. Eventually may reach root → Split root → Tree grows
Example:
[Root] [Root: Full] [New Root]
/ \ / \ / | \
[A] [B:Full] [A] [B][C] [A] [B] [C]
↓ ↓ ↓ ↓ ↓ ↓
Split B Insert C Cascade up if needed
Purpose: Reclaim space from underfull pages after deletions.
Underflow Detection:
Trigger: Page falls below 25% capacity
Why 25%? Balance between:
- Too high (30%+): Frequent rebalancing
- Too low (10%-): Wasted space
Min cells threshold:
- Internal: order/4 (e.g., 128/4 = 32 cells)
- Leaf: order/4 (e.g., 128/4 = 32 cells)
Rebalancing Strategies:
Strategy 1: Redistribution (Borrow from sibling)
Before:
Parent: [50]
↓ ↓
Left: [10,20,30,40] (4 cells)
Right: [60] (1 cell ← UNDERFULL)
After:
Parent: [40] ← Updated separator
↓ ↓
Left: [10,20,30] (3 cells)
Right: [40,60] (2 cells ← Borrowed)
Redistribution possible if sibling has enough cells
Strategy 2: Merge (Combine with sibling)
Before:
Parent: [50, 80]
↓ ↓ ↓
Left: [10,20] (2 cells ← UNDERFULL)
Middle: [60] (1 cell ← UNDERFULL)
Right: [90]
After:
Parent: [80] ← Removed separator "50"
↓ ↓
LeftMerged: [10,20,60] (3 cells)
Right: [90]
Merge when both siblings underfull
Merge Algorithm:
func (b *BTree) mergeLeaf(leftPageID, rightPageID uint32, parentPageID uint32, separatorIdx int) error {
leftPage := b.pager.GetPage(leftPageID)
rightPage := b.pager.GetPage(rightPageID)
parentPage := b.pager.GetPage(parentPageID)
// Copy all cells from right to left
for i := 0; i < rightPage.numCells(); i++ {
cell := rightPage.cellAt(i)
leftPage.insertCell(cell.Key, cell.Value)
}
// Update linked list (skip right page)
leftPage.rightPtr = rightPage.rightPtr
// Remove separator from parent
parentPage.removeCell(separatorIdx)
// Free right page
b.pager.freePage(rightPageID)
// Mark pages dirty
b.pager.markDirty(leftPageID)
b.pager.markDirty(parentPageID)
// Check if parent now underfull (recursive!)
if parentPage.isUnderfull() && parentPageID != b.rootPageID {
return b.rebalanceParent(parentPageID)
}
return nil
}Root Shrinking:
Special case: Root has only 1 child after merge
Before:
[Root: empty]
↓
[Only Child]
/ | \
[A] [B] [C]
After:
[Only Child ← NEW ROOT]
/ | \
[A] [B] [C]
Tree height decreases by 1!
When Merge is Triggered:
Delete operation:
1. Remove key from leaf
2. Check if leaf underfull (<25%)
3. If yes:
a. Try redistribution from sibling
b. If can't redistribute, merge
c. Recursively check parent
4. If root becomes empty, shrink tree
Performance Impact:
Without merge:
- Delete 50% of keys
- Still occupy 100% space
- Space amp = 2.0x
With merge:
- Delete 50% of keys
- Space reclaimed via merge
- Space amp = 1.1x (near optimal!)
Purpose: Provide crash recovery by logging page modifications before applying them.
Physical vs Logical WAL:
Logical WAL (LSM-Tree):
- Logs operations: Put(key="user1", value="Alice")
- Pros: Small records, easy to understand
- Cons: Must replay operations to reconstruct state
Physical WAL (B-Tree):
- Logs page modifications: Set bytes 100-200 of page 42 to [data]
- Pros: Direct replay, byte-accurate
- Cons: Larger records, more complex
WAL Record Types:
1. PAGE_WRITE: Before modifying page
[Type: 1 byte] = 0x01
[PageID: 4 bytes] - Which page
[Offset: 4 bytes] - Where in page
[Length: 4 bytes] - How many bytes
[Data: variable] - Byte data
[CRC32: 4 bytes] - Checksum
2. PAGE_SPLIT: Before splitting
[Type: 1 byte] = 0x02
[OldPageID: 4 bytes] - Page being split
[NewPageID: 4 bytes] - New page created
[SeparatorLen: 2 bytes] - Length of separator key
[Separator: variable] - Separator key
[CRC32: 4 bytes]
3. ROOT_CHANGE: When root changes
[Type: 1 byte] = 0x03
[NewRootPageID: 4 bytes]- New root page ID
[CRC32: 4 bytes]
4. CHECKPOINT: Mark safe point
[Type: 1 byte] = 0x04
[Timestamp: 8 bytes] - When checkpoint taken
[CRC32: 4 bytes]
WAL Lifecycle:
1. Operation begins
├─→ Write WAL record
├─→ fsync WAL file
└─→ Guaranteed on disk
2. Apply operation
├─→ Modify pages in cache
├─→ Mark pages dirty
└─→ Don't flush yet (lazy)
3. Eventually flush
├─→ Write dirty pages to disk
├─→ fsync data file
└─→ Pages durable
4. Checkpoint
├─→ Write CHECKPOINT record
├─→ fsync WAL
└─→ Truncate WAL (can discard old records)
Crash Recovery:
func (b *BTree) recoverFromWAL() error {
wal := openWAL(b.config.DataDir + "/wal.log")
entries := wal.ReadAll()
for _, entry := range entries {
switch entry.Type {
case PAGE_WRITE:
// Replay page modification
page := b.pager.GetPage(entry.PageID)
copy(page.data[entry.Offset:], entry.Data)
b.pager.markDirty(entry.PageID)
case PAGE_SPLIT:
// Re-split operation
// (Usually not needed, splits already durable)
case ROOT_CHANGE:
// Update root page ID
b.rootPageID = entry.NewRootPageID
case CHECKPOINT:
// Safe to discard all records before this
break
}
}
// Flush recovered state
b.pager.Flush()
// Truncate WAL (recovery complete)
wal.Truncate()
return nil
}Crash Scenarios:
Scenario 1: Crash before WAL sync
Timeline:
1. Write WAL record (in buffer)
2. CRASH! (before fsync)
Recovery:
- WAL entry not on disk
- Operation never happened
- Consistent state restored
Result: Lost operation (acceptable, not synced)
Scenario 2: Crash after WAL sync, before page flush
Timeline:
1. Write WAL record
2. fsync WAL ✓
3. Modify page in cache
4. CRASH! (before flush)
Recovery:
- WAL has operation
- Page not modified on disk
- Replay WAL → Apply operation
Result: No data loss! ✓
Scenario 3: Crash during page flush
Timeline:
1. Write WAL ✓
2. Modify page ✓
3. Start flush (partial write)
4. CRASH! (mid-write)
Recovery:
- WAL has operation
- Page partially written (corrupted)
- Replay WAL → Overwrite with correct data
Result: No data loss! ✓
Scenario 4: Crash after flush, before checkpoint
Timeline:
1. Write WAL ✓
2. Modify page ✓
3. Flush page ✓
4. CRASH! (before checkpoint)
Recovery:
- WAL has operation
- Page already on disk
- Replay WAL → Re-apply (idempotent)
- Result: Duplicate write (harmless)
- Checkpoint later → Truncate WAL
Performance Impact:
Write amplification with WAL:
- User write: 1x
- WAL write: 1x (small record)
- Page flush: 1x (4KB page)
Total: ~2-3x write amp
Trade-off:
- Cost: 2-3x write amp
- Gain: Crash safety, durability guarantees
- Verdict: Essential for production!
Checkpoint Strategy:
When to checkpoint:
1. After every N page flushes (e.g., N=100)
2. On explicit Sync() call
3. On normal Close()
Why checkpoint:
- Keeps WAL small (truncate old records)
- Faster recovery (fewer records to replay)
- Reclaims disk space
Purpose: Allow concurrent reads and writes without global lock.
Latch Coupling Protocol:
Traditional approach: Global RWMutex
- Problem: Readers block writers, writers block everyone
- Throughput: Limited to serial execution
Latch coupling:
- Each page has its own RWMutex
- Lock parent → Lock child → Unlock parent
- Multiple threads traverse different paths simultaneously
- Throughput: 2-5x improvement on multi-core
Latch Coupling Algorithm (Read):
Search for key "75":
Step 1: Lock root (READ)
[Root: 50, 80] ← LOCKED
↓ ↓ ↓
Step 2: Lock child (READ), unlock root
[Root: 50, 80] ← UNLOCKED
↓ ↓ ↓
[P2: 60-90] ← LOCKED
Step 3: Lock leaf (READ), unlock parent
[P2: 60-90] ← UNLOCKED
↓
[Leaf: 70,75,80] ← LOCKED
Step 4: Read value, unlock leaf
[Leaf] ← UNLOCKED
Return value
Only one page locked at a time!
Other threads can traverse different paths simultaneously.
Latch Coupling Algorithm (Write):
Insert key "75":
Step 1: Lock root (WRITE if might split, else READ)
[Root] ← WRITE LOCK (safe node check)
Step 2: Descend to child
- If child won't split → downgrade to READ lock
- If child might split → keep WRITE lock
Step 3: Continue until leaf
[Leaf] ← WRITE LOCK
Step 4: Insert, unlock all
Safe Node Concept:
Safe node = Node that WON'T split on insert
For inserts:
- Page is safe if has space for new cell
- If safe, no split propagation → release parent locks
For deletes:
- Page is safe if won't underflow after deletion
- If safe, no merge propagation → release parent locks
Optimization:
- Check if node is safe BEFORE locking deeply
- Release ancestor locks early if safe
- Reduces lock contention dramatically
Implementation:
type Page struct {
id uint32
data []byte
latch sync.RWMutex ← Per-page lock
dirty bool
}
func (b *BTree) ConcurrentGet(key []byte) ([]byte, error) {
pageID := b.rootPageID
var parentLatch *sync.RWMutex = nil
for {
page := b.pager.GetPage(pageID)
// Lock current page
page.latch.RLock()
// Unlock parent (latch coupling!)
if parentLatch != nil {
parentLatch.RUnlock()
}
if page.isLeaf() {
// Found leaf, search for key
value, found := searchKey(page, key)
page.latch.RUnlock()
if found {
return value, nil
}
return nil, ErrKeyNotFound
}
// Internal node, descend
childPageID := searchChild(page, key)
parentLatch = &page.latch
pageID = childPageID
}
}Deadlock Prevention:
Potential deadlock:
Thread 1: Lock A → Lock B
Thread 2: Lock B → Lock A
Result: Deadlock!
Latch coupling avoids this:
- Always lock top-down (root → leaf)
- Never lock sibling simultaneously
- Release parent before locking child
- Result: No cycles possible!
Proof: Locks form a DAG (directed acyclic graph)
Root always locked before children
No back edges → No cycles → No deadlock
Performance Comparison:
Benchmark: Concurrent reads (8 threads)
Global RWMutex:
- Throughput: 300K ops/sec
- All threads wait for lock
Per-page latching:
- Throughput: 1.2M ops/sec
- Threads traverse independently
- 4x improvement!
Why 4x (not 8x)?
- Some contention on root (hot page)
- Cache line bouncing
- Still significant improvement
Write Contention:
Writes are harder:
- Must WRITE-lock path to leaf
- Blocks other writers on same path
- Still allows concurrent reads
Mixed workload (80% reads, 20% writes):
- Global lock: 300K ops/sec
- Latch coupling: 900K ops/sec
- 3x improvement
Purpose: Reduce space overhead for small key/value sizes.
Problem with Fixed Encoding:
V1 format uses fixed 2-byte size prefixes:
Key="user" (4 bytes), Value="Alice" (5 bytes)
Encoding: [0x00, 0x04][0x00, 0x05][...data...]
└─ Wasted byte (size < 128)
For small keys/values (most common!):
- 1st byte always 0x00
- Wasted 1 byte per size field
- 2 size fields per cell = 2 bytes wasted per cell
Varint Encoding (Protocol Buffers Style):
Values 0-127: 1 byte
[0xxx xxxx]
Values 128-16383: 2 bytes
[1xxx xxxx][xxxx xxxx]
↑ MSB=1 means "continue"
Values 16384+: 3 bytes
[1xxx xxxx][1xxx xxxx][xxxx xxxx]
Example:
5 → [0x05] (1 byte)
200 → [0x81, 0x48] (2 bytes)
20000 → [0x82, 0xBC, 0x20] (3 bytes)
Space Savings:
Typical key size distribution:
- 95% keys < 128 bytes
- 4% keys 128-16383 bytes
- 1% keys > 16383 bytes
V1 encoding: 100% use 2 bytes
V2 encoding: 95% use 1 byte, 4% use 2 bytes, 1% use 3 bytes
Average: 1.06 bytes per size field
Savings: 2 size fields × (2 - 1.06) = 1.88 bytes per cell
With 306 cells/page: 1.88 × 306 = 575 bytes saved
Extra cells: 575 / (avg cell size 11) = 51 more cells
Improvement: 306 → 357 cells per page (17% increase!)
Implementation:
func putUvarint16(buf []byte, x uint16) int {
i := 0
for x >= 0x80 {
buf[i] = byte(x) | 0x80 // Set MSB
x >>= 7
i++
}
buf[i] = byte(x)
return i + 1
}
func uvarint16(buf []byte) (uint16, int) {
var x uint16
var s uint
for i := 0; i < 3; i++ {
b := buf[i]
if b < 0x80 {
// Last byte
return x | uint16(b)<<s, i + 1
}
// More bytes coming
x |= uint16(b&0x7F) << s
s += 7
}
return 0, -1 // Overflow
}
func varintSize16(x uint16) int {
if x < 128 {
return 1
} else if x < 16384 {
return 2
}
return 3
}Backward Compatibility:
Challenge: Old pages use V1, new pages use V2
Solution: Page format versioning
Page header version byte:
- 0 (or unset): V1 format (fixed 2-byte encoding)
- 1: V2 format (varint encoding)
Parsing:
func (p *Page) parseLeafCell(offset int) (*Cell, error) {
if p.Version() == PageFormatV1 {
// Fixed 2-byte encoding
keySize := binary.BigEndian.Uint16(p.data[offset:])
valueSize := binary.BigEndian.Uint16(p.data[offset+2:])
offset += 4
} else {
// Varint encoding
keySize, n1 := uvarint16(p.data[offset:])
valueSize, n2 := uvarint16(p.data[offset+n1:])
offset += n1 + n2
}
// ... rest of parsing
}
Gradual migration:
- Old pages remain V1 (still readable)
- New pages use V2 (better space efficiency)
- No downtime required!
Performance Impact:
CPU overhead:
- Varint decode: ~5-10ns per call
- Fixed decode: ~2ns per call
- Overhead: ~3-8ns (negligible!)
Space savings:
- 18% fewer bytes per page
- 17% more keys per page
- Fewer pages → fewer disk I/O
- Net result: Faster despite tiny CPU cost!
Tree height reduction:
- 1M keys, 306 keys/page: height = ceil(log306(1M)) = 3
- 1M keys, 357 keys/page: height = ceil(log357(1M)) = 3
- 100M keys, 306 keys/page: height = 4
- 100M keys, 357 keys/page: height = 3 ← One level shorter!
Purpose: Support efficient range scans over sorted keys.
Linked Leaf Pages:
Internal nodes:
[Root: 50, 80]
↓ ↓ ↓
Leaf pages (linked list):
[Leaf1: 10,20,30] → [Leaf2: 40,50,60] → [Leaf3: 70,80,90] → NULL
↑ RightPtr ↑ RightPtr ↑ RightPtr
Range scan exploits this structure!
Iterator State:
type BTreeIterator struct {
btree *BTree
currentPage *Page
cellIndex int
endKey []byte
err error
started bool
}Seek Algorithm:
Scan("user:100", "user:200"):
1. Seek to start key "user:100"
- Traverse tree to leaf containing "user:100"
- Binary search within leaf for first key >= "user:100"
2. Set iterator state:
currentPage = leaf page
cellIndex = index of first key >= "user:100"
endKey = "user:200"
Example tree:
[Root: user:150]
↓ ↓
[Leaf1: user:050-120] → [Leaf2: user:130-180] → [Leaf3: user:190-250]
Seek("user:100"):
- Traverse to Leaf1
- Binary search finds index 5 (key="user:105")
- Return Leaf1, index=5
Next Algorithm:
func (it *BTreeIterator) Next() bool {
if !it.started {
it.Seek(it.startKey)
it.started = true
}
// Check end condition
if it.cellIndex >= it.currentPage.numCells() {
// Move to next page
rightPtr := it.currentPage.rightPtr
if rightPtr == 0 {
return false // End of tree
}
it.currentPage = it.btree.pager.GetPage(rightPtr)
it.cellIndex = 0
}
// Check end key
key := it.currentPage.cellAt(it.cellIndex).Key
if it.endKey != nil && bytes.Compare(key, it.endKey) >= 0 {
return false // Beyond range
}
it.cellIndex++
return true
}Full Scan Example:
Scan("a", "z"):
Step 1: Seek to "a" (first key in tree)
Step 2: Iterate through Leaf1 cells
Step 3: Move to Leaf2 via rightPtr
Step 4: Iterate through Leaf2 cells
Step 5: Move to Leaf3 via rightPtr
Step 6: Stop when key >= "z"
Performance:
- Sequential I/O (follow linked list)
- No tree traversal after initial seek
- Cache-friendly (scan hot pages)
Usage:
iter, _ := btree.Scan([]byte("user:100"), []byte("user:200"))
defer iter.Close()
for iter.Next() {
key := iter.Key()
value := iter.Value()
fmt.Printf("%s: %s\n", key, value)
}
if iter.Error() != nil {
log.Fatal(iter.Error())
}Performance Characteristics:
Range scan [start, end]:
- Seek: O(log n) tree traversal
- Scan: O(k) where k = keys in range
- I/O: ceil(k / keys_per_page) page reads
Example: 1M keys, scan 1000 keys
- Seek: log(1M) = ~20 comparisons
- Scan: 1000 keys / 357 keys_per_page = 3 page reads
- Total: 20 comparisons + 3 disk reads
Compare to Hash Index: No range scans possible!
Compare to LSM-Tree: Must merge all levels (slower)
Purpose: Coordinate all components, provide user-facing API.
Architecture:
┌─────────────────────────────────────────────────────┐
│ BTree (Main Engine) │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Config: DataDir, Order, CacheSize │ │
│ └─────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Pager: Page cache, I/O, free list │ │
│ └─────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ WAL: Crash recovery, durability │ │
│ └─────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Metadata: RootPageID, Height, NumKeys │ │
│ └─────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Stats: Read/Write counters, amplification │ │
│ └─────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Lock: RWMutex for structural changes │ │
│ │ (Per-page latches for data operations) │ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
Configuration:
type Config struct {
DataDir string // Directory for data files
Order int // Max keys per page (default 128)
CacheSize int // Page cache size (default 100 pages)
SyncWAL bool // fsync WAL on every write (default true)
}
func DefaultConfig(dataDir string) Config {
return Config{
DataDir: dataDir,
Order: 128,
CacheSize: 100,
SyncWAL: true,
}
}Initialization:
func New(config Config) (*BTree, error) {
// Create directory
os.MkdirAll(config.DataDir, 0755)
// Open pager
pager := newPager(config.DataDir+"/btree.db", config.CacheSize)
// Open WAL
wal := openWAL(config.DataDir + "/wal.log")
// Check if existing database
if fileExists(config.DataDir + "/btree.db") {
// Load metadata (root page ID, etc.)
metadata := pager.loadMetadata()
// Recover from WAL if needed
wal.recover(pager)
return &BTree{
config: config,
pager: pager,
wal: wal,
rootPageID: metadata.RootPageID,
height: metadata.Height,
numKeys: metadata.NumKeys,
}, nil
}
// New database - create root page
rootPage := pager.NewPage(LEAF)
metadata := Metadata{
RootPageID: rootPage.id,
Height: 1,
NumKeys: 0,
}
pager.saveMetadata(metadata)
return &BTree{
config: config,
pager: pager,
wal: wal,
rootPageID: rootPage.id,
height: 1,
numKeys: 0,
}, nil
}Put Operation:
func (b *BTree) Put(key, value []byte) error {
// Validate inputs
if len(key) == 0 {
return ErrEmptyKey
}
// Write to WAL (durability)
b.wal.logPut(key, value)
// Global lock for structural changes
b.mu.Lock()
defer b.mu.Unlock()
// Insert into tree
splitKey, newPageID, err := b.insertRecursive(b.rootPageID, key, value, 0)
if err != nil {
return err
}
// If root split, create new root
if splitKey != nil {
b.splitRoot(splitKey, newPageID)
}
b.numKeys++
b.stats.writeCount.Add(1)
return nil
}Get Operation:
func (b *BTree) Get(key []byte) ([]byte, error) {
// Read lock (concurrent reads allowed)
b.mu.RLock()
defer b.mu.RUnlock()
pageID := b.rootPageID
// Traverse tree
for {
page := b.pager.GetPage(pageID)
if page.isLeaf() {
// Search in leaf
value, found := searchKey(page, key)
if found {
b.stats.readCount.Add(1)
return value, nil
}
return nil, ErrKeyNotFound
}
// Internal node - find child
pageID = searchChild(page, key)
}
}Delete Operation:
func (b *BTree) Delete(key []byte) error {
// Write to WAL
b.wal.logDelete(key)
b.mu.Lock()
defer b.mu.Unlock()
// Find and remove key
found, err := b.deleteRecursive(b.rootPageID, key, nil, 0)
if err != nil {
return err
}
if !found {
return ErrKeyNotFound
}
b.numKeys--
return nil
}Sync Operation:
func (b *BTree) Sync() error {
// Flush all dirty pages
if err := b.pager.Flush(); err != nil {
return err
}
// Checkpoint WAL
if err := b.wal.Checkpoint(); err != nil {
return err
}
// Update metadata
metadata := Metadata{
RootPageID: b.rootPageID,
Height: b.height,
NumKeys: b.numKeys,
}
return b.pager.saveMetadata(metadata)
}Close Operation:
func (b *BTree) Close() error {
// Sync everything
if err := b.Sync(); err != nil {
return err
}
// Close WAL
if err := b.wal.Close(); err != nil {
return err
}
// Close pager
return b.pager.Close()
}Stats Implementation:
func (b *BTree) Stats() common.Stats {
b.mu.RLock()
defer b.mu.RUnlock()
// Calculate space amplification
numPages := b.pager.NumPages()
totalDiskSize := numPages * 4096
// Estimate logical size
avgKeySize := 8 // Approximate
avgValueSize := 32 // Approximate
logicalSize := b.numKeys * (avgKeySize + avgValueSize)
spaceAmp := float64(totalDiskSize) / float64(logicalSize)
// Write amplification (WAL doubles writes)
writeAmp := 2.0 + 1.0 // WAL + page flush + occasional split
return common.Stats{
NumKeys: b.numKeys,
NumSegments: numPages,
TotalDiskSize: uint64(totalDiskSize),
WriteCount: b.stats.writeCount.Load(),
ReadCount: b.stats.readCount.Load(),
WriteAmplification: writeAmp,
SpaceAmplification: spaceAmp,
}
}Compact (No-Op):
func (b *BTree) Compact() error {
// B-Tree doesn't need compaction!
// In-place updates mean no old versions to clean up
// Space reclaimed automatically via page merging
return nil
}Performance Characteristics Summary:
Operation Time Complexity Disk I/O
─────────────────────────────────────────────
Put O(log n) O(height) reads + O(1) write
Get O(log n) O(height) reads
Delete O(log n) O(height) reads + O(1) write
Scan O(log n + k) O(height + k/fanout)
Where:
- n = total keys
- k = keys in range
- height = typically 3-4 for millions of keys
- fanout = keys per page (128-357)
Space Amplification: 1.0-1.1x (BEST!)
Write Amplification: 2-3x (WAL + page flush)
Read Amplification: 1x (direct lookup)
Why Blocks?
Alternative 1: One entry per disk read
- Pro: Simple
- Con: 100 entries = 100 disk seeks = 10ms!
Alternative 2: Entire file in memory
- Pro: Fast
- Con: Doesn't scale
Block-based (4KB blocks):
- Pro: Amortizes disk seeks
- Pro: Fits in page cache
- Pro: Small enough for fast search
- Balance: ~100 entries per block
Purpose: Total ordering of all writes for conflict resolution.
Timeline:
t=1: Put("user1", "Alice") → seq=1
t=2: Put("user2", "Bob") → seq=2
t=3: Put("user1", "Carol") → seq=3
t=4: Delete("user2") → seq=4
During merge:
- "user1" has seq=1 and seq=3 → Keep seq=3 ("Carol")
- "user2" has seq=2 and seq=4 → Keep seq=4 (deleted)
Implementation:
var sequence uint64 // Global atomic counter
func Put(key, value) {
seq := atomic.AddUint64(&sequence, 1)
entry := Entry{key, value, seq, deleted=false}
...
}Purpose: Mark keys as deleted without immediate removal.
Why Needed?
Problem without tombstones:
1. Delete "key1" from memtable
2. Flush memtable (no entry for "key1")
3. "key1" still in L1 SSTable!
4. Get("key1") returns old value ← BUG!
Solution with tombstones:
1. Delete "key1" → Add tombstone to memtable
2. Flush memtable → Tombstone in L0
3. Compaction merges L0+L1
4. Tombstone overrides old value
5. Drop tombstone in final level (L4)
Format:
type Entry struct {
Key string
Value []byte
Deleted bool ← Tombstone marker
Seq uint64
}Definition: Total bytes written to disk / User bytes written
Example:
User writes 100MB of data
Compaction rewrites it 3 times
Total disk writes: 100MB + 300MB = 400MB
Write amplification: 400MB / 100MB = 4x
LSM Write Amplification Calculation:
For LSM with 10x ratio and 4 levels:
L0→L1: Write 40MB, read 40MB (L0) + 400MB (L1) = 440MB write
L1→L2: Write 400MB, read 400MB (L1) + 4GB (L2) = 4.4GB write
L2→L3: Write 4GB, read 4GB (L2) + 40GB (L3) = 44GB write
L3→L4: Write 40GB, read 40GB (L3) + 400GB (L4) = 440GB write
Total amplification ≈ 1 + 10 × num_levels = 41x
Definition: Actual disk usage / Logical data size
Example:
Logical data: 100MB (latest versions)
Disk usage: 250MB (includes old versions, tombstones)
Space amplification: 250MB / 100MB = 2.5x
Factors:
- Old versions not yet compacted
- Tombstones not yet dropped
- Fragmentation in blocks
- Bloom filters, indexes, metadata
WAL-Based Recovery:
On Startup:
1. Check if WAL exists
2. If yes:
a. Replay all entries into memtable
b. Restore sequence number
3. Load existing SSTables
4. Resume operation
Guarantees:
- Durability: All synced writes recovered
- Consistency: Sequence numbers maintain order
- Isolation: No partial writes visible
Small (1MB):
+ Faster flush
+ Less memory
- More L0 files
- More compaction
Large (64MB):
+ Fewer L0 files
+ Less compaction
- Slower flush
- More memory
- Longer WAL replay
Sweet spot: 4-8MB
Small (512B):
+ Fine-grained access
- Large index overhead
- More seeks
Large (64KB):
+ Small index
+ Fewer seeks
- Wasted I/O
- Slower search
Sweet spot: 4KB (page size)
Lower FP (0.1%):
+ Fewer false positives
+ Fewer wasted reads
- Larger bloom filter
- More memory
Higher FP (5%):
+ Smaller bloom filter
+ Less memory
- More false positives
- More wasted reads
Sweet spot: 1% (balanced)
for level := 0; level < 5; level++ {
numFiles := lsm.levels.NumFiles(level)
size := lsm.levels.LevelSize(level)
fmt.Printf("L%d: %d files, %d MB\n", level, numFiles, size/1024/1024)
}Track:
- Bytes written by user
- Bytes written by compaction
- Ratio = total / user
High ratio (>50x)?
→ Level sizes too small
→ Increase size ratio or thresholds
Add read counter to SSTable:
- Which SSTables get most reads?
- Which blocks get most reads?
- Optimize: Cache hot blocks in memory
This guide provides the foundation for understanding and extending these storage engines. Each design decision represents a trade-off optimized for specific workloads.