Skip to content

Latest commit

 

History

History
2566 lines (2086 loc) · 65 KB

File metadata and controls

2566 lines (2086 loc) · 65 KB

Storage Engines: Comprehensive Component Guide

This document provides a deep dive into each component of the three storage engines implemented in this project.

Table of Contents

  1. LSM-Tree Components
  2. Hash Index Components
  3. B-Tree Components
  4. SSTable Format
  5. Common Concepts

LSM-Tree Components

1. MemTable (lsm/memtable.go)

Purpose: In-memory sorted structure for recent writes before they're flushed to disk.

Data Structure:

type MemTableEntry struct {
    Key      string
    Value    []byte
    Sequence uint64  // Monotonically increasing, determines version
    Deleted  bool     // Tombstone marker
}

Implementation Details:

  • Uses a sorted slice with binary search (simple and effective)
  • Thread-safe with RWMutex (multiple readers, single writer)
  • Entries maintained in sorted order by key
  • Sequence numbers provide total ordering of all writes

Why Sorted Slice Instead of Tree?

Alternatives considered:
1. Red-Black Tree: O(log n) insert, O(log n) search
   - Pros: Balanced, predictable
   - Cons: Complex, cache-unfriendly

2. Skip List: O(log n) average insert/search
   - Pros: Simpler than RB-tree, concurrent-friendly
   - Cons: Probabilistic, variable height

3. Sorted Slice: O(n) insert, O(log n) search
   - Pros: Cache-friendly, simple, fast iteration
   - Cons: Slow inserts for large sizes

Choice: Sorted slice because memtable is small (4MB),
flushed frequently, and iteration is fast (needed for flush).

Operations:

// Put: O(n) insertion (binary search + shift)
Put(key, value, seq) -> Updates existing or inserts new

// Get: O(log n) binary search
Get(key) -> (value, sequence, deleted, found)

// GetAllEntries: O(1) (already sorted!)
GetAllEntries() -> []MemTableEntry  // For flushing

Memory Management:

  • Size tracked approximately: len(key) + len(value) + 16 bytes overhead
  • Becomes immutable when size >= maxSize (default 4MB)
  • Old entries replaced (not appended) to save space

2. Write-Ahead Log (lsm/wal.go)

Purpose: Provide durability for writes before they're persisted to SSTables.

Record Format:

[CRC32: 4 bytes]      - Checksum for corruption detection
[Sequence: 8 bytes]   - Global sequence number
[KeySize: 4 bytes]    - Length of key
[ValueSize: 4 bytes]  - Length of value
[Deleted: 1 byte]     - Tombstone flag (0 or 1)
[Key: variable]       - Actual key bytes
[Value: variable]     - Actual value bytes

Why This Format?

  • CRC first: Detect corruption before parsing
  • Sizes before data: Know how much to read
  • Self-contained: Each record independent

Recovery Process:

1. On startup, read WAL from beginning to end
2. For each record:
   a. Validate CRC
   b. If valid, apply to memtable
   c. Track max sequence number
3. Resume from max sequence + 1
4. Keep WAL until memtable flushed
5. Delete old WAL, create new one

Crash Scenarios:

Scenario 1: Crash before memtable flush
- WAL has all writes
- Recovery replays WAL → memtable reconstructed
- Result: No data loss ✅

Scenario 2: Crash during memtable flush
- Partial SSTable written
- Recovery replays WAL → memtable reconstructed
- Re-flush memtable → new SSTable
- Result: Duplicate SSTable (cleaned up by compaction)

Scenario 3: Crash after memtable flush, before WAL delete
- SSTable persisted
- WAL still exists
- Recovery loads SSTable + replays WAL
- Result: Duplicate entries (sequence numbers deduplicate)

3. Bloom Filter (lsm/bloom.go)

Purpose: Probabilistic data structure to skip SSTables that definitely don't contain a key.

Math Behind It:

Given:
- n = expected number of keys
- p = desired false positive rate (e.g., 0.01 for 1%)

Optimal parameters:
- m (number of bits) = -(n * ln(p)) / (ln(2)²)
- k (number of hashes) = (m/n) * ln(2)

Example with n=1000, p=0.01:
- m = -(1000 * ln(0.01)) / 0.48 ≈ 9585 bits (1.2 KB)
- k = (9585/1000) * 0.69 ≈ 7 hash functions

Implementation:

// Uses double hashing for efficiency
h_i(x) = (h1(x) + i * h2(x)) mod m

// Instead of k different hash functions, compute just 2:
h1 = FNV64a(key)
h2 = FNV64(key)

// Then derive k hashes via linear combination

Why Double Hashing?

  • Efficient: Only 2 hash computations instead of k
  • Good distribution: Linear combination spreads bits well
  • Fast: FNV is a fast non-cryptographic hash

Performance Impact:

Without Bloom Filter:
Get("nonexistent_key"):
- Check L0: 4 files × 100µs = 400µs
- Check L1: 10 files × 100µs = 1000µs
- Check L2: 100 files × 100µs = 10000µs
Total: ~11.4ms per negative lookup!

With Bloom Filter (1% FP):
- Bloom checks: 5 levels × 1µs = 5µs
- 99% stop here (negative result)
- 1% false positive → 11.4ms
Average: 5µs + 0.01 * 11400µs = 119µs

Speedup: 96x faster! 🚀

4. SSTable Structure (lsm/sstable.go)

Purpose: Immutable sorted file on disk with efficient search.

File Layout:

┌─────────────────────────────────────┐
│ Data Block 1 (4KB)                  │ ← Sorted entries
├─────────────────────────────────────┤
│ Data Block 2 (4KB)                  │
├─────────────────────────────────────┤
│ ...                                  │
├─────────────────────────────────────┤
│ Data Block N (4KB)                  │
├─────────────────────────────────────┤
│ Index Block                         │ ← Maps first key → block offset
├─────────────────────────────────────┤
│ Metadata Block                      │ ← MinKey, MaxKey
├─────────────────────────────────────┤
│ Bloom Filter                        │ ← Membership test
├─────────────────────────────────────┤
│ Footer (28 bytes)                   │ ← Offsets + magic number
└─────────────────────────────────────┘

Data Block Format:

[NumEntries: 4 bytes]
[Entry1]
[Entry2]
...

Entry format:
[KeySize: 4 bytes]
[ValueSize: 4 bytes]
[Deleted: 1 byte]
[Key: variable]
[Value: variable]

Index Block Format:

[NumEntries: 4 bytes]
[IndexEntry1]
[IndexEntry2]
...

IndexEntry format:
[KeySize: 4 bytes]
[BlockOffset: 8 bytes]  ← Position of data block in file
[Key: variable]         ← First key in that block

Search Algorithm:

func Get(key string) (value []byte, found bool, err error) {
    // Step 1: Check bloom filter (1µs)
    if !bloomFilter.MayContain(key) {
        return nil, false, nil  // Definitely not present
    }

    // Step 2: Binary search index to find block (log n)
    blockIdx := binarySearch(index, key)

    // Step 3: Read block from disk (100µs)
    block := readBlock(blockIdx)

    // Step 4: Linear search within block (10µs)
    return searchBlock(block, key)
}

Why 4KB Blocks?

Alternatives:
- 512 bytes: Too small, index overhead too high
- 64KB: Too large, waste I/O for small values

4KB is optimal because:
1. Matches page size (efficient I/O)
2. ~100 entries per block (good granularity)
3. Small enough for fast linear search within block
4. Large enough to amortize index overhead

5. SSTable Builder (lsm/sstable_builder.go)

Purpose: Construct immutable SSTables from sorted entries.

Build Process:

1. Initialize:
   - Create file
   - Init bloom filter (based on expected keys)
   - Init current block buffer

2. For each entry (must be in sorted order!):
   - Add key to bloom filter
   - Track minKey (first key) and maxKey (last key)
   - Append entry to current block
   - If block full (>4KB):
     a. Flush block to disk
     b. Add index entry (first key → block offset)
     c. Reset block buffer

3. Finish:
   - Flush remaining block
   - Write index block
   - Write metadata block (minKey, maxKey)
   - Write bloom filter
   - Write footer (offsets + magic)
   - Sync to disk
   - Close file

Critical Invariant: Entries MUST be added in sorted order!

Why? Because we build index as we go:
- First key of each block goes in index
- If unsorted, index would be wrong
- Result: Binary search would fail!

Enforcement: Caller's responsibility (flush, compaction)

Block Flushing:

// When block exceeds 4KB:
1. Write block header (numEntries)
2. Write block to file
3. Pad to 4KB boundary (for alignment)
4. Create index entry:
   - Key: first key in block
   - Offset: file position of block start
5. Reset block buffer

6. Compaction (lsm/compaction.go)

Purpose: Merge sorted files to reclaim space and reduce read amplification.

K-Way Merge Algorithm:

Problem: Merge N sorted files into 1 sorted file

Algorithm:
1. Create iterator for each input file
2. Build min-heap with first entry from each iterator
3. While heap not empty:
   a. Pop smallest entry (by key)
   b. If duplicate key, keep entry with highest sequence
   c. Write entry to output file
   d. Advance iterator that produced this entry
   e. Push next entry from that iterator to heap

Time: O(k * n * log k) where k=files, n=total entries
Space: O(k) for heap

Handling Duplicates:

Example: Same key in multiple files
File 1: key="user1", seq=5, value="Alice"
File 2: key="user1", seq=8, value="Bob"
File 3: key="user1", seq=3, value="Charlie"

Merge process:
1. All 3 appear in heap simultaneously
2. Heap orders by (key, -sequence):
   ("user1", -8)  ← Pop this (highest sequence)
   ("user1", -5)
   ("user1", -3)
3. Write ("user1", seq=8, "Bob")
4. Skip ("user1", seq=5) and ("user1", seq=3)

Result: Only latest version kept!

Tombstone Handling:

Rule: Drop tombstones ONLY in final level (L4)

Why?
- Tombstone in L2 might delete key in L3/L4
- If dropped too early, old version resurfaces!

Example:
L1: ("user1", DELETED, seq=10)
L2: ("user1", "Alice", seq=5)

If we drop tombstone in L1→L2 compaction:
Result: ("user1", "Alice") ← WRONG! Should be deleted

Correct: Keep tombstone until L4

Compaction Strategies:

L0 → L1 (Special Case):

Problem: L0 files may overlap (from memtable flushes)
Solution: Merge ALL L0 files at once

Example:
L0-001: [a-m, z]     ← Overlaps!
L0-002: [b-n, x]     ← Overlaps!

L0→L1 compaction:
- Pick ALL L0 files
- Find overlapping L1 files
- Merge all together
- Output: Non-overlapping L1 files

L1 → L2, L2 → L3, L3 → L4 (Leveled):

Strategy: Pick 1 file from source, merge with overlapping target files

Example L1→L2:
L1: [a-m], [n-z]
L2: [a-f], [g-m], [n-s], [t-z]

Step 1: Pick L1 file [a-m]
Step 2: Find overlapping L2 files: [a-f], [g-m]
Step 3: Merge these 3 files
Step 4: Output: New L2 files [a-m] (merged and compacted)

Result:
- Removed 3 old files ([a-m] from L1, [a-f] and [g-m] from L2)
- Created 1 new file ([a-m] in L2)
- Reclaimed space from deleted/overwritten keys

7. Level Manager (lsm/levels.go)

Purpose: Organize SSTables across multiple levels and track when compaction needed.

Level Configuration:

L0: 40 MB    (4-10 files)         ← May overlap
L1: 400 MB   (10-100 files)       ← Non-overlapping
L2: 4 GB     (100-1000 files)     ← Non-overlapping
L3: 40 GB    (1000-10000 files)   ← Non-overlapping
L4: 400 GB   (10000+ files)       ← Non-overlapping

Why 10x Size Ratio?

Write amplification = 1 + (size_ratio × num_levels)

With 10x ratio and 4 levels:
WA = 1 + (10 × 4) = 41x

Seems high, but alternatives worse:
- 5x ratio, 6 levels: WA = 1 + (5 × 6) = 31x (more levels to check)
- 20x ratio, 2 levels: WA = 1 + (20 × 2) = 41x (huge compactions)

10x is industry standard (LevelDB, RocksDB)

Compaction Triggers:

L0: Trigger when >= 4 files (count-based)
    Why? L0 files overlap, more files = slower reads

L1+: Trigger when size >= max_size (size-based)
     Why? Non-overlapping, size matters more than count

File Organization:

L0: Unsorted (by time created)
    - Don't sort, order doesn't matter
    - All checked anyway (overlap)

L1-L4: Sorted by minKey
       - Binary search possible
       - Skip non-overlapping files

8. Main LSM Engine (lsm/lsm.go)

Purpose: Coordinate all components, handle reads/writes, manage background workers.

Write Path:

Put(key, value):
1. Generate sequence number (atomic increment)
2. Append to WAL (for durability)
3. Insert into active memtable
4. If memtable full:
   a. Lock
   b. Swap active ← new empty memtable
   c. Set immutable ← old active
   d. Signal flush worker (non-blocking)
   e. Unlock
5. Return (fast!)

Read Path:

Get(key):
1. Check active memtable (fastest)
   - If found, return immediately

2. Check immutable memtable (if exists)
   - If found, return immediately

3. Check L0 SSTables (ALL of them, may overlap)
   - For each: bloom filter → binary search
   - If found, return immediately

4. Check L1 SSTables (binary search by key range)
   - Find files where minKey <= key <= maxKey
   - For each: bloom filter → binary search
   - If found, return immediately

5. Repeat for L2, L3, L4

6. Not found → return nil

Read amplification: Worst case checks all levels

Background Workers:

Flush Worker:

for {
    select {
    case <-flushChan:
        if immutableMemtable != nil {
            1. Get all entries from immutable memtable
            2. Build SSTable from entries
            3. Add SSTable to L0
            4. Delete WAL (data now durable in SSTable)
            5. Set immutableMemtable = nil
            6. Check if L0 needs compaction
        }
    case <-closeChan:
        return
    }
}

Compaction Worker:

for {
    select {
    case <-compactionChan:
        if L0.needsCompaction() {
            compactL0ToL1()
            triggerNextIfNeeded(L1)
        } else if L1.needsCompaction() {
            compactL1ToL2()
            triggerNextIfNeeded(L2)
        } else if L2.needsCompaction() {
            compactL2ToL3()
            triggerNextIfNeeded(L3)
        } else if L3.needsCompaction() {
            compactL3ToL4()
        }
    case <-closeChan:
        return
    }
}

Hash Index Components

1. Sharded Index (hashindex/shard.go)

Purpose: Divide keyspace into independent shards to reduce lock contention.

Architecture:

256 shards (power of 2 for fast modulo)
Each shard independently manages:
- In-memory index (map[uint64]Location)
- Active segment
- List of immutable segments
- Own lock (RWMutex)

Shard selection: hash(key) % 256

Why 256 Shards?

Trade-offs:
- 16 shards: Simple, but contention with >16 threads
- 256 shards: Sweet spot for modern CPUs (8-16 cores)
- 1024 shards: Overhead, diminishing returns

256 chosen because:
1. Enough parallelism for 16-32 threads
2. Low overhead (256 × small struct = <1MB)
3. Power of 2 (fast modulo with bitwise AND)

2. Segment Format (hashindex/segment.go)

Entry Format:

[CRC32: 4 bytes]      - Checksum
[Timestamp: 8 bytes]  - Unix timestamp
[KeySize: 4 bytes]    - Length of key
[ValueSize: 4 bytes]  - Length of value
[Key: variable]       - Key bytes
[Value: variable]     - Value bytes

Append-Only Writes:

Benefits:
1. Sequential I/O (fast!)
2. No random seeks
3. Simple crash recovery
4. No corruption from partial writes

Cost:
- Space amplification (old values not deleted)
- Need compaction

3. Compaction (hashindex/compaction.go)

Strategy: Merge-rewrite all segments into one.

Process:
1. Scan all segments
2. Build map: key → latest value
3. Write map to new segment (sorted order)
4. Atomically replace old segments
5. Update in-memory index

Problem: Blocks writes during merge!
Solution: LSM-Tree (incremental compaction)

B-Tree Components

1. Page Structure (btree/page.go)

Purpose: Fixed 4KB page-based storage unit for B-tree nodes.

Page Layout:

┌───────────────────────────────────────────────────┐
│ Header (13 bytes)                                 │
├───────────────────────────────────────────────────┤
│ Cell Directory (grows forward →)                  │
│   [offset1][offset2][offset3]...                  │
├───────────────────────────────────────────────────┤
│ Free Space                                        │
├───────────────────────────────────────────────────┤
│ Cells (grow backward ←)                           │
│   [cell N]...[cell 3][cell 2][cell 1]            │
└───────────────────────────────────────────────────┘
Size: 4096 bytes (4KB)

Header Format:

[Version: 1 byte]       - Page format version (0=V1, 1=V2)
[PageType: 1 byte]      - LEAF=1, INTERNAL=2
[NumCells: 2 bytes]     - Number of cells in page
[FreeOffset: 2 bytes]   - Start of free space
[RightPtr: 4 bytes]     - Right sibling (leaf) or rightmost child (internal)
[Reserved: 3 bytes]     - Future use

Cell Directory:

Array of 2-byte offsets pointing to cells
- Grows forward from byte 13
- Each entry: offset to cell in page
- Binary searchable (cells kept sorted by key)

Leaf Cell Format (V1):

[KeySize: 2 bytes]      - Length of key
[ValueSize: 2 bytes]    - Length of value
[Key: variable]         - Actual key bytes
[Value: variable]       - Actual value bytes

Example: key="user", value="Alice"
[0x00, 0x04][0x00, 0x05]['u','s','e','r']['A','l','i','c','e']
Total: 2 + 2 + 4 + 5 = 13 bytes

Leaf Cell Format (V2 - Varint Optimized):

[KeySize: 1-3 bytes]    - Varint-encoded key length
[ValueSize: 1-3 bytes]  - Varint-encoded value length
[Key: variable]         - Actual key bytes
[Value: variable]       - Actual value bytes

Example: key="user", value="Alice"
[0x04][0x05]['u','s','e','r']['A','l','i','c','e']
Total: 1 + 1 + 4 + 5 = 11 bytes (2 bytes saved!)

Internal Cell Format (V1):

[KeySize: 2 bytes]      - Length of separator key
[ChildPageID: 4 bytes]  - Left child page ID
[Key: variable]         - Separator key

Interpretation: All keys in ChildPageID < Key

Internal Cell Format (V2):

[KeySize: 1-3 bytes]    - Varint-encoded key length
[ChildPageID: 4 bytes]  - Left child page ID
[Key: variable]         - Separator key

Why 4KB Pages?

Alternatives considered:
1. 512 bytes: Too small, poor fanout (20-30 keys)
2. 8KB: Larger than OS page, wasted I/O
3. 64KB: Huge, cache-unfriendly

4KB is optimal because:
1. Matches OS page size (single syscall)
2. Good fanout: 128-357 keys per page (depending on key size)
3. Fits in L1/L2 cache
4. Standard in PostgreSQL, MySQL, SQLite

Why Two-Way Growth (Directory Forward, Cells Backward)?

Problem with linear layout:
- Insert in middle → shift all cells → expensive!

Two-way growth:
- Directory grows forward (small, fixed-size offsets)
- Cells grow backward (variable size)
- Free space in middle
- No cell shifting needed! Just update offsets
- Fragmentation isolated to free space calculation

Page Format Versioning:

V1 (Legacy):
- Fixed 2-byte size encoding
- Version byte = 0 (or unset)
- Backward compatible

V2 (Varint):
- Variable-length size encoding (1-3 bytes)
- Version byte = 1
- 18% space savings for small keys
- 17% more keys per page (306 → 357)

Key Operations:

// Insert cell (O(n) for directory shift, but no cell shift!)
func (p *Page) InsertCell(key, value []byte) bool {
    // Check space
    cellSize := varintSize16(len(key)) + varintSize16(len(value)) + len(key) + len(value)
    if p.freeBytes() < cellSize + 2 {
        return false  // Page full
    }

    // Find insertion index (binary search)
    idx := p.searchCell(key)

    // Write cell at free space end (grows backward)
    p.freeOffset -= cellSize
    p.writeLeafCell(p.freeOffset, key, value)

    // Insert offset in directory (grows forward)
    p.insertOffsetAt(idx, p.freeOffset)
    p.numCells++

    return true
}

// Search cell (O(log n) binary search in directory)
func (p *Page) SearchCell(key []byte) int {
    left, right := 0, int(p.numCells)
    for left < right {
        mid := (left + right) / 2
        cellKey := p.cellAt(mid).Key
        if bytes.Compare(cellKey, key) < 0 {
            left = mid + 1
        } else {
            right = mid
        }
    }
    return left
}

2. Pager (btree/pager.go)

Purpose: Manage page cache, disk I/O, and page lifecycle.

Architecture:

┌─────────────────────────────────────────────┐
│ Pager                                        │
│ ┌─────────────────────────────────────────┐ │
│ │ LRU Page Cache (default 100 pages)      │ │
│ │   pageID → *Page                        │ │
│ │   [1]->[5]->[12]->[3]->[8]             │ │
│ │    ↑ MRU              LRU ↓             │ │
│ └─────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────┐ │
│ │ Dirty Pages Set                         │ │
│ │   {1, 5, 8} → Need flush                │ │
│ └─────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────┐ │
│ │ Free List                               │ │
│ │   Deleted page IDs for reuse            │ │
│ │   [42, 137, 89] → Available             │ │
│ └─────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────┐ │
│ │ Metadata (Page 0)                       │ │
│ │   - Root page ID                        │ │
│ │   - Total pages                         │ │
│ │   - Free list head                      │ │
│ └─────────────────────────────────────────┘ │
└─────────────────────────────────────────────┘
         ↕ (I/O)
┌─────────────────────────────────────────────┐
│ Disk File (btree.db)                        │
│ [Page 0][Page 1][Page 2]...[Page N]        │
│  Meta   Root    Internal    Leaf           │
└─────────────────────────────────────────────┘

LRU Cache Design:

Why LRU?
- Hot pages stay in memory (temporal locality)
- Automatic eviction of cold pages
- Simple to implement (doubly-linked list + hash map)

Cache hit: Page in memory (fast, ~50ns)
Cache miss: Read from disk (slow, ~100µs = 2000x slower!)

Hit rate target: >95% for typical workloads

Page Lifecycle:

1. Allocation:
   - Check free list first (reuse deleted pages)
   - If empty, append new page to file
   - Return page ID

2. Load:
   - Check cache (O(1) hash lookup)
   - If miss, read from disk:
     file.Seek(pageID * 4096)
     file.Read(4096 bytes)
   - Add to cache, update LRU

3. Modification:
   - Modify page in cache
   - Mark as dirty (add to dirty set)
   - Don't flush immediately (lazy)

4. Eviction:
   - When cache full, evict LRU page
   - If dirty, flush to disk first
   - Remove from cache

5. Flush:
   - Write all dirty pages to disk
   - Clear dirty set
   - Keep pages in cache (still usable)

6. Free:
   - Add page ID to free list
   - Remove from cache
   - Update metadata

Metadata Page (Page 0):

[Magic: 8 bytes]        - "BTREEDB\0" (file type marker)
[Version: 4 bytes]      - File format version
[PageSize: 4 bytes]     - Always 4096
[RootPageID: 4 bytes]   - Current root page
[NumPages: 4 bytes]     - Total pages in file
[FreeListHead: 4 bytes] - First free page ID (linked list)
[Reserved: 4064 bytes]  - Future use

Free List Structure (Linked List):

Deleted page IDs stored in linked list:
Page 42: [NextFree=137][...]
Page 137: [NextFree=89][...]
Page 89: [NextFree=0][...]  (0 = end)

Allocation:
1. Pop head from free list (O(1))
2. Update metadata's FreeListHead
3. Return page ID

Better than bitmap (no fragmentation tracking needed)

Critical Operations:

// Get page (cache-aware)
func (p *Pager) GetPage(pageID uint32) (*Page, error) {
    // Check cache
    if page, ok := p.cache[pageID]; ok {
        p.updateLRU(pageID)  // Move to front
        return page, nil
    }

    // Cache miss - read from disk
    page := &Page{id: pageID, data: make([]byte, 4096)}
    offset := int64(pageID) * 4096
    _, err := p.file.ReadAt(page.data, offset)
    if err != nil {
        return nil, err
    }

    // Add to cache (may evict LRU)
    p.addToCache(pageID, page)
    return page, nil
}

// Flush dirty pages
func (p *Pager) Flush() error {
    for pageID := range p.dirtyPages {
        page := p.cache[pageID]
        offset := int64(pageID) * 4096
        _, err := p.file.WriteAt(page.data, offset)
        if err != nil {
            return err
        }
    }
    p.dirtyPages = make(map[uint32]bool)  // Clear dirty set
    return p.file.Sync()  // fsync
}

Performance Tuning:

Cache Size Trade-off:
- Small (10 pages = 40KB):
  + Low memory
  - High cache miss rate (many disk reads)

- Large (1000 pages = 4MB):
  + High cache hit rate
  - High memory usage

- Default (100 pages = 400KB):
  * Balanced for most workloads
  * ~95% hit rate for working set < 400KB

3. Node Operations (btree/node.go)

Purpose: Helper functions for internal and leaf node operations.

Internal Node Operations:

Internal node structure:
  Key1 < Key2 < Key3
  ↓      ↓      ↓      ↓
[P0]  [P1]  [P2]  [P3]

P0: All keys < Key1
P1: Keys where Key1 ≤ k < Key2
P2: Keys where Key2 ≤ k < Key3
P3: All keys ≥ Key3 (RightPtr)

Search Child Algorithm:

func searchChild(page *Page, key []byte) uint32 {
    // Binary search for separator key
    idx := page.searchCell(key)

    if idx == 0 {
        // Key smaller than all separators → leftmost child
        return page.cellAt(0).ChildPageID
    } else if idx >= page.numCells() {
        // Key larger than all separators → rightmost child
        return page.rightPtr
    } else {
        // Key falls between separators
        return page.cellAt(idx).ChildPageID
    }
}

Example:
Internal page: [10, 20, 30], RightPtr=P3
Search(5):  idx=0return P0 (leftmost)
Search(15): idx=1return P1
Search(25): idx=2return P2
Search(35): idx=3 (>= numCells) → return P3 (RightPtr)

Insert Pointer Algorithm:

func insertPointer(page *Page, key []byte, childPageID uint32) error {
    // Find insertion point
    idx := page.searchCell(key)

    // Insert cell (key + child pointer)
    cell := InternalCell{
        Key:         key,
        ChildPageID: childPageID,
    }

    // Write cell and update directory
    return page.insertInternalCell(idx, cell)
}

Leaf Node Operations:

Leaf node structure:
  Sorted keys with values
  [k1:v1][k2:v2][k3:v3]...[kN:vN] → RightPtr (next leaf)

All leaves form doubly-linked list for range scans:
  [Leaf1] ⇄ [Leaf2] ⇄ [Leaf3] ⇄ [Leaf4]

Search Key Algorithm:

func searchKey(page *Page, key []byte) ([]byte, bool, error) {
    // Binary search in leaf cells
    idx := page.searchCell(key)

    if idx < page.numCells() {
        cell := page.cellAt(idx)
        if bytes.Equal(cell.Key, key) {
            return cell.Value, true, nil
        }
    }

    return nil, false, nil
}

4. Split Algorithm (btree/split.go)

Purpose: Maintain balanced tree by splitting full pages.

Why Splitting is Needed:

Problem: Page full, can't insert
4KB page can hold ~357 keys (with varint)
Once full, must split to maintain B-tree properties

Leaf Page Split:

Before:
┌────────────────────────────────────────────┐
│ Leaf: [1,2,3,4,5,6,7,8,9,10] ← FULL       │
└────────────────────────────────────────────┘

After inserting key=4.5:
┌────────────────────────┐  ┌────────────────────────┐
│ Leaf1: [1,2,3,4,4.5]  │→│ Leaf2: [5,6,7,8,9,10] │
└────────────────────────┘  └────────────────────────┘
                              ↑
                         Separator: "5"

Parent Internal node gets: [5, PageID(Leaf2)]

Leaf Split Algorithm:

func (b *BTree) splitLeaf(pageID uint32, key, value []byte) ([]byte, uint32, error) {
    oldPage := b.pager.GetPage(pageID)

    // Create new page
    newPage := b.pager.NewPage(LEAF)

    // Find midpoint
    midIdx := oldPage.numCells() / 2

    // Move second half to new page
    for i := midIdx; i < oldPage.numCells(); i++ {
        cell := oldPage.cellAt(i)
        newPage.insertCell(cell.Key, cell.Value)
    }

    // Truncate old page
    oldPage.truncateCells(midIdx)

    // Link pages (doubly-linked list for scans)
    newPage.rightPtr = oldPage.rightPtr
    oldPage.rightPtr = newPage.id

    // Insert new key into appropriate page
    separatorKey := newPage.cellAt(0).Key
    if bytes.Compare(key, separatorKey) < 0 {
        oldPage.insertCell(key, value)
    } else {
        newPage.insertCell(key, value)
    }

    b.pager.markDirty(oldPage.id)
    b.pager.markDirty(newPage.id)

    return separatorKey, newPage.id, nil
}

Internal Page Split:

Before:
┌───────────────────────────────────────────────────┐
│ Internal: [10,20,30,40,50] ← FULL                │
│           ↓  ↓  ↓  ↓  ↓  ↓                       │
│          [P0][P1][P2][P3][P4][P5]                │
└───────────────────────────────────────────────────┘

After inserting separator 25:
┌────────────────────────┐  ┌────────────────────────┐
│ Internal1: [10,20]     │  │ Internal2: [40,50]     │
│            ↓  ↓  ↓     │  │            ↓  ↓  ↓     │
│           [P0][P1][P2] │  │           [P4][P5][P6] │
└────────────────────────┘  └────────────────────────┘
                             ↑
                    Separator: "30" (middle key)

Key difference: Middle key (30) MOVES UP to parent
                (Doesn't stay in either child)

Internal Split Algorithm:

func (b *BTree) splitInternal(pageID uint32, key []byte, childPageID uint32) ([]byte, uint32, error) {
    oldPage := b.pager.GetPage(pageID)
    newPage := b.pager.NewPage(INTERNAL)

    midIdx := oldPage.numCells() / 2
    midCell := oldPage.cellAt(midIdx)
    separatorKey := midCell.Key  // Middle key goes UP

    // Move second half to new page (EXCLUDING middle)
    for i := midIdx + 1; i < oldPage.numCells(); i++ {
        cell := oldPage.cellAt(i)
        newPage.insertPointer(cell.Key, cell.ChildPageID)
    }

    // Set new page's rightmost child
    newPage.rightPtr = oldPage.rightPtr

    // Truncate old page (remove second half + middle)
    oldPage.truncateCells(midIdx)
    oldPage.rightPtr = midCell.ChildPageID  // Middle's child becomes old page's rightmost

    // Insert new pointer into appropriate page
    if bytes.Compare(key, separatorKey) < 0 {
        oldPage.insertPointer(key, childPageID)
    } else {
        newPage.insertPointer(key, childPageID)
    }

    return separatorKey, newPage.id, nil
}

Root Split (Tree Growth):

Problem: Root page is full, needs split
But root has no parent!

Solution: Create NEW root
Before:
       [Root: Full]

After:
       [New Root: separator]
              ↓
     [Old Root] [New Page]

Tree height increases by 1!

Root Split Algorithm:

func (b *BTree) splitRoot() error {
    oldRootID := b.rootPageID
    oldRoot := b.pager.GetPage(oldRootID)

    // Create new root
    newRoot := b.pager.NewPage(INTERNAL)

    // Split old root
    separatorKey, newPageID, err := b.split(oldRootID, ...)
    if err != nil {
        return err
    }

    // New root points to both children
    newRoot.insertPointer(separatorKey, oldRootID)  // Left child
    newRoot.rightPtr = newPageID                     // Right child

    // Update root page ID
    b.rootPageID = newRoot.id
    b.height++  // Tree grew taller

    // Persist root change
    b.pager.saveMetadata()

    return nil
}

Split Propagation:

Insert can trigger cascade of splits:

1. Insert into leaf → Leaf full → Split leaf
2. Insert separator into parent → Parent full → Split parent
3. Insert separator into grandparent → Grandparent full → Split grandparent
4. Eventually may reach root → Split root → Tree grows

Example:
       [Root]           [Root: Full]        [New Root]
       /    \           /    \              /   |   \
    [A]      [B:Full] [A]   [B][C]    [A]  [B]  [C]
                 ↓           ↓  ↓       ↓   ↓    ↓
             Split B    Insert C   Cascade up if needed

5. Page Merge (btree/merge.go)

Purpose: Reclaim space from underfull pages after deletions.

Underflow Detection:

Trigger: Page falls below 25% capacity
Why 25%? Balance between:
- Too high (30%+): Frequent rebalancing
- Too low (10%-): Wasted space

Min cells threshold:
- Internal: order/4 (e.g., 128/4 = 32 cells)
- Leaf: order/4 (e.g., 128/4 = 32 cells)

Rebalancing Strategies:

Strategy 1: Redistribution (Borrow from sibling)

Before:
Parent: [50]
         ↓  ↓
Left: [10,20,30,40]  (4 cells)
Right: [60]          (1 cell ← UNDERFULL)

After:
Parent: [40]  ← Updated separator
         ↓  ↓
Left: [10,20,30]     (3 cells)
Right: [40,60]       (2 cells ← Borrowed)

Redistribution possible if sibling has enough cells

Strategy 2: Merge (Combine with sibling)

Before:
Parent: [50, 80]
         ↓   ↓   ↓
Left: [10,20]  (2 cells ← UNDERFULL)
Middle: [60]   (1 cell ← UNDERFULL)
Right: [90]

After:
Parent: [80]  ← Removed separator "50"
         ↓   ↓
LeftMerged: [10,20,60]  (3 cells)
Right: [90]

Merge when both siblings underfull

Merge Algorithm:

func (b *BTree) mergeLeaf(leftPageID, rightPageID uint32, parentPageID uint32, separatorIdx int) error {
    leftPage := b.pager.GetPage(leftPageID)
    rightPage := b.pager.GetPage(rightPageID)
    parentPage := b.pager.GetPage(parentPageID)

    // Copy all cells from right to left
    for i := 0; i < rightPage.numCells(); i++ {
        cell := rightPage.cellAt(i)
        leftPage.insertCell(cell.Key, cell.Value)
    }

    // Update linked list (skip right page)
    leftPage.rightPtr = rightPage.rightPtr

    // Remove separator from parent
    parentPage.removeCell(separatorIdx)

    // Free right page
    b.pager.freePage(rightPageID)

    // Mark pages dirty
    b.pager.markDirty(leftPageID)
    b.pager.markDirty(parentPageID)

    // Check if parent now underfull (recursive!)
    if parentPage.isUnderfull() && parentPageID != b.rootPageID {
        return b.rebalanceParent(parentPageID)
    }

    return nil
}

Root Shrinking:

Special case: Root has only 1 child after merge

Before:
      [Root: empty]
           ↓
      [Only Child]
       /    |    \
    [A]   [B]   [C]

After:
      [Only Child ← NEW ROOT]
       /    |    \
    [A]   [B]   [C]

Tree height decreases by 1!

When Merge is Triggered:

Delete operation:
1. Remove key from leaf
2. Check if leaf underfull (<25%)
3. If yes:
   a. Try redistribution from sibling
   b. If can't redistribute, merge
   c. Recursively check parent
4. If root becomes empty, shrink tree

Performance Impact:

Without merge:
- Delete 50% of keys
- Still occupy 100% space
- Space amp = 2.0x

With merge:
- Delete 50% of keys
- Space reclaimed via merge
- Space amp = 1.1x (near optimal!)

6. Physical Write-Ahead Log (btree/wal.go)

Purpose: Provide crash recovery by logging page modifications before applying them.

Physical vs Logical WAL:

Logical WAL (LSM-Tree):
- Logs operations: Put(key="user1", value="Alice")
- Pros: Small records, easy to understand
- Cons: Must replay operations to reconstruct state

Physical WAL (B-Tree):
- Logs page modifications: Set bytes 100-200 of page 42 to [data]
- Pros: Direct replay, byte-accurate
- Cons: Larger records, more complex

WAL Record Types:

1. PAGE_WRITE: Before modifying page

[Type: 1 byte]          = 0x01
[PageID: 4 bytes]       - Which page
[Offset: 4 bytes]       - Where in page
[Length: 4 bytes]       - How many bytes
[Data: variable]        - Byte data
[CRC32: 4 bytes]        - Checksum

2. PAGE_SPLIT: Before splitting

[Type: 1 byte]          = 0x02
[OldPageID: 4 bytes]    - Page being split
[NewPageID: 4 bytes]    - New page created
[SeparatorLen: 2 bytes] - Length of separator key
[Separator: variable]   - Separator key
[CRC32: 4 bytes]

3. ROOT_CHANGE: When root changes

[Type: 1 byte]          = 0x03
[NewRootPageID: 4 bytes]- New root page ID
[CRC32: 4 bytes]

4. CHECKPOINT: Mark safe point

[Type: 1 byte]          = 0x04
[Timestamp: 8 bytes]    - When checkpoint taken
[CRC32: 4 bytes]

WAL Lifecycle:

1. Operation begins
   ├─→ Write WAL record
   ├─→ fsync WAL file
   └─→ Guaranteed on disk

2. Apply operation
   ├─→ Modify pages in cache
   ├─→ Mark pages dirty
   └─→ Don't flush yet (lazy)

3. Eventually flush
   ├─→ Write dirty pages to disk
   ├─→ fsync data file
   └─→ Pages durable

4. Checkpoint
   ├─→ Write CHECKPOINT record
   ├─→ fsync WAL
   └─→ Truncate WAL (can discard old records)

Crash Recovery:

func (b *BTree) recoverFromWAL() error {
    wal := openWAL(b.config.DataDir + "/wal.log")
    entries := wal.ReadAll()

    for _, entry := range entries {
        switch entry.Type {
        case PAGE_WRITE:
            // Replay page modification
            page := b.pager.GetPage(entry.PageID)
            copy(page.data[entry.Offset:], entry.Data)
            b.pager.markDirty(entry.PageID)

        case PAGE_SPLIT:
            // Re-split operation
            // (Usually not needed, splits already durable)

        case ROOT_CHANGE:
            // Update root page ID
            b.rootPageID = entry.NewRootPageID

        case CHECKPOINT:
            // Safe to discard all records before this
            break
        }
    }

    // Flush recovered state
    b.pager.Flush()

    // Truncate WAL (recovery complete)
    wal.Truncate()

    return nil
}

Crash Scenarios:

Scenario 1: Crash before WAL sync

Timeline:
1. Write WAL record (in buffer)
2. CRASH! (before fsync)

Recovery:
- WAL entry not on disk
- Operation never happened
- Consistent state restored
Result: Lost operation (acceptable, not synced)

Scenario 2: Crash after WAL sync, before page flush

Timeline:
1. Write WAL record
2. fsync WAL ✓
3. Modify page in cache
4. CRASH! (before flush)

Recovery:
- WAL has operation
- Page not modified on disk
- Replay WAL → Apply operation
Result: No data loss! ✓

Scenario 3: Crash during page flush

Timeline:
1. Write WAL ✓
2. Modify page ✓
3. Start flush (partial write)
4. CRASH! (mid-write)

Recovery:
- WAL has operation
- Page partially written (corrupted)
- Replay WAL → Overwrite with correct data
Result: No data loss! ✓

Scenario 4: Crash after flush, before checkpoint

Timeline:
1. Write WAL ✓
2. Modify page ✓
3. Flush page ✓
4. CRASH! (before checkpoint)

Recovery:
- WAL has operation
- Page already on disk
- Replay WAL → Re-apply (idempotent)
- Result: Duplicate write (harmless)
- Checkpoint later → Truncate WAL

Performance Impact:

Write amplification with WAL:
- User write: 1x
- WAL write: 1x (small record)
- Page flush: 1x (4KB page)
Total: ~2-3x write amp

Trade-off:
- Cost: 2-3x write amp
- Gain: Crash safety, durability guarantees
- Verdict: Essential for production!

Checkpoint Strategy:

When to checkpoint:
1. After every N page flushes (e.g., N=100)
2. On explicit Sync() call
3. On normal Close()

Why checkpoint:
- Keeps WAL small (truncate old records)
- Faster recovery (fewer records to replay)
- Reclaims disk space

7. Fine-Grained Locking (btree/latch.go)

Purpose: Allow concurrent reads and writes without global lock.

Latch Coupling Protocol:

Traditional approach: Global RWMutex
- Problem: Readers block writers, writers block everyone
- Throughput: Limited to serial execution

Latch coupling:
- Each page has its own RWMutex
- Lock parent → Lock child → Unlock parent
- Multiple threads traverse different paths simultaneously
- Throughput: 2-5x improvement on multi-core

Latch Coupling Algorithm (Read):

Search for key "75":

Step 1: Lock root (READ)
       [Root: 50, 80] ← LOCKED
        ↓    ↓    ↓

Step 2: Lock child (READ), unlock root
       [Root: 50, 80] ← UNLOCKED
        ↓    ↓    ↓
            [P2: 60-90] ← LOCKED

Step 3: Lock leaf (READ), unlock parent
            [P2: 60-90] ← UNLOCKED
                 ↓
            [Leaf: 70,75,80] ← LOCKED

Step 4: Read value, unlock leaf
            [Leaf] ← UNLOCKED
            Return value

Only one page locked at a time!
Other threads can traverse different paths simultaneously.

Latch Coupling Algorithm (Write):

Insert key "75":

Step 1: Lock root (WRITE if might split, else READ)
       [Root] ← WRITE LOCK (safe node check)

Step 2: Descend to child
       - If child won't split → downgrade to READ lock
       - If child might split → keep WRITE lock

Step 3: Continue until leaf
       [Leaf] ← WRITE LOCK

Step 4: Insert, unlock all

Safe Node Concept:

Safe node = Node that WON'T split on insert

For inserts:
- Page is safe if has space for new cell
- If safe, no split propagation → release parent locks

For deletes:
- Page is safe if won't underflow after deletion
- If safe, no merge propagation → release parent locks

Optimization:
- Check if node is safe BEFORE locking deeply
- Release ancestor locks early if safe
- Reduces lock contention dramatically

Implementation:

type Page struct {
    id       uint32
    data     []byte
    latch    sync.RWMutexPer-page lock
    dirty    bool
}

func (b *BTree) ConcurrentGet(key []byte) ([]byte, error) {
    pageID := b.rootPageID
    var parentLatch *sync.RWMutex = nil

    for {
        page := b.pager.GetPage(pageID)

        // Lock current page
        page.latch.RLock()

        // Unlock parent (latch coupling!)
        if parentLatch != nil {
            parentLatch.RUnlock()
        }

        if page.isLeaf() {
            // Found leaf, search for key
            value, found := searchKey(page, key)
            page.latch.RUnlock()
            if found {
                return value, nil
            }
            return nil, ErrKeyNotFound
        }

        // Internal node, descend
        childPageID := searchChild(page, key)
        parentLatch = &page.latch
        pageID = childPageID
    }
}

Deadlock Prevention:

Potential deadlock:
Thread 1: Lock A → Lock B
Thread 2: Lock B → Lock A
Result: Deadlock!

Latch coupling avoids this:
- Always lock top-down (root → leaf)
- Never lock sibling simultaneously
- Release parent before locking child
- Result: No cycles possible!

Proof: Locks form a DAG (directed acyclic graph)
       Root always locked before children
       No back edges → No cycles → No deadlock

Performance Comparison:

Benchmark: Concurrent reads (8 threads)

Global RWMutex:
- Throughput: 300K ops/sec
- All threads wait for lock

Per-page latching:
- Throughput: 1.2M ops/sec
- Threads traverse independently
- 4x improvement!

Why 4x (not 8x)?
- Some contention on root (hot page)
- Cache line bouncing
- Still significant improvement

Write Contention:

Writes are harder:
- Must WRITE-lock path to leaf
- Blocks other writers on same path
- Still allows concurrent reads

Mixed workload (80% reads, 20% writes):
- Global lock: 300K ops/sec
- Latch coupling: 900K ops/sec
- 3x improvement

8. Variable-Length Encoding (btree/varint.go)

Purpose: Reduce space overhead for small key/value sizes.

Problem with Fixed Encoding:

V1 format uses fixed 2-byte size prefixes:
Key="user" (4 bytes), Value="Alice" (5 bytes)
Encoding: [0x00, 0x04][0x00, 0x05][...data...]
          └─ Wasted byte (size < 128)

For small keys/values (most common!):
- 1st byte always 0x00
- Wasted 1 byte per size field
- 2 size fields per cell = 2 bytes wasted per cell

Varint Encoding (Protocol Buffers Style):

Values 0-127: 1 byte
[0xxx xxxx]

Values 128-16383: 2 bytes
[1xxx xxxx][xxxx xxxx]
 ↑ MSB=1 means "continue"

Values 16384+: 3 bytes
[1xxx xxxx][1xxx xxxx][xxxx xxxx]

Example:
5 → [0x05]                    (1 byte)
200 → [0x81, 0x48]            (2 bytes)
20000 → [0x82, 0xBC, 0x20]    (3 bytes)

Space Savings:

Typical key size distribution:
- 95% keys < 128 bytes
- 4% keys 128-16383 bytes
- 1% keys > 16383 bytes

V1 encoding: 100% use 2 bytes
V2 encoding: 95% use 1 byte, 4% use 2 bytes, 1% use 3 bytes
Average: 1.06 bytes per size field

Savings: 2 size fields × (2 - 1.06) = 1.88 bytes per cell
With 306 cells/page: 1.88 × 306 = 575 bytes saved
Extra cells: 575 / (avg cell size 11) = 51 more cells
Improvement: 306 → 357 cells per page (17% increase!)

Implementation:

func putUvarint16(buf []byte, x uint16) int {
    i := 0
    for x >= 0x80 {
        buf[i] = byte(x) | 0x80  // Set MSB
        x >>= 7
        i++
    }
    buf[i] = byte(x)
    return i + 1
}

func uvarint16(buf []byte) (uint16, int) {
    var x uint16
    var s uint
    for i := 0; i < 3; i++ {
        b := buf[i]
        if b < 0x80 {
            // Last byte
            return x | uint16(b)<<s, i + 1
        }
        // More bytes coming
        x |= uint16(b&0x7F) << s
        s += 7
    }
    return 0, -1  // Overflow
}

func varintSize16(x uint16) int {
    if x < 128 {
        return 1
    } else if x < 16384 {
        return 2
    }
    return 3
}

Backward Compatibility:

Challenge: Old pages use V1, new pages use V2
Solution: Page format versioning

Page header version byte:
- 0 (or unset): V1 format (fixed 2-byte encoding)
- 1: V2 format (varint encoding)

Parsing:
func (p *Page) parseLeafCell(offset int) (*Cell, error) {
    if p.Version() == PageFormatV1 {
        // Fixed 2-byte encoding
        keySize := binary.BigEndian.Uint16(p.data[offset:])
        valueSize := binary.BigEndian.Uint16(p.data[offset+2:])
        offset += 4
    } else {
        // Varint encoding
        keySize, n1 := uvarint16(p.data[offset:])
        valueSize, n2 := uvarint16(p.data[offset+n1:])
        offset += n1 + n2
    }
    // ... rest of parsing
}

Gradual migration:
- Old pages remain V1 (still readable)
- New pages use V2 (better space efficiency)
- No downtime required!

Performance Impact:

CPU overhead:
- Varint decode: ~5-10ns per call
- Fixed decode: ~2ns per call
- Overhead: ~3-8ns (negligible!)

Space savings:
- 18% fewer bytes per page
- 17% more keys per page
- Fewer pages → fewer disk I/O
- Net result: Faster despite tiny CPU cost!

Tree height reduction:
- 1M keys, 306 keys/page: height = ceil(log306(1M)) = 3
- 1M keys, 357 keys/page: height = ceil(log357(1M)) = 3
- 100M keys, 306 keys/page: height = 4
- 100M keys, 357 keys/page: height = 3 ← One level shorter!

9. Iterator (btree/iterator.go)

Purpose: Support efficient range scans over sorted keys.

Linked Leaf Pages:

Internal nodes:
       [Root: 50, 80]
        ↓    ↓    ↓

Leaf pages (linked list):
[Leaf1: 10,20,30] → [Leaf2: 40,50,60] → [Leaf3: 70,80,90] → NULL
      ↑ RightPtr         ↑ RightPtr          ↑ RightPtr

Range scan exploits this structure!

Iterator State:

type BTreeIterator struct {
    btree       *BTree
    currentPage *Page
    cellIndex   int
    endKey      []byte
    err         error
    started     bool
}

Seek Algorithm:

Scan("user:100", "user:200"):

1. Seek to start key "user:100"
   - Traverse tree to leaf containing "user:100"
   - Binary search within leaf for first key >= "user:100"

2. Set iterator state:
   currentPage = leaf page
   cellIndex = index of first key >= "user:100"
   endKey = "user:200"

Example tree:
       [Root: user:150]
        ↓            ↓
[Leaf1: user:050-120] → [Leaf2: user:130-180] → [Leaf3: user:190-250]

Seek("user:100"):
- Traverse to Leaf1
- Binary search finds index 5 (key="user:105")
- Return Leaf1, index=5

Next Algorithm:

func (it *BTreeIterator) Next() bool {
    if !it.started {
        it.Seek(it.startKey)
        it.started = true
    }

    // Check end condition
    if it.cellIndex >= it.currentPage.numCells() {
        // Move to next page
        rightPtr := it.currentPage.rightPtr
        if rightPtr == 0 {
            return false  // End of tree
        }
        it.currentPage = it.btree.pager.GetPage(rightPtr)
        it.cellIndex = 0
    }

    // Check end key
    key := it.currentPage.cellAt(it.cellIndex).Key
    if it.endKey != nil && bytes.Compare(key, it.endKey) >= 0 {
        return false  // Beyond range
    }

    it.cellIndex++
    return true
}

Full Scan Example:

Scan("a", "z"):

Step 1: Seek to "a" (first key in tree)
Step 2: Iterate through Leaf1 cells
Step 3: Move to Leaf2 via rightPtr
Step 4: Iterate through Leaf2 cells
Step 5: Move to Leaf3 via rightPtr
Step 6: Stop when key >= "z"

Performance:
- Sequential I/O (follow linked list)
- No tree traversal after initial seek
- Cache-friendly (scan hot pages)

Usage:

iter, _ := btree.Scan([]byte("user:100"), []byte("user:200"))
defer iter.Close()

for iter.Next() {
    key := iter.Key()
    value := iter.Value()
    fmt.Printf("%s: %s\n", key, value)
}

if iter.Error() != nil {
    log.Fatal(iter.Error())
}

Performance Characteristics:

Range scan [start, end]:
- Seek: O(log n) tree traversal
- Scan: O(k) where k = keys in range
- I/O: ceil(k / keys_per_page) page reads

Example: 1M keys, scan 1000 keys
- Seek: log(1M) = ~20 comparisons
- Scan: 1000 keys / 357 keys_per_page = 3 page reads
- Total: 20 comparisons + 3 disk reads

Compare to Hash Index: No range scans possible!
Compare to LSM-Tree: Must merge all levels (slower)

10. Main B-Tree Engine (btree/btree.go)

Purpose: Coordinate all components, provide user-facing API.

Architecture:

┌─────────────────────────────────────────────────────┐
│ BTree (Main Engine)                                 │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Config: DataDir, Order, CacheSize               │ │
│ └─────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Pager: Page cache, I/O, free list               │ │
│ └─────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ WAL: Crash recovery, durability                 │ │
│ └─────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Metadata: RootPageID, Height, NumKeys           │ │
│ └─────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Stats: Read/Write counters, amplification       │ │
│ └─────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Lock: RWMutex for structural changes            │ │
│ │       (Per-page latches for data operations)    │ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘

Configuration:

type Config struct {
    DataDir   string  // Directory for data files
    Order     int     // Max keys per page (default 128)
    CacheSize int     // Page cache size (default 100 pages)
    SyncWAL   bool    // fsync WAL on every write (default true)
}

func DefaultConfig(dataDir string) Config {
    return Config{
        DataDir:   dataDir,
        Order:     128,
        CacheSize: 100,
        SyncWAL:   true,
    }
}

Initialization:

func New(config Config) (*BTree, error) {
    // Create directory
    os.MkdirAll(config.DataDir, 0755)

    // Open pager
    pager := newPager(config.DataDir+"/btree.db", config.CacheSize)

    // Open WAL
    wal := openWAL(config.DataDir + "/wal.log")

    // Check if existing database
    if fileExists(config.DataDir + "/btree.db") {
        // Load metadata (root page ID, etc.)
        metadata := pager.loadMetadata()

        // Recover from WAL if needed
        wal.recover(pager)

        return &BTree{
            config:     config,
            pager:      pager,
            wal:        wal,
            rootPageID: metadata.RootPageID,
            height:     metadata.Height,
            numKeys:    metadata.NumKeys,
        }, nil
    }

    // New database - create root page
    rootPage := pager.NewPage(LEAF)
    metadata := Metadata{
        RootPageID: rootPage.id,
        Height:     1,
        NumKeys:    0,
    }
    pager.saveMetadata(metadata)

    return &BTree{
        config:     config,
        pager:      pager,
        wal:        wal,
        rootPageID: rootPage.id,
        height:     1,
        numKeys:    0,
    }, nil
}

Put Operation:

func (b *BTree) Put(key, value []byte) error {
    // Validate inputs
    if len(key) == 0 {
        return ErrEmptyKey
    }

    // Write to WAL (durability)
    b.wal.logPut(key, value)

    // Global lock for structural changes
    b.mu.Lock()
    defer b.mu.Unlock()

    // Insert into tree
    splitKey, newPageID, err := b.insertRecursive(b.rootPageID, key, value, 0)
    if err != nil {
        return err
    }

    // If root split, create new root
    if splitKey != nil {
        b.splitRoot(splitKey, newPageID)
    }

    b.numKeys++
    b.stats.writeCount.Add(1)

    return nil
}

Get Operation:

func (b *BTree) Get(key []byte) ([]byte, error) {
    // Read lock (concurrent reads allowed)
    b.mu.RLock()
    defer b.mu.RUnlock()

    pageID := b.rootPageID

    // Traverse tree
    for {
        page := b.pager.GetPage(pageID)

        if page.isLeaf() {
            // Search in leaf
            value, found := searchKey(page, key)
            if found {
                b.stats.readCount.Add(1)
                return value, nil
            }
            return nil, ErrKeyNotFound
        }

        // Internal node - find child
        pageID = searchChild(page, key)
    }
}

Delete Operation:

func (b *BTree) Delete(key []byte) error {
    // Write to WAL
    b.wal.logDelete(key)

    b.mu.Lock()
    defer b.mu.Unlock()

    // Find and remove key
    found, err := b.deleteRecursive(b.rootPageID, key, nil, 0)
    if err != nil {
        return err
    }

    if !found {
        return ErrKeyNotFound
    }

    b.numKeys--
    return nil
}

Sync Operation:

func (b *BTree) Sync() error {
    // Flush all dirty pages
    if err := b.pager.Flush(); err != nil {
        return err
    }

    // Checkpoint WAL
    if err := b.wal.Checkpoint(); err != nil {
        return err
    }

    // Update metadata
    metadata := Metadata{
        RootPageID: b.rootPageID,
        Height:     b.height,
        NumKeys:    b.numKeys,
    }
    return b.pager.saveMetadata(metadata)
}

Close Operation:

func (b *BTree) Close() error {
    // Sync everything
    if err := b.Sync(); err != nil {
        return err
    }

    // Close WAL
    if err := b.wal.Close(); err != nil {
        return err
    }

    // Close pager
    return b.pager.Close()
}

Stats Implementation:

func (b *BTree) Stats() common.Stats {
    b.mu.RLock()
    defer b.mu.RUnlock()

    // Calculate space amplification
    numPages := b.pager.NumPages()
    totalDiskSize := numPages * 4096

    // Estimate logical size
    avgKeySize := 8  // Approximate
    avgValueSize := 32  // Approximate
    logicalSize := b.numKeys * (avgKeySize + avgValueSize)

    spaceAmp := float64(totalDiskSize) / float64(logicalSize)

    // Write amplification (WAL doubles writes)
    writeAmp := 2.0 + 1.0  // WAL + page flush + occasional split

    return common.Stats{
        NumKeys:            b.numKeys,
        NumSegments:        numPages,
        TotalDiskSize:      uint64(totalDiskSize),
        WriteCount:         b.stats.writeCount.Load(),
        ReadCount:          b.stats.readCount.Load(),
        WriteAmplification: writeAmp,
        SpaceAmplification: spaceAmp,
    }
}

Compact (No-Op):

func (b *BTree) Compact() error {
    // B-Tree doesn't need compaction!
    // In-place updates mean no old versions to clean up
    // Space reclaimed automatically via page merging
    return nil
}

Performance Characteristics Summary:

Operation     Time Complexity    Disk I/O
─────────────────────────────────────────────
Put           O(log n)           O(height) reads + O(1) write
Get           O(log n)           O(height) reads
Delete        O(log n)           O(height) reads + O(1) write
Scan          O(log n + k)       O(height + k/fanout)

Where:
- n = total keys
- k = keys in range
- height = typically 3-4 for millions of keys
- fanout = keys per page (128-357)

Space Amplification:  1.0-1.1x (BEST!)
Write Amplification:  2-3x (WAL + page flush)
Read Amplification:   1x (direct lookup)

SSTable Format

Block-Based Storage

Why Blocks?

Alternative 1: One entry per disk read
- Pro: Simple
- Con: 100 entries = 100 disk seeks = 10ms!

Alternative 2: Entire file in memory
- Pro: Fast
- Con: Doesn't scale

Block-based (4KB blocks):
- Pro: Amortizes disk seeks
- Pro: Fits in page cache
- Pro: Small enough for fast search
- Balance: ~100 entries per block

Common Concepts

Sequence Numbers

Purpose: Total ordering of all writes for conflict resolution.

Timeline:
t=1: Put("user1", "Alice") → seq=1
t=2: Put("user2", "Bob")   → seq=2
t=3: Put("user1", "Carol") → seq=3
t=4: Delete("user2")       → seq=4

During merge:
- "user1" has seq=1 and seq=3 → Keep seq=3 ("Carol")
- "user2" has seq=2 and seq=4 → Keep seq=4 (deleted)

Implementation:

var sequence uint64  // Global atomic counter

func Put(key, value) {
    seq := atomic.AddUint64(&sequence, 1)
    entry := Entry{key, value, seq, deleted=false}
    ...
}

Tombstones

Purpose: Mark keys as deleted without immediate removal.

Why Needed?

Problem without tombstones:
1. Delete "key1" from memtable
2. Flush memtable (no entry for "key1")
3. "key1" still in L1 SSTable!
4. Get("key1") returns old value ← BUG!

Solution with tombstones:
1. Delete "key1" → Add tombstone to memtable
2. Flush memtable → Tombstone in L0
3. Compaction merges L0+L1
4. Tombstone overrides old value
5. Drop tombstone in final level (L4)

Format:

type Entry struct {
    Key     string
    Value   []byte
    Deleted boolTombstone marker
    Seq     uint64
}

Write Amplification

Definition: Total bytes written to disk / User bytes written

Example:
User writes 100MB of data
Compaction rewrites it 3 times
Total disk writes: 100MB + 300MB = 400MB
Write amplification: 400MB / 100MB = 4x

LSM Write Amplification Calculation:

For LSM with 10x ratio and 4 levels:

L0→L1: Write 40MB, read 40MB (L0) + 400MB (L1) = 440MB write
L1→L2: Write 400MB, read 400MB (L1) + 4GB (L2) = 4.4GB write
L2→L3: Write 4GB, read 4GB (L2) + 40GB (L3) = 44GB write
L3→L4: Write 40GB, read 40GB (L3) + 400GB (L4) = 440GB write

Total amplification ≈ 1 + 10 × num_levels = 41x

Space Amplification

Definition: Actual disk usage / Logical data size

Example:
Logical data: 100MB (latest versions)
Disk usage: 250MB (includes old versions, tombstones)
Space amplification: 250MB / 100MB = 2.5x

Factors:

  1. Old versions not yet compacted
  2. Tombstones not yet dropped
  3. Fragmentation in blocks
  4. Bloom filters, indexes, metadata

Crash Recovery

WAL-Based Recovery:

On Startup:
1. Check if WAL exists
2. If yes:
   a. Replay all entries into memtable
   b. Restore sequence number
3. Load existing SSTables
4. Resume operation

Guarantees:
- Durability: All synced writes recovered
- Consistency: Sequence numbers maintain order
- Isolation: No partial writes visible

Performance Tuning Guide

MemTable Size

Small (1MB):
+ Faster flush
+ Less memory
- More L0 files
- More compaction

Large (64MB):
+ Fewer L0 files
+ Less compaction
- Slower flush
- More memory
- Longer WAL replay

Sweet spot: 4-8MB

Block Size

Small (512B):
+ Fine-grained access
- Large index overhead
- More seeks

Large (64KB):
+ Small index
+ Fewer seeks
- Wasted I/O
- Slower search

Sweet spot: 4KB (page size)

Bloom Filter False Positive Rate

Lower FP (0.1%):
+ Fewer false positives
+ Fewer wasted reads
- Larger bloom filter
- More memory

Higher FP (5%):
+ Smaller bloom filter
+ Less memory
- More false positives
- More wasted reads

Sweet spot: 1% (balanced)

Debugging Tips

Check Level Distribution

for level := 0; level < 5; level++ {
    numFiles := lsm.levels.NumFiles(level)
    size := lsm.levels.LevelSize(level)
    fmt.Printf("L%d: %d files, %d MB\n", level, numFiles, size/1024/1024)
}

Monitor Write Amplification

Track:
- Bytes written by user
- Bytes written by compaction
- Ratio = total / user

High ratio (>50x)?
→ Level sizes too small
→ Increase size ratio or thresholds

Find Hot Keys

Add read counter to SSTable:
- Which SSTables get most reads?
- Which blocks get most reads?
- Optimize: Cache hot blocks in memory

This guide provides the foundation for understanding and extending these storage engines. Each design decision represents a trade-off optimized for specific workloads.