Skip to content

Step 3: Pure Elixir Backend #181

@thanos

Description

@thanos

Step 3: Pure Elixir Backend

Modify lib/ex_data_sketch/backend/pure.ex

Binary state layout (ULL1, identical structure to HLL):

Offset  Size    Field
0       4       Header: version(u8=1) + p(u8) + reserved(u16 LE=0)
4       2^p     Registers (one u8 per register)
Total: 4 + 2^p bytes.
Magic: "ULL1"

Wait -- to maintain consistency with other sketches that use 4-byte magic prefixes (BLM1, CKO1, QOT1, etc.), use a 4-byte magic "ULL1" as the state header prefix:

Offset  Size    Field
0       4       Magic "ULL1"
4       1       Version (u8, 1)
5       1       Precision p (u8, 4..26)
6       2       Reserved (u16 LE, 0)
8       2^p     Registers (one u8 per register)
Total: 8 + 2^p bytes.

ULL register encoding from Ertl 2023:

  • Given 64-bit hash h and precision p:
    • bucket = h >>> (64 - p) (same as HLL)
    • w = h <<< p (remaining bits, shifted left)
    • geometric_rank = clz(w) + 1 (count leading zeros + 1, like HLL)
    • sub_bit = bit at position (p + geometric_rank) in original hash (the bit just after the leading zeros)
    • register_value = 2 * geometric_rank - sub_bit (encodes rank + sub-bucket in single byte)
    • Clamp to 0..255

Update rule: new_reg = max(old_reg, register_value) (same as HLL)

FGRA estimator:

  • For each register value r, compute contribution q_r:
    • If r == 0: use q_0 = (2^(1-p) * C0) where C0 is the small-range correction
    • Otherwise: q_r = 2^(-floor((r+1)/2)) adjusted by the sub-bucket bit
  • raw_estimate = alpha_m * m^2 / sum(q_r)
  • Apply bias correction table (pre-computed for each p value)

The exact FGRA formulas from the paper:

tau(x) = (1 - x) * sum_{k>=1} (1 - x^(2^k)) / 2^k   [converges quickly]
sigma(x) = x * sum_{k>=1} x^(2^k) / (1 + x^(2^k))   [converges quickly]

estimate = alpha_inf * m^2 / (m * sigma(C0/m) + sum_{r=1..q} C_r * 2^(-r) + m * tau((m - C_q)/m))

where C_r is the count of registers with value r, q = max register value, and alpha_inf = 1 / (2 * ln(2)).

Note: The FGRA estimator is actually the "new HLL" estimator from Ertl's earlier 2017 paper, applied to ULL register values. The sigma and tau functions converge in ~50 iterations.

Functions to implement:

  • ull_new/1 -- allocate header + zeroed registers
  • ull_update/3 -- compute register value from hash, max-update
  • ull_update_many/3 -- tuple-based bulk update (same pattern as hll_update_many)
  • ull_merge/3 -- register-wise max (reuse zip_max_binary helper from HLL)
  • ull_estimate/3 -- FGRA estimator with sigma/tau helper functions

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions