yakutil

English | 中文

High-performance shared primitives for the uniyak ecosystem

Zero external dependencies · Generics · Lock-free · Zero-allocation optimized

go get -u github.com/uniyakcom/yakutil

Go 1.25+

Package Overview

Package	Purpose	Core Types / Functions
`yakutil`	Root: utilities + constants + byte conversion + sentinel errors	`B2S` `S2B` `IsPow2` `Pow2Ceil` `Native` `ErrStr` `Pad` `NoCopy`
`hash`	FNV-1a 64-bit hash + maphash AES acceleration	`Sum64` `Sum64s` `Sum64Map` `Sum64sMap`
`percpu`	Per-CPU contention-free counter/gauge	`Counter` `Gauge`
`mpsc`	MPSC lock-free Ring (Group Commit)	`Ring[T]`
`spsc`	SPSC wait-free Ring	`Ring[T]`
`backoff`	Three-phase adaptive backoff	`Backoff`
`arena`	CAS bump allocator	`Arena`
`bufpool`	Tiered byte-slice pool	`Pool` / `Get` / `Put` / `GetPtr` / `PutPtr`
`cow`	COW atomic value (generic)	`Value[T]` `Swap` `UpdateCAS`
`smap`	Sharded concurrent Map	`Map[V]` `Map64[V]`
`swar`	SWAR byte-parallel scan	`FindByte` `FindQuote` `FindEscape` `HasZero` `HasByte` `HasLess` `FirstByte`
`fold`	Fast ASCII case-insensitive comparison	`Lower` `Upper` `Equal` `EqualStr` `EqualBytes` `ToLowerBytes` `ToLowerString` `ToUpperBytes` `ToUpperString`
`ring`	Auto-growing ring byte buffer	`Buffer`
`wheel`	Generic timing wheel (configurable resolution)	`Wheel[T]`
`wpool`	Worker pool + adaptive scaling	`Submitter` `Pool` `Stack` `Adaptive`
`itable`	Integer-key high-performance lookup table	`Table[V]`
`lru`	Sharded LRU cache	`Cache[V]`
`lz4`	LZ4 block compression / decompression (pure Go, no frame headers)	`Compress` `CompressHC` `Decompress` `CompressBound`
`ratelimit`	Token-bucket rate limiter (CAS lock-free, zero-alloc)	`Limiter`
`semaphore`	Counting semaphore (bounded concurrency control)	`Semaphore`
`art`	Adaptive Radix Tree (ordered dictionary, prefix queries)	`Tree[V]`
`hist`	Equi-height histogram (CBO selectivity estimation)	`Hist` `Bucket`
`hll`	HyperLogLog cardinality estimator	`Sketch`
`sketch`	Count-Min Sketch frequency estimator	`CMS`
`topn`	Generic Top-N selection	`TopN` `ByKey`
`coarsetime`	Coarse-grained clock at 500µs precision (~1ns/op)	`NowNano` `Now` `Stop`
`scram`	SCRAM-SHA-256 cryptographic primitives (RFC 5802) — zero external deps	`PBKDF2SHA256` `HMACSHA256` `SHA256Sum`

Root Package `yakutil`

Zero-copy conversions and bitwise utility toolkit.

import "github.com/uniyakcom/yakutil"

s := yakutil.B2S(buf)        // []byte → string (zero-copy)
b := yakutil.S2B(s)          // string → []byte (zero-copy, read-only)
yakutil.IsPow2(1024)         // true
n, err := yakutil.Pow2Ceil(1000) // n=1024, err=nil
// overflow: n=0, err=yakutil.ErrOverflow

// Native byte order
order := yakutil.Native      // binary.BigEndian or LittleEndian

// Zero-allocation sentinel error
const ErrNotFound = yakutil.ErrStr("not found")
fmt.Println(ErrNotFound.Error()) // "not found"

// Cache-line padding
type HotStruct struct {
    counter int64
    _       yakutil.Pad // 64-byte padding, avoids false sharing
    flag    int64
}

Constants and Types

CacheLine = 64 — x86/ARM cache line size
Pad — [64]byte padding type
Native — binary.ByteOrder, CPU native byte order detected at runtime
ErrStr — zero-allocation string error type, declarable as const
ErrOverflow — returned by Pow2Ceil on integer overflow

NoCopy

Embedding yakutil.NoCopy makes go vet copylocks detect illegal copies:

type MyMutex struct {
    yakutil.NoCopy
    // ...
}

`hash` — FNV-1a + maphash Dual-Engine Hash

FNV-1a (Cross-Process Deterministic)

import "github.com/uniyakcom/yakutil/hash"

h := hash.Sum64(key)    // []byte input
h := hash.Sum64s("key") // string input (zero-alloc)

Produces the same result as hash/fnv.New64a(); results can be compared cross-process or persisted
Pure computation, no state allocation, inlinable
Sum64s iterates directly over the string's underlying bytes, avoiding []byte(s) conversion
Measured ~2.8× faster than fnv.New64a(), zero allocations (fnv.New64a costs 1 alloc/8 B per call)

maphash (AES-NI Accelerated, Process-Local)

h := hash.Sum64Map(key)    // []byte input, AES-NI accelerated
h := hash.Sum64sMap("key") // string input (zero-alloc)

Leverages Go runtime's built-in random seed + AES-NI hardware instructions
Note: seed is randomized on process restart; hash values cannot be persisted or compared cross-process
Zero allocations; suitable for in-process routing, sharding, Bloom filters, etc.

Performance Comparison (Intel Xeon E-2186G @ 3.80GHz, Linux amd64, Go 1.26.1, `bench_linux_6c12t.txt`)

Data Size	FNV `Sum64s`	maphash `Sum64Map`	maphash `Sum64sMap`	maphash/FNV ratio
8 B	1,344 MB/s	1,751 MB/s	1,272 MB/s	×1.3 (small gap at 8B)
32 B	1,712 MB/s	6,882 MB/s	5,545 MB/s	×3.2–4.0
128 B	1,257 MB/s	15,381 MB/s	14,200 MB/s	×11.3–12.2
256 B	1,194 MB/s	14,544 MB/s	13,480 MB/s	×11.3–12.2
1 KB	1,151 MB/s	13,016 MB/s	12,781 MB/s	×11.1–11.3
4 KB	1,141 MB/s	12,492 MB/s	12,520 MB/s	×10.9–11.0

All implementations: 0 B/op, 0 allocs/op.

Implementation	ns/op	allocs	Notes
`Sum64s` (internal FNV)	16.4	0 / 0 B	Reference key; recommended stdlib replacement
`fnv.New64a()` (stdlib)	50.5	2 / 40 B	Interface boxing; ~3.1× slower, allocations

Parallel throughput (32B key, 12 threads, -race=false)

Implementation	ns/op	MB/s	allocs
`FNV Sum64s`	2.04	15,707	0
`maphash Sum64sMap`	0.95	33,583	0
Parallel speedup	—	×2.1	—

Selection Guide

Scenario	Recommended	Reason
Cross-process, persisted, deterministic hash	`Sum64` / `Sum64s`	FNV-1a spec compliant, stable seed
Short key (≤16 B), in-process	`Sum64` / `Sum64s`	Small perf gap (≤1.3×), no seed restriction
Medium-long key (≥32 B) in-process routing	`Sum64Map` / `Sum64sMap`	4–12× throughput, zero-cost AES-NI
smap, Bloom filter, consistent hash ring	`Sum64Map` / `Sum64sMap`	In-process; stable seed unnecessary

When to Choose yakhash

yakutil/hash is designed as a zero-external-dependency lightweight internal tool, covering in-process hashing needs across the yakutil ecosystem. When your use case exceeds the following boundaries, consider the dedicated uniyak hash library yakhash (full xxHash XXH64/XXH3 implementation with AVX2/NEON assembly acceleration):

Scenario	yakutil/hash	yakhash
≥ 1 KB large-block throughput	FNV ~1.15 GB/s (byte-loop ceiling, no vectorization)	XXH3-64 41.6 GB/s @ 1 KB, 56.2 GB/s @ 10 MB (36–49× faster)
Reproducible hash (fixed seed)	`maphash` is process-random; not reproducible	`Sum3_64Seed` / `Sum64Seed`
Cross-process / distributed consistent hash	`maphash` seed cannot be shared cross-process	`Sum3_64Secret` + `GenSecret`
HashDoS active defense (controlled key)	Not available	192-byte secret space, `Sum3_64Secret` / `Sum3_64Seed`
128-bit output (content addressing / dedup)	Not available	`Sum3_128` / `Sum3_128Seed` / `Sum3_128String`
Streaming chunked processing (`io.Reader`)	Not available	`New3()` + `Write` + `Sum64` / `Sum128`
State snapshot / checkpoint-resume	Not available	`MarshalBinary` / `UnmarshalBinary`
C xxHash 0.8.x bit-exact compatibility	Not available	✓ All functions bit-identical to C original

go get github.com/uniyakcom/yakhash

// Large data: AVX2 vectorized, 56 GB/s (10 MB block, same machine)
h := yakhash.Sum3_64(data)

// HashDoS defense (cross-process consistent key)
secret := yakhash.GenSecret(loadSeedFromConfig())
h, _ := yakhash.Sum3_64Secret(key, secret[:])

// Streaming chunked
d := yakhash.New3()
io.Copy(d, reader)
h := d.Sum64()

yakutil/hash maintains zero external dependencies; yakhash is the dedicated uniyak ecosystem hash library. The two have no transitive dependency relationship.

`percpu` — Per-CPU Counters

Multi-core concurrent counting with near-zero contention.

import "github.com/uniyakcom/yakutil/percpu"

c := percpu.New(runtime.GOMAXPROCS(0))

c.Add(1)          // writes distributed across different cache line slots
total := c.Load() // aggregate all slots (approximate)
c.Reset()         // zero all slots

Design: Maps callers to 8–256 isolated 64B slots using goroutine stack address × Fibonacci multiplicative hash constant (0x9e3779b97f4a7c15 = 2⁶⁴/φ) right-shifted 16 bits. Write path uses only atomic Add, no cross-core contention.

Auto slot count: max(8, ceilPow2(procs)), capped at 256; low-core environments retain 8 slots.

Why Fibonacci hash instead of >>13?
goroutine pool stack spacing is fixed at 8 KB (1<<13 bytes); plain >>13 maps them to consecutive slots (hotspot clustering).
Fibonacci multiplication disperses clustered addresses uniformly across all slots.
On 32-bit systems the constant auto-truncates to 0x7f4a7c15 (still odd, preserving bijection).

Selection Guide

Scenario	Recommendation	Reason
Monitoring counters (bytes, messages)	`percpu.Counter`	High-frequency writes, approximate reads, 2× parallel advantage
Connection count, error count, rate limiting logic	`atomic.Int64`	Requires exact value for control flow
CAS / Swap / arbitrary initial value	`atomic.Int64`	percpu does not support these
Load requires strict consistent snapshot	`atomic.Int64`	percpu.Load is O(slots) aggregation, may see intermediate state

Performance (Intel Xeon E-2186G @ 3.8GHz, 12 cores, bench_linux_6c12t.txt)

Operation	percpu	atomic.Int64
Add serial	5.6 ns	5.8 ns
Add parallel (12C contention)	2.2 ns ✓	17.1 ns ✗ (×7.8 slower)
Load	~10 ns	~3 ns ✗
Reset	~139 ns	—

Serial: comparable; high-concurrency writes: percpu ~8× advantage; Load is ~4× slower than atomic.Load.
Use percpu exclusively for "write-heavy, read-light" monitoring counters.

Diagnostic interface Stats()

st := c.Stats()
// Stats{Slots:64, Min:..., Max:..., Mean:..., Skew:1.0}
// Skew = Max/Mean; ≤2.0 is uniform, >2.0 suggests hot slot
if st.Skew > 2.0 {
    log.Printf("percpu hot slot detected: skew=%.1f", st.Skew)
}

Field	Meaning
`Slots`	Total slot count
`Min` / `Max`	Minimum / maximum slot value
`Mean`	Average slot value
`Skew`	`Max/Mean`; 1.0 = perfectly uniform; >2.0 suggests hotspot; >5.0 suggests increasing procs

`Gauge` — Bidirectional Per-CPU Gauge

Gauge supports Add + Sub (value can be negative), suitable for tracking active connections, concurrent request counts, etc.

g := percpu.NewGauge(runtime.GOMAXPROCS(0))

g.Add(1)       // new connection enters
g.Sub(1)       // connection closes (equivalent to Add(-1))
g.Load()       // aggregate all slots (approximate snapshot)
g.Reset()      // zero all slots

// Diagnostics
st := g.Stats() // GaugeStats{Slots, Min, Max, Sum, Mean, Skew}

Counter vs Gauge	Counter	Gauge
Monotonic	Yes (increment only)	No (increment and decrement)
Add/Sub	Add only	Add + Sub
Load	Approximate aggregate	Approximate aggregate
Typical use	Byte counts, message counts	Active connections, memory usage

Performance identical to Counter (same Fibonacci hash + 64B slot layout).

Measured Performance (Intel Xeon E-2186G @ 3.80GHz, 12 cores, bench_linux_6c12t.txt)

Operation	ns/op	Notes
`Add` (serial)	5.64	Fibonacci hash distributes to random slot
`Add` (12 parallel)	~2.9	12 threads, no slot contention
`Load` (cross-slot aggregate)	—	Single pass over all slots

Comparison: atomic.Int64.Add at 12 parallel ≈ 17.0 ns/op (≥6× gap).

`mpsc` — MPSC Lock-Free Ring + Group Commit

Multi-producer single-consumer queue, ideal for WAL Group Commit pattern.

import "github.com/uniyakcom/yakutil/mpsc"

r := mpsc.New[Record](4096)

// Producer (multiple goroutines)
seq, ok := r.Enqueue(record)
if ok {
    err := r.Wait(seq) // blocks until consumer Commits
}

// Producer with context timeout
seq, ok := r.Enqueue(record)
if ok {
    err := r.WaitContext(ctx, seq) // returns ctx.Err() on timeout
}

// Consumer (single goroutine)
start, n := r.Drain(func(rec *Record) error {
    return encodeToBuf(rec) // batch encode
})
flushErr := fsync(buf)
r.Commit(start, n, flushErr) // wake all producers

State machine: free → filling → ready → drained → free

Enqueue: CAS tail, write slot
Wait: channel-block waiting for done signal (Go runtime scheduling, no busy-spin), then release slot
WaitContext: like Wait but respects context.Context cancel/timeout; spawns lightweight cleanup goroutine on cancel (30s safety timeout prevents goroutine leak)
Drain: batch harvest consecutive ready slots
Commit: set done=1 to wake producers, can carry batch error

`spsc` — SPSC Wait-Free Ring

Single-producer single-consumer ultra-low-latency queue.

import "github.com/uniyakcom/yakutil/spsc"

r := spsc.New[Event](1024)

// Producer (single goroutine)
r.Push(event)

// Consumer (single goroutine)
if evt, ok := r.Pop(); ok {
    handle(evt)
}

No CAS; only atomic.Store / atomic.Load (equivalent to plain MOV on x86)
cachedHead / cachedTail eliminate normal cross-core cache line reads
Typical throughput 2-5 ns/op

`backoff` — Three-Phase Adaptive Backoff

import "github.com/uniyakcom/yakutil/backoff"

var bo backoff.Backoff
for !condition() {
    bo.Spin()
}
bo.Reset()

Phase	Iteration Range	Behavior
Phase 1	`N < 64`	Tight CPU spin (zero overhead)
Phase 2	`N < 128`	`runtime.Gosched()`
Phase 3	`N ≥ 128`	Exponential sleep (1µs to 1ms)

Customizable:

bo := backoff.Backoff{
    SpinN:   32,
    YieldN:  16,
    MaxWait: 500 * time.Microsecond,
}

Zero value usable directly (default parameters). Value type, no allocation.

`arena` — CAS Bump Allocator

High-concurrency bump allocator, suitable for short-lived scenarios like WAL encode buffers.

import "github.com/uniyakcom/yakutil/arena"

a := arena.New(0) // default 64KB chunk

buf := a.Alloc(128) // 8-byte aligned, CAS concurrency-safe; n≤0 returns nil
// ... use buf ...

a.Reset() // switch to new chunk; old references still valid until GC

Fast path CAS + add = < 5 ns/op (vs Go heap ~25 ns)
n > chunkSize falls back to make()
Chunk exhaustion auto-switches; old chunk references remain valid

`bufpool` — Tiered Byte-Slice Pool

20 tiers of sync.Pool (64B to 32MB), automatically tiered by size. Requests >32MB are directly allocated, bypassing the pool.

Two APIs are available:

API	Steady-state allocs	Best for
`Get` / `Put`	0 alloc, 0 B	Single-goroutine short-lived buf
`GetPtr` / `PutPtr`	0 alloc, 0 B	Cross-goroutine async pipeline, pointer travels with buf

`Get` / `Put` — General Use

import "github.com/uniyakcom/yakutil/bufpool"

// Global functions
buf := bufpool.Get(4096)
defer bufpool.Put(buf)

// Independent instance
var p bufpool.Pool
buf := p.Get(1024)
p.Put(buf)

Get(size): returns a slice with len=size, cap=2^n
Put(b): returns to the appropriate tier by cap
Slices with cap < 64B or cap > 32MB are automatically discarded
Slices with non-power-of-2 cap are automatically discarded (prevents b[:size] out-of-bounds panic)
Internally uses unsafe.Pointer to bypass interface boxing — steady-state 0 B / 0 allocs

`GetPtr` / `PutPtr` — Zero-Alloc Hot Path

Stores *[]byte (pointer type) rather than []byte (3-word value) in sync.Pool, eliminating the interface-boxing allocation entirely. The wrapper pointer and its backing buffer travel together through the full lifecycle — GetPtr → Event.bufPtr → Post steal → logEntry.bufPtr → drainRing → PutPtr → ptrTiers → GetPtr → … — with zero allocations in steady state.

// Acquire: wrapper pointer + buf together (0 alloc in steady state)
bp := bufpool.GetPtr(4096)
buf := *bp            // the actual []byte, len=4096, cap=2^n
buf = buf[:0]         // reset for writing
// ... write into buf ...

// Release: pointer type stored in sync.Pool, 0-alloc boxing
bufpool.PutPtr(bp)
// ⚠️ Do NOT access bp or *bp after PutPtr — use-after-free

Independent instance:

var p bufpool.Pool
bp := p.GetPtr(512)
*bp = append((*bp)[:0], data...)
p.PutPtr(bp)

Producer / consumer across goroutines (async log pipeline pattern):

type task struct {
    buf    []byte
    bufPtr *[]byte   // travels with buf; nil if from Get (boxing path)
}

// Producer goroutine
bp  := bufpool.GetPtr(512)
buf := *bp
buf  = encode(buf[:0])
queue <- task{buf: buf, bufPtr: bp}

// Consumer goroutine (different P)
t := <-queue	
write(t.buf)
if t.bufPtr != nil {
    bufpool.PutPtr(t.bufPtr) // 0-alloc boxing even across Ps
} else {
    bufpool.Put(t.buf)
}

Safety rules:

GetPtr and PutPtr use a separate internal pool (ptrTiers) from Get/Put (tiers). Never mix them — Put(buf) for a buf obtained via GetPtr, or PutPtr(bp) with a pointer not from GetPtr, will silently discard the buffer or corrupt the pool.
After PutPtr(bp), ownership transfers to the pool. Do not read or write bp or *bp.
bp must point to an independently heap-allocated object (e.g. Event.bufPtr), not a field embedded inside a ring-slot struct. Embedding causes the wrapper to be aliased by future slot reuse, leading to use-after-free.
Buffers with non-standard cap (e.g. after append grew the slice) are silently discarded by PutPtr; the next GetPtr will allocate fresh. Track cap stability if reuse is critical.

Benchmark comparison (Intel Xeon E-2186G @ 3.80GHz, Go 1.26.1):

BenchmarkPool_Get128-12                 ~14 ns/op    0 B/op   0 allocs/op
BenchmarkPool_Get128_Parallel-12         ~2 ns/op    0 B/op   0 allocs/op
BenchmarkPool_GetPtr512-12              ~15 ns/op    0 B/op   0 allocs/op
BenchmarkPool_GetPtr512_Parallel-12      ~2 ns/op    0 B/op   0 allocs/op

`cow` — Copy-on-Write Atomic Value

Generic atomic snapshot for read-heavy, write-light scenarios.

import "github.com/uniyakcom/yakutil/cow"

v := cow.New[Config](defaultConfig)

// Read (any goroutine, ~1ns)
cfg := v.Load()

// Write (single writer)
v.Store(newConfig)

// Read-modify-write
v.Update(func(old Config) Config {
    old.Timeout = 5 * time.Second
    return old
})

Read path: single atomic.Pointer.Load(), truly lock-free
Write path: construct new value, atomic.Store
Update: suitable for single writer; use external lock for multiple writers
Swap: atomically replaces and returns old value, concurrency-safe
UpdateCAS: CAS-loop lock-free read-modify-write, safe for multiple writers; fn must be pure and cheap — each CAS retry re-invokes fn; use UpdateCASPtr for allocation-sensitive hot paths

// Multiple-writer-safe read-modify-write
v.UpdateCAS(func(old Config) Config {
    old.Count++
    return old
})

// Atomic replace and get old value
old := v.Swap(newConfig)

// Zero-alloc read-modify-write (with sync.Pool for old pointer recycling)
var pool sync.Pool
oldPtr := v.UpdateCASPtr(func(old *Config) *Config {
    np, _ := pool.Get().(*Config)
    if np == nil {
        np = new(Config)
    }
    if old != nil {
        *np = *old
    }
    np.Timeout = 5 * time.Second
    return np
})
if oldPtr != nil {
    pool.Put(oldPtr)
}

UpdateCASPtr is the zero-alloc variant of UpdateCAS: fn receives the current *T (nil for zero-value) and returns new *T; ownership of the old *T is returned to the caller for sync.Pool recycling. Each successful write = 0 allocs, vs UpdateCAS 1 alloc / 8 B.

Method	ns/op	allocs	Notes
`UpdateCAS`	~19.7	1 / 8 B	Allocates new object per write
`UpdateCASPtr` + pool	~17.7	0 / 0 B	Pool supplies object, old pointer returned
`UpdateCAS_Parallel` (12t)	~159	6 / 50 B	Multiple allocs under CAS contention

`smap` — Sharded Concurrent Map

High-performance concurrent Map, N shards + RWMutex to isolate contention. String key routing uses maphash AES acceleration (for in-process routing; seed changes on process restart).

import "github.com/uniyakcom/yakutil/smap"

// string key (maphash AES-accelerated shard hash)
m := smap.New[int](64) // 64 shards
m.Set("foo", 42)
v, ok := m.Get("foo")
m.Range(func(k string, v int) bool { return true })

// Atomic get-or-create (double-checked locking; fn executes exactly-once under write lock)
v, created := m.GetOrSet("session:42", func() int {
    return expensiveInit()
})
// created=true means this call triggered creation

// uint64 key (Fibonacci hash)
m64 := smap.New64[string](32)
m64.Set(12345, "hello")
val, created := m64.GetOrSet(12345, func() string { return "world" })

Read path RLock on single shard (~22 ns), no global lock
Cache-line isolation between shards
GetOrSet: fast RLock check first; if present returns (false); if absent upgrades to Lock, double-checks, then calls fn — guarantees fn executes only once concurrently

Measured Performance (bench_linux_6c12t.txt, Go 1.26.1)

Operation	ns/op
Get	~22 ns
Set	~35 ns
Parallel Get (12t)	~45 ns
ReadHeavy NoAlloc (B2S)	~9 ns

Note: For lowest latency, set shard count to 2–4× CPU count.

`swar` — SWAR Byte-Parallel Scan

SIMD-Within-A-Register: process 8 bytes simultaneously in a single integer operation.

import "github.com/uniyakcom/yakutil/swar"

idx := swar.FindByte(data, '\n')   // find newline
idx := swar.FindQuote(data)        // find '"'
idx := swar.FindEscape(data)       // find <0x20 / '"' / '\\'

Typical 4–8× speedup vs single-byte loop
Suitable for JSON parsing, HTTP header scanning

`fold` — ASCII Case-Insensitive Comparison

Based on 256-byte lookup table, zero allocation.

import "github.com/uniyakcom/yakutil/fold"

fold.Equal([]byte("Content-Type"), "content-type") // true
fold.EqualStr("HOST", "Host")                      // true

// In-place ASCII lowercase / uppercase (zero-alloc)
fold.ToLowerBytes(header) // e.g. header = []byte("Content-Type") → "content-type"
fold.ToUpperBytes(key)    // e.g. key = []byte("accept") → "ACCEPT"

// Allocating variants (1 alloc)
lower := fold.ToLowerString("Content-Type") // "content-type"
upper := fold.ToUpperString("accept")       // "ACCEPT"

Approximately 1.84× faster than strings.EqualFold (~6.6 vs ~12.1 ns, bench_linux_6c12t.txt)
ASCII only (A-Z and a-z); use stdlib for Unicode
ToLowerBytes / ToUpperBytes: in-place ASCII case conversion, zero-alloc (~26 ns for 46-byte header on Xeon E-2186G)
ToLowerString / ToUpperString: allocating variants, 1 alloc

`ring` — Ring Byte Buffer

2^N auto-growing ring buffer for network I/O.

import "github.com/uniyakcom/yakutil/ring"

buf := ring.New(4096)
buf.Write(data)
buf.WriteByte(0x0A)       // implements io.ByteWriter, writes single byte (zero-alloc)
p := buf.Peek(10)         // inspect first 10 bytes (non-consuming)
buf.Discard(10)           // discard
c, err := buf.ReadByte()  // implements io.ByteReader, read byte-by-byte (zero-alloc)
buf.UnreadByte()          // implements io.ByteScanner, replay last ReadByte (zero-alloc)
buf.WriteTo(conn)         // zero-copy output
buf.ReadFrom(conn)        // read from Reader

Byte-by-byte I/O (WriteByte / ReadByte / UnreadByte trio):

No []byte allocation needed; suitable for frame header parsing (magic, version, length fields)
ReadByte returns (byte, error); empty buffer returns io.EOF
UnreadByte rolls back the last ReadByte, implements io.ByteScanner for peek-and-rollback protocol parsing
- Two consecutive UnreadByte calls return io.ErrNoProgress
Can be mixed with Read/Write

Peek semantics:

No wraparound: returns slice pointing to internal buffer (zero-copy); caller must not modify
Crossing boundary: allocates new slice and copies (can modify directly)
Bitmask modulo, zero-alloc wrap
Auto 2× expansion with data linearization
Struct is exactly 64B = 1 cache line

Measured Throughput (bench_linux_6c12t.txt)

Operation	ns/op	Throughput
Write 64 B	7.6 ns	8,404 MB/s
WriteRead 1 KB	31.0 ns	33,064 MB/s
PeekDiscard	9.5 ns	—
WriteByte	2.75 ns	0 allocs
ReadByte	1.81 ns	0 allocs
UnreadByte (Write+Read+Unread+Read round-trip)	5.6 ns	0 allocs

Zero-copy I/O (ReadableSegments / CommitRead / WritableSegments / CommitWrite)

High-performance protocol parsers (e.g., RESP) can use the zero-copy interface to directly access internal buffers, eliminating intermediate Read(tmp) copies:

// Zero-copy read side
// Returns one or two segments pointing to internal buffer; two segments when wrapping
s1, s2 := buf.ReadableSegments(n)  // caller must not write to these
parse(s1)                          // zero-copy parse
if s2 != nil { parse(s2) }         // handle second segment when wrapping
buf.CommitRead(n)                  // advance read pointer (returns error on out-of-range)

// Zero-copy write side
s1, s2 := buf.WritableSegments(need) // get writable memory segments
copy(s1, data)                       // write directly into internal buffer
buf.CommitWrite(n)                   // confirm n bytes written (returns error on out-of-range)

Returned s1/s2 point directly to internal buffer, no copy needed. After consuming k bytes, must call CommitRead(k) to advance read pointer.

32-bit overflow auto-protection

r, w, mask fields use uint (platform-native unsigned integer):

Platform	Field width	Overflow threshold	Conclusion
64-bit	uint64	~18 EB	Practically unreachable
32-bit	uint32	4 GB	Go slice limit 2 GB; also unreachable

Unsigned overflow wraps naturally; Len() = w - r always correct; no need to manually call Reset() in continuous non-empty stream scenarios.

`wheel` — Generic Timing Wheel

High-performance timer with configurable resolution.

import "github.com/uniyakcom/yakutil/wheel"

w := wheel.New[ConnID](10*time.Millisecond, 1024)
id := w.Add(5*time.Second, connID)  // O(1) add
w.Cancel(id)                         // O(1) cancel

// Manual advance
w.Advance(func(id ConnID) { close(id) })

// Or automatic tick
w.Run(ctx, func(id ConnID) { close(id) })

Add/Cancel O(1), Advance O(expired)
Supports layered rounds (long delays exceeding slots count)
sync.Pool reuses entry nodes

`wpool` — Worker Pool + Adaptive Scaling

Submitter Interface

Pool, Stack, and Adaptive all implement Submitter; callers need not care about the underlying implementation:

type Submitter interface {
    Submit(task func()) bool    // blocks until task accepted or pool stopped
    TrySubmit(task func()) bool // non-blocking; returns false if cannot accept immediately
    Running() int               // current active worker count
    Stop()                      // graceful shutdown, waits for all workers to finish
    PanicCount() int64          // lifetime task panic count (increments regardless of handler)
}

TimedSubmitter Interface

Pool and Stack additionally implement TimedSubmitter for timeout backpressure control:

type TimedSubmitter interface {
    Submitter
    SubmitTimeout(task func(), timeout time.Duration) bool
    // Waits at most timeout: Pool waits for queue slot, Stack waits for idle worker.
    // Returns false on timeout or pool stopped.
}

Use case:

// HTTP handler: wait at most 50ms for worker, return 503 on timeout
ok := pool.SubmitTimeout(func() { handle(req) }, 50*time.Millisecond)
if !ok {
    http.Error(w, "503 overloaded", http.StatusServiceUnavailable)
    return
}

Pool — FIFO Basic Worker Pool

import "github.com/uniyakcom/yakutil/wpool"

p := wpool.NewPool(8, 5*time.Second) // 8 workers, 5s idle timeout
p.Submit(func() { handle(req) })
p.TrySubmit(func() { handle(req) }) // non-blocking
p.Resize(16)                         // dynamic scaling
p.Stop()                             // graceful shutdown

Queue watermark >75% / idle >50% auto-scaled by Adaptive.adjust() (Pool itself does not auto-scale)
Adjust worker count via Resize(n); excess workers auto-exit on idle after downsizing
Submit / TrySubmit blocking/non-blocking modes
safeRun delegates to panicSafeRun (wpool/safe.go), shared by Pool and Stack
Single task panic does not affect worker survival; each panic is counted then handler called
PanicCount() int64 cumulative panic count (increments regardless of whether handler is set)

PanicCount usage example

p := wpool.NewPool(8, 5*time.Second, wpool.WithPanicHandler(func(r any, stack []byte) {
    slog.Error("worker panic", "err", r, "stack", string(stack))
}))

// Periodically report to monitoring (Prometheus, OTEL, etc.)
go func() {
    for range time.Tick(30 * time.Second) {
        metrics.Set("wpool_panics_total", float64(p.PanicCount()))
    }
}()

PanicCount and WithPanicHandler are independent: increments without handler, counts first then calls handler when present.

Stack — FILO Hot Worker Pool

s := wpool.NewStack(8, 10*time.Second) // 8 workers, 10s idle timeout
s.Submit(func() { handle(req) })
s.TrySubmit(func() { handle(req) })
s.PanicCount()
s.Stop()

FILO vs FIFO Selection:

Scenario	Recommended
Network IO reactor dispatch, protocol parsing (short-lived, high-concurrency)	`Stack` (FILO)
General background tasks, variable task duration	`Pool` (FIFO)

Stack design details:

per-worker chan(capacity=1) avoids Submit/workerFunc mutual blocking
sync.Pool reuses stackWorkerChan, reduces GC allocation
sync.Cond zero-CPU-block waits when maxWorkers is reached
Background cleaner periodically reclaims idle-timeout workers (default 10s)
Stop() guarantee: waits for all worker goroutines to fully exit (including currently executing tasks); Running() == 0 after return, no goroutine leak

Adaptive — Auto-Scaling Pool

a := wpool.NewAdaptive(4, 64, 500*time.Millisecond, 5*time.Second)
// params: min=4, max=64, sample period=500ms, idle timeout=5s
a.Submit(func() { handle(req) })
a.Stop()

Queue watermark >75%: scale up (+25% workers, up to max)
Idle workers >50%: scale down (-25% workers, down to min)
PanicCount() int64 delegates to internal Pool, sharing the same counter

Interface Substitution Example

func NewDispatcher(pool wpool.Submitter) *Dispatcher {
    return &Dispatcher{pool: pool}
}

// Reactor IO: use FILO Stack
d := NewDispatcher(wpool.NewStack(runtime.NumCPU(), 0))

// General background tasks: use FIFO Pool
d := NewDispatcher(wpool.NewPool(8, 0))

// With explicit backpressure deadline: use TimedSubmitter
var ts wpool.TimedSubmitter = wpool.NewStack(8, 0)
ts.SubmitTimeout(task, 50*time.Millisecond)

Measured Performance (bench_linux_6c12t.txt)

Implementation	Submit (serial)	TrySubmit	Submit (parallel)
`Pool` FIFO	440 ns	12 ns	384 ns
`Stack` FILO	425 ns	40 ns	420 ns
`go spawn`	240 ns	—	—

TrySubmit is fastest (12 ns), suitable for non-blocking dispatch.
Stack.Submit (FILO) is CPU-cache affine, suitable for low-latency network IO dispatch.

SubmitTimeout Measured Performance (bench_linux_6c12t.txt)

Implementation	ns/op (serial)	allocs	Notes
`Pool.SubmitTimeout`	603 ns	0 (0B)	Timer pool optimization, zero-alloc
`Stack.SubmitTimeout`	473 ns	53 B / 0 alloc	FILO worker reuse, timer token pooled
`Pool.SubmitTimeout` (parallel)	519 ns	0	—
`Stack.SubmitTimeout` (parallel)	642 ns	72 B / 1 alloc	—

Stack.SubmitTimeout is ~23% faster than Pool.SubmitTimeout.

`itable` — Integer-Key Lookup Table

Small key array direct lookup + large key sync.Map fallback.

import "github.com/uniyakcom/yakutil/itable"

tb := itable.New[Conn](0)  // default 65536 fast path
tb.Set(fd, &conn)          // O(1) atomic store
conn, ok := tb.Get(fd)     // O(1) atomic load
tb.Del(fd)

key < 65536: atomic.Pointer array, zero-lock, no contention
key ≥ 65536: sync.Map fallback
Suitable for dense integer scenarios like fd, connection IDs

`lru` — Sharded LRU Cache

Multiple shards reduce lock contention, maphash AES-accelerated shard routing, optional lazy TTL.

import "github.com/uniyakcom/yakutil/lru"

// Basic: 16 shards, 10000 entries per shard
c := lru.New[[]byte](16, 10000)
c.Set("key", data)
val, ok := c.Get("key")
c.Del("key")

// Eviction callback (triggered on LRU capacity eviction, not TTL expiry)
c = lru.New[[]byte](16, 10000, lru.WithEvict[[]byte](func(k string, v []byte) {
    log.Printf("evicted %s", k)
}))

// Lazy TTL: checked on Get, no background goroutine
c = lru.New[[]byte](16, 10000, lru.WithTTL[[]byte](12*time.Hour))

// Combined: TTL + eviction callback
c = lru.New[[]byte](16, 10000,
    lru.WithTTL[[]byte](5*time.Minute),
    lru.WithEvict[[]byte](func(k string, v []byte) { ... }),
)

// Iterate all live entries (newest→oldest, skip TTL-expired)
c.Range(func(k string, v []byte) bool {
    fmt.Println(k)
    return true // return false to terminate early
})

// Clear all entries (does not trigger evict callback)
c.Purge()

Get ~21 ns, Set ~36 ns (bench_linux_6c12t.txt, Go 1.26.1)
Parallel Get (12t): ~46 ns, Mixed 90R10W (serial): ~100 ns
Per-shard independent Mutex; Clock-LRU: Get skips moveToFront (set accessed bit only)
Auto-evicts via Clock algorithm—recently accessed entries get a second chance
Range: snapshot iteration by insertion order, callback invoked outside lock, skips TTL-expired entries
RangeLimit(batchSize, fn): batched iteration for large caches, each batch re-locks
Purge: O(shards) clear, does not trigger evict callback
WithTTL: TTL resets when same key is overwritten; Len() includes lazily uncleaned expired entries

Go 1.26 measured (bench_linux_6c12t.txt, count=5): serial Mixed 90R10W ~100 ns; parallel Get (12t) ~46 ns. Shard isolation continues to provide significant parallel-access advantage.

`lz4` — LZ4 Block Compression

Pure-Go minimal implementation of the LZ4 block format.
No frame headers; framing is the caller's responsibility. Suitable for WAL, snapshot, and record-oriented storage.

import "github.com/uniyakcom/yakutil/lz4"

// Allocate destination (caller-provided or nil for auto-alloc)
dst := make([]byte, lz4.CompressBound(len(src)))

compressed, err := lz4.Compress(dst, src)
// err is non-nil only for ErrInputTooLarge (src > ~2 GB)

original := make([]byte, len(src))
out, err := lz4.Decompress(original, compressed)

High-compression mode (slower, same format, decompressible by Decompress):

compressed, err := lz4.CompressHC(nil, src) // auto-alloc

Errors (comparable via errors.Is):

Constant	Meaning
`ErrCorrupt`	Input block data is corrupt or malformed
`ErrOutputFull`	Destination buffer capacity insufficient
`ErrInputTooLarge`	Input exceeds LZ4 block format limit (~2 GB)

Performance (Intel Xeon E-2186G @ 3.80GHz, repetitive text input):

Operation	Throughput
`Compress`	~500 MB/s
`Decompress`	~2 GB/s
`CompressHC`	~60 MB/s

Decompression fast path: when offset >= matchLen (no overlap), copy() is used and can leverage SIMD/memcpy.
Self-overlapping (RLE) matches use byte-by-byte copy to preserve correct semantics.

`ratelimit` — Token-Bucket Rate Limiter

CAS lock-free token bucket, no background goroutines, zero allocation, suitable for per-IP / global rate limiting.

import "github.com/uniyakcom/yakutil/ratelimit"

// Global: 1000/sec, burst cap 200
rl := ratelimit.New(1000, 200)

// Non-blocking check (hot path, ~15-25 ns, 0 allocs)
if !rl.Allow() {
    conn.Close()
    return
}

// Batch consume (writing multiple messages at once)
if !rl.AllowN(5) {
    return ErrRateLimited
}

// Runtime diagnostics
rl.Tokens()  // current available token count snapshot
rl.Rate()    // configured rate (tokens/sec)
rl.Burst()   // bucket capacity
rl.Reset()   // reset to full bucket (connection reset, test init)

Performance (Intel Xeon E-2186G @ 3.80GHz, amd64, Go 1.26.1):

Scenario	Latency	Allocation
`Allow` (single core, tokens available)	~12 ns	0 allocs
`Allow` (12-thread parallel, CAS contention)	~147 ns	0 allocs
`Allow` (tokens exhausted, fast reject)	~10 ns	0 allocs

Algorithm highlights:

Lazy refill (compute diff per call by time), avoids background timer goroutine
Refill calculation uses integer nanoseconds (1e9 / rate), no floating-point error
CAS loop: read snapshot → compute new state → CAS update, retries on contention (typically 1-2 rounds)
n > burst always returns false (can short-circuit early, no waiting)

Default is already the high-performance clock

Limiter uses coarsetime.NowNano (500µs precision, ~1 ns/op) as its default clock — no configuration needed. coarsetime starts automatically via package init().

// Default: already uses coarsetime, no extra setup required
rl := ratelimit.New(10_000, 500)

// For precise rate shaping, explicitly inject wall clock:
rl := ratelimit.New(10_000, 500,
    ratelimit.WithClock(func() int64 { return time.Now().UnixNano() }),
)

When to use precise clock: only when 500µs jitter matters (e.g., tight burst shaping). For all other uses (per-IP, global rate limiting), the default coarsetime is optimal.

`IPMap` — Per-IP Rate Limit Manager

IPMap combines smap.Map (maphash AES-NI sharded hash) with *Limiter, no background goroutines, zero-alloc hot path.

ipmap := ratelimit.NewIPMap(100, 20, 0) // rate=100/s, burst=20, default 64 shards

// Each new connection (OnAccept)
if !ipmap.Allow(conn.RemoteIP()) {
    conn.Close()
    return
}

// Batch operation
if !ipmap.AllowN(ip, 5) { return ErrRateLimited }

// Get underlying Limiter (diagnostics)
lim := ipmap.Get(ip)
log.Printf("ip=%s tokens=%d", ip, lim.Tokens())

// Periodically evict inactive IPs (prevent memory leak)
go func() {
    for range time.Tick(5 * time.Minute) {
        n := ipmap.Evict(10 * time.Minute)
        log.Printf("evicted %d stale IP limiters", n)
    }
}()

// Clear all (config change / test init)
ipmap.Purge()

// Strict mode: eliminates TOCTOU race on first Limiter creation (exactly-once semantics)
ipmap := ratelimit.NewIPMap(100, 20, 0, ratelimit.WithStrictNewIP())

// Access hotspot analysis: top N IPs by hit count
top := ipmap.TopN(5)
for _, e := range top {
    log.Printf("ip=%s hits=%d recent=%d", e.IP, e.Hits, e.RecentHits)
}

// Recent hotspot analysis: sorted by ~1min sliding window activity
recent := ipmap.TopNRecent(5)
for _, e := range recent {
    log.Printf("ip=%s recent_hits=%d", e.IP, e.RecentHits)
}

Method	Description
`Allow(ip)`	Consume 1 token; false = rate limited
`AllowN(ip, n)`	Batch consume n tokens
`Get(ip)`	Get (or create) underlying `*Limiter`
`Evict(ttl)`	Evict IPs inactive within ttl; returns eviction count
`Len()`	Current tracked IP count
`Purge()`	Clear all entries
`TopN(n)`	Top-N by cumulative `Hits` (descending)
`TopNRecent(n)`	Top-N by recent (~1min window) `RecentHits` (descending)

Measured Performance (bench_linux_6c12t.txt, Go 1.26.1, rate=1M/s, burst=100K)

Operation	ns/op	allocs	Notes
`Allow` (serial, hot IP)	49	0	coarsetime + maphash shard + token-bucket CAS
`Allow` (12 parallel)	37	0	Shard lock scatter, coarsetime replaces time.Now
`Allow` (with RecentHits)	57	0	addRecent same overhead as Allow
`TopN(5, 100 IPs)`	6.3 µs	—	Min-heap top-N, occasional call
`TopNRecent(5, 100 IPs)`	6.3 µs	—	Min-heap top-N by recent window

`semaphore` — Counting Semaphore

Channel-backed, Go runtime native scheduling, supports non-blocking / context timeout, suitable for MaxConns limiting.

import "github.com/uniyakcom/yakutil/semaphore"

sem, err := semaphore.New(opts.MaxConns) // max concurrent connections
if err != nil {
    // opts.MaxConns < 1
    return err
}

// New connection entry: non-blocking fast reject
if !sem.TryAcquire() {
    conn.Close()
    return ErrTooManyConns
}
defer func() { _ = sem.Release() }()

// Client-side dial control: context-blocked wait
if err := sem.AcquireContext(ctx); err != nil {
    return err // context.DeadlineExceeded or Canceled
}
defer func() { _ = sem.Release() }()

// Runtime diagnostics (Prometheus gauge)
sem.Count()     // current held permits (active connection count)
sem.Cap()       // maximum capacity
sem.Available() // remaining available permits

	`Acquire`	`TryAcquire`	`AcquireContext`	`TryAcquireN(n)`
Blocking	Yes (wait indefinitely)	No (return immediately)	Yes (until ctx cancelled)	No (return immediately)
Count	1	1	1	n (all-or-nothing)
Suitable for	Internal background tasks	New connection entry	Client dial	Batch slot pre-reservation

Batch operations TryAcquireN / ReleaseN:

if !sem.TryAcquireN(n) {
    return ErrNotEnoughSlots
}
defer sem.ReleaseN(n)

TryAcquireN(n) bool: atomically acquire n permits all-or-nothing; n > Cap() always false
ReleaseN(n) error: batch release n permits; returns ErrReleaseNOver if n exceeds held count

Measured Performance (bench_linux_6c12t.txt)

Operation	ns/op	allocs
`Acquire`/`Release` (serial)	29.6	0
`TryAcquire_Success` (serial)	29.9	0
`TryAcquire_Full` (fast fail)	0.7	0
`TryAcquireN(1)`	16.6	0
`TryAcquireN(1)` parallel	152	0

`art` — Adaptive Radix Tree

Ordered dictionary supporting exact lookup, prefix scan, and ordered cursor iteration. vs map: keys are ordered, prefix queries in O(|prefix|+|result|), no rehash pauses.

import "github.com/uniyakcom/yakutil/art"

var tr art.Tree[int]

tr.Put("user:1001", 42)
tr.Put("user:1002", 99)
tr.Put("session:abc", 1)

v, ok := tr.Get("user:1001")  // 42, true

old, ok := tr.Delete("session:abc")  // 1, true

// Ordered full traversal (lexicographic)
tr.ForEach(func(key string, val int) bool {
    fmt.Println(key, val)
    return true // return false to terminate early
})

// Prefix scan: only traverse keys starting with "user:"
tr.ScanPrefix("user:", func(key string, val int) bool {
    fmt.Println(key, val)
    return true
})

// Cursor iteration: traverse all entries strictly greater than cursor
tr.Seek("user:1001", func(key string, val int) bool {
    fmt.Println(key, val) // "user:1002", 99
    return true
})

// Memory monitoring + compaction (reclaim arena dead bytes after bulk deletions)
fmt.Println(tr.PrefixArenaBytes())
tr.CompactPrefixArena() // call during off-peak

// Leaf slab monitoring + compaction (reclaim fragmented slab pages after bulk deletions)
fmt.Println(tr.SlabBytes())
tr.CompactSlab() // call during off-peak, O(N)

Characteristics

Feature	Description
Key type	Binary-safe (`string`, may contain `\x00`)
Node adaptation	4 / 16 / 48 / 256 children auto upgrade/downgrade
Path compression	Pessimistic prefix compression; prefix stored in arena
Ordering	`ForEach` / `Seek` / `ScanPrefix` all lexicographic
Memory	`CompactPrefixArena()` reclaims dead bytes after deletions; `CompactSlab()` reclaims fragmented leaf slab pages
Thread safety	Not thread-safe; wrap all ops with `sync.Mutex` (or `sync.RWMutex` if no concurrent writes)

Time Complexity

Operation	Complexity
`Get` / `Put` / `Delete`	O(k), k = key length
`ForEach`	O(N)
`ScanPrefix`	O(k + \|result\|)
`Seek`	O(k + \|result\|)
`CompactPrefixArena`	O(N × avg_prefix_len)
`CompactSlab`	O(N)

Performance (Intel Xeon E-2186G @ 3.80GHz, amd64, go1.26.1):

Operation	art.Tree	map (comparison)
Put (sequential key)	~124 ns	~129 ns
Put (random key)	~394 ns	~129 ns
Get (sequential key)	~64 ns	~22 ns
Get (random key)	~82 ns	~22 ns
ScanPrefix	~900 ns	~1.3 µs (full scan)

ScanPrefix is ~1.4× faster than map full scan when prefix hit rate is low; sequential key art.Put is comparable to map (prefix sharing reduces internal allocations).
Random key exact lookup is 3–4× slower than map — the cost of ordering and adaptive node traversal. Prefer standard map unless prefix queries are frequent.

`hist` — Equi-Height Histogram

Used for CBO selectivity estimation. Splits sorted samples into equal-frequency buckets; estimates the row fraction in any range [lo, hi].

import "github.com/uniyakcom/yakutil/hist"

h, err := hist.Build(samples, 100) // 100 buckets
if err != nil { ... }

ratio := h.EstimateRange(1.0, 50.0) // fraction of rows in [lo, hi], in [0,1]

Performance (Intel Xeon E-2186G @ 3.80GHz, Go 1.26.1):

Operation	ns/op	allocs	Notes
`Build` (100K samples, 100 buckets)	~331 µs	3 / 24 KB	Scans for NaN/Inf first; only copies if invalid values found

`hll` — HyperLogLog Cardinality Estimator

Estimates the number of distinct elements (NDV) in a stream. Standard error ~0.81% (p=14, 16384 registers).

import "github.com/uniyakcom/yakutil/hll"

s := hll.New()
s.Add([]byte("hello"))
s.AddStr("world")
s.AddHash(precomputedHash) // reuse hash across HLL + CMS + smap without recomputing
n := s.Count()             // estimated distinct element count

Internally uses maphash (AES-NI, process-random seed) to prevent HashDoS attacks
AddHash allows injecting a pre-computed hash, avoiding redundant hashing when the same key is fed to multiple data structures
Not thread-safe; callers must synchronize

Performance (Intel Xeon E-2186G @ 3.80GHz, Go 1.26.1, 1000 elements):

Method	ns/op	Notes
`AddStr` (maphash, 18B key)	~7.3 µs	1000 calls
`AddHash` (FNV-1a, ≤8B key)	~11.9 µs	Caller pre-computes FNV; FNV is slower than maphash for ≥18B keys

Merge(other) error merges another HLL sketch; returns ErrMergeNil if other is nil.

`sketch` — Count-Min Sketch Frequency Estimator

Estimates the approximate frequency of elements in a data stream. Used for Top-K frequency estimation and admission filter assist.

import "github.com/uniyakcom/yakutil/sketch"

cms, err := sketch.New(1000, 0.01) // estimate 1000 keys, 1% error rate
if err != nil { ... }

cms.Add("key")
freq := cms.Count("key") // approximate frequency

// Pre-hash API: reuse hash across HLL + CMS + smap
h := hash.Sum64s("key")
cms.AddHash(h, 1)
freq = cms.CountHash(h)

Uses 4 independent hash functions; width w is rounded up to a power of 2, using bitmask instead of modulo
sketch.NewSized(d, w) for direct control of depth and width
AddHash(h, count) / CountHash(h): pre-hash API avoids redundant hashing when the same key is fed to multiple data structures
Merge(other) error: merges another CMS; returns ErrMergeNil if other is nil
Not thread-safe

Performance (Intel Xeon E-2186G @ 3.80GHz, Go 1.26.1):

Operation	ns/op	Notes
`Add` (1000 calls)	~15.2 µs	Bitmask replaces modulo

`topn` — Generic Top-N Selection

Min-heap implementation, O(N log k), efficiently extracts the top-k elements from a large collection.

import "github.com/uniyakcom/yakutil/topn"

// Descending top-10
top := topn.TopN[Item](items, 10, func(a, b Item) bool {
    return a.Score > b.Score
})

// By int64 key (used by IPMap.TopN internally)
top := topn.ByKey(items, 5, func(e Item) int64 { return e.Score })

Performance (Intel Xeon E-2186G @ 3.80GHz, Go 1.26.1):

Operation	ns/op
`TopN` (64 elements, top 5)	~311 ns
`TopN` (1024 elements, top 10)	~2.65 µs
`ByKey` (64 elements, top 5)	~578 ns

Large collections benefit most: O(N×k) selection sort replaced with O(N log k) heap.

`coarsetime` — Coarse-Grained Clock

500µs precision, ~1 ns/op, single background goroutine per process started automatically on import.

import "github.com/uniyakcom/yakutil/coarsetime"

ns := coarsetime.NowNano() // ~1 ns/op vs time.Now() ~35-40 ns/op
t  := coarsetime.Now()     // time.Time (same precision)

// Graceful shutdown (test cleanup, process exit)
coarsetime.Stop() // idempotent, safe to call multiple times

Background goroutine (time.NewTicker(500µs)) updates a global atomic.Int64
NowNano() = single atomic read, no syscall
Automatically used by ratelimit.Limiter (default clock) and ratelimit.IPMap
No Start() call needed — the clock initializes in package init()

Suitable for hot paths where time.Now() syscall overhead (~35–40 ns) is unacceptable (e.g., token bucket refill, connection timestamps).

`scram` — SCRAM-SHA-256 Cryptographic Primitives

Provides the three cryptographic building blocks required by SCRAM-SHA-256 (RFC 5802 / RFC 7677), built on crypto/hmac + crypto/sha256 only.

import "github.com/uniyakcom/yakutil/scram"

// PBKDF2-HMAC-SHA-256 (RFC 2898 §5.2)
// iter ≥ 4096 for SCRAM-SHA-256; ≥ 60000 recommended for credential storage
saltedPassword := scram.PBKDF2SHA256([]byte(password), salt, 4096, 32)

// HMAC-SHA-256
clientKey := scram.HMACSHA256(saltedPassword, "Client Key")
serverKey := scram.HMACSHA256(saltedPassword, "Server Key")

// SHA-256
storedKey := scram.SHA256Sum(clientKey)

SCRAM-SHA-256 derivation chain (RFC 7677):

SaltedPassword = PBKDF2(password, salt, iter, 32)
ClientKey      = HMAC(SaltedPassword, "Client Key")
StoredKey      = SHA-256(ClientKey)
ServerKey      = HMAC(SaltedPassword, "Server Key")
ClientSignature= HMAC(StoredKey, AuthMessage)
ServerSignature= HMAC(ServerKey, AuthMessage)
ClientProof    = ClientKey XOR ClientSignature

Function	Signature	Note
`PBKDF2SHA256`	`(password, salt []byte, iter, dkLen int) ([]byte, error)`	`dkLen=0` → 32; returns error on invalid params
`HMACSHA256`	`(key []byte, data string) []byte`	`string` avoids `[]byte(s)` conversion
`SHA256Sum`	`(b []byte) []byte`	Wraps `sha256.Sum256`, returns heap slice

Performance characteristics

PBKDF2SHA256: single hmac.New per call; U buffer reused via Sum(U[:0]) — eliminates per-iteration heap allocation
HMACSHA256 / SHA256Sum: thin wrappers, no additional allocation beyond stdlib

Naming Design

Decision	Choice	Reason
Root utilities	`B2S` / `S2B`	Short naming, Go community convention
Hash	`Sum64` / `Sum64s`	Following `hash/` stdlib naming
Ring	`mpsc.Ring` / `spsc.Ring`	Package-qualified semantics; type name needs no prefix
Counter	`percpu.Counter`	Direct semantics
Backoff	`backoff.Backoff`	Standard terminology
API methods	`Add` `Load` `Push` `Pop` `Spin` `Alloc` `Swap`	Short naming, verb-first
Dual-type Map	`Map[V]` / `Map64[V]`	Separated by key type, avoids `interface{}` overhead
Timing wheel	`Wheel[T]`	Generic payload, no type assertion needed

Testing

# Single run (25 packages)
go test ./... -count=1

# With race detection
go test ./... -count=1 -race

# Single package verbose
go test ./ring/... -v -run TestUnreadByte

All 25 packages pass tests (including -race race detection).

Benchmarks

Use the bench.sh script in the project root to run all benchmarks at once:

# Basic run (benchtime=3s, count=1)
./bench.sh

# Custom benchtime and count
./bench.sh 5s 3

# Output file: bench_${OS}_${CORES}c${THREADS}t.txt
# Example: bench_linux_6c12t.txt

# Use benchstat to compare two bench results
go install golang.org/x/perf/cmd/benchstat@latest
benchstat before.txt after.txt

Fuzz Testing

Six packages have Fuzz tests: ring, hll, ratelimit, art, lru, sketch. Use fuzz.sh for local fuzzing:

# All targets, 1m each
./fuzz.sh

# Single target, custom duration
./fuzz.sh art/FuzzPutGet 2m

# Custom duration for all
./fuzz.sh 10m

CI nightly fuzzing runs automatically via .github/workflows/fuzz.yml.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.githooks		.githooks
.github		.github
.vscode		.vscode
arena		arena
art		art
backoff		backoff
bufpool		bufpool
coarsetime		coarsetime
cow		cow
fold		fold
hash		hash
hist		hist
hll		hll
itable		itable
lru		lru
lz4		lz4
mpsc		mpsc
percpu		percpu
ratelimit		ratelimit
ring		ring
scram		scram
semaphore		semaphore
sketch		sketch
smap		smap
spsc		spsc
swar		swar
topn		topn
wheel		wheel
wpool		wpool
.gitignore		.gitignore
.golangci.yml		.golangci.yml
LICENSE		LICENSE
README.md		README.md
README.zh.md		README.zh.md
bench.sh		bench.sh
bench_linux_6c12t.txt		bench_linux_6c12t.txt
bench_test.go		bench_test.go
fuzz.sh		fuzz.sh
go.mod		go.mod
lint.sh		lint.sh
yakutil.go		yakutil.go
yakutil_test.go		yakutil_test.go

Folders and files

Latest commit

History

Repository files navigation

yakutil

Package Overview

Root Package yakutil

hash — FNV-1a + maphash Dual-Engine Hash

FNV-1a (Cross-Process Deterministic)

maphash (AES-NI Accelerated, Process-Local)

Performance Comparison (Intel Xeon E-2186G @ 3.80GHz, Linux amd64, Go 1.26.1, bench_linux_6c12t.txt)

Selection Guide

When to Choose yakhash

percpu — Per-CPU Counters

Gauge — Bidirectional Per-CPU Gauge

mpsc — MPSC Lock-Free Ring + Group Commit

spsc — SPSC Wait-Free Ring

backoff — Three-Phase Adaptive Backoff

arena — CAS Bump Allocator

bufpool — Tiered Byte-Slice Pool

Get / Put — General Use

GetPtr / PutPtr — Zero-Alloc Hot Path

cow — Copy-on-Write Atomic Value

smap — Sharded Concurrent Map

swar — SWAR Byte-Parallel Scan

fold — ASCII Case-Insensitive Comparison

ring — Ring Byte Buffer

wheel — Generic Timing Wheel

wpool — Worker Pool + Adaptive Scaling

Submitter Interface

TimedSubmitter Interface

Pool — FIFO Basic Worker Pool

Stack — FILO Hot Worker Pool

Adaptive — Auto-Scaling Pool

Interface Substitution Example

itable — Integer-Key Lookup Table

lru — Sharded LRU Cache

lz4 — LZ4 Block Compression

ratelimit — Token-Bucket Rate Limiter

IPMap — Per-IP Rate Limit Manager

semaphore — Counting Semaphore

art — Adaptive Radix Tree

hist — Equi-Height Histogram

hll — HyperLogLog Cardinality Estimator

sketch — Count-Min Sketch Frequency Estimator

topn — Generic Top-N Selection

coarsetime — Coarse-Grained Clock

scram — SCRAM-SHA-256 Cryptographic Primitives

Naming Design

Testing

Benchmarks

Fuzz Testing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Root Package `yakutil`

`hash` — FNV-1a + maphash Dual-Engine Hash

Performance Comparison (Intel Xeon E-2186G @ 3.80GHz, Linux amd64, Go 1.26.1, `bench_linux_6c12t.txt`)

`percpu` — Per-CPU Counters

`Gauge` — Bidirectional Per-CPU Gauge

`mpsc` — MPSC Lock-Free Ring + Group Commit

`spsc` — SPSC Wait-Free Ring

`backoff` — Three-Phase Adaptive Backoff

`arena` — CAS Bump Allocator

`bufpool` — Tiered Byte-Slice Pool

`Get` / `Put` — General Use

`GetPtr` / `PutPtr` — Zero-Alloc Hot Path

`cow` — Copy-on-Write Atomic Value

`smap` — Sharded Concurrent Map

`swar` — SWAR Byte-Parallel Scan

`fold` — ASCII Case-Insensitive Comparison

`ring` — Ring Byte Buffer

`wheel` — Generic Timing Wheel

`wpool` — Worker Pool + Adaptive Scaling

`itable` — Integer-Key Lookup Table

`lru` — Sharded LRU Cache

`lz4` — LZ4 Block Compression

`ratelimit` — Token-Bucket Rate Limiter

`IPMap` — Per-IP Rate Limit Manager

`semaphore` — Counting Semaphore

`art` — Adaptive Radix Tree

`hist` — Equi-Height Histogram

`hll` — HyperLogLog Cardinality Estimator

`sketch` — Count-Min Sketch Frequency Estimator

`topn` — Generic Top-N Selection

`coarsetime` — Coarse-Grained Clock

`scram` — SCRAM-SHA-256 Cryptographic Primitives

Packages