English | 中文
High-performance shared primitives for the uniyak ecosystem
Zero external dependencies · Generics · Lock-free · Zero-allocation optimized
go get -u github.com/uniyakcom/yakutil
Go 1.25+
| Package | Purpose | Core Types / Functions |
|---|---|---|
yakutil |
Root: utilities + constants + byte conversion + sentinel errors | B2S S2B IsPow2 Pow2Ceil Native ErrStr Pad NoCopy |
hash |
FNV-1a 64-bit hash + maphash AES acceleration | Sum64 Sum64s Sum64Map Sum64sMap |
percpu |
Per-CPU contention-free counter/gauge | Counter Gauge |
mpsc |
MPSC lock-free Ring (Group Commit) | Ring[T] |
spsc |
SPSC wait-free Ring | Ring[T] |
backoff |
Three-phase adaptive backoff | Backoff |
arena |
CAS bump allocator | Arena |
bufpool |
Tiered byte-slice pool | Pool / Get / Put / GetPtr / PutPtr |
cow |
COW atomic value (generic) | Value[T] Swap UpdateCAS |
smap |
Sharded concurrent Map | Map[V] Map64[V] |
swar |
SWAR byte-parallel scan | FindByte FindQuote FindEscape HasZero HasByte HasLess FirstByte |
fold |
Fast ASCII case-insensitive comparison | Lower Upper Equal EqualStr EqualBytes ToLowerBytes ToLowerString ToUpperBytes ToUpperString |
ring |
Auto-growing ring byte buffer | Buffer |
wheel |
Generic timing wheel (configurable resolution) | Wheel[T] |
wpool |
Worker pool + adaptive scaling | Submitter Pool Stack Adaptive |
itable |
Integer-key high-performance lookup table | Table[V] |
lru |
Sharded LRU cache | Cache[V] |
lz4 |
LZ4 block compression / decompression (pure Go, no frame headers) | Compress CompressHC Decompress CompressBound |
ratelimit |
Token-bucket rate limiter (CAS lock-free, zero-alloc) | Limiter |
semaphore |
Counting semaphore (bounded concurrency control) | Semaphore |
art |
Adaptive Radix Tree (ordered dictionary, prefix queries) | Tree[V] |
hist |
Equi-height histogram (CBO selectivity estimation) | Hist Bucket |
hll |
HyperLogLog cardinality estimator | Sketch |
sketch |
Count-Min Sketch frequency estimator | CMS |
topn |
Generic Top-N selection | TopN ByKey |
coarsetime |
Coarse-grained clock at 500µs precision (~1ns/op) | NowNano Now Stop |
scram |
SCRAM-SHA-256 cryptographic primitives (RFC 5802) — zero external deps | PBKDF2SHA256 HMACSHA256 SHA256Sum |
Zero-copy conversions and bitwise utility toolkit.
import "github.com/uniyakcom/yakutil"
s := yakutil.B2S(buf) // []byte → string (zero-copy)
b := yakutil.S2B(s) // string → []byte (zero-copy, read-only)
yakutil.IsPow2(1024) // true
n, err := yakutil.Pow2Ceil(1000) // n=1024, err=nil
// overflow: n=0, err=yakutil.ErrOverflow
// Native byte order
order := yakutil.Native // binary.BigEndian or LittleEndian
// Zero-allocation sentinel error
const ErrNotFound = yakutil.ErrStr("not found")
fmt.Println(ErrNotFound.Error()) // "not found"
// Cache-line padding
type HotStruct struct {
counter int64
_ yakutil.Pad // 64-byte padding, avoids false sharing
flag int64
}Constants and Types
CacheLine = 64— x86/ARM cache line sizePad—[64]bytepadding typeNative—binary.ByteOrder, CPU native byte order detected at runtimeErrStr— zero-allocation string error type, declarable asconstErrOverflow— returned byPow2Ceilon integer overflow
NoCopy
Embedding yakutil.NoCopy makes go vet copylocks detect illegal copies:
type MyMutex struct {
yakutil.NoCopy
// ...
}import "github.com/uniyakcom/yakutil/hash"
h := hash.Sum64(key) // []byte input
h := hash.Sum64s("key") // string input (zero-alloc)- Produces the same result as
hash/fnv.New64a(); results can be compared cross-process or persisted - Pure computation, no state allocation, inlinable
Sum64siterates directly over the string's underlying bytes, avoiding[]byte(s)conversion- Measured ~2.8× faster than
fnv.New64a(), zero allocations (fnv.New64a costs 1 alloc/8 B per call)
h := hash.Sum64Map(key) // []byte input, AES-NI accelerated
h := hash.Sum64sMap("key") // string input (zero-alloc)- Leverages Go runtime's built-in random seed + AES-NI hardware instructions
- Note: seed is randomized on process restart; hash values cannot be persisted or compared cross-process
- Zero allocations; suitable for in-process routing, sharding, Bloom filters, etc.
Performance Comparison (Intel Xeon E-2186G @ 3.80GHz, Linux amd64, Go 1.26.1, bench_linux_6c12t.txt)
| Data Size | FNV Sum64s |
maphash Sum64Map |
maphash Sum64sMap |
maphash/FNV ratio |
|---|---|---|---|---|
| 8 B | 1,344 MB/s | 1,751 MB/s | 1,272 MB/s | ×1.3 (small gap at 8B) |
| 32 B | 1,712 MB/s | 6,882 MB/s | 5,545 MB/s | ×3.2–4.0 |
| 128 B | 1,257 MB/s | 15,381 MB/s | 14,200 MB/s | ×11.3–12.2 |
| 256 B | 1,194 MB/s | 14,544 MB/s | 13,480 MB/s | ×11.3–12.2 |
| 1 KB | 1,151 MB/s | 13,016 MB/s | 12,781 MB/s | ×11.1–11.3 |
| 4 KB | 1,141 MB/s | 12,492 MB/s | 12,520 MB/s | ×10.9–11.0 |
All implementations: 0 B/op, 0 allocs/op.
| Implementation | ns/op | allocs | Notes |
|---|---|---|---|
Sum64s (internal FNV) |
16.4 | 0 / 0 B | Reference key; recommended stdlib replacement |
fnv.New64a() (stdlib) |
50.5 | 2 / 40 B | Interface boxing; ~3.1× slower, allocations |
Parallel throughput (32B key, 12 threads, -race=false)
| Implementation | ns/op | MB/s | allocs |
|---|---|---|---|
FNV Sum64s |
2.04 | 15,707 | 0 |
maphash Sum64sMap |
0.95 | 33,583 | 0 |
| Parallel speedup | — | ×2.1 | — |
| Scenario | Recommended | Reason |
|---|---|---|
| Cross-process, persisted, deterministic hash | Sum64 / Sum64s |
FNV-1a spec compliant, stable seed |
| Short key (≤16 B), in-process | Sum64 / Sum64s |
Small perf gap (≤1.3×), no seed restriction |
| Medium-long key (≥32 B) in-process routing | Sum64Map / Sum64sMap |
4–12× throughput, zero-cost AES-NI |
| smap, Bloom filter, consistent hash ring | Sum64Map / Sum64sMap |
In-process; stable seed unnecessary |
When to Choose yakhash
yakutil/hash is designed as a zero-external-dependency lightweight internal tool, covering in-process hashing needs across the yakutil ecosystem. When your use case exceeds the following boundaries, consider the dedicated uniyak hash library yakhash (full xxHash XXH64/XXH3 implementation with AVX2/NEON assembly acceleration):
| Scenario | yakutil/hash | yakhash |
|---|---|---|
| ≥ 1 KB large-block throughput | FNV ~1.15 GB/s (byte-loop ceiling, no vectorization) | XXH3-64 41.6 GB/s @ 1 KB, 56.2 GB/s @ 10 MB (36–49× faster) |
| Reproducible hash (fixed seed) | maphash is process-random; not reproducible |
Sum3_64Seed / Sum64Seed |
| Cross-process / distributed consistent hash | maphash seed cannot be shared cross-process |
Sum3_64Secret + GenSecret |
| HashDoS active defense (controlled key) | Not available | 192-byte secret space, Sum3_64Secret / Sum3_64Seed |
| 128-bit output (content addressing / dedup) | Not available | Sum3_128 / Sum3_128Seed / Sum3_128String |
Streaming chunked processing (io.Reader) |
Not available | New3() + Write + Sum64 / Sum128 |
| State snapshot / checkpoint-resume | Not available | MarshalBinary / UnmarshalBinary |
| C xxHash 0.8.x bit-exact compatibility | Not available | ✓ All functions bit-identical to C original |
go get github.com/uniyakcom/yakhash
// Large data: AVX2 vectorized, 56 GB/s (10 MB block, same machine)
h := yakhash.Sum3_64(data)
// HashDoS defense (cross-process consistent key)
secret := yakhash.GenSecret(loadSeedFromConfig())
h, _ := yakhash.Sum3_64Secret(key, secret[:])
// Streaming chunked
d := yakhash.New3()
io.Copy(d, reader)
h := d.Sum64()
yakutil/hashmaintains zero external dependencies;yakhashis the dedicated uniyak ecosystem hash library. The two have no transitive dependency relationship.
Multi-core concurrent counting with near-zero contention.
import "github.com/uniyakcom/yakutil/percpu"
c := percpu.New(runtime.GOMAXPROCS(0))
c.Add(1) // writes distributed across different cache line slots
total := c.Load() // aggregate all slots (approximate)
c.Reset() // zero all slotsDesign: Maps callers to 8–256 isolated 64B slots using goroutine stack address × Fibonacci multiplicative hash constant (0x9e3779b97f4a7c15 = 2⁶⁴/φ) right-shifted 16 bits. Write path uses only atomic Add, no cross-core contention.
Auto slot count: max(8, ceilPow2(procs)), capped at 256; low-core environments retain 8 slots.
Why Fibonacci hash instead of
>>13?
goroutine pool stack spacing is fixed at 8 KB (1<<13 bytes); plain>>13maps them to consecutive slots (hotspot clustering).
Fibonacci multiplication disperses clustered addresses uniformly across all slots.
On 32-bit systems the constant auto-truncates to0x7f4a7c15(still odd, preserving bijection).
Selection Guide
| Scenario | Recommendation | Reason |
|---|---|---|
| Monitoring counters (bytes, messages) | percpu.Counter |
High-frequency writes, approximate reads, 2× parallel advantage |
| Connection count, error count, rate limiting logic | atomic.Int64 |
Requires exact value for control flow |
| CAS / Swap / arbitrary initial value | atomic.Int64 |
percpu does not support these |
| Load requires strict consistent snapshot | atomic.Int64 |
percpu.Load is O(slots) aggregation, may see intermediate state |
Performance (Intel Xeon E-2186G @ 3.8GHz, 12 cores, bench_linux_6c12t.txt)
| Operation | percpu | atomic.Int64 |
|---|---|---|
| Add serial | 5.6 ns | 5.8 ns |
| Add parallel (12C contention) | 2.2 ns ✓ | 17.1 ns ✗ (×7.8 slower) |
| Load | ~10 ns | ~3 ns ✗ |
| Reset | ~139 ns | — |
Serial: comparable; high-concurrency writes: percpu ~8× advantage; Load is ~4× slower than atomic.Load.
Use percpu exclusively for "write-heavy, read-light" monitoring counters.
Diagnostic interface Stats()
st := c.Stats()
// Stats{Slots:64, Min:..., Max:..., Mean:..., Skew:1.0}
// Skew = Max/Mean; ≤2.0 is uniform, >2.0 suggests hot slot
if st.Skew > 2.0 {
log.Printf("percpu hot slot detected: skew=%.1f", st.Skew)
}| Field | Meaning |
|---|---|
Slots |
Total slot count |
Min / Max |
Minimum / maximum slot value |
Mean |
Average slot value |
Skew |
Max/Mean; 1.0 = perfectly uniform; >2.0 suggests hotspot; >5.0 suggests increasing procs |
Gauge supports Add + Sub (value can be negative), suitable for tracking active connections, concurrent request counts, etc.
g := percpu.NewGauge(runtime.GOMAXPROCS(0))
g.Add(1) // new connection enters
g.Sub(1) // connection closes (equivalent to Add(-1))
g.Load() // aggregate all slots (approximate snapshot)
g.Reset() // zero all slots
// Diagnostics
st := g.Stats() // GaugeStats{Slots, Min, Max, Sum, Mean, Skew}| Counter vs Gauge | Counter | Gauge |
|---|---|---|
| Monotonic | Yes (increment only) | No (increment and decrement) |
| Add/Sub | Add only | Add + Sub |
| Load | Approximate aggregate | Approximate aggregate |
| Typical use | Byte counts, message counts | Active connections, memory usage |
Performance identical to Counter (same Fibonacci hash + 64B slot layout).
Measured Performance (Intel Xeon E-2186G @ 3.80GHz, 12 cores, bench_linux_6c12t.txt)
| Operation | ns/op | allocs | Notes |
|---|---|---|---|
Add (serial) |
5.64 | 0 | Fibonacci hash distributes to random slot |
Add (12 parallel) |
~2.9 | 0 | 12 threads, no slot contention |
Load (cross-slot aggregate) |
— | 0 | Single pass over all slots |
Comparison:
atomic.Int64.Addat 12 parallel ≈ 17.0 ns/op (≥6× gap).
Multi-producer single-consumer queue, ideal for WAL Group Commit pattern.
import "github.com/uniyakcom/yakutil/mpsc"
r := mpsc.New[Record](4096)
// Producer (multiple goroutines)
seq, ok := r.Enqueue(record)
if ok {
err := r.Wait(seq) // blocks until consumer Commits
}
// Producer with context timeout
seq, ok := r.Enqueue(record)
if ok {
err := r.WaitContext(ctx, seq) // returns ctx.Err() on timeout
}
// Consumer (single goroutine)
start, n := r.Drain(func(rec *Record) error {
return encodeToBuf(rec) // batch encode
})
flushErr := fsync(buf)
r.Commit(start, n, flushErr) // wake all producersState machine: free → filling → ready → drained → free
Enqueue: CAS tail, write slotWait: channel-block waiting for done signal (Go runtime scheduling, no busy-spin), then release slotWaitContext: likeWaitbut respectscontext.Contextcancel/timeout; spawns lightweight cleanup goroutine on cancel (30s safety timeout prevents goroutine leak)Drain: batch harvest consecutive ready slotsCommit: setdone=1to wake producers, can carry batch error
Single-producer single-consumer ultra-low-latency queue.
import "github.com/uniyakcom/yakutil/spsc"
r := spsc.New[Event](1024)
// Producer (single goroutine)
r.Push(event)
// Consumer (single goroutine)
if evt, ok := r.Pop(); ok {
handle(evt)
}- No CAS; only
atomic.Store/atomic.Load(equivalent to plain MOV on x86) cachedHead/cachedTaileliminate normal cross-core cache line reads- Typical throughput 2-5 ns/op
import "github.com/uniyakcom/yakutil/backoff"
var bo backoff.Backoff
for !condition() {
bo.Spin()
}
bo.Reset()| Phase | Iteration Range | Behavior |
|---|---|---|
| Phase 1 | N < 64 |
Tight CPU spin (zero overhead) |
| Phase 2 | N < 128 |
runtime.Gosched() |
| Phase 3 | N ≥ 128 |
Exponential sleep (1µs to 1ms) |
Customizable:
bo := backoff.Backoff{
SpinN: 32,
YieldN: 16,
MaxWait: 500 * time.Microsecond,
}Zero value usable directly (default parameters). Value type, no allocation.
High-concurrency bump allocator, suitable for short-lived scenarios like WAL encode buffers.
import "github.com/uniyakcom/yakutil/arena"
a := arena.New(0) // default 64KB chunk
buf := a.Alloc(128) // 8-byte aligned, CAS concurrency-safe; n≤0 returns nil
// ... use buf ...
a.Reset() // switch to new chunk; old references still valid until GC- Fast path CAS + add = < 5 ns/op (vs Go heap ~25 ns)
n > chunkSizefalls back tomake()- Chunk exhaustion auto-switches; old chunk references remain valid
20 tiers of sync.Pool (64B to 32MB), automatically tiered by size. Requests >32MB are directly allocated, bypassing the pool.
Two APIs are available:
| API | Steady-state allocs | Best for |
|---|---|---|
Get / Put |
0 alloc, 0 B | Single-goroutine short-lived buf |
GetPtr / PutPtr |
0 alloc, 0 B | Cross-goroutine async pipeline, pointer travels with buf |
import "github.com/uniyakcom/yakutil/bufpool"
// Global functions
buf := bufpool.Get(4096)
defer bufpool.Put(buf)
// Independent instance
var p bufpool.Pool
buf := p.Get(1024)
p.Put(buf)Get(size): returns a slice withlen=size, cap=2^nPut(b): returns to the appropriate tier bycap- Slices with
cap < 64Borcap > 32MBare automatically discarded - Slices with non-power-of-2
capare automatically discarded (preventsb[:size]out-of-bounds panic) - Internally uses
unsafe.Pointerto bypass interface boxing — steady-state 0 B / 0 allocs
Stores *[]byte (pointer type) rather than []byte (3-word value) in sync.Pool, eliminating the interface-boxing allocation entirely. The wrapper pointer and its backing buffer travel together through the full lifecycle — GetPtr → Event.bufPtr → Post steal → logEntry.bufPtr → drainRing → PutPtr → ptrTiers → GetPtr → … — with zero allocations in steady state.
// Acquire: wrapper pointer + buf together (0 alloc in steady state)
bp := bufpool.GetPtr(4096)
buf := *bp // the actual []byte, len=4096, cap=2^n
buf = buf[:0] // reset for writing
// ... write into buf ...
// Release: pointer type stored in sync.Pool, 0-alloc boxing
bufpool.PutPtr(bp)
// ⚠️ Do NOT access bp or *bp after PutPtr — use-after-freeIndependent instance:
var p bufpool.Pool
bp := p.GetPtr(512)
*bp = append((*bp)[:0], data...)
p.PutPtr(bp)Producer / consumer across goroutines (async log pipeline pattern):
type task struct {
buf []byte
bufPtr *[]byte // travels with buf; nil if from Get (boxing path)
}
// Producer goroutine
bp := bufpool.GetPtr(512)
buf := *bp
buf = encode(buf[:0])
queue <- task{buf: buf, bufPtr: bp}
// Consumer goroutine (different P)
t := <-queue
write(t.buf)
if t.bufPtr != nil {
bufpool.PutPtr(t.bufPtr) // 0-alloc boxing even across Ps
} else {
bufpool.Put(t.buf)
}Safety rules:
GetPtrandPutPtruse a separate internal pool (ptrTiers) fromGet/Put(tiers). Never mix them —Put(buf)for a buf obtained viaGetPtr, orPutPtr(bp)with a pointer not fromGetPtr, will silently discard the buffer or corrupt the pool.- After
PutPtr(bp), ownership transfers to the pool. Do not read or writebpor*bp. bpmust point to an independently heap-allocated object (e.g.Event.bufPtr), not a field embedded inside a ring-slot struct. Embedding causes the wrapper to be aliased by future slot reuse, leading to use-after-free.- Buffers with non-standard
cap(e.g. afterappendgrew the slice) are silently discarded byPutPtr; the nextGetPtrwill allocate fresh. Trackcapstability if reuse is critical.
Benchmark comparison (Intel Xeon E-2186G @ 3.80GHz, Go 1.26.1):
BenchmarkPool_Get128-12 ~14 ns/op 0 B/op 0 allocs/op
BenchmarkPool_Get128_Parallel-12 ~2 ns/op 0 B/op 0 allocs/op
BenchmarkPool_GetPtr512-12 ~15 ns/op 0 B/op 0 allocs/op
BenchmarkPool_GetPtr512_Parallel-12 ~2 ns/op 0 B/op 0 allocs/op
Generic atomic snapshot for read-heavy, write-light scenarios.
import "github.com/uniyakcom/yakutil/cow"
v := cow.New[Config](defaultConfig)
// Read (any goroutine, ~1ns)
cfg := v.Load()
// Write (single writer)
v.Store(newConfig)
// Read-modify-write
v.Update(func(old Config) Config {
old.Timeout = 5 * time.Second
return old
})- Read path: single
atomic.Pointer.Load(), truly lock-free - Write path: construct new value,
atomic.Store Update: suitable for single writer; use external lock for multiple writersSwap: atomically replaces and returns old value, concurrency-safeUpdateCAS: CAS-loop lock-free read-modify-write, safe for multiple writers;fnmust be pure and cheap — each CAS retry re-invokesfn; useUpdateCASPtrfor allocation-sensitive hot paths
// Multiple-writer-safe read-modify-write
v.UpdateCAS(func(old Config) Config {
old.Count++
return old
})
// Atomic replace and get old value
old := v.Swap(newConfig)
// Zero-alloc read-modify-write (with sync.Pool for old pointer recycling)
var pool sync.Pool
oldPtr := v.UpdateCASPtr(func(old *Config) *Config {
np, _ := pool.Get().(*Config)
if np == nil {
np = new(Config)
}
if old != nil {
*np = *old
}
np.Timeout = 5 * time.Second
return np
})
if oldPtr != nil {
pool.Put(oldPtr)
}UpdateCASPtr is the zero-alloc variant of UpdateCAS: fn receives the current *T (nil for zero-value) and returns new *T; ownership of the old *T is returned to the caller for sync.Pool recycling. Each successful write = 0 allocs, vs UpdateCAS 1 alloc / 8 B.
| Method | ns/op | allocs | Notes |
|---|---|---|---|
UpdateCAS |
~19.7 | 1 / 8 B | Allocates new object per write |
UpdateCASPtr + pool |
~17.7 | 0 / 0 B | Pool supplies object, old pointer returned |
UpdateCAS_Parallel (12t) |
~159 | 6 / 50 B | Multiple allocs under CAS contention |
High-performance concurrent Map, N shards + RWMutex to isolate contention. String key routing uses maphash AES acceleration (for in-process routing; seed changes on process restart).
import "github.com/uniyakcom/yakutil/smap"
// string key (maphash AES-accelerated shard hash)
m := smap.New[int](64) // 64 shards
m.Set("foo", 42)
v, ok := m.Get("foo")
m.Range(func(k string, v int) bool { return true })
// Atomic get-or-create (double-checked locking; fn executes exactly-once under write lock)
v, created := m.GetOrSet("session:42", func() int {
return expensiveInit()
})
// created=true means this call triggered creation
// uint64 key (Fibonacci hash)
m64 := smap.New64[string](32)
m64.Set(12345, "hello")
val, created := m64.GetOrSet(12345, func() string { return "world" })- Read path RLock on single shard (~22 ns), no global lock
- Cache-line isolation between shards
GetOrSet: fast RLock check first; if present returns (false); if absent upgrades to Lock, double-checks, then callsfn— guaranteesfnexecutes only once concurrently
Measured Performance (bench_linux_6c12t.txt, Go 1.26.1)
| Operation | ns/op |
|---|---|
| Get | ~22 ns |
| Set | ~35 ns |
| Parallel Get (12t) | ~45 ns |
| ReadHeavy NoAlloc (B2S) | ~9 ns |
Note: For lowest latency, set shard count to 2–4× CPU count.
SIMD-Within-A-Register: process 8 bytes simultaneously in a single integer operation.
import "github.com/uniyakcom/yakutil/swar"
idx := swar.FindByte(data, '\n') // find newline
idx := swar.FindQuote(data) // find '"'
idx := swar.FindEscape(data) // find <0x20 / '"' / '\\'- Typical 4–8× speedup vs single-byte loop
- Suitable for JSON parsing, HTTP header scanning
Based on 256-byte lookup table, zero allocation.
import "github.com/uniyakcom/yakutil/fold"
fold.Equal([]byte("Content-Type"), "content-type") // true
fold.EqualStr("HOST", "Host") // true
// In-place ASCII lowercase / uppercase (zero-alloc)
fold.ToLowerBytes(header) // e.g. header = []byte("Content-Type") → "content-type"
fold.ToUpperBytes(key) // e.g. key = []byte("accept") → "ACCEPT"
// Allocating variants (1 alloc)
lower := fold.ToLowerString("Content-Type") // "content-type"
upper := fold.ToUpperString("accept") // "ACCEPT"- Approximately 1.84× faster than
strings.EqualFold(~6.6 vs ~12.1 ns,bench_linux_6c12t.txt) - ASCII only (A-Z and a-z); use stdlib for Unicode
ToLowerBytes/ToUpperBytes: in-place ASCII case conversion, zero-alloc (~26 ns for 46-byte header on Xeon E-2186G)ToLowerString/ToUpperString: allocating variants, 1 alloc
2^N auto-growing ring buffer for network I/O.
import "github.com/uniyakcom/yakutil/ring"
buf := ring.New(4096)
buf.Write(data)
buf.WriteByte(0x0A) // implements io.ByteWriter, writes single byte (zero-alloc)
p := buf.Peek(10) // inspect first 10 bytes (non-consuming)
buf.Discard(10) // discard
c, err := buf.ReadByte() // implements io.ByteReader, read byte-by-byte (zero-alloc)
buf.UnreadByte() // implements io.ByteScanner, replay last ReadByte (zero-alloc)
buf.WriteTo(conn) // zero-copy output
buf.ReadFrom(conn) // read from ReaderByte-by-byte I/O (WriteByte / ReadByte / UnreadByte trio):
- No
[]byteallocation needed; suitable for frame header parsing (magic, version, length fields) ReadBytereturns(byte, error); empty buffer returnsio.EOFUnreadByterolls back the lastReadByte, implementsio.ByteScannerfor peek-and-rollback protocol parsing- Two consecutive
UnreadBytecalls returnio.ErrNoProgress
- Two consecutive
- Can be mixed with
Read/Write
Peek semantics:
-
No wraparound: returns slice pointing to internal buffer (zero-copy); caller must not modify
-
Crossing boundary: allocates new slice and copies (can modify directly)
-
Bitmask modulo, zero-alloc wrap
-
Auto 2× expansion with data linearization
-
Struct is exactly 64B = 1 cache line
Measured Throughput (bench_linux_6c12t.txt)
| Operation | ns/op | Throughput |
|---|---|---|
| Write 64 B | 7.6 ns | 8,404 MB/s |
| WriteRead 1 KB | 31.0 ns | 33,064 MB/s |
| PeekDiscard | 9.5 ns | — |
| WriteByte | 2.75 ns | 0 allocs |
| ReadByte | 1.81 ns | 0 allocs |
| UnreadByte (Write+Read+Unread+Read round-trip) | 5.6 ns | 0 allocs |
Zero-copy I/O (ReadableSegments / CommitRead / WritableSegments / CommitWrite)
High-performance protocol parsers (e.g., RESP) can use the zero-copy interface to directly access internal buffers, eliminating intermediate Read(tmp) copies:
// Zero-copy read side
// Returns one or two segments pointing to internal buffer; two segments when wrapping
s1, s2 := buf.ReadableSegments(n) // caller must not write to these
parse(s1) // zero-copy parse
if s2 != nil { parse(s2) } // handle second segment when wrapping
buf.CommitRead(n) // advance read pointer (returns error on out-of-range)
// Zero-copy write side
s1, s2 := buf.WritableSegments(need) // get writable memory segments
copy(s1, data) // write directly into internal buffer
buf.CommitWrite(n) // confirm n bytes written (returns error on out-of-range)Returned
s1/s2point directly to internal buffer, no copy needed. After consuming k bytes, must callCommitRead(k)to advance read pointer.
32-bit overflow auto-protection
r, w, mask fields use uint (platform-native unsigned integer):
| Platform | Field width | Overflow threshold | Conclusion |
|---|---|---|---|
| 64-bit | uint64 | ~18 EB | Practically unreachable |
| 32-bit | uint32 | 4 GB | Go slice limit 2 GB; also unreachable |
Unsigned overflow wraps naturally; Len() = w - r always correct; no need to manually call Reset() in continuous non-empty stream scenarios.
High-performance timer with configurable resolution.
import "github.com/uniyakcom/yakutil/wheel"
w := wheel.New[ConnID](10*time.Millisecond, 1024)
id := w.Add(5*time.Second, connID) // O(1) add
w.Cancel(id) // O(1) cancel
// Manual advance
w.Advance(func(id ConnID) { close(id) })
// Or automatic tick
w.Run(ctx, func(id ConnID) { close(id) })- Add/Cancel O(1), Advance O(expired)
- Supports layered rounds (long delays exceeding slots count)
sync.Poolreuses entry nodes
Pool, Stack, and Adaptive all implement Submitter; callers need not care about the underlying implementation:
type Submitter interface {
Submit(task func()) bool // blocks until task accepted or pool stopped
TrySubmit(task func()) bool // non-blocking; returns false if cannot accept immediately
Running() int // current active worker count
Stop() // graceful shutdown, waits for all workers to finish
PanicCount() int64 // lifetime task panic count (increments regardless of handler)
}Pool and Stack additionally implement TimedSubmitter for timeout backpressure control:
type TimedSubmitter interface {
Submitter
SubmitTimeout(task func(), timeout time.Duration) bool
// Waits at most timeout: Pool waits for queue slot, Stack waits for idle worker.
// Returns false on timeout or pool stopped.
}Use case:
// HTTP handler: wait at most 50ms for worker, return 503 on timeout
ok := pool.SubmitTimeout(func() { handle(req) }, 50*time.Millisecond)
if !ok {
http.Error(w, "503 overloaded", http.StatusServiceUnavailable)
return
}import "github.com/uniyakcom/yakutil/wpool"
p := wpool.NewPool(8, 5*time.Second) // 8 workers, 5s idle timeout
p.Submit(func() { handle(req) })
p.TrySubmit(func() { handle(req) }) // non-blocking
p.Resize(16) // dynamic scaling
p.Stop() // graceful shutdown- Queue watermark >75% / idle >50% auto-scaled by
Adaptive.adjust()(Poolitself does not auto-scale) - Adjust worker count via
Resize(n); excess workers auto-exit on idle after downsizing Submit/TrySubmitblocking/non-blocking modessafeRundelegates topanicSafeRun(wpool/safe.go), shared by Pool and Stack- Single task panic does not affect worker survival; each panic is counted then handler called
PanicCount() int64cumulative panic count (increments regardless of whether handler is set)
PanicCount usage example
p := wpool.NewPool(8, 5*time.Second, wpool.WithPanicHandler(func(r any, stack []byte) {
slog.Error("worker panic", "err", r, "stack", string(stack))
}))
// Periodically report to monitoring (Prometheus, OTEL, etc.)
go func() {
for range time.Tick(30 * time.Second) {
metrics.Set("wpool_panics_total", float64(p.PanicCount()))
}
}()
PanicCountandWithPanicHandlerare independent: increments without handler, counts first then calls handler when present.
s := wpool.NewStack(8, 10*time.Second) // 8 workers, 10s idle timeout
s.Submit(func() { handle(req) })
s.TrySubmit(func() { handle(req) })
s.PanicCount()
s.Stop()FILO vs FIFO Selection:
| Scenario | Recommended |
|---|---|
| Network IO reactor dispatch, protocol parsing (short-lived, high-concurrency) | Stack (FILO) |
| General background tasks, variable task duration | Pool (FIFO) |
Stack design details:
- per-worker
chan(capacity=1)avoids Submit/workerFunc mutual blocking sync.PoolreusesstackWorkerChan, reduces GC allocationsync.Condzero-CPU-block waits whenmaxWorkersis reached- Background cleaner periodically reclaims idle-timeout workers (default 10s)
Stop()guarantee: waits for all worker goroutines to fully exit (including currently executing tasks);Running() == 0after return, no goroutine leak
a := wpool.NewAdaptive(4, 64, 500*time.Millisecond, 5*time.Second)
// params: min=4, max=64, sample period=500ms, idle timeout=5s
a.Submit(func() { handle(req) })
a.Stop()- Queue watermark >75%: scale up (+25% workers, up to max)
- Idle workers >50%: scale down (-25% workers, down to min)
PanicCount() int64delegates to internal Pool, sharing the same counter
func NewDispatcher(pool wpool.Submitter) *Dispatcher {
return &Dispatcher{pool: pool}
}
// Reactor IO: use FILO Stack
d := NewDispatcher(wpool.NewStack(runtime.NumCPU(), 0))
// General background tasks: use FIFO Pool
d := NewDispatcher(wpool.NewPool(8, 0))
// With explicit backpressure deadline: use TimedSubmitter
var ts wpool.TimedSubmitter = wpool.NewStack(8, 0)
ts.SubmitTimeout(task, 50*time.Millisecond)Measured Performance (bench_linux_6c12t.txt)
| Implementation | Submit (serial) | TrySubmit | Submit (parallel) |
|---|---|---|---|
Pool FIFO |
440 ns | 12 ns | 384 ns |
Stack FILO |
425 ns | 40 ns | 420 ns |
go spawn |
240 ns | — | — |
TrySubmitis fastest (12 ns), suitable for non-blocking dispatch.
Stack.Submit(FILO) is CPU-cache affine, suitable for low-latency network IO dispatch.
SubmitTimeout Measured Performance (bench_linux_6c12t.txt)
| Implementation | ns/op (serial) | allocs | Notes |
|---|---|---|---|
Pool.SubmitTimeout |
603 ns | 0 (0B) | Timer pool optimization, zero-alloc |
Stack.SubmitTimeout |
473 ns | 53 B / 0 alloc | FILO worker reuse, timer token pooled |
Pool.SubmitTimeout (parallel) |
519 ns | 0 | — |
Stack.SubmitTimeout (parallel) |
642 ns | 72 B / 1 alloc | — |
Stack.SubmitTimeoutis ~23% faster thanPool.SubmitTimeout.
Small key array direct lookup + large key sync.Map fallback.
import "github.com/uniyakcom/yakutil/itable"
tb := itable.New[Conn](0) // default 65536 fast path
tb.Set(fd, &conn) // O(1) atomic store
conn, ok := tb.Get(fd) // O(1) atomic load
tb.Del(fd)- key < 65536:
atomic.Pointerarray, zero-lock, no contention - key ≥ 65536:
sync.Mapfallback - Suitable for dense integer scenarios like fd, connection IDs
Multiple shards reduce lock contention, maphash AES-accelerated shard routing, optional lazy TTL.
import "github.com/uniyakcom/yakutil/lru"
// Basic: 16 shards, 10000 entries per shard
c := lru.New[[]byte](16, 10000)
c.Set("key", data)
val, ok := c.Get("key")
c.Del("key")
// Eviction callback (triggered on LRU capacity eviction, not TTL expiry)
c = lru.New[[]byte](16, 10000, lru.WithEvict[[]byte](func(k string, v []byte) {
log.Printf("evicted %s", k)
}))
// Lazy TTL: checked on Get, no background goroutine
c = lru.New[[]byte](16, 10000, lru.WithTTL[[]byte](12*time.Hour))
// Combined: TTL + eviction callback
c = lru.New[[]byte](16, 10000,
lru.WithTTL[[]byte](5*time.Minute),
lru.WithEvict[[]byte](func(k string, v []byte) { ... }),
)
// Iterate all live entries (newest→oldest, skip TTL-expired)
c.Range(func(k string, v []byte) bool {
fmt.Println(k)
return true // return false to terminate early
})
// Clear all entries (does not trigger evict callback)
c.Purge()Get~21 ns,Set~36 ns (bench_linux_6c12t.txt, Go 1.26.1)- Parallel Get (12t): ~46 ns, Mixed 90R10W (serial): ~100 ns
- Per-shard independent Mutex; Clock-LRU: Get skips moveToFront (set accessed bit only)
- Auto-evicts via Clock algorithm—recently accessed entries get a second chance
Range: snapshot iteration by insertion order, callback invoked outside lock, skips TTL-expired entriesRangeLimit(batchSize, fn): batched iteration for large caches, each batch re-locksPurge: O(shards) clear, does not trigger evict callbackWithTTL: TTL resets when same key is overwritten;Len()includes lazily uncleaned expired entries
Go 1.26 measured (
bench_linux_6c12t.txt, count=5): serial Mixed 90R10W ~100 ns; parallel Get (12t) ~46 ns. Shard isolation continues to provide significant parallel-access advantage.
Pure-Go minimal implementation of the LZ4 block format.
No frame headers; framing is the caller's responsibility. Suitable for WAL, snapshot, and record-oriented storage.
import "github.com/uniyakcom/yakutil/lz4"
// Allocate destination (caller-provided or nil for auto-alloc)
dst := make([]byte, lz4.CompressBound(len(src)))
compressed, err := lz4.Compress(dst, src)
// err is non-nil only for ErrInputTooLarge (src > ~2 GB)
original := make([]byte, len(src))
out, err := lz4.Decompress(original, compressed)High-compression mode (slower, same format, decompressible by Decompress):
compressed, err := lz4.CompressHC(nil, src) // auto-allocErrors (comparable via errors.Is):
| Constant | Meaning |
|---|---|
ErrCorrupt |
Input block data is corrupt or malformed |
ErrOutputFull |
Destination buffer capacity insufficient |
ErrInputTooLarge |
Input exceeds LZ4 block format limit (~2 GB) |
Performance (Intel Xeon E-2186G @ 3.80GHz, repetitive text input):
| Operation | Throughput |
|---|---|
Compress |
~500 MB/s |
Decompress |
~2 GB/s |
CompressHC |
~60 MB/s |
Decompression fast path: when offset >= matchLen (no overlap), copy() is used and can leverage SIMD/memcpy.
Self-overlapping (RLE) matches use byte-by-byte copy to preserve correct semantics.
CAS lock-free token bucket, no background goroutines, zero allocation, suitable for per-IP / global rate limiting.
import "github.com/uniyakcom/yakutil/ratelimit"
// Global: 1000/sec, burst cap 200
rl := ratelimit.New(1000, 200)
// Non-blocking check (hot path, ~15-25 ns, 0 allocs)
if !rl.Allow() {
conn.Close()
return
}
// Batch consume (writing multiple messages at once)
if !rl.AllowN(5) {
return ErrRateLimited
}
// Runtime diagnostics
rl.Tokens() // current available token count snapshot
rl.Rate() // configured rate (tokens/sec)
rl.Burst() // bucket capacity
rl.Reset() // reset to full bucket (connection reset, test init)Performance (Intel Xeon E-2186G @ 3.80GHz, amd64, Go 1.26.1):
| Scenario | Latency | Allocation |
|---|---|---|
Allow (single core, tokens available) |
~12 ns | 0 allocs |
Allow (12-thread parallel, CAS contention) |
~147 ns | 0 allocs |
Allow (tokens exhausted, fast reject) |
~10 ns | 0 allocs |
Algorithm highlights:
- Lazy refill (compute diff per call by time), avoids background timer goroutine
- Refill calculation uses integer nanoseconds (
1e9 / rate), no floating-point error - CAS loop: read snapshot → compute new state → CAS update, retries on contention (typically 1-2 rounds)
n > burstalways returnsfalse(can short-circuit early, no waiting)
Default is already the high-performance clock
Limiter uses coarsetime.NowNano (500µs precision, ~1 ns/op) as its default clock —
no configuration needed. coarsetime starts automatically via package init().
// Default: already uses coarsetime, no extra setup required
rl := ratelimit.New(10_000, 500)
// For precise rate shaping, explicitly inject wall clock:
rl := ratelimit.New(10_000, 500,
ratelimit.WithClock(func() int64 { return time.Now().UnixNano() }),
)When to use precise clock: only when 500µs jitter matters (e.g., tight burst shaping). For all other uses (per-IP, global rate limiting), the default coarsetime is optimal.
IPMap combines smap.Map (maphash AES-NI sharded hash) with *Limiter, no background goroutines, zero-alloc hot path.
ipmap := ratelimit.NewIPMap(100, 20, 0) // rate=100/s, burst=20, default 64 shards
// Each new connection (OnAccept)
if !ipmap.Allow(conn.RemoteIP()) {
conn.Close()
return
}
// Batch operation
if !ipmap.AllowN(ip, 5) { return ErrRateLimited }
// Get underlying Limiter (diagnostics)
lim := ipmap.Get(ip)
log.Printf("ip=%s tokens=%d", ip, lim.Tokens())
// Periodically evict inactive IPs (prevent memory leak)
go func() {
for range time.Tick(5 * time.Minute) {
n := ipmap.Evict(10 * time.Minute)
log.Printf("evicted %d stale IP limiters", n)
}
}()
// Clear all (config change / test init)
ipmap.Purge()// Strict mode: eliminates TOCTOU race on first Limiter creation (exactly-once semantics)
ipmap := ratelimit.NewIPMap(100, 20, 0, ratelimit.WithStrictNewIP())// Access hotspot analysis: top N IPs by hit count
top := ipmap.TopN(5)
for _, e := range top {
log.Printf("ip=%s hits=%d recent=%d", e.IP, e.Hits, e.RecentHits)
}
// Recent hotspot analysis: sorted by ~1min sliding window activity
recent := ipmap.TopNRecent(5)
for _, e := range recent {
log.Printf("ip=%s recent_hits=%d", e.IP, e.RecentHits)
}| Method | Description |
|---|---|
Allow(ip) |
Consume 1 token; false = rate limited |
AllowN(ip, n) |
Batch consume n tokens |
Get(ip) |
Get (or create) underlying *Limiter |
Evict(ttl) |
Evict IPs inactive within ttl; returns eviction count |
Len() |
Current tracked IP count |
Purge() |
Clear all entries |
TopN(n) |
Top-N by cumulative Hits (descending) |
TopNRecent(n) |
Top-N by recent (~1min window) RecentHits (descending) |
Measured Performance (bench_linux_6c12t.txt, Go 1.26.1, rate=1M/s, burst=100K)
| Operation | ns/op | allocs | Notes |
|---|---|---|---|
Allow (serial, hot IP) |
49 | 0 | coarsetime + maphash shard + token-bucket CAS |
Allow (12 parallel) |
37 | 0 | Shard lock scatter, coarsetime replaces time.Now |
Allow (with RecentHits) |
57 | 0 | addRecent same overhead as Allow |
TopN(5, 100 IPs) |
6.3 µs | — | Min-heap top-N, occasional call |
TopNRecent(5, 100 IPs) |
6.3 µs | — | Min-heap top-N by recent window |
Channel-backed, Go runtime native scheduling, supports non-blocking / context timeout, suitable for MaxConns limiting.
import "github.com/uniyakcom/yakutil/semaphore"
sem, err := semaphore.New(opts.MaxConns) // max concurrent connections
if err != nil {
// opts.MaxConns < 1
return err
}
// New connection entry: non-blocking fast reject
if !sem.TryAcquire() {
conn.Close()
return ErrTooManyConns
}
defer func() { _ = sem.Release() }()
// Client-side dial control: context-blocked wait
if err := sem.AcquireContext(ctx); err != nil {
return err // context.DeadlineExceeded or Canceled
}
defer func() { _ = sem.Release() }()
// Runtime diagnostics (Prometheus gauge)
sem.Count() // current held permits (active connection count)
sem.Cap() // maximum capacity
sem.Available() // remaining available permitsAcquire |
TryAcquire |
AcquireContext |
TryAcquireN(n) |
|
|---|---|---|---|---|
| Blocking | Yes (wait indefinitely) | No (return immediately) | Yes (until ctx cancelled) | No (return immediately) |
| Count | 1 | 1 | 1 | n (all-or-nothing) |
| Suitable for | Internal background tasks | New connection entry | Client dial | Batch slot pre-reservation |
Batch operations TryAcquireN / ReleaseN:
if !sem.TryAcquireN(n) {
return ErrNotEnoughSlots
}
defer sem.ReleaseN(n)TryAcquireN(n) bool: atomically acquire n permits all-or-nothing;n > Cap()always falseReleaseN(n) error: batch release n permits; returnsErrReleaseNOverif n exceeds held count
Measured Performance (bench_linux_6c12t.txt)
| Operation | ns/op | allocs |
|---|---|---|
Acquire/Release (serial) |
29.6 | 0 |
TryAcquire_Success (serial) |
29.9 | 0 |
TryAcquire_Full (fast fail) |
0.7 | 0 |
TryAcquireN(1) |
16.6 | 0 |
TryAcquireN(1) parallel |
152 | 0 |
Ordered dictionary supporting exact lookup, prefix scan, and ordered cursor iteration. vs map: keys are ordered, prefix queries in O(|prefix|+|result|), no rehash pauses.
import "github.com/uniyakcom/yakutil/art"
var tr art.Tree[int]
tr.Put("user:1001", 42)
tr.Put("user:1002", 99)
tr.Put("session:abc", 1)
v, ok := tr.Get("user:1001") // 42, true
old, ok := tr.Delete("session:abc") // 1, true
// Ordered full traversal (lexicographic)
tr.ForEach(func(key string, val int) bool {
fmt.Println(key, val)
return true // return false to terminate early
})
// Prefix scan: only traverse keys starting with "user:"
tr.ScanPrefix("user:", func(key string, val int) bool {
fmt.Println(key, val)
return true
})
// Cursor iteration: traverse all entries strictly greater than cursor
tr.Seek("user:1001", func(key string, val int) bool {
fmt.Println(key, val) // "user:1002", 99
return true
})
// Memory monitoring + compaction (reclaim arena dead bytes after bulk deletions)
fmt.Println(tr.PrefixArenaBytes())
tr.CompactPrefixArena() // call during off-peak
// Leaf slab monitoring + compaction (reclaim fragmented slab pages after bulk deletions)
fmt.Println(tr.SlabBytes())
tr.CompactSlab() // call during off-peak, O(N)Characteristics
| Feature | Description |
|---|---|
| Key type | Binary-safe (string, may contain \x00) |
| Node adaptation | 4 / 16 / 48 / 256 children auto upgrade/downgrade |
| Path compression | Pessimistic prefix compression; prefix stored in arena |
| Ordering | ForEach / Seek / ScanPrefix all lexicographic |
| Memory | CompactPrefixArena() reclaims dead bytes after deletions; CompactSlab() reclaims fragmented leaf slab pages |
| Thread safety | Not thread-safe; wrap all ops with sync.Mutex (or sync.RWMutex if no concurrent writes) |
Time Complexity
| Operation | Complexity |
|---|---|
Get / Put / Delete |
O(k), k = key length |
ForEach |
O(N) |
ScanPrefix |
O(k + |result|) |
Seek |
O(k + |result|) |
CompactPrefixArena |
O(N × avg_prefix_len) |
CompactSlab |
O(N) |
Performance (Intel Xeon E-2186G @ 3.80GHz, amd64, go1.26.1):
| Operation | art.Tree | map (comparison) |
|---|---|---|
| Put (sequential key) | ~124 ns | ~129 ns |
| Put (random key) | ~394 ns | ~129 ns |
| Get (sequential key) | ~64 ns | ~22 ns |
| Get (random key) | ~82 ns | ~22 ns |
| ScanPrefix | ~900 ns | ~1.3 µs (full scan) |
ScanPrefixis ~1.4× faster than map full scan when prefix hit rate is low; sequential keyart.Putis comparable to map (prefix sharing reduces internal allocations).
Random key exact lookup is 3–4× slower than map — the cost of ordering and adaptive node traversal. Prefer standardmapunless prefix queries are frequent.
Used for CBO selectivity estimation. Splits sorted samples into equal-frequency buckets; estimates the row fraction in any range [lo, hi].
import "github.com/uniyakcom/yakutil/hist"
h, err := hist.Build(samples, 100) // 100 buckets
if err != nil { ... }
ratio := h.EstimateRange(1.0, 50.0) // fraction of rows in [lo, hi], in [0,1]Performance (Intel Xeon E-2186G @ 3.80GHz, Go 1.26.1):
| Operation | ns/op | allocs | Notes |
|---|---|---|---|
Build (100K samples, 100 buckets) |
~331 µs | 3 / 24 KB | Scans for NaN/Inf first; only copies if invalid values found |
Estimates the number of distinct elements (NDV) in a stream. Standard error ~0.81% (p=14, 16384 registers).
import "github.com/uniyakcom/yakutil/hll"
s := hll.New()
s.Add([]byte("hello"))
s.AddStr("world")
s.AddHash(precomputedHash) // reuse hash across HLL + CMS + smap without recomputing
n := s.Count() // estimated distinct element count- Internally uses
maphash(AES-NI, process-random seed) to prevent HashDoS attacks AddHashallows injecting a pre-computed hash, avoiding redundant hashing when the same key is fed to multiple data structures- Not thread-safe; callers must synchronize
Performance (Intel Xeon E-2186G @ 3.80GHz, Go 1.26.1, 1000 elements):
| Method | ns/op | Notes |
|---|---|---|
AddStr (maphash, 18B key) |
~7.3 µs | 1000 calls |
AddHash (FNV-1a, ≤8B key) |
~11.9 µs | Caller pre-computes FNV; FNV is slower than maphash for ≥18B keys |
Merge(other) error merges another HLL sketch; returns ErrMergeNil if other is nil.
Estimates the approximate frequency of elements in a data stream. Used for Top-K frequency estimation and admission filter assist.
import "github.com/uniyakcom/yakutil/sketch"
cms, err := sketch.New(1000, 0.01) // estimate 1000 keys, 1% error rate
if err != nil { ... }
cms.Add("key")
freq := cms.Count("key") // approximate frequency
// Pre-hash API: reuse hash across HLL + CMS + smap
h := hash.Sum64s("key")
cms.AddHash(h, 1)
freq = cms.CountHash(h)- Uses 4 independent hash functions; width
wis rounded up to a power of 2, using bitmask instead of modulo sketch.NewSized(d, w)for direct control of depth and widthAddHash(h, count)/CountHash(h): pre-hash API avoids redundant hashing when the same key is fed to multiple data structuresMerge(other) error: merges another CMS; returnsErrMergeNilifotheris nil- Not thread-safe
Performance (Intel Xeon E-2186G @ 3.80GHz, Go 1.26.1):
| Operation | ns/op | Notes |
|---|---|---|
Add (1000 calls) |
~15.2 µs | Bitmask replaces modulo |
Min-heap implementation, O(N log k), efficiently extracts the top-k elements from a large collection.
import "github.com/uniyakcom/yakutil/topn"
// Descending top-10
top := topn.TopN[Item](items, 10, func(a, b Item) bool {
return a.Score > b.Score
})
// By int64 key (used by IPMap.TopN internally)
top := topn.ByKey(items, 5, func(e Item) int64 { return e.Score })Performance (Intel Xeon E-2186G @ 3.80GHz, Go 1.26.1):
| Operation | ns/op |
|---|---|
TopN (64 elements, top 5) |
~311 ns |
TopN (1024 elements, top 10) |
~2.65 µs |
ByKey (64 elements, top 5) |
~578 ns |
Large collections benefit most: O(N×k) selection sort replaced with O(N log k) heap.
500µs precision, ~1 ns/op, single background goroutine per process started automatically on import.
import "github.com/uniyakcom/yakutil/coarsetime"
ns := coarsetime.NowNano() // ~1 ns/op vs time.Now() ~35-40 ns/op
t := coarsetime.Now() // time.Time (same precision)
// Graceful shutdown (test cleanup, process exit)
coarsetime.Stop() // idempotent, safe to call multiple times- Background goroutine (
time.NewTicker(500µs)) updates a globalatomic.Int64 NowNano()= single atomic read, no syscall- Automatically used by
ratelimit.Limiter(default clock) andratelimit.IPMap - No
Start()call needed — the clock initializes in packageinit()
Suitable for hot paths where time.Now() syscall overhead (~35–40 ns) is unacceptable (e.g., token bucket refill, connection timestamps).
Provides the three cryptographic building blocks required by SCRAM-SHA-256 (RFC 5802 / RFC 7677), built on crypto/hmac + crypto/sha256 only.
import "github.com/uniyakcom/yakutil/scram"
// PBKDF2-HMAC-SHA-256 (RFC 2898 §5.2)
// iter ≥ 4096 for SCRAM-SHA-256; ≥ 60000 recommended for credential storage
saltedPassword := scram.PBKDF2SHA256([]byte(password), salt, 4096, 32)
// HMAC-SHA-256
clientKey := scram.HMACSHA256(saltedPassword, "Client Key")
serverKey := scram.HMACSHA256(saltedPassword, "Server Key")
// SHA-256
storedKey := scram.SHA256Sum(clientKey)SCRAM-SHA-256 derivation chain (RFC 7677):
SaltedPassword = PBKDF2(password, salt, iter, 32)
ClientKey = HMAC(SaltedPassword, "Client Key")
StoredKey = SHA-256(ClientKey)
ServerKey = HMAC(SaltedPassword, "Server Key")
ClientSignature= HMAC(StoredKey, AuthMessage)
ServerSignature= HMAC(ServerKey, AuthMessage)
ClientProof = ClientKey XOR ClientSignature
| Function | Signature | Note |
|---|---|---|
PBKDF2SHA256 |
(password, salt []byte, iter, dkLen int) ([]byte, error) |
dkLen=0 → 32; returns error on invalid params |
HMACSHA256 |
(key []byte, data string) []byte |
string avoids []byte(s) conversion |
SHA256Sum |
(b []byte) []byte |
Wraps sha256.Sum256, returns heap slice |
Performance characteristics
PBKDF2SHA256: singlehmac.Newper call;Ubuffer reused viaSum(U[:0])— eliminates per-iteration heap allocationHMACSHA256/SHA256Sum: thin wrappers, no additional allocation beyond stdlib
| Decision | Choice | Reason |
|---|---|---|
| Root utilities | B2S / S2B |
Short naming, Go community convention |
| Hash | Sum64 / Sum64s |
Following hash/ stdlib naming |
| Ring | mpsc.Ring / spsc.Ring |
Package-qualified semantics; type name needs no prefix |
| Counter | percpu.Counter |
Direct semantics |
| Backoff | backoff.Backoff |
Standard terminology |
| API methods | Add Load Push Pop Spin Alloc Swap |
Short naming, verb-first |
| Dual-type Map | Map[V] / Map64[V] |
Separated by key type, avoids interface{} overhead |
| Timing wheel | Wheel[T] |
Generic payload, no type assertion needed |
# Single run (25 packages)
go test ./... -count=1
# With race detection
go test ./... -count=1 -race
# Single package verbose
go test ./ring/... -v -run TestUnreadByteAll 25 packages pass tests (including -race race detection).
Use the bench.sh script in the project root to run all benchmarks at once:
# Basic run (benchtime=3s, count=1)
./bench.sh
# Custom benchtime and count
./bench.sh 5s 3
# Output file: bench_${OS}_${CORES}c${THREADS}t.txt
# Example: bench_linux_6c12t.txt# Use benchstat to compare two bench results
go install golang.org/x/perf/cmd/benchstat@latest
benchstat before.txt after.txtSix packages have Fuzz tests: ring, hll, ratelimit, art, lru, sketch. Use fuzz.sh for local fuzzing:
# All targets, 1m each
./fuzz.sh
# Single target, custom duration
./fuzz.sh art/FuzzPutGet 2m
# Custom duration for all
./fuzz.sh 10mCI nightly fuzzing runs automatically via .github/workflows/fuzz.yml.
MIT © 2026 uniyak.com