Skip to content

Overview

wpp edited this page Feb 5, 2026 · 5 revisions

EloqStore is a hybrid-tier key-value storage engine that combines NVMe SSDs with S3-compatible object storage to deliver memory-like latency at disk economics. Built in modern C++, it underpins EloqData products (EloqKV, EloqDoc, EloqSQL) by providing predictable performance, copy-on-write semantics, and cloud-native durability.

Architecture Highlights

  • Multi-tier design: hot in-memory B-tree index plus a warm local SSD tier; when object storage is configured it becomes the durable cold tier while local disks act as a cache. Non-leaf nodes remain resident to guarantee O(log n) point lookups.
  • Copy-on-write B-tree: writers update cloned tree fragments, so reads stay lock-free and tail latencies remain flat even under heavy batch ingest.
  • Async IO via io_uring: the engine targets Linux 6.8+ and relies on io_uring for low-overhead, high-parallelism disk access.
  • Coroutine scheduler: user-space coroutines multiplex read/write tasks on shard threads to avoid expensive kernel context switches while keeping the pipeline busy.
  • Object storage first: append-optimized data files stream to S3/MinIO-compatible backends with intelligent caching, archive snapshots, and tier-aware garbage collection.

Position in the Stack

EloqStore lives beneath EloqData's Data Substrate, which owns caching, concurrency control, and durability duties for EloqKV/EloqDoc/EloqSQL. Query engines interact only with Data Substrate; EloqStore is invoked exclusively on cache misses or when checkpoint writers flush accumulated changes. It therefore does not implement distributed caching or transaction coordination—its mandate is to serve low-latency miss reads and absorb high-throughput checkpoint batches without perturbing readers. When cloud/object storage is enabled it becomes the source of truth with local NVMe acting as a cache/staging tier; in local-only deployments the NVMe-resident data is the durable store.

Design Philosophy

  • NVMe-first: prioritize local NVMe bandwidth/IOPS and treat DRAM as a cache that already lives in Data Substrate. EloqStore leans on SSD-friendly B+-trees with predictable height and one I/O per lookup.
  • Predictable tail latency: a copy-on-write B-tree, cached inner nodes, and coroutine-driven io_uring minimize jitter when servicing cache-miss reads.
  • Cost efficiency: couple commodity NVMe (high performance, ephemeral) with object storage (durable, cheaper per TB) instead of relying on DRAM-heavy deployment or premium block storage.
  • Simplicity over abstraction: a single writer model, explicit batch APIs, and well-defined maintenance jobs keep the engine understandable and debuggable.
  • Cloud-native: failure is expected; local NVMe disappears with the VM. EloqStore assumes cold tiers live in object storage and that instances are reprovisioned quickly.

Core Assumptions

  • Modern servers expose high-IOPS NVMe devices and io_uring-capable kernels.
  • Data Substrate absorbs transactional writes and hot reads; EloqStore mainly handles batch flushes and cold cache misses.
  • Node crashes lose locally attached NVMe, so durability comes from object storage or replicated logs outside EloqStore.

Non-Goals

  • Deep dives into data structures—the focus here is architectural intent, not page layouts or compaction internals.
  • Benchmark tables or micro-bench numbers (they belong in dedicated perf reports).
  • API minutiae—the API Usage Guide covers request semantics separately.

Core Features

  • Batch write pipeline with strict key ordering, caller-managed timestamps, and TTL-aware entries to support high-throughput ingestion.
  • Single-I/O point reads thanks to cached upper B-tree layers and deterministic page layout.
  • Zero-copy snapshots & agent branching via copy-on-write manifests, enabling instant dataset forks for AI experimentation or multi-tenant isolation.
  • Maintenance suite (archive, compact, clean-expired) to keep append-only storage healthy in both local and cloud deployments.
  • Shard-per-core execution: partitions hash onto shards that each pin to a core, eliminating thread-pool tuning. Coroutine context switches are user-space and significantly cheaper than thread switches, so EloqStore sustains high NVMe IOPS with fewer CPU cycles and without workload-specific tuning—even if a carefully tuned thread pool could approach similar throughput.
  • Multi-path storage layout: a single instance can stripe data across multiple local paths automatically; in cloud/object storage mode it partitions files under different prefixes, making it easy to assign ranges to distinct nodes and scale out horizontally.

Tooling & Ecosystem

  • APIs: Native C++ interface (see eloqstore/include/eloq_store.h and the API Usage Guide) with sync/async requests and INI-based configuration (benchmark/opts_append.ini).
  • SDKs: Language-specific SDKs available:
  • Examples & Benchmarks: examples/basic_example.cpp, benchmark/simple_bench, and micro-bench/ provide ready-to-run workloads for validation and tuning.
  • Downstream integrations: EloqStore powers Redis/Valkey-compatible services, MongoDB-wire compatible document stores, relational SQL fronts, and native vector search modules.

Clone this wiki locally