A distributed, strongly consistent, and crash-resilient Key-Value store based on the Raft Consensus Algorithm (Ongaro & Ousterhout). Designed for high availability and linearizable semantics in distributed environments.
Distributed systems require a "source of truth" that remains consistent even under partial network failures or node crashes. RaftKV exists to solve the coordination problem by providing a replicated state machine. It is intended for metadata management, configuration storage, or as a foundational component for more complex distributed databases.
The system follows a modular, layered architecture to decouple consensus logic from storage and networking.
- Raft Core (
src/raft/): Implements the State Machine Replication (SMR). Handles leader election, log replication, and safety invariants. - Networking (
src/network/): High-performance gRPC layer usingtonic. Handles inter-node RPCs (RequestVote,AppendEntries) and Client APIs (Execute). - Storage Layer (
src/storage/):- Log Store: Persistent Write-Ahead Log (WAL) backed by
RocksDBfor efficient sequential writes and crash recovery. - Snapshot Store: File-based state checkpointing to prevent unbounded log growth.
- Log Store: Persistent Write-Ahead Log (WAL) backed by
- State Machine (
src/statemachine/): In-memory KV map (HashMapwithRwLock) where committed commands are applied.
- Mutation: Client sends
PUT/DELETEtoKVService. - Replication: Leader appends command to local log and broadcasts
AppendEntriesto followers. - Consensus: Once a majority (quorum) acknowledges, the leader advances its
commit_index. - Execution: Committed entries are applied to the
KVStorestate machine. - Response: Result is returned to the client.
- Rust: Memory safety without GC, critical for predictable latency in consensus.
- Tokio: Asynchronous runtime for high-concurrency I/O.
- Tonic/Prost: gRPC implementation for efficient, type-safe inter-node communication.
- RocksDB: Industrial-grade persistent key-value store used for the Raft WAL.
- Bincode/Serde: Zero-overhead serialization for log entries and snapshots.
- Rust 1.70+
- Protobuf Compiler (
protoc) - LLVM/Clang (required by
rust-rocksdb)
To simulate a 3-node cluster locally:
chmod +x scripts/simulate_cluster.sh
./scripts/simulate_cluster.shThe script initializes 3 nodes, triggers an election, simulates a leader crash, and verifies recovery.
- Distributed Configuration: Consistent settings across a fleet of microservices.
- Service Discovery: Reliable registry for dynamic service endpoints.
- Distributed Locking: Foundation for a distributed lock manager (DLM).
- Strong Persistence Foundation: Direct use of RocksDB for WAL ensures that
currentTermandvotedForare flushed before RPC responses, a common pitfall in naive implementations. - Async-First Design: Leveraging
tokio::select!for the main event loop cleanly handles heartbeats, election timeouts, and RPC messages without complex threading. - Clean Abstractions: The
LogStoretrait allows swapping storage engines (e.g., in-memory for testing, RocksDB for prod).
- Read Linearity: Current
GETimplementation is a placeholder. Production requires ReadIndex or LeaseRead to avoid returning stale data from followers. - Membership Changes: The cluster configuration is static. Adding/removing nodes requires a full restart (Single-server member change implementation is missing).
- Message Processing: The network layer is currently a skeleton; actual gRPC client calls to peers within the Raft loop need completion to move beyond simulation.
- Linearizable Reads: Implement
ReadIndex(Section 6.4 of Raft paper) to ensure clients always see the latest committed state. - Dynamic Membership: Implement
AddNode/RemoveNodevia joint consensus. - Pipelining & Batching: Batch log entries in
AppendEntriesand pipeline RPCs to saturate network bandwidth. - gRPC Health Probing: Integrated health checks for faster failure detection.