feat: Raft Consensus Implementation #2

martinffx · 2025-10-15T19:01:48Z

Feature Overview

Implements core Raft consensus components for Seshat. This PR adds the foundational pieces needed for distributed consensus but does not wire them into a working cluster yet.

What Was Built

MemStorage - Raft Log Storage

Implements raft::Storage trait for in-memory Raft log management:

Log entry storage with append/compact operations
Hard state persistence (term, vote, commit index)
Snapshot creation and restoration
Thread-safe with RwLock

RaftNode - Consensus Coordinator

Orchestrates Raft consensus operations:

Ready processing loop (tick, propose, apply)
Proposal submission interface
Leader status queries
Hard state management

StateMachine - KV Operations

Deterministic state machine for KV operations:

Operation application (GET, SET, DEL, EXISTS, PING)
Snapshot generation and restoration
Bincode serialization for operations

gRPC Transport Layer

Custom protobuf-based transport for Raft messages:

Own proto definitions (not using raft-proto)
Bridges raft-rs (prost 0.11) with modern gRPC (tonic 0.14)
Server/client stubs for node-to-node communication

Common Types & KV Operations

Shared foundation:

Error types across crates (250 lines)
Type-safe wrappers: NodeId, LogIndex, Term (192 lines)
KV operation definitions with serialization (405 lines)

Implement Phase 1 (Common Types Foundation) of Raft consensus feature: - Add type aliases: NodeId, Term, LogIndex with comprehensive docs - Define Error enum with thiserror for ergonomic error handling - Add 32 passing unit tests (100% Phase 1 test coverage) - Update task tracking with executive summary and progress metrics Phase 1 Status: 2/2 tasks complete (100%) Overall Progress: 2/24 tasks (8%) Test Coverage: - crates/common/src/types.rs: 10 tests passing - crates/common/src/errors.rs: 20 tests passing - All doctests passing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Implement Phase 4 Storage Layer task 1 (mem_storage_skeleton): - Create MemStorage struct with thread-safe RwLock fields - Add comprehensive test coverage (13 storage tests + 2 doctests) - Switch raft-rs to prost-codec to avoid protobuf version conflicts Implementation Details: - MemStorage with HardState, ConfState, Vec<Entry>, Snapshot fields - Thread-safe design (Send + Sync) using RwLock for concurrent access - new() constructor with Default trait implementation - Comprehensive documentation with usage examples Dependencies: - raft = { version = "0.7", default-features = false, features = ["prost-codec"] } - tokio = { version = "1", features = ["full"] } Fixes: - Fix clippy warnings in common crate (inline format args, assign ops) - Fix mise lint task (remove --all-features flag causing protobuf conflicts) Test Results: - 46 tests passing workspace-wide - 14/14 raft crate tests passing - 32/32 common crate tests passing - No clippy warnings Progress: - Phase 4 (Storage Layer): 1/7 tasks complete (14%) - Overall: 3/24 tasks complete (12.5%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Implement Phase 4 Storage Layer task 2 (mem_storage_initial_state): - Add initial_state() method returning RaftState with HardState and ConfState - Add helper methods set_hard_state() and set_conf_state() for testing - 11 comprehensive tests covering defaults, mutations, thread safety, and edge cases Implementation Details: - initial_state() acquires read locks for efficient concurrent access - Returns cloned data to prevent mutation leaks - Thread-safe with multiple concurrent readers - Follows raft-rs API conventions (raft::Result<RaftState>) Helper Methods: - set_hard_state(hs: HardState) - Updates storage hard state - set_conf_state(cs: ConfState) - Updates storage conf state Test Coverage (11 new tests): - test_initial_state_returns_defaults - Verifies term=0, vote=0, commit=0 - test_initial_state_reflects_hard_state_changes - State updates reflected - test_initial_state_reflects_conf_state_changes - Config updates reflected - test_initial_state_is_thread_safe - 10 concurrent threads - test_initial_state_returns_cloned_data - Data isolation verified - test_initial_state_multiple_calls_are_consistent - 100 consecutive calls - test_set_hard_state_updates_storage - Direct storage verification - test_set_conf_state_updates_storage - Direct storage verification - test_initial_state_with_empty_conf_state - Partial state updates - test_initial_state_with_complex_conf_state - Joint consensus scenarios - Edge cases for configuration changes Fixes: - Use struct initialization syntax to satisfy clippy::field_reassign_with_default - All 24 tests passing (13 original + 11 new) - No clippy warnings Progress: - Phase 4 (Storage Layer): 2/7 tasks complete (29%) - Overall: 4/24 tasks complete (16.7%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add MemStorage::entries() method with comprehensive range query support: - Range queries [low, high) with proper bounds checking - Size-limited queries using prost::Message::encoded_len() - Error handling for compacted (StorageError::Compacted) and unavailable entries - Helper methods: first_index(), last_index(), append() - Guarantees at least one entry returned even if exceeds max_size Test coverage (12 new tests): - Empty and normal range queries - Size limits and partial results - Boundary conditions and error cases - Thread safety with concurrent access Dependencies: Added prost = "0.11" for message size calculation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add MemStorage::term() method with comprehensive term lookup support: - Special case: term(0) always returns 0 (Raft convention) - Returns snapshot.metadata.term for snapshot index - Proper bounds checking with first_index() and last_index() - Error handling for compacted (StorageError::Compacted) entries - Error handling for unavailable (StorageError::Unavailable) entries - Thread-safe with RwLock read access Test coverage (9 new tests): - Index 0 returns 0 - Valid indices return correct terms - Snapshot index returns snapshot term - Compacted and unavailable error cases - Empty storage and snapshot-only scenarios - Thread safety with concurrent access - Boundary conditions Progress: 6/24 tasks (25%), Storage Layer 57% (4/7) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add 18 comprehensive tests for existing first_index() and last_index() methods: first_index() tests (6 tests): - Empty log returns 1 - After append returns correct index - With snapshot returns snapshot.index + 1 - Snapshot with entries scenario - After compaction with sparse entries - Entries not starting at index 1 last_index() tests (6 tests): - Empty log returns 0 - After append returns last entry index - Snapshot only returns snapshot.index - Snapshot with entries returns last entry - Multiple appends update correctly - Single entry edge case Invariant & safety tests (6 tests): - Verify first_index <= last_index + 1 always holds - Boundary conditions (empty, single, multiple) - Thread safety with concurrent access - Consistency across multiple calls - Large snapshot indices handling - Multiple scenario lifecycle testing All methods already implemented and working - this formalizes them with comprehensive test coverage per acceptance criteria. Progress: 7/24 tasks (29.2%), Storage Layer 71% (5/7) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add MemStorage::snapshot() method with Phase 1 simplified implementation: - Always returns current snapshot (ignores request_index in Phase 1) - Thread-safe with RwLock read access - Returns cloned snapshot to prevent mutation leaks - Comprehensive documentation with Phase 1 simplification note Test coverage (7 new tests): - Default snapshot on new storage - Stored snapshot retrieval - Phase 1 behavior (ignores request_index) - Complex metadata (ConfState with voters/learners) - Large data payloads (10KB) - Clone independence validation - Thread safety (10 threads × 100 iterations) Implementation notes: - Phase 1: Simple read-lock-clone-return pattern - Future phases may return SnapshotTemporarilyUnavailable - Validates snapshot data integrity (metadata + data) - 1000 total concurrent reads tested Progress: 8/24 tasks (33.3%), Storage Layer 86% (6/7) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Update mise check task to: - Format code (not just check formatting) - Include build step - Use cleaner depends pattern Now runs: format → lint → build → test

Implement apply_snapshot() and wl_append_entries() to complete the Storage Layer implementation. Both methods use proper Raft semantics: - apply_snapshot(): Replaces storage state with snapshot, clears covered entries, updates hard_state and conf_state - wl_append_entries(): Appends entries with conflict resolution (compares terms, truncates on mismatch) Adds 16 comprehensive tests covering: - Snapshot installation with state updates - Entry appending with conflict resolution - Thread safety with concurrent operations - Edge cases (empty log, conflicting terms) All 86 tests passing with zero clippy warnings. Storage Layer (Phase 4) now 100% complete (7/7 tasks). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Corrected protobuf enum variant names (Normal, ConfChange, Noop) and updated all format strings to use inline variable syntax for clippy compliance. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add Operation types and StateMachine implementation for Raft consensus: Protocol Layer (operations.rs): - Operation enum with Set/Del variants for key-value mutations - Serialization/deserialization using bincode - apply() method for executing operations on HashMap - 17 tests covering all operation scenarios State Machine (state_machine.rs): - StateMachine struct with HashMap data and last_applied tracking - Core methods: new(), get(), exists(), last_applied() - apply() method with Operation deserialization and idempotency - 19 tests covering all state machine operations - Integration with protocol crate Operation types Progress: 16/24 tasks complete (66.7%), Phase 5 at 67% (2/3) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Implement snapshot/restore functionality for log compaction: - Add snapshot() method to serialize state machine using bincode - Add restore() method to deserialize and replace state - Add Serialize/Deserialize derives to StateMachine struct - Add bincode 1.3 dependency to raft crate Tests: 9 new unit tests + 2 doc tests covering: - Empty snapshot creation - Snapshot with data - Restore from snapshot - Roundtrip serialization - Error handling for invalid data - Large state (100 keys) performance - State overwrite verification All 147 tests passing (123 unit + 24 doc tests) Phase 5 (State Machine) now 100% complete (3/3 tasks) Overall progress: 70.8% (17/24 tasks) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add RaftNode struct wrapping raft-rs RawNode - Implement new() for node initialization - Implement tick() for logical clock advancement - Implement propose() for client command submission - Implement handle_ready() for Raft state processing - Add apply_committed_entries() helper method - Add MemStorage::append() for entry persistence - Add comprehensive test coverage (22 tests) - Update progress: 83.3% complete (20/24 tasks) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Implement is_leader() to check if node is leader - Implement leader_id() to get current leader ID - Add 8 comprehensive tests for leader queries - Complete Phase 6 (Raft Node) - 100% done - Update progress: 87.5% complete (21/24 tasks) - Ready for Phase 7 (Integration) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Address code review feedback: - Replace .unwrap() with .expect() for descriptive error messages - Fix TOCTOU races in entries() and term() by acquiring locks once - Add defensive logging in apply_committed_entries() - Document lock poisoning philosophy for Phase 1 All 199 tests passing, zero clippy warnings. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Implement transport layer for Raft message communication: - Add TransportServer/Client with gRPC (tonic 0.12, prost 0.13) - Bridge prost 0.11 (raft-rs) ↔ 0.13 (transport) via conversion layer - Extract KV operations to separate crate (seshat-kv) - Rename protocol → protocol-resp as RESP placeholder - Remove custom protobuf definitions (use raft-rs built-ins internally) Benefits: - Modern gRPC stack (2024/2025 versions) for transport - No version lock on rest of service - Clean isolation of old prost dependency Tests: 203 passing (157 unit + 13 integration + 33 doctests) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Upgrade transport layer to latest versions: - tonic 0.12 → 0.14 - prost 0.13 → 0.14 - Use tonic-prost-build instead of tonic-build (API change) - Add tonic-prost runtime dependency for generated code All 203 tests passing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Remove placeholder add() function from lib.rs - Add prost version bridging documentation in storage.rs - Replace eprintln! with log::warn! for structured logging - Document direct field access rationale in is_leader() - Remove outdated #[allow(dead_code)] on MemStorage - Add log dependency for proper logging infrastructure All 156 library tests and 13 integration tests passing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Integrate complete RESP2/3 parser and encoder from feat/resp branch: - Full protocol support (14 data types, 487 tests passing) - Zero-copy parsing with bytes::Bytes - Tokio codec integration for async I/O - Command parser for GET, SET, DEL, EXISTS, PING - Buffer pooling for memory efficiency Additional changes: - Simplify CI workflow to use mise for local/CI parity - Fix duplicate CI runs (removed push on feat/* branches) - Remove optional raft dependency from common crate to avoid protobuf-build conflicts - Add --all-features to mise lint task for comprehensive testing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

The mise install task warns about protoc but doesn't install it. CI needs protoc installed before building raft-proto dependency. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add protoc to mise.toml tools for automatic installation. This eliminates manual protoc installation steps and ensures version consistency across local development and CI environments. Changes: - Add protoc = "28" to [tools] in mise.toml - Remove manual apt-get protoc installation from CI workflow - Mise action automatically installs all tools defined in mise.toml Benefits: - Single source of truth for tool versions - Automatic protoc installation in CI via mise-action - Consistent protoc version (28.3) across all environments - Simpler CI workflow 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

martinffx and others added 25 commits October 12, 2025 17:57

chore(mise): Improve check command to format and build

39ff488

Update mise check task to: - Format code (not just check formatting) - Include build step - Use cleaner depends pattern Now runs: format → lint → build → test

test(raft): Add single-node bootstrap integration test

5e38892

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

feat(raft): Add propose/apply integration tests

d92d924

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Merge branch 'main' into feat/raft

948ea5b

add dataflow docs

4410707

fix(ci): Install protoc before running mise

66b292d

The mise install task warns about protoc but doesn't install it. CI needs protoc installed before building raft-proto dependency. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

martinffx merged commit d094128 into main Oct 18, 2025
1 check passed

martinffx deleted the feat/raft branch October 18, 2025 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Raft Consensus Implementation #2

feat: Raft Consensus Implementation #2

Uh oh!

martinffx commented Oct 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Raft Consensus Implementation #2

feat: Raft Consensus Implementation #2

Uh oh!

Conversation

martinffx commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Feature Overview

What Was Built

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

martinffx commented Oct 15, 2025 •

edited

Loading