Skip to content

A production-ready distributed cache implementation using the Raft consensus algorithm with ML-based eviction, read lease optimization, and comprehensive testing.

Notifications You must be signed in to change notification settings

jahnavikedia/raft-cache

Repository files navigation

Distributed Raft Cache

A production-ready distributed cache implementation using the Raft consensus algorithm with ML-based eviction, read lease optimization, and comprehensive testing.

Project Overview

This project implements a distributed key-value cache with:

  • Raft Consensus: Ensures strong consistency across multiple nodes
  • Read Lease Optimization: 105x speedup for read operations
  • ML-Based Eviction: Intelligent cache eviction using access pattern prediction
  • Snapshot Support: Efficient state persistence and recovery
  • Multiple Read Consistency Levels: STRONG, LEASE, and EVENTUAL
  • High Availability: Automatic leader election and follower recovery
NotebookLM Mind Map

Features

Core Capabilities

Raft Consensus Protocol

  • Leader election with randomized timeouts
  • Log replication with majority quorum
  • Committed entry durability
  • NO-OP entries for immediate leadership confirmation

Read Optimizations

  • ReadIndex Protocol: Linearizable reads without log appends
  • Read Lease: Time-based leases for 105x faster reads
  • Eventual Consistency: Fastest reads for non-critical data

Cache Eviction

  • ML-Based: RandomForest predictions on access patterns
  • LRU Fallback: Graceful degradation when ML unavailable
  • Access Tracking: Detailed statistics for prediction

Persistence & Recovery

  • Snapshots: Automatic snapshot creation every 1000 entries
  • Log Persistence: Durable storage of Raft logs
  • Crash Recovery: Followers catch up after downtime

Testing

  • Follower recovery test (100% pass rate)
  • Concurrent writes test (100% pass rate)
  • Comprehensive test suite with proper timing

Architecture

┌─────────────────────────────────────────────────────────┐
│              Java Raft Cache Cluster                    │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │   Node 1     │  │   Node 2     │  │   Node 3     │  │
│  │  (Leader)    │  │  (Follower)  │  │  (Follower)  │  │
│  │              │  │              │  │              │  │
│  │ HTTP: 8081   │  │ HTTP: 8082   │  │ HTTP: 8083   │  │
│  │ Raft: 9001   │  │ Raft: 9002   │  │ Raft: 9003   │  │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘  │
│         │                 │                 │            │
│         └─────────────────┴─────────────────┘            │
│              Raft Protocol (Netty)                       │
└────────────────────────┬────────────────────────────────┘
                         │ HTTP REST API
                         ↓
                  ┌──────────────┐
                  │  Python ML   │
                  │  Service     │
                  │  Port: 5001  │
                  └──────────────┘

Quick Start

Prerequisites

  • Java 17 or later
  • Maven 3.6+
  • Python 3.8+ (for ML service and node manager)
  • Node.js 18+ (for demo UI)
  • jq (for test scripts)

Build the Project

# Clone the repository
git clone https://github.com/jahnavikedia/raft-cache.git
cd raft-cache

# Build with Maven
mvn clean package

# Start the ML service (optional - cache works with LRU if ML unavailable)
cd ml-service
pip install -r requirements.txt
python app.py &
cd ..

Start the Cluster with Demo UI

# Start the node manager (manages cluster lifecycle)
python node-manager.py &

# Start the frontend demo UI
cd frontend
npm install
npm run dev &
cd ..

# Open http://localhost:5173 in your browser

Alternative: Command Line Only

# Start 3-node cluster (cleans data directory for fresh start)
./scripts/start-cluster.sh

# Stop the cluster
./scripts/stop-cluster.sh

Run Tests

# Test follower recovery
./test_follower_recovery.sh

# Test concurrent writes
./test_concurrent_writes.sh

# Test read lease optimization
./scripts/test_read_lease.sh

API Usage

Write Data (PUT)

curl -X POST http://localhost:8081/cache/mykey \
  -H 'Content-Type: application/json' \
  -d '{"value":"myvalue","clientId":"client1","sequenceNumber":1}'

Read Data (GET)

# Strong consistency (linearizable via ReadIndex)
curl "http://localhost:8081/cache/mykey?consistency=strong"

# Lease-based read (105x faster when lease valid)
curl "http://localhost:8081/cache/mykey?consistency=lease"

# Eventual consistency (fastest, may be stale)
curl "http://localhost:8081/cache/mykey?consistency=eventual"

Delete Data

curl -X DELETE http://localhost:8081/cache/mykey \
  -H 'Content-Type: application/json' \
  -d '{"clientId":"client1","sequenceNumber":2}'

Check Node Status

curl http://localhost:8081/status
# Returns: {"nodeId":"node1","role":"LEADER","currentTerm":5,...}

Get Cache Statistics

curl http://localhost:8081/stats
# Returns cache size, eviction count, ML availability, etc.

Demo UI

The project includes a React-based demo UI for visualizing the Raft cluster:

  • Cluster Status: Real-time view of all nodes (Leader/Follower status, Term, Log size)
  • Raft Election: Visualize leader election by killing/starting nodes
  • Log Replication: Watch log entries replicate across the cluster
  • Cache Operations: Write, read, and delete cache entries
  • Performance: Compare read latency across consistency levels (Strong vs Lease vs Eventual)
  • ML Eviction: Demo ML-based cache eviction vs traditional LRU

Project Structure

raft-cache/
├── frontend/                              # React demo UI
│   ├── src/
│   │   ├── App.jsx                        # Main application
│   │   └── components/
│   │       ├── ClusterStatus.jsx          # Node status cards
│   │       ├── ElectionDemo.jsx           # Election visualization
│   │       ├── ReplicationDemo.jsx        # Log replication demo
│   │       ├── CacheView.jsx              # Cache operations
│   │       ├── LatencyComparison.jsx      # Performance comparison
│   │       └── EvictionDemo.jsx           # ML eviction demo
│   └── package.json
├── node-manager.py                        # Python cluster manager (Flask API)
├── src/main/java/com/distributed/cache/
│   ├── Main.java                          # Entry point
│   ├── raft/
│   │   ├── RaftNode.java                  # Core Raft logic
│   │   ├── RaftState.java                 # Leader/Follower/Candidate states
│   │   ├── LogEntry.java                  # Log entry structure
│   │   ├── ReadIndexManager.java          # ReadIndex protocol
│   │   ├── ReadLease.java                 # Lease-based reads
│   │   ├── Snapshot.java                  # Snapshot data structure
│   │   ├── SnapshotManager.java           # Snapshot creation/restoration
│   │   └── api/
│   │       ├── CacheRESTServer.java       # HTTP API
│   │       └── LeaderProxy.java           # Request forwarding
│   ├── replication/
│   │   ├── LeaderReplicator.java          # Leader-side replication
│   │   ├── FollowerReplicator.java        # Follower-side log handling
│   │   └── RaftLog.java                   # Log management
│   ├── store/
│   │   ├── KeyValueStore.java             # State machine
│   │   ├── AccessTracker.java             # Access pattern tracking
│   │   ├── AccessStats.java               # Per-key statistics
│   │   └── ReadConsistency.java           # Consistency level enum
│   ├── eviction/
│   │   ├── EvictionPolicy.java            # Eviction interface
│   │   ├── MLEvictionPolicy.java          # ML-based eviction
│   │   ├── LRUEvictionPolicy.java         # LRU fallback
│   │   ├── MLClient.java                  # ML service client
│   │   └── MLPrediction.java              # Prediction result
│   ├── network/
│   │   └── NetworkBase.java               # Netty-based communication
│   └── storage/
│       └── LogPersistence.java            # Disk persistence
├── ml-service/
│   ├── app.py                             # Flask ML service
│   ├── requirements.txt                   # Python dependencies
│   └── train_model.py                     # Model training (optional)
├── config/
│   ├── node1.yaml                         # Node 1 configuration
│   ├── node2.yaml                         # Node 2 configuration
│   └── node3.yaml                         # Node 3 configuration
├── scripts/
│   ├── start-cluster.sh                   # Start cluster helper
│   ├── stop-cluster.sh                    # Stop cluster helper
│   └── test_read_lease.sh                 # Read lease benchmark
├── test_follower_recovery.sh              # Follower recovery test
├── test_concurrent_writes.sh              # Concurrent write test
└── pom.xml                                # Maven configuration

Implementation Progress

✅ Completed Features

Core Raft (Days 1-6)

  • Leader election with randomized timeouts
  • Log replication with majority quorum
  • State machine (key-value store)
  • Log persistence and recovery
  • Network communication (Netty)
  • REST API for client operations

Advanced Features (Day 1-2)

  • ReadIndex protocol for linearizable reads
  • Read Lease optimization (105x speedup)
  • Snapshot creation and restoration
  • Access pattern tracking
  • ML-based cache eviction
  • Multiple consistency levels (STRONG/LEASE/EVENTUAL)

Testing & Quality (Day 3)

  • Follower recovery test
  • Concurrent writes test
  • Bug fix: Snapshot/log index mismatch
  • Comprehensive documentation

Performance Benchmarks

Read Latency (from test_read_lease.sh):

  • STRONG: ~50-100ms (ReadIndex protocol with heartbeat)
  • LEASE: ~0.5-1ms (lease-based, 105x faster)
  • EVENTUAL: ~0.3-0.5ms (local read, fastest)

Write Throughput:

  • 120 concurrent writes in ~8 seconds
  • 15 writes/second with 4 concurrent clients
  • 100% success rate, 0% data loss

Recovery Time:

  • Leader election: ~5 seconds
  • Follower catch-up: ~10 seconds for 50 entries
  • Snapshot restoration: <1 second

Configuration

Each node has a YAML configuration file:

# config/node1.yaml
nodeId: node1
raftPort: 9001
httpPort: 8081
dataDir: data/node1
peers:
  - nodeId: node2
    host: localhost
    port: 9002
  - nodeId: node3
    host: localhost
    port: 9003

Testing

Test Suite

  1. Follower Recovery Test (test_follower_recovery.sh)

    • Writes 30 keys, waits for replication
    • Kills follower, writes 20 more keys
    • Restarts follower and verifies catch-up
    • Pass Criteria: ≥8/10 old keys, ≥7/10 new keys
  2. Concurrent Writes Test (test_concurrent_writes.sh)

    • Launches 4 concurrent clients
    • Each writes 30 unique keys (120 total)
    • Verifies data consistency and deduplication
    • Pass Criteria: ≥90% success, ≥90% consistency
  3. Read Lease Benchmark (scripts/test_read_lease.sh)

    • Measures read latency across consistency levels
    • Validates 105x speedup with lease-based reads

Running Tests

# Run all tests
./test_follower_recovery.sh
./test_concurrent_writes.sh
./scripts/test_read_lease.sh

# Expected output: All tests PASSED

Known Issues & Limitations

Current Limitations

  1. Snapshot/Log Reconciliation: Test script cleans data directory on startup (not production-safe)
  2. No Log Rotation: Logs grow unbounded
  3. No Graceful Shutdown: Uses kill -9 in tests
  4. Hard-coded Timeouts: Should be configurable
  5. No Dynamic Membership: Cluster topology is static

Future Improvements

  • Proper snapshot/log reconciliation for restarts
  • Log rotation and compaction
  • Graceful shutdown mechanism
  • Configuration-based timeouts
  • Dynamic cluster membership (add/remove nodes)
  • Monitoring and metrics export (Prometheus)
  • Leader transfer for rolling upgrades

Documentation

Technology Stack

Java Backend

  • Java 17: Core language
  • Maven: Build and dependency management
  • Netty: Asynchronous network communication
  • Javalin: Lightweight HTTP framework
  • Gson: JSON serialization
  • SLF4J + Logback: Logging

React Frontend

  • React 18: UI framework
  • Vite: Build tool
  • Axios: HTTP client
  • Lucide React: Icons
  • Tailwind CSS: Styling

Python Services

  • Flask: REST API framework (ML service + node manager)
  • scikit-learn: RandomForest for eviction prediction
  • NumPy: Numerical computations

Testing

  • Bash: Test scripts
  • jq: JSON parsing in tests
  • curl: HTTP client

Contributing

This is an academic project demonstrating distributed systems concepts. For questions or suggestions, please open an issue.

References

  • "In Search of an Understandable Consensus Algorithm" (Raft Paper) - Diego Ongaro & John Ousterhout
  • "Rethink the Linearizability Constraints of Raft" - Performance optimizations for reads
  • Raft Visualization: thesecretlivesofdata.com/raft

About

A production-ready distributed cache implementation using the Raft consensus algorithm with ML-based eviction, read lease optimization, and comprehensive testing.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •