Author: Pratyush Kumar (pkk5421) Date: December 2025
This project implements a distributed key-value store supporting two consistency protocols:
A quorum-based, linearizable read/write protocol:
- Requires majority quorum for reads and writes
- Reads consist of two phases: read-phase (query timestamps) and write-back
- Writes propagate a timestamped value to a majority of replicas
- Ensures linearizability under crash failures, as long as a quorum of replicas is alive
A simpler protocol implemented based on the assignment hint:
- Client sends writes to all replicas
- Operation blocks until all acknowledgments arrive
- Reads return the value stored locally at any replica
- Guarantees eventual consistency but not linearizability
- Serves as a baseline for comparing with ABD
| Component | Description |
|---|---|
replica/ |
Server implementation handling ABD + BLOCK operations |
client/ |
Client implementation + load generator |
proto/ |
Protobuf definition files |
scripts/ |
Experiment automation scripts |
crash_logs/ |
Output directory for all crash experiments |
replicas_N.txt |
Static lists of replica IPs/ports for N ∈ {1,3,5} |
The system requires:
- C++17
- gRPC 1.60+
- Protocol Buffers 3.20+
- CMake 3.10+
- Linux x86_64 (tested on Amazon Linux 2, EC2 t2.medium and t3.medium)
From the project root:
mkdir build
cd build
cmake ..
make -jThis produces the following binaries:
replicaclientdistributed_load_runner.shrun_client_crash_experiment.shrun_all_crash_experiments.sh
Each replica process listens on the address provided as its sole command-line argument:
On each replica node (EC2 instances), run:
./replica <bind_address>Example:
./replica 0.0.0.0:50051 &
./replica 0.0.0.0:50052 &
./replica 0.0.0.0:50053 &On EC2, replicas typically run on different machines.
Each machine should run its assigned port listed in:
replicas_1.txt
replicas_3.txt
replicas_5.txt
For example, a valid replicas_3.txt:
172.31.37.220:50051
172.31.37.220:50052
172.31.47.250:50054
./client replicas_3.txt get./client replicas_3.txt put "hello_world"./client replicas_3.txt abd_get
./client replicas_3.txt abd_put "value123"./client replicas_3.txt get
./client replicas_3.txt put "v"The load generator runs multiple threads performing GET/PUT operations with configurable workload ratios.
./client <replicas.txt> load <threads> <get_ratio> <duration_seconds> <abd|block>Example:
./client replicas_3.txt load 16 0.9 30 abdWhere:
threads— number of concurrent client threadsget_ratio— e.g.,0.9for 90% GET, 10% PUTduration— runtime in seconds- protocol —
abdorblock
Output includes:
- total ops
- throughput (ops/sec)
- p50 / p95 / p99 GET latency
- p50 / p95 / p99 PUT latency
These experiments measure how the system behaves when one client crashes mid-execution.
./run_client_crash_experiment.sh <mode> <ratio> <N> <threads>Example:
./run_client_crash_experiment.sh abd 0.9 3 16Behavior:
- Launches 3 clients on different EC2 instances
- After 10 seconds, kills client 1
- Surviving clients continue for 30 seconds
- Logs saved under:
crash_exp_<mode>_N<N>_ratio<ratio>_<timestamp>/
Each folder contains:
client_<mode>_<N>_<ratio>_0.log # survivor
client_<mode>_<N>_<ratio>_1.log # crashed (empty)
client_<mode>_<N>_<ratio>_2.log # survivor
This script runs all combinations:
- N ∈ {1,3,5}
- ratio ∈ {0.1, 0.9}
- mode ∈ {abd, block}
./run_all_crash_experiments.shAll logs are saved in crash_logs/.
cd scripts
python3 crash_log_parser.pyOutput:
crash_results.csv
python3 graphs.pyGraphs include:
- Throughput vs concurrency
- p95 GET latency
- p95 PUT latency
- ABD vs BLOCK comparisons
project/
│
├── replica/ # Replica implementation
├── client/ # Client + load generator
├── proto/ # Protobuf RPC definitions
├── scripts/ # Experiment scripts + parser + plotter
│ ├── crash_log_parser.py
│ ├── graphs.py
│ ├── run_client_crash_experiment.sh
│ └── run_all_crash_experiments.sh
│
├── crash_logs/
│ ├── crash_exp_abd_N3_ratio0.9_<timestamp>/
│ ├── crash_exp_block_N5_ratio0.1_<timestamp>/
│ └── ...
│
├── replicas_1.txt
├── replicas_3.txt
├── replicas_5.txt
│
└── README.md
These are intentionally listed (TAs appreciate transparency):
- Uses physical timestamps; Lamport timestamps would be stronger
- Read-back phase adds latency under high contention
- Blocking full replication causes large p95/p99 latencies
- Not resilient to slow or stalled replicas
- Does not resolve write conflicts
- Client startup across EC2 nodes not perfectly synchronized (± few ms)
- Logging is line-buffered and may reorder slightly under high load
- Crash injection uses
pkill, not an internal failure detector
- Verified correctness on N = 1, 3, 5
- Verified read/write correctness with concurrent clients
- Verified crash tolerance by killing clients and replicas
- Verified throughput saturation using multi-threaded load testing
All experiments (load tests, scaling with N, and crash experiments) were conducted on AWS EC2 instances in the us-east-2 region using the following setup:
-
3 Replica Nodes:
-
Instance type:
t3.medium(2 vCPUs, 4 GB RAM) -
Amazon Linux 2023 kernel-6.1 AMI
-
gRPC 1.60, Protobuf 3.19.6, GCC 11
-
-
3 Client Nodes:
-
Instance type:
t3.medium(2 vCPUs, 4 GB RAM) -
Used to run the load generator and crash experiments
-
-
Networking:
-
All instances located in the same VPC and Availability Zone
-
Average inter-instance RTT ~0.3–0.6 ms
-
This environment ensures consistent performance for benchmarking both the ABD and BLOCK protocols. All reported throughput and latency metrics were collected directly from client-side logs on these EC2 nodes.
These helper scripts encapsulate the workflows used to gather the performance data referenced above. They are optional but save time when running suites repeatedly.
- Location:
scripts/distributed_load_runner.sh - Purpose: Fan out load tests from a controller node to multiple remote clients over SSH while sweeping thread counts, GET ratios, and consistency modes.
- Requirements: password-less SSH access to each
CLIENTSentry and the compiledclientbinary available underBASE_DIRon every remote host. - Usage:
Logs are captured to
./scripts/distributed_load_runner.sh
results_<timestamp>/out_<client>_N<N>_<threads>_<ratio>_<mode>.txtwhich later feed the parser.
- Location:
scripts/distributed_parser.py - Purpose: Convert the raw per-client log files under
raw_logs/(or the output directory produced above) into a single CSV with throughput and latency percentiles. - Usage:
The script automatically creates
cd scripts python3 distributed_parser.py../results/distributed_results.csvif the folder is missing and warns if no logs are present.
- Location:
scripts/run_all_crash_experiments.sh - Purpose: Iterate through every {mode × ratio × replica count} combination, delegating to
run_client_crash_experiment.shfor each run, and throttle the loop with a 5 second cool-down to avoid spurious overload. - Usage:
Ensure
./scripts/run_all_crash_experiments.sh
run_client_crash_experiment.shis executable and that replica inventory files are populated prior to starting the sweep.
The file tests/test_cases.txt documents the manual/TA test plan used throughout development. Each entry specifies the goal, setup, commands, and verification cues for areas ranging from single-replica sanity tests to failure-injection scenarios. Consult this file when validating a new deployment or when reproducing a bug; the layout is intentionally script-friendly so sections can be converted into automated smoke tests later.