From 073f75d7eed69da62ad6daf1c17396fab3934789 Mon Sep 17 00:00:00 2001 From: Leo Toff Date: Sun, 26 Nov 2023 14:54:14 -0800 Subject: [PATCH 1/7] Initialize README.md with an intro to Accord and build commands --- README.md | 18 ++++++++++++++++++ build.gradle | 1 + 2 files changed, 19 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000000..547d2d2287 --- /dev/null +++ b/README.md @@ -0,0 +1,18 @@ +Cassandra Accord +---------------- + +Distributed consensus protocol for Apache Cassandra. Accord is the first protocol to achieve the same steady-state performance as leader-based protocols under important conditions such as contention and failure, while delivering the benefits of leaderless approaches to scaling, transaction isolation and geo-distributed client latency. See [the Accord whitepaper](https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=2&modificationDate=1637000779000&api=v2). + +Build +----- +This repo is used as a git submodule in Cassandra, see [C*/CONTRIBUTING.md](https://github.com/apache/cassandra/blob/607302aaa8c1816a75a70173ae39a7d96ce1b18a/CONTRIBUTING.md#working-with-submodules) for instructions. + +To build this repo: +```bash +./gradlew dependencies +./gradlew check +``` + +Maelstrom testing +----------------- + diff --git a/build.gradle b/build.gradle index defce6f867..dd0017248e 100644 --- a/build.gradle +++ b/build.gradle @@ -31,6 +31,7 @@ rat { // List of Gradle exclude directives, defaults to ['**/.gradle/**'] excludes.add("**/build/**") excludes.add(".idea/**") + excludes.add("README.md") if (layout.projectDirectory.file(".rat-excludes.txt").asFile.exists()) { excludeFile.set(layout.projectDirectory.file(".rat-excludes.txt")) From 16e941fd40d5d1e2c1c841bc24f6b705043ae0d0 Mon Sep 17 00:00:00 2001 From: Leo Toff Date: Sun, 26 Nov 2023 16:46:57 -0800 Subject: [PATCH 2/7] Add duplicate resolution strategy to accord-maelstrom gradle config --- accord-maelstrom/build.gradle | 1 + 1 file changed, 1 insertion(+) diff --git a/accord-maelstrom/build.gradle b/accord-maelstrom/build.gradle index 2499217ab0..70ff1749db 100644 --- a/accord-maelstrom/build.gradle +++ b/accord-maelstrom/build.gradle @@ -39,6 +39,7 @@ jar { task fatJar(type: Jar) { manifest.from jar.manifest archiveClassifier = 'all' + duplicatesStrategy = DuplicatesStrategy.EXCLUDE from { configurations.runtimeClasspath.collect { it.isDirectory() ? it : zipTree(it) } } { From abc553745bae7a258f80f2bfdba56da051192ed3 Mon Sep 17 00:00:00 2001 From: Leo Toff Date: Sun, 26 Nov 2023 16:48:28 -0800 Subject: [PATCH 3/7] Add bash scripts for building and running accord-maelstrom --- accord-maelstrom/build.sh | 3 +++ accord-maelstrom/server.sh | 10 ++++++++++ 2 files changed, 13 insertions(+) create mode 100755 accord-maelstrom/build.sh create mode 100755 accord-maelstrom/server.sh diff --git a/accord-maelstrom/build.sh b/accord-maelstrom/build.sh new file mode 100755 index 0000000000..3b6b1c871d --- /dev/null +++ b/accord-maelstrom/build.sh @@ -0,0 +1,3 @@ +#!/bin/bash + +gradle fatJar \ No newline at end of file diff --git a/accord-maelstrom/server.sh b/accord-maelstrom/server.sh new file mode 100755 index 0000000000..8ff873f519 --- /dev/null +++ b/accord-maelstrom/server.sh @@ -0,0 +1,10 @@ +#!/bin/bash + +# http://mywiki.wooledge.org/BashFAQ/028 +if [[ $BASH_SOURCE = */* ]]; then + DIR=${BASH_SOURCE%/*}/ +else + DIR=./ +fi + +exec java -Xmx256M -jar "$DIR/build/libs/accord-maelstrom-1.0-SNAPSHOT-all.jar" From fd37252f1ae570f86c68a4574be76b5fc14210c5 Mon Sep 17 00:00:00 2001 From: Leo Toff Date: Sun, 26 Nov 2023 16:49:27 -0800 Subject: [PATCH 4/7] Add maelstrom instructions to README --- README.md | 28 ++++++++++++++++++++++++---- 1 file changed, 24 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 547d2d2287..04ea11946f 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,10 @@ Cassandra Accord ---------------- - Distributed consensus protocol for Apache Cassandra. Accord is the first protocol to achieve the same steady-state performance as leader-based protocols under important conditions such as contention and failure, while delivering the benefits of leaderless approaches to scaling, transaction isolation and geo-distributed client latency. See [the Accord whitepaper](https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=2&modificationDate=1637000779000&api=v2). Build ----- -This repo is used as a git submodule in Cassandra, see [C*/CONTRIBUTING.md](https://github.com/apache/cassandra/blob/607302aaa8c1816a75a70173ae39a7d96ce1b18a/CONTRIBUTING.md#working-with-submodules) for instructions. +This repo is used as a submodule for Cassandra, see [C*/CONTRIBUTING.md](https://github.com/apache/cassandra/blob/607302aaa8c1816a75a70173ae39a7d96ce1b18a/CONTRIBUTING.md#working-with-submodules) for instructions on how to include it. To build this repo: ```bash @@ -13,6 +12,27 @@ To build this repo: ./gradlew check ``` -Maelstrom testing ------------------ +Maelstrom +--------- +Jepsen Maelstrom is a workbench for writing toy implementations of distributed systems. It's used for running dtests (distributed tests) against Accord. + +First, build `accord-maelstrom`: +```bash +cd ./accord-maelstrom +./build.sh +``` +Save the path to the server script in an environment variable: +```bash +cd ./accord-maelstrom +export ACCORD_MAELSTROM_SERVER=$(pwd)/server.sh +``` +Clone Maelstrom repo or get the binary, see [Maelstrom installation](https://github.com/jepsen-io/maelstrom/blob/main/doc/01-getting-ready/index.md#installation) for more. If cloned, use `lein run` instead of `./maelstrom` below. +```bash +# Single-node KV store test +./maelstrom test -w lin-kv --bin $ACCORD_MAELSTROM_SERVER --time-limit 10 --rate 10 --node-count 1 --concurrency 2n +# Multi-node KV store test +./maelstrom test -w lin-kv --bin $ACCORD_MAELSTROM_SERVER --time-limit 10 --rate 10 --node-count 3 --concurrency 2n +# Multi-node with partitions +./maelstrom test -w lin-kv --bin $ACCORD_MAELSTROM_SERVER --time-limit 60 --rate 30 --node-count 3 --concurrency 4n --nemesis partition --nemesis-interval 10 --test-count 10 +``` From 77c3d415a798f8803849e9bc166ff68b26f2cfc8 Mon Sep 17 00:00:00 2001 From: Leo Toff Date: Sun, 26 Nov 2023 17:23:17 -0800 Subject: [PATCH 5/7] Add a summary of the Accord protocol --- README.md | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 04ea11946f..6d333b4ceb 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,23 @@ Cassandra Accord ----------------- +================ Distributed consensus protocol for Apache Cassandra. Accord is the first protocol to achieve the same steady-state performance as leader-based protocols under important conditions such as contention and failure, while delivering the benefits of leaderless approaches to scaling, transaction isolation and geo-distributed client latency. See [the Accord whitepaper](https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=2&modificationDate=1637000779000&api=v2). +Accord Protocol +--------------- +Key Features: +- Leaderless - There is no designated leader node. Any node can coordinate transactions. This removes bottlenecks and improves scalability. +- Single round-trip commits - Most transactions can commit in a single message round-trip between coordinating node and replicas. This minimizes latency. +- High contention performance - Special techniques allow it to avoid slowed performance when many transactions conflict. +- Configurable failure tolerance - The protocol can be dynamically reconfigured to maintain fast performance even after node failures. +- Strong consistency guarantees - Provides strict serializability, the strongest isolation level. Transactions behave as if they execute one at a time. +- General purpose transactions - Supports cross-shard multi-statement ACID transactions, not just single partition reads/writes. + +At a high level, it works as follows: +1. A coordinator node is chosen to handle each transaction (typically nearby the client for low latency). +2. The coordinator gets votes from replica nodes on an execution timestamp and set of conflicts for the transaction. +3. With enough votes, the transaction can commit in a single round-trip. +4. The coordinator waits for conflicting transactions, then tells the replicas to execute and persist the changes. + Build ----- This repo is used as a submodule for Cassandra, see [C*/CONTRIBUTING.md](https://github.com/apache/cassandra/blob/607302aaa8c1816a75a70173ae39a7d96ce1b18a/CONTRIBUTING.md#working-with-submodules) for instructions on how to include it. From 756cf4b2ade9a920876e42d15757e0ac767ecfcf Mon Sep 17 00:00:00 2001 From: Leo Toff Date: Sun, 26 Nov 2023 17:30:01 -0800 Subject: [PATCH 6/7] Add a short description of code structure --- README.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/README.md b/README.md index 6d333b4ceb..bac2acea07 100644 --- a/README.md +++ b/README.md @@ -18,6 +18,12 @@ At a high level, it works as follows: 3. With enough votes, the transaction can commit in a single round-trip. 4. The coordinator waits for conflicting transactions, then tells the replicas to execute and persist the changes. +Code structure +-------------- +`accord-core` is the implementation of the Accord protocol that is imported in Cassandra. See [cep-15-accord branch](https://github.com/apache/cassandra/tree/cep-15-accord) in Cassandra. + +`accord-maelstrom` is a wrapper for running Accord within [Jepsen Maelstrom](https://github.com/jepsen-io/maelstrom) which uses STDIN for ingress and STDOUT for egress. + Build ----- This repo is used as a submodule for Cassandra, see [C*/CONTRIBUTING.md](https://github.com/apache/cassandra/blob/607302aaa8c1816a75a70173ae39a7d96ce1b18a/CONTRIBUTING.md#working-with-submodules) for instructions on how to include it. From 519e69a5aa4dc3d2537aaa2049a006602834e56c Mon Sep 17 00:00:00 2001 From: Leo Toff Date: Mon, 27 Nov 2023 10:56:32 -0800 Subject: [PATCH 7/7] Refactor Accord protocal description --- README.md | 34 ++++++++++++++++++++++++---------- 1 file changed, 24 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index bac2acea07..8c33cd4cb8 100644 --- a/README.md +++ b/README.md @@ -5,18 +5,32 @@ Distributed consensus protocol for Apache Cassandra. Accord is the first protoco Accord Protocol --------------- Key Features: -- Leaderless - There is no designated leader node. Any node can coordinate transactions. This removes bottlenecks and improves scalability. -- Single round-trip commits - Most transactions can commit in a single message round-trip between coordinating node and replicas. This minimizes latency. -- High contention performance - Special techniques allow it to avoid slowed performance when many transactions conflict. -- Configurable failure tolerance - The protocol can be dynamically reconfigured to maintain fast performance even after node failures. -- Strong consistency guarantees - Provides strict serializability, the strongest isolation level. Transactions behave as if they execute one at a time. -- General purpose transactions - Supports cross-shard multi-statement ACID transactions, not just single partition reads/writes. +- Leaderless Design: Accord operates without a designated leader node, allowing any node to coordinate transactions. This approach eliminates single points of failure and enhances the system's scalability. +- Fast Consensus Mechanism: It utilizes a special initial round for faster consensus, enabling most transactions to reach agreement quickly, often within two message round-trips. +- Efficient Handling of High Contention: Accord's design reduces the incidence of contention by determining a timestamp for each transaction’s execution order. This feature is particularly beneficial in scenarios where many transactions conflict. +- Partial State-Machine Replication for Scalability: The protocol can be extended to partial state-machine replication (PSMR) scenarios, improving scalability by allowing shards to replicate only a part of the global state-machine. +- Strong Consistency and Isolation Guarantees: Accord provides optimal baseline characteristics for consistency and isolation, ensuring that conflicting transactions are applied in the same order on all participating replicas. +- Support for General-Purpose Transactions: The protocol is capable of handling general-purpose transactions that combine cross-shard state, facilitating multi-statement ACID transactions across different partitions. +- Robust Recovery Mechanism: Accord includes a comprehensive recovery protocol to handle failures effectively. This protocol ensures the continuation and completion of transactions even in the event of coordinator failure. +- Minimal Latency with Fast-Path Option: For transactions where a fast-path quorum of replicas unanimously accept a proposed timestamp, the decision is made immediately, significantly reducing latency. +- Dependency and Order Safety: The protocol ensures that all dependencies of a transaction are accounted for before its execution, maintaining strict order safety and atomicity. At a high level, it works as follows: -1. A coordinator node is chosen to handle each transaction (typically nearby the client for low latency). -2. The coordinator gets votes from replica nodes on an execution timestamp and set of conflicts for the transaction. -3. With enough votes, the transaction can commit in a single round-trip. -4. The coordinator waits for conflicting transactions, then tells the replicas to execute and persist the changes. +1. Consensus Phase: +- Accord assigns each conflicting transaction a unique execution timestamp, forming a total order. +- Timestamps are tuples of time, sequence, and identifier values. These timestamps are used to assign execution times and impose a total order on conflicting transactions. +- A transaction coordinator proposes a timestamp for execution. If a fast-path quorum of replicas unanimously accepts this timestamp, it is immediately decided. Otherwise, a slow path using classic Paxos is used to agree on one of the possible timestamps. +- Execution proceeds after all transactions with earlier timestamps have been committed and executed. +2. Execution Phase: +- Once the execution timestamp is decided and logically committed, it is disseminated to all shards. +- The coordinator sends Read messages to at least one process in each shard to gather responses. +- Execution awaits the completion of dependencies with lower execution timestamps before computing the transaction result. +- The result is then applied to all replicas. + +Recovery protocol: +- If a transaction coordinator fails, a weak failure detector invokes a recovery protocol. +- The recovery protocol contacts a recovery quorum to ensure the transaction is pre-accepted, maintaining the properties of normal execution. +- For fast-path decisions, the protocol may propose a slow-path solution based on the dependencies of superseding transactions. This ensures that all necessary properties are maintained during the recovery of a transaction. Code structure --------------