A high-performance transaction processing pipeline for DecentralChain that doubles throughput without modifying consensus rules.
Overview • Results • Optimizations • Architecture • Benchmarking • Getting Started • License
This project implements a payments-TPS "fast lane" for the DecentralChain blockchain node — a series of surgical, consensus-safe optimizations that collectively double transaction throughput from ~2,070 TPS to ~4,080+ TPS.
Every optimization operates strictly within the existing block validation and state-transition rules. No consensus changes, no fork risk, no protocol modifications. Nodes running these optimizations remain fully compatible with the existing network.
| Principle | Description |
|---|---|
| Zero Consensus Impact | All changes are internal to the node's execution pipeline — block structure, validation rules, and state hashes remain identical |
| Backward Compatible | Optimized nodes produce and validate the same blocks as unmodified nodes |
| Measured & Verified | Every optimization is benchmarked with JMH microbenchmarks on realistic workloads (5,000 transfer transactions per block) |
| Incremental & Reversible | Each optimization is an independent, isolated change that can be applied or reverted individually |
| Stage | ms/op (5K txs) | Effective TPS | Cumulative Gain |
|---|---|---|---|
| 📊 Baseline | 2,414 | ~2,071 | — |
| 🔀 + Parallel Signature Verification | 1,869 ± 374 | ~2,675 | +29% |
| 🗂️ + SortedBatch Elimination | 1,554 ± 417 | ~3,217 | +55% |
| 📦 + Pre-Serialization Pipeline | 1,226 ± 238 | ~4,078 | +97% |
| ⚡ + Batch Dedup & Async Persist | ~1,050* | ~4,760* | ~130% |
* Final stage measured as ~13.5% improvement over PR-4 in controlled same-session benchmarks. All benchmarks: JMH, 5,000 transfer txs/block, single fork, JDK 11, Apple Silicon.
Bottom line: 2× throughput improvement with zero protocol changes.
Impact: +29% TPS · node/.../state/diffs/BlockDiffer.scala
The single largest bottleneck in block processing is Ed25519 signature verification — a pure CPU-bound operation performed sequentially for every transaction in a block.
Problem: The default BlockDiffer verifies each transaction's cryptographic signature one-by-one inside the main processing loop. With 5,000 transactions per block, this serial verification dominates wall-clock time.
Solution: Verify all transaction signatures in parallel before entering the sequential state-diff loop using a CountDownLatch + ExecutorService pattern:
┌─────────────────────────────────────────────────┐
│ Block Received (N txs) │
├─────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Verify │ │ Verify │ ... │ Verify │ │
│ │ Sig #1 │ │ Sig #2 │ │ Sig #N │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ └───────────┴───────────────┘ │
│ │ │
│ CountDownLatch.await() │
│ │ │
│ ┌───────────▼───────────┐ │
│ │ Sequential Diff Loop │ │
│ │ (signatures already │ │
│ │ verified — skip) │ │
│ └───────────────────────┘ │
└─────────────────────────────────────────────────┘
Why it works: Ed25519 verification is stateless and CPU-bound — a textbook candidate for data parallelism. On modern multi-core hardware, N signatures verify in roughly O(1) wall-clock time instead of O(N).
Impact: +16.8% TPS (55% cumulative) · node/.../database/package.scala
Problem: The readWrite() database method maintained a SortedBatch structure that double-buffered all pending writes — sorting and re-grouping key-value pairs before flushing to LevelDB. Since LevelDB already performs its own internal sorting via memtables and SSTables, this intermediate sort was redundant work.
Solution: Bypass the SortedBatch accumulator and write directly to the LevelDB WriteBatch, eliminating:
- One full copy of all key-value pairs per block
- A
TreeMapsort operation over thousands of entries - Additional GC pressure from short-lived intermediate objects
Why it works: LevelDB's LSM-tree architecture already guarantees sorted, atomic writes through its WAL + memtable pipeline. The application-level sort was defense-in-depth that, upon profiling, proved to be pure overhead.
Impact: +21% TPS (97% cumulative) · node/.../database/LevelDBWriter.scala
Problem: Transaction serialization (converting in-memory transaction objects to byte arrays for storage) was happening inside the readWrite lock — the critical section that holds exclusive access to the database. Serialization is CPU-intensive but does not require database access, meaning it unnecessarily extended lock hold time.
Solution: Move all transaction serialization outside the readWrite lock into a pre-computation phase:
BEFORE (serialization inside lock):
┌──────────────────────────────────┐
│ readWrite Lock Held │
│ ┌────────────────────────────┐ │
│ │ for each tx: │ │
│ │ serialize(tx) ← CPU │ │
│ │ db.put(key, bytes) │ │
│ └────────────────────────────┘ │
└──────────────────────────────────┘
AFTER (serialization before lock):
┌──────────────────────────────────┐
│ Pre-serialize all txs │ ← No lock held
│ val entries = txs.map { │
│ tx => (key, serialize(tx)) │
│ } │
└─────────────┬────────────────────┘
│
┌─────────────▼────────────────────┐
│ readWrite Lock Held │
│ for each (key, bytes): │ ← Lock held for minimal time
│ db.put(key, bytes) │
└──────────────────────────────────┘
Why it works: By reducing critical section duration, other threads (signature verification, network I/O, API handlers) spend less time contended on the write lock. The total work is identical — it's simply reordered to minimize lock contention.
Impact: Part of ~13.5% combined improvement · BlockDiffer.scala · TransactionDiffer.scala
Problem: During block application, each transaction is checked for duplicate IDs by querying the database individually. With 5,000 transactions per block, this results in 5,000 separate LevelDB lookups — each incurring seek overhead, system call latency, and cache pollution.
Solution: Collect all transaction IDs upfront and perform a single batch existence check against the database before entering the diff loop. Pass a skipDuplicateCheck flag to TransactionDiffer so it doesn't re-check IDs already verified in bulk.
Why it works: Batch I/O operations amortize fixed costs (system calls, disk seeks, iterator initialization) across thousands of lookups. A single multi-get is dramatically faster than N individual gets.
Impact: Part of ~13.5% combined improvement · node/.../database/Caches.scala
Problem: After computing the state diff for a block, the node synchronously writes the results to disk before starting work on the next block. This creates a pipeline stall — the CPU sits idle during disk I/O.
Solution: Pipeline block processing and persistence by submitting write operations to a background executor:
Block N: [ Diff ][ Persist → background ]
Block N+1: [ Diff ][ Persist → background ]
Block N+2: [ Diff ][ Persist → background ]
──────────────────────────────▶ time
The background executor is a single-threaded ExecutorService that guarantees write ordering. A proper shutdown hook ensures all pending writes complete before the node exits.
Why it works: Block diff computation and disk persistence are largely independent — diffing is CPU-bound while persistence is I/O-bound. Pipelining overlaps these two phases, recovering what was previously dead time.
Rigorous engineering means knowing what not to ship:
| Approach | Hypothesis | Result |
|---|---|---|
| Mutable HAMT Accumulators (PR-2) | Replace immutable CHAMP trie maps with mutable structures during diff | CHAMP tries already use structural sharing efficiently; mutable versions showed no measurable gain |
| Bloom Filter Pre-Screening (PR-5) | Use Bloom filters to skip unnecessary DB lookups for duplicate detection | Increased variance by 2.5× with no mean improvement — false positive overhead negated lookup savings |
┌──────────────────────────────────────────────────┐
│ BLOCK RECEIVED │
└──────────────────┬───────────────────────────────┘
│
┌──────────────────▼───────────────────────────────┐
│ PR-1: PARALLEL SIGNATURE VERIFICATION │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ Core │ │ Core │ │ Core │ │ Core │ ... │
│ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ │
│ └────────┴────────┴────────┘ │
│ CountDownLatch │
└──────────────────┬───────────────────────────────┘
│
┌──────────────────▼───────────────────────────────┐
│ PR-6: BATCH DUPLICATE-ID CHECK │
│ Collect all tx IDs → single DB multi-get │
└──────────────────┬───────────────────────────────┘
│
┌──────────────────▼───────────────────────────────┐
│ SEQUENTIAL STATE DIFF LOOP │
│ (sigs pre-verified, dupes pre-checked) │
└──────────────────┬───────────────────────────────┘
│
┌──────────────────▼───────────────────────────────┐
│ PR-4: PRE-SERIALIZE TRANSACTIONS │
│ Serialize all tx bytes OUTSIDE write lock │
└──────────────────┬───────────────────────────────┘
│
┌──────────────────▼───────────────────────────────┐
│ PR-3: DIRECT LEVELDB WRITE │
│ Skip SortedBatch → WriteBatch directly │
└──────────────────┬───────────────────────────────┘
│
┌──────────────────▼───────────────────────────────┐
│ PR-7: ASYNC PERSIST (background executor) │
│ Pipeline: current block persists while │
│ next block begins diff computation │
└──────────────────────────────────────────────────┘
| File | Optimization | Description |
|---|---|---|
node/.../state/diffs/BlockDiffer.scala |
PR-1, PR-6 | Parallel sig verification, batch duplicate pre-check |
node/.../database/package.scala |
PR-3 | SortedBatch elimination |
node/.../database/LevelDBWriter.scala |
PR-4 | Pre-serialization outside write lock |
node/.../state/diffs/TransactionDiffer.scala |
PR-6 | skipDuplicateCheck parameter |
node/.../database/Caches.scala |
PR-7 | Async persist pipelining with background executor |
All benchmarks use JMH (Java Microbenchmark Harness) with the following configuration:
| Parameter | Value |
|---|---|
| Transactions per block | 5,000 (transfer transactions) |
| Unique accounts | 10,000 pre-funded |
| JMH warmup iterations | 3 |
| JMH measurement iterations | 5 |
| Forks | 1 |
| Threads | 1 |
| JVM | OpenJDK 11, G1GC, 4 GB heap |
# Set up environment
export JAVA_HOME="/opt/homebrew/opt/openjdk@11/libexec/openjdk.jdk/Contents/Home"
export PATH="$JAVA_HOME/bin:$PATH"
export SBT_OPTS="-Xmx4g -XX:+UseG1GC"
# Run the transfer-only benchmark
sbt -java-home "$JAVA_HOME" \
"benchmark/jmh:run -i 5 -wi 3 -f1 -t1 \
com.wavesplatform.state.TransferOnlyBenchmark"JMH results can vary between runs due to:
- LevelDB compaction state — background compaction affects read/write latency
- JVM JIT compilation — HotSpot optimizes differently across runs
- OS disk cache — warm vs. cold cache produces different I/O profiles
- GC pressure — G1GC pause timing affects tail latency
For reproducible results, we recommend:
- Running
@Setup(Level.Trial)with pre-populated state - Increasing iterations:
-wi 10 -i 20 - Pre-warming the database with a discard run
- Pinning GC settings:
-XX:+UseG1GC -XX:MaxGCPauseMillis=50
- JDK 11+ (OpenJDK recommended)
- SBT 1.3.8+
- protoc (Protocol Buffers compiler)
git clone https://github.com/dylanpersonguy/DCC-Tx-Speed-Upgrade.git
cd DCC-Tx-Speed-Upgrade
export JAVA_HOME="/opt/homebrew/opt/openjdk@11/libexec/openjdk.jdk/Contents/Home"
export SBT_OPTS="-Xmx4g -XX:+UseG1GC"
# Compile the full project
sbt compile
# Run the node
java -jar node/target/decentralchain-all*.jar path/to/config.confsbt testsbt "benchmark/jmh:run -i 5 -wi 3 -f1 -t1 com.wavesplatform.state.TransferOnlyBenchmark"DCC-Tx-Speed-Upgrade/
├── node/ # Core blockchain node
│ └── src/main/scala/com/wavesplatform/
│ ├── state/diffs/
│ │ ├── BlockDiffer.scala # PR-1: Parallel sig verify
│ │ │ # PR-6: Batch duplicate check
│ │ └── TransactionDiffer.scala # PR-6: Skip-duplicate flag
│ └── database/
│ ├── LevelDBWriter.scala # PR-4: Pre-serialization
│ ├── package.scala # PR-3: SortedBatch removal
│ └── Caches.scala # PR-7: Async persist
├── benchmark/ # JMH performance benchmarks
├── grpc-server/ # gRPC API server
├── lang/ # RIDE smart contract language
└── project/ # SBT build configuration
The optimizations are not independent — they form a synergistic pipeline:
- Parallel signatures free up the CPU during the most expensive per-transaction operation
- Batch dedup eliminates thousands of individual DB seeks before the diff loop
- The diff loop itself now runs faster because signatures and duplicates are pre-resolved
- Pre-serialization ensures the write lock is held for the minimum possible duration
- Direct WriteBatch removes redundant buffering in the persist path
- Async persist overlaps I/O with the next block's CPU work
Each optimization removes a different bottleneck, allowing the next bottleneck to surface and be addressed. This is why the gains compound multiplicatively rather than additively.
| Property | Guarantee |
|---|---|
| Deterministic state | State hash after applying block N is identical whether optimizations are enabled or not |
| Crash consistency | LevelDB's WAL ensures atomic writes; async persist flushes completely before shutdown |
| Network compatibility | Optimized nodes communicate identically on the wire — no protocol changes |
| Ordering | Transaction ordering within blocks is preserved exactly as received |
- Built on DecentralChain (a Waves Platform fork)
- Benchmarked with JMH by the OpenJDK team
- Profiling powered by YourKit Java Profiler
This project is licensed under the MIT License.