Skip to content

feat: add DatumCache for memory optimization of inline datums#18

Open
awcjack wants to merge 28 commits intomasterfrom
feature/datum-cache-memory-optimization
Open

feat: add DatumCache for memory optimization of inline datums#18
awcjack wants to merge 28 commits intomasterfrom
feature/datum-cache-memory-optimization

Conversation

@awcjack
Copy link
Owner

@awcjack awcjack commented Jan 3, 2026

Summary

Reduces memory footprint of Hydra nodes by storing datum hashes instead of full inline datums in memory, with a separate cache to restore them when needed for on-chain transactions.

Problem

High memory usage in Hydra nodes caused by UTxOs with inline datums that remain unspent in the head. These datums consume memory even when not actively needed.

Solution

Implement a datum caching mechanism that:

  1. Strips inline datums from UTxOs after they enter the head (stores only datum hash)
  2. Caches the full datums in a separate DatumCache (Map from hash to datum)
  3. Restores datums before on-chain transactions (Close/Contest) to preserve hash consistency

Key Changes

New Module: Hydra.DatumCache

  • DatumCache type - strict Map from datum hash to full datum
  • HasDatumCache type class for abstracting datum operations
  • stripDatums / restoreDatums functions for UTxO manipulation

Datum Stripping (reduces memory)

  • HeadOpened: strip datums from initialUTxO
  • TransactionAppliedToLocalUTxO: strip after tx validation
  • CommitFinalized: strip datums from deposited UTxO
  • DecommitRecorded: strip datums from remaining UTxO

Datum Restoration (preserves on-chain correctness)

  • HeadClosed: restore before storing in ClosedState
  • onOpenClientClose: restore before emitting CloseTx
  • onOpenChainCloseTx: restore before emitting ContestTx

Critical Bug Fixed

Discovered that hashUTxO produces different hashes for stripped datums vs full inline datums. This would cause on-chain validation failures because snapshots are signed with full datums. Solution: restore datums BEFORE emitting Close/Contest transactions.

Testing

  • ✅ Library builds without warnings
  • ✅ HeadLogic tests pass
  • ✅ 543/547 tests pass (4 failures are unrelated network infrastructure tests)

Files Changed

Category Files
New Hydra/DatumCache.hs
Core Logic HeadLogic.hs, HeadLogic/State.hs, HeadLogic/Outcome.hs
Node Layer Node.hs, Node/Run.hs, API/Server.hs
Tests Ledger/Simple.hs, HeadLogicSpec.hs, NodeSpec.hs, RotationSpec.hs
Schema api.yaml, golden files

Introduce DatumCache module to reduce memory footprint of Hydra nodes
by storing datum hashes instead of full inline datums in memory.

- Add DatumCache type (strict Map from datum hash to full datum)
- Add HasDatumCache type class for abstracting datum operations
- Implement stripDatums/restoreDatums for UTxO manipulation
- Export emptyCache, insertDatum, lookupDatum utilities
- Add datumCache field to OpenState for storing stripped datums
- Add HasDatumCache constraint to StateChanged and Outcome types
- Implement no-op HasDatumCache instance for SimpleTx test type
Strip inline datums from UTxOs to reduce memory footprint:
- HeadOpened: strip datums from initialUTxO
- TransactionAppliedToLocalUTxO: strip after tx validation
- CommitFinalized: strip datums from deposited UTxO
- DecommitRecorded: strip datums from remaining UTxO

Restore datums before on-chain transactions to preserve hash consistency:
- HeadClosed: restore datums before storing in ClosedState
- onOpenClientClose: restore datums before emitting CloseTx
- onOpenChainCloseTx: restore datums before emitting ContestTx

This fixes a critical bug where stripped datums would produce different
hashes than original inline datums, causing on-chain validation failures.
Add HasDatumCache constraint to:
- HydraNode type in Node.hs
- runHydraNode and related functions in Node/Run.hs
- Server functions in API/Server.hs
- Add datumCache = emptyCache to OpenState in HeadLogicSpec
- Add HasDatumCache constraint in NodeSpec and RotationSpec
- Add DatumCache schema to api.yaml
- Update golden test files to include datumCache field in OpenState
@github-actions
Copy link

github-actions bot commented Jan 3, 2026

Transaction cost differences

No cost or size differences found

@github-actions
Copy link

github-actions bot commented Jan 3, 2026

Transaction costs

Sizes and execution budgets for Hydra protocol transactions. Note that unlisted parameters are currently using arbitrary values and results are not fully deterministic and comparable to previous runs.

Metadata
Generated at 2026-01-05 15:22:37.88171796 UTC
Max. memory units 14000000
Max. CPU units 10000000000
Max. tx size (kB) 16384

Script summary

Name Hash Size (Bytes)
νInitial c8a101a5c8ac4816b0dceb59ce31fc2258e387de828f02961d2f2045 2652
νCommit 61458bc2f297fff3cc5df6ac7ab57cefd87763b0b7bd722146a1035c 685
νHead a1442faf26d4ec409e2f62a685c1d4893f8d6bcbaf7bcb59d6fa1340 14599
μHead fd173b993e12103cd734ca6710d364e17120a5eb37a224c64ab2b188* 5284
νDeposit ae01dade3a9c346d5c93ae3ce339412b90a0b8f83f94ec6baa24e30c 1102
  • The minting policy hash is only usable for comparison. As the script is parameterized, the actual script is unique per head.

Init transaction costs

Parties Tx size % max Mem % max CPU Min fee ₳
1 5836 10.64 3.38 0.52
2 6038 12.34 3.90 0.54
3 6239 14.72 4.66 0.58
5 6640 18.41 5.80 0.63
10 7646 29.00 9.14 0.79
43 14281 98.97 30.93 1.80

Commit transaction costs

This uses ada-only outputs for better comparability.

UTxO Tx size % max Mem % max CPU Min fee ₳
1 561 2.44 1.16 0.20
2 743 3.38 1.73 0.22
3 920 4.36 2.33 0.24
5 1280 6.41 3.60 0.28
10 2176 12.13 7.25 0.40
54 10068 98.61 68.52 1.88

CollectCom transaction costs

Parties UTxO (bytes) Tx size % max Mem % max CPU Min fee ₳
1 57 525 24.46 7.13 0.42
2 114 636 33.18 9.60 0.52
3 171 747 43.73 12.51 0.63
4 224 858 50.85 14.62 0.70
5 282 969 64.32 18.24 0.84
6 338 1081 65.18 18.95 0.86
7 395 1192 74.44 21.41 0.96
8 449 1303 87.15 24.94 1.09

Cost of Increment Transaction

Parties Tx size % max Mem % max CPU Min fee ₳
1 1785 24.29 7.69 0.48
2 1954 25.85 8.78 0.51
3 2068 27.40 9.88 0.53
5 2391 31.30 12.31 0.60
10 3175 41.04 18.38 0.75
40 7667 96.38 53.75 1.65

Cost of Decrement Transaction

Parties Tx size % max Mem % max CPU Min fee ₳
1 645 22.50 7.30 0.41
2 768 24.35 8.48 0.44
3 829 24.09 9.03 0.45
5 1268 30.15 12.07 0.54
10 2230 42.44 18.85 0.73
38 6082 93.84 51.77 1.55

Close transaction costs

Parties Tx size % max Mem % max CPU Min fee ₳
1 673 27.47 8.46 0.46
2 863 29.90 9.82 0.50
3 944 30.94 10.75 0.52
5 1249 35.04 13.25 0.58
10 2067 45.13 19.42 0.75
37 5891 95.71 51.56 1.55

Contest transaction costs

Parties Tx size % max Mem % max CPU Min fee ₳
1 666 33.83 10.16 0.53
2 825 35.85 11.38 0.56
3 979 38.59 12.82 0.60
5 1273 42.64 15.28 0.66
10 2127 55.97 22.38 0.86
28 4673 95.60 45.33 1.46

Abort transaction costs

There is some variation due to the random mixture of initial and already committed outputs.

Parties Tx size % max Mem % max CPU Min fee ₳
1 5795 27.13 9.11 0.69
2 5918 35.80 12.04 0.79
3 6107 44.89 15.06 0.89
4 6191 51.09 17.19 0.96
5 6604 66.70 22.65 1.14
6 6569 72.80 24.57 1.20
7 6689 79.54 26.73 1.28
8 6771 88.22 29.60 1.37

FanOut transaction costs

Involves spending head output and burning head tokens. Uses ada-only UTXO for better comparability.

Parties UTxO UTxO (bytes) Tx size % max Mem % max CPU Min fee ₳
10 0 0 5834 18.30 6.11 0.60
10 1 57 5869 21.41 7.28 0.63
10 10 569 6173 38.18 14.00 0.83
10 20 1138 6512 58.66 22.07 1.07
10 30 1704 6851 80.92 30.76 1.33
10 40 2277 7193 99.84 38.30 1.55
10 39 2221 7161 99.12 37.95 1.54

End-to-end benchmark results

This page is intended to collect the latest end-to-end benchmark results produced by Hydra's continuous integration (CI) system from the latest master code.

Please note that these results are approximate as they are currently produced from limited cloud VMs and not controlled hardware. Rather than focusing on the absolute results, the emphasis should be on relative results, such as how the timings for a scenario evolve as the code changes.

Generated at 2026-01-05 15:25:36.626686903 UTC

Baseline Scenario

Number of nodes 1
Number of txs 300
Avg. Confirmation Time (ms) 5.510206623
P99 9.466960909999965ms
P95 6.93772895ms
P50 5.2393695000000005ms
Number of Invalid txs 0

Three local nodes

Number of nodes 3
Number of txs 900
Avg. Confirmation Time (ms) 32.543640974
P99 50.21502475999999ms
P95 42.639221299999996ms
P50 31.453317ms
Number of Invalid txs 0

awcjack added 20 commits January 3, 2026 17:29
- Remove unused Monoid instance for DatumCache
- Remove unused functions: insertDatum, deleteDatum, cacheSize
- Remove unused extractDatumsFromUTxO and extractInlineDatum helpers
- Clean up redundant imports in Simple.hs and Node.hs
GitHub Actions runners have limited disk space (~14GB available).
When building uncached Nix derivations (like our modified hydra-node),
the build can exhaust disk space during compilation.

This adds a cleanup step that removes unused tools before the build:
- .NET SDK (~1.8GB)
- Android SDK (~9GB)
- GHC (~5GB)
- CodeQL (~2.5GB)
- Unused Docker images

This frees up ~20GB of disk space, ensuring builds complete successfully.
- Add pull_request trigger for PRs targeting master branch
- Tag PR builds as pr-<number> for easy identification
- Use PR head SHA as version for traceability
The datum cache feature strips inline datums from UTxO sets to save memory,
storing them in a separate cache. However, two StateChanged event handlers
were missing the stripDatums call, causing inconsistency between localUTxO
and datumCache:

- SnapshotRequested: Was assigning newLocalUTxO directly without stripping
- LocalStateCleared (ConfirmedSnapshot case): Was assigning snapshot.utxo
  directly without stripping

Both handlers now:
1. Call stripDatums on the UTxO to extract inline datums
2. Merge the extracted datums with the existing datumCache
3. Store only the stripped UTxO in localUTxO

This fixes the 'chain out of sync' runtime error that occurred after the
datum cache memory optimization was implemented.
When processing ReqSn (snapshot requests), the confirmedUTxO from the
confirmed snapshot has inline datums stripped (due to DatumCache optimization).
Before applying transactions via ledger validation, we must restore the
datums so that:

1. Script validation works correctly (scripts need inline datums)
2. The resulting UTxO hash matches what other parties compute
3. Subsequent SnapshotConfirmed events are emitted correctly

This fixes an issue where only the first SnapshotConfirmed event was
being emitted because transaction application was failing silently due
to missing datums in the UTxO set passed to applyTransactions.

The fix:
- Added HasDatumCache constraint to onOpenNetworkReqSn
- Extract datumCache from OpenState
- Create restoredConfirmedUTxO using restoreDatums before passing to
  requireApplicableDecommitTx and subsequently requireApplyTxs
L2 transactions don't require L1 chain awareness, so they should be
processed even when the node is temporarily behind on observing L1 blocks.
Other L1-dependent operations (Init, Close, Contest, etc.) remain blocked.

This fixes the 'chain out of sync' error that was rejecting all client
inputs when the node was briefly behind on L1, even though L2 transactions
operate independently of L1 state.
The handleSubmitL2Tx function was returning a plain JSON string for
request parsing errors, which caused clients (like Tonic/Go) to fail
parsing the response with 'cannot unmarshal string into Go value'.

Now returns SubmitTxRejectedResponse object with proper 'tag' and
'reason' fields, consistent with other error responses.

This prevents client-side parsing failures that led to transaction
retries, which in turn caused BadInputsUTxO errors when the original
transaction had already been confirmed in a snapshot.
Under high load with concurrent TX submissions, the HTTP handler for
POST /transaction was matching ANY RejectedInput for NewTx, not checking
if it was the specific transaction submitted. This caused false-negative
responses where a successful TX was reported as rejected because another
concurrent TX was rejected.

Now the handler checks txId transaction == txid before returning
SubmitTxRejected, preventing the race condition.
Adds a test that verifies the HTTP handler correctly ignores RejectedInput
events for different transactions. This ensures that when TX_A is submitted
and a RejectedInput for TX_B appears, the handler for TX_A continues waiting
and correctly returns success when TX_A is confirmed.
Add --datum-hot-cache-size CLI option to control datum cache memory usage.
This threads the configuration from RunOptions through Environment to
HeadLogic, where pruneCacheWithLimit applies size-based eviction after
each snapshot confirmation.

- Add datumHotCacheSize field to RunOptions (default: 100)
- Add datumHotCacheSize field to Environment
- Add pruneCacheWithLimit function to DatumCache module
- Update aggregate functions in HeadLogic to accept cache size config
- Pass cache size through processNextInput and aggregateState
- Update all test fixtures with datumHotCacheSize = 0 (unlimited)

Behavior:
- 0 = unlimited (UTxO-aligned pruning only)
- N > 0 = evict oldest entries (by hash order) when cache exceeds N
- Fuse mapMaybe/map in DatumCache.hs (HLint warning)
- Add missing datumHotCacheSize field in hydra-cluster HydraNode.hs
- Reorder import in HeadLogic.hs (move Numeric.Natural after Hydra.Tx.Snapshot)
- Add datumHotCacheSize to Greetings/Greetings.json golden file
- Add datumHotCacheSize property to Environment schema in api.yaml
- Increase persistent broadcast queue capacity from 100 to 1000
- Increase gRPC put message timeout from 3s to 10s

These changes help prevent snapshot confirmation failures under high
transaction load by allowing more messages to queue and giving more
time for gRPC operations to complete.
- Add logCritical helper that always logs to stderr regardless of verbosity
- Log QueueNearCapacity when broadcast queue reaches 80% capacity
- Log ConsecutiveBroadcastFailures after 5+ consecutive failures
- Track consecutive broadcast failures with counter reset on success
- Add withCriticalTracer function for future use

This helps diagnose snapshot confirmation issues under high load
where one node may not be sending AckSn due to network problems.
…oss under high load

Under high transaction load, snapshot signatures were being lost because:
- All messages shared a single FIFO queue
- ReqSn/AckSn protocol messages got buried behind ReqTx transactions
- When AckSn arrived before local ReqSn was processed, it got re-enqueued
  to the back of the queue, causing signature collection to fail

Solution: Dual-queue system that processes protocol messages before transactions
- HighPriority: ReqSn, AckSn, ChainInput, ClientInput, ConnectivityEvent
- LowPriority: ReqTx, ReqDec

This ensures protocol state machine messages are never starved by transaction load.
- Remove unused ToJSON/FromJSON instances from MessagePriority
- Remove unused withCriticalTracer function from Logging module
- Remove unused Natural import from HeadLogic
- Add type signature for local binding to satisfy -Wmissing-local-signatures
Previously, pruneCacheWithLimit could evict datums that are still
referenced by the current UTxO set when the cache size exceeded
datumHotCacheSize. This caused MissingRequiredDatums errors when
validating transactions that consume UTxOs with inline datums.

The fix removes the evictToLimit logic entirely because after
pruneCache restricts the cache to only datums in the current UTxO
set, all remaining datums are required for transaction validation.
Evicting any of them would break the system.

The datumHotCacheSize parameter is kept for API compatibility but
now only serves as a monitoring hint - the actual cache size will
be equal to the number of UTxOs with inline datums in the current
state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant