From 751222e562943152724a39c55710b82839756c06 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?M=C3=A5rten=20Blankfors?= Date: Fri, 13 Mar 2026 12:53:40 +0100 Subject: [PATCH 1/9] wip: Design doc --- docs/tdx-config-updates-design.md | 301 ++++++++++++++++++++++++++++++ 1 file changed, 301 insertions(+) create mode 100644 docs/tdx-config-updates-design.md diff --git a/docs/tdx-config-updates-design.md b/docs/tdx-config-updates-design.md new file mode 100644 index 000000000..0b7a66a55 --- /dev/null +++ b/docs/tdx-config-updates-design.md @@ -0,0 +1,301 @@ +# Dynamic Configuration Updates in TDX Deployments + +**Status:** WIP / Design +**Issue:** #2420 +**Authors:** TBD +**Date:** 2026-03-13 + +## Problem Statement + +Today, MPC node configuration in TDX deployments is static: environment variables are passed via the dstack `user-config.conf` at CVM creation time, and the `config.yaml` file is generated once at first boot. There is **no mechanism to update either file while the node is running**. + +This is tolerable for settings that rarely change (account IDs, contract IDs, protocol timeouts). It becomes a serious operational bottleneck for **foreign chain validation**, where we expect frequent updates to: + +- Add new chains and RPC providers +- Rotate or add API keys for RPC providers +- Update RPC URLs when providers change endpoints + +Each such change currently requires a **full CVM restart** (stop CVM, update `user-config.conf`, start CVM), causing downtime for the node and potentially missing signature requests during the restart window. + +### Specific Pain Points + +1. **`config.yaml` is read once at startup.** The `ConfigFile::from_file()` call in `config.rs` reads the YAML from `$MPC_HOME_DIR/config.yaml` at boot. There is no file watcher or reload mechanism. Changes require a process restart. + +2. **`foreign_chains` config lives in `config.yaml`.** Chain definitions, provider URLs, and auth configuration are all part of the static `ConfigFile` struct. They cannot be updated without rewriting `config.yaml` and restarting the node. + +3. **API keys are resolved from environment variables at signing time** (via `TokenConfig::Env`), but the environment itself is fixed at container creation. New env vars cannot be injected into a running container. + +4. **`user-config.conf` changes require CVM restart.** Dstack's `update-user-config` command followed by stop/start is the only supported mechanism. The launcher re-reads the file only on boot. + +5. **No confidential channel for secrets.** The `user-config.conf` is stored in `/tapp/user_config` which is an unmeasured dstack input. While convenient, there is no encrypted or attestation-protected path for delivering API keys to the node. + +## Current Architecture + +### Configuration Flow + +``` +Operator writes user-config.conf + | + v + dstack VMM deploys CVM + | + v + Launcher (launcher.py) reads /tapp/user_config + | + +--> Launcher-only vars: MPC_IMAGE_NAME, MPC_REGISTRY, MPC_IMAGE_TAGS + +--> Passthrough vars: MPC_* env vars, RUST_LOG, NEAR_BOOT_NODES + | + v + Launcher starts MPC container with env vars + | + v + MPC node startup (cli.rs -> run.rs): + +--> Reads $MPC_HOME_DIR/config.yaml (or generates default) + +--> Reads secrets.json (p2p key, signer key) + +--> Starts indexer, coordinator, web server + +--> Starts allowed_image_hashes_watcher (writes to /mnt/shared/) +``` + +### What CAN Change at Runtime Today + +- **MPC Docker image hash**: The `allowed_image_hashes_watcher` monitors the contract for approved image hashes, writes them to `/mnt/shared/image-digest.bin`, and the launcher reads this on next CVM boot. This is the only existing "dynamic update" pattern. + +### What CANNOT Change at Runtime + +- `config.yaml` contents (all protocol parameters, foreign chains config) +- Environment variables in the MPC container +- API keys for RPC providers +- The set of supported foreign chains + +### TEE/Attestation Constraints + +Understanding what is and isn't measured is critical for designing a solution: + +| Component | Measured in | Can change without breaking attestation? | +|-----------|-------------|------------------------------------------| +| Launcher docker image | RTMR3 (extended by launcher) | No | +| Launcher docker-compose | RTMR3 (extended by launcher) | No | +| MPC docker image hash | RTMR3 (extended by launcher) | No (but approved list is dynamic) | +| vCPU, Memory | RTMR2 | No | +| Guest OS / dstack version | MRTD, RTMR0-2 | No | +| `user-config.conf` | **Unmeasured** | Yes | +| `/mnt/shared/` contents | **Unmeasured** (encrypted at rest) | Yes | +| `config.yaml` | **Unmeasured** (inside encrypted CVM disk) | Yes (if a mechanism exists) | + +Key insight: **`config.yaml` and environment variables live inside the CVM's encrypted filesystem, which is not individually measured.** The attestation guarantees that the correct code is running, but the config data itself is not part of the measurement. This means we can update configuration without breaking attestation -- the challenge is getting the data into the running process. + +## Proposed Solutions + +### Option A: File-Based Config Hot-Reload (Recommended) + +Add a file watcher to the MPC node that monitors `config.yaml` for changes, and provide a mechanism to update the file from outside the CVM. + +#### Design + +1. **Config file watcher in the MPC node:** + - Add a `tokio::fs::watch` (or `notify` crate) watcher on `config.yaml` + - On file change, re-parse and validate the new config + - Apply changes to `foreign_chains` (and potentially other safe-to-reload fields) without restart + - Reject invalid configs with a warning log, keeping the old config active + - Follow the existing pattern from `allowed_image_hashes_watcher.rs` for crash-safe atomic file writes + +2. **Config update delivery via shared volume:** + - Extend the `/mnt/shared/` volume pattern already used for `image-digest.bin` + - The launcher (or a new sidecar) watches for config update files on the shared volume + - The launcher writes the validated config into the MPC container's data volume + - Alternatively, the node itself watches a "config overlay" file on `/mnt/shared/` + +3. **Operator workflow:** + - Operator updates `user-config.conf` with new foreign chain settings + - Operator calls `vmm-cli.py update-user-config` (no CVM restart needed) + - Launcher detects the change and writes the new config to the shared volume + - MPC node picks up the change via file watcher + +#### What Changes Can Be Hot-Reloaded + +| Config field | Hot-reloadable? | Notes | +|-------------|-----------------|-------| +| `foreign_chains` | Yes | Primary use case | +| `triple`, `presignature`, `signature` timeouts | Potentially | Low risk, but needs careful handling of in-flight protocols | +| `my_near_account_id` | No | Fundamental identity, requires restart | +| `indexer` settings | No | Requires indexer restart | +| `web_ui`, `pprof_bind_address` | No | Bound at startup | + +#### API Key Delivery + +For API keys referenced via `TokenConfig::Env`: +- New env var values can be written to a `.env` file on the shared volume +- The node reads this file when resolving `TokenConfig::Env` tokens at signing time +- This avoids the need to inject env vars into a running container + +#### Pros +- Follows existing patterns (`allowed_image_hashes_watcher`) +- No new infrastructure (dstack, contract changes) +- Operator workflow is simple (update file, no restart) +- Gradual rollout: start with `foreign_chains` only, expand later +- API keys stay within the CVM's encrypted filesystem + +#### Cons +- Requires MPC node code changes (file watcher, partial config reload) +- Need to carefully define which fields are safe to hot-reload +- Launcher changes needed to relay config updates +- No consensus mechanism: each operator updates independently (but foreign chain policy voting already handles consensus for the chain list) + +#### Implementation Sketch + +```rust +// New module: crates/node/src/config/watcher.rs +pub async fn watch_config_file( + config_path: PathBuf, + foreign_chains_sender: watch::Sender, + cancellation_token: CancellationToken, +) -> Result<(), ConfigWatchError> { + // Use notify crate or poll-based approach + // On change: parse, validate, send update via channel + // Consumers (coordinator, providers) receive via watch::Receiver +} +``` + +```rust +// In coordinator.rs, replace static config read with watch channel +let foreign_chains_config = foreign_chains_receiver.borrow().clone(); +``` + +For API key delivery, a separate `.env` file approach: + +```rust +// In auth.rs, modify token resolution +impl TokenConfig { + pub fn resolve(&self) -> Result { + match self { + TokenConfig::Val { val } => Ok(val.clone()), + TokenConfig::Env { env } => { + // First check override file, then fall back to process env + if let Some(val) = read_env_override_file(env)? { + Ok(val) + } else { + std::env::var(env).map_err(|_| AuthError::MissingEnvVar(env.clone())) + } + } + } + } +} +``` + +--- + +### Option B: Contract-Driven Configuration + +Store configuration (chain definitions, RPC URLs) on the contract and have nodes read it via the indexer, similar to how `foreign_chain_policy` already works. + +#### Design + +1. **Extend the contract** with a new `node_config` or `foreign_chains_config` field +2. **Operators vote** on config changes (similar to `vote_foreign_chain_policy`) +3. **Nodes read** the config from the contract via the indexer +4. **API keys** still need a local mechanism (they cannot go on-chain) + +#### Pros +- Consensus built-in: all operators must agree on config changes +- Single source of truth for chain definitions +- Already partially implemented: `vote_foreign_chain_policy` exists + +#### Cons +- API keys cannot be stored on-chain (secrets must remain local) +- Slow iteration: every config change requires a voting round +- RPC URLs are somewhat operator-specific (different providers, different API tiers) +- Over-engineers the problem: not all config should require consensus +- The existing `foreign_chain_policy` already handles the consensus part (which chains/URLs are accepted); the local config is for operator-specific settings like auth + +--- + +### Option C: Sidecar Config Service + +Run a lightweight sidecar container alongside the MPC node that exposes an HTTP API for config updates, protected by mutual TLS or the dstack attestation mechanism. + +#### Design + +1. **Config sidecar** runs in the same CVM, shares the data volume +2. **Exposes an API** (e.g., `POST /config/foreign_chains`) for config updates +3. **Writes config** atomically to the shared volume +4. **MPC node** watches the config file (same as Option A) +5. **Authentication** via the CVM's TLS certificate or a shared secret + +#### Pros +- Clean API for config updates (could integrate with CI/CD) +- Sidecar can validate and merge configs before writing +- Could support encrypted config delivery via TLS + +#### Cons +- Additional container to maintain and measure +- New attack surface (API endpoint inside the CVM) +- Launcher compose changes = new attestation measurements +- Over-engineered for the current needs + +--- + +### Option D: Periodic Config Polling from External Source + +The MPC node periodically fetches configuration from an external source (S3 bucket, HTTP endpoint, etc.). + +#### Design + +1. **Node polls** an operator-defined URL for config updates +2. **Config is signed** by the operator's key to prevent tampering +3. **Node applies** validated config changes + +#### Pros +- No CVM restart or dstack interaction needed +- Works with existing CI/CD and secret management tools + +#### Cons +- Introduces external dependency (what if the config server is down?) +- Needs a signing/verification scheme for config integrity +- Network access from within CVM may be restricted +- Significant new code for a simple problem + +## Recommendation + +**Option A (File-Based Config Hot-Reload)** is recommended as the primary approach, with the existing contract-based foreign chain policy voting (which is already implemented) providing consensus for the chain/provider list. + +### Rationale + +1. **Follows existing patterns**: The `allowed_image_hashes_watcher` already demonstrates the file-watch + shared-volume pattern. We'd be extending a proven approach. + +2. **Minimal infrastructure changes**: No new containers, no contract changes, no external services. The main work is in the MPC node code. + +3. **Separation of concerns**: The contract handles consensus (which chains are accepted), while local config handles operator-specific details (API keys, provider preferences, timeouts). + +4. **Incremental delivery**: Start with `foreign_chains` hot-reload only. Expand to other config fields later if needed. + +5. **API key handling**: The `.env` override file approach is simple and keeps secrets within the CVM's encrypted filesystem. + +### Proposed Implementation Plan + +#### Phase 1: Config Hot-Reload in MPC Node +- Add a config file watcher for `config.yaml` (or a dedicated `foreign_chains.yaml`) +- Implement partial config reload for `foreign_chains` section +- Add an env-override file mechanism for API key updates +- Add metrics/logging for config reload events + +#### Phase 2: Launcher Support for Config Updates +- Extend the launcher to relay `user-config.conf` changes to the node's config files +- Support writing env override files from `user-config.conf` entries +- Document the operator workflow for config updates without CVM restart + +#### Phase 3: Operator Tooling +- Update `deploy-launcher.sh` and `vmm-cli.py` workflows +- Add a dedicated config update command/script +- Update the external operator guide with the new workflow + +### Open Questions + +1. **Should we use a separate file for hot-reloadable config?** Using a dedicated `foreign_chains.yaml` (or `dynamic_config.yaml`) would make it clearer which fields support hot-reload and avoid the risk of operators editing non-reloadable fields expecting them to take effect. + +2. **How should the launcher relay config changes?** The launcher currently only runs at boot. Should it run a background loop watching for `user-config.conf` changes, or should we use a different mechanism (e.g., the node watches a config file on `/mnt/shared/` directly)? + +3. **Do we need config change auditing?** Should config changes be logged to an append-only file or reported via the `/public_data` endpoint for observability? + +4. **What is the interaction with `vote_foreign_chain_policy`?** Currently, the node votes its local `foreign_chains` config as the foreign chain policy. If the config is hot-reloaded, should the node automatically re-vote? This seems desirable but needs careful handling to avoid vote spam. + +5. **Should API keys be deliverable via dstack encrypted env vars?** Dstack supports encrypted environment variables via KMS, but we currently don't use KMS. If we adopt KMS in the future, this could be a cleaner path for secret delivery. From b491579bc89e7f0222dfeec179b93ba75f70d52b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?M=C3=A5rten=20Blankfors?= Date: Fri, 13 Mar 2026 13:07:22 +0100 Subject: [PATCH 2/9] Fix: Correcting problem statement --- docs/tdx-config-updates-design.md | 582 +++++++++++++++++++++--------- 1 file changed, 413 insertions(+), 169 deletions(-) diff --git a/docs/tdx-config-updates-design.md b/docs/tdx-config-updates-design.md index 0b7a66a55..be40ce657 100644 --- a/docs/tdx-config-updates-design.md +++ b/docs/tdx-config-updates-design.md @@ -7,171 +7,346 @@ ## Problem Statement -Today, MPC node configuration in TDX deployments is static: environment variables are passed via the dstack `user-config.conf` at CVM creation time, and the `config.yaml` file is generated once at first boot. There is **no mechanism to update either file while the node is running**. +MPC nodes running in TDX have **no way to receive arbitrary configuration**, most critically the `foreign_chains` section needed for foreign transaction validation. This is not merely a "hot-reload" problem -- the configuration literally cannot be delivered into the CVM at all with the current architecture. + +### The Core Gap + +The `config.yaml` file that the MPC node reads at startup is generated by `deployment/start.sh` using a **hardcoded template** that only includes a fixed set of fields derived from environment variables: + +```bash +# From deployment/start.sh:initialize_mpc_config() +cat <"$1" +my_near_account_id: $MPC_ACCOUNT_ID +near_responder_account_id: $responder_id +... +indexer: + ... + mpc_contract_id: $MPC_CONTRACT_ID + ... +EOF +``` -This is tolerable for settings that rarely change (account IDs, contract IDs, protocol timeouts). It becomes a serious operational bottleneck for **foreign chain validation**, where we expect frequent updates to: +This template has **no `foreign_chains` section**. There is no mechanism to inject one, because: -- Add new chains and RPC providers -- Rotate or add API keys for RPC providers -- Update RPC URLs when providers change endpoints +1. **`start.sh` cannot template arbitrary YAML.** It uses simple `cat` heredocs and `sed` substitutions. The `foreign_chains` config is deeply nested YAML with per-chain provider lists, auth configs, and API key references -- it cannot be expressed as flat `KEY=VALUE` environment variables. -Each such change currently requires a **full CVM restart** (stop CVM, update `user-config.conf`, start CVM), causing downtime for the node and potentially missing signature requests during the restart window. +2. **The launcher only passes flat env vars.** The launcher (`launcher.py`) reads `user-config.conf` (a flat `.env` file) and passes matching `MPC_*` keys as `--env` flags to `docker run`. There is no mechanism for structured config data. -### Specific Pain Points +3. **`config.yaml` is only generated once.** On first boot, `start.sh` generates the file. On subsequent boots, `update_mpc_config()` only updates `my_near_account_id`, `mpc_contract_id`, and `near_responder_account_id` via `sed`. Any manual edits to `config.yaml` inside the CVM persist across restarts, but there is no external interface to make such edits. -1. **`config.yaml` is read once at startup.** The `ConfigFile::from_file()` call in `config.rs` reads the YAML from `$MPC_HOME_DIR/config.yaml` at boot. There is no file watcher or reload mechanism. Changes require a process restart. +4. **No volume path for config files.** The CVM has two relevant volumes: `mpc-data:/data` (the MPC home dir, containing `config.yaml`) and `shared-volume:/mnt/shared` (shared between launcher and node). Neither is exposed to the operator for file injection after deployment. -2. **`foreign_chains` config lives in `config.yaml`.** Chain definitions, provider URLs, and auth configuration are all part of the static `ConfigFile` struct. They cannot be updated without rewriting `config.yaml` and restarting the node. +### Immediate Consequences -3. **API keys are resolved from environment variables at signing time** (via `TokenConfig::Env`), but the environment itself is fixed at container creation. New env vars cannot be injected into a running container. +- **Foreign transaction validation is blocked on TDX.** We cannot deploy nodes with `foreign_chains` config, which means TDX nodes cannot participate in foreign tx validation. This blocks testnet migration to TDX. -4. **`user-config.conf` changes require CVM restart.** Dstack's `update-user-config` command followed by stop/start is the only supported mechanism. The launcher re-reads the file only on boot. +- **API keys cannot be delivered.** Foreign chain providers requiring authentication (Alchemy, QuickNode, etc.) need API keys, delivered either as `TokenConfig::Env` (env var reference) or `TokenConfig::Val` (inline value). Neither path works: env vars are fixed at container creation, and inline values would need to be in `config.yaml` which can't be populated. -5. **No confidential channel for secrets.** The `user-config.conf` is stored in `/tapp/user_config` which is an unmeasured dstack input. While convenient, there is no encrypted or attestation-protected path for delivering API keys to the node. +- **No config updates of any kind.** Even non-foreign-chain config changes (e.g., adjusting triple/presignature concurrency, changing boot nodes) require stopping the CVM, updating `user-config.conf`, and restarting -- which causes downtime. -## Current Architecture +## Current Architecture in Detail -### Configuration Flow +### Deployment Flow ``` -Operator writes user-config.conf +Operator writes user-config.conf (flat KEY=VALUE) | v - dstack VMM deploys CVM + dstack VMM deploys CVM with launcher docker-compose | v - Launcher (launcher.py) reads /tapp/user_config + Launcher container starts (launcher.py) | - +--> Launcher-only vars: MPC_IMAGE_NAME, MPC_REGISTRY, MPC_IMAGE_TAGS - +--> Passthrough vars: MPC_* env vars, RUST_LOG, NEAR_BOOT_NODES + +--> Reads /tapp/user_config (the user-config.conf file) + +--> Selects & validates MPC docker image hash + +--> Extends RTMR3 with image hash (TEE attestation) + +--> Builds `docker run` command: + | --env MPC_ACCOUNT_ID=... + | --env MPC_CONTRACT_ID=... + | --env MPC_HOME_DIR=/data + | --env MPC_IMAGE_HASH=... + | --env DSTACK_ENDPOINT=... + | -v mpc-data:/data + | -v shared-volume:/mnt/shared + | (image digest) | v - Launcher starts MPC container with env vars + MPC node container starts → /app/start.sh runs + | + +--> First boot: generates config.yaml from env vars (hardcoded template) + | (NO foreign_chains, NO custom fields) + +--> Subsequent boots: sed-updates account_id/contract_id only + +--> Generates secrets.json if missing (p2p key, signer key) + +--> Runs: /app/mpc-node start [local|dstack] | v - MPC node startup (cli.rs -> run.rs): - +--> Reads $MPC_HOME_DIR/config.yaml (or generates default) - +--> Reads secrets.json (p2p key, signer key) + mpc-node process (cli.rs → run.rs): + +--> Reads config.yaml from $MPC_HOME_DIR/config.yaml + +--> Reads secrets.json +--> Starts indexer, coordinator, web server +--> Starts allowed_image_hashes_watcher (writes to /mnt/shared/) ``` -### What CAN Change at Runtime Today +### Volume Layout Inside the CVM + +``` +/data/ (mpc-data volume, persistent across restarts) +├── config.yaml (generated by start.sh, read by mpc-node) +├── config.json (near node config) +├── secrets.json (p2p key, signer key -- generated inside CVM) +├── data/ (near indexer state) +└── backup_encryption_key.hex + +/mnt/shared/ (shared-volume, shared between launcher and node) +└── image-digest.bin (written by node, read by launcher -- approved image hashes) + +/tapp/ (dstack app config, read-only in launcher container) +└── user_config (the user-config.conf file) +``` -- **MPC Docker image hash**: The `allowed_image_hashes_watcher` monitors the contract for approved image hashes, writes them to `/mnt/shared/image-digest.bin`, and the launcher reads this on next CVM boot. This is the only existing "dynamic update" pattern. +### What start.sh Actually Generates + +The `initialize_mpc_config()` function in `start.sh` produces: + +```yaml +my_near_account_id: +near_responder_account_id: +number_of_responder_keys: 50 +web_ui: 0.0.0.0:8080 +migration_web_ui: 0.0.0.0:8079 +pprof_bind_address: 0.0.0.0:34001 +triple: + concurrency: 2 + desired_triples_to_buffer: 1000000 + timeout_sec: 60 + parallel_triple_generation_stagger_time_sec: 1 +presignature: + concurrency: 16 + desired_presignatures_to_buffer: 8192 + timeout_sec: 60 +signature: + timeout_sec: 60 +ckd: + timeout_sec: 60 +indexer: + validate_genesis: false + sync_mode: Latest + concurrency: 1 + mpc_contract_id: + finality: optimistic + port_override: 80 # added via sed for non-localnet +cores: 12 +# NOTE: NO foreign_chains section +``` -### What CANNOT Change at Runtime +On subsequent boots, `update_mpc_config()` runs: +```bash +sed -i "s/my_near_account_id:.*/my_near_account_id: $MPC_ACCOUNT_ID/" "$1" +sed -i "s/mpc_contract_id:.*/mpc_contract_id: $MPC_CONTRACT_ID/" "$1" +sed -i "s/near_responder_account_id:.*/near_responder_account_id: $responder_id/" "$1" +``` -- `config.yaml` contents (all protocol parameters, foreign chains config) -- Environment variables in the MPC container -- API keys for RPC providers -- The set of supported foreign chains +Nothing else is updated. The `foreign_chains` section, if somehow manually added, would persist -- but there is no way to add it from outside the CVM. -### TEE/Attestation Constraints +### What the Launcher Allows Through -Understanding what is and isn't measured is critical for designing a solution: +The launcher (`launcher.py`) passes env vars to the MPC container with strict filtering: -| Component | Measured in | Can change without breaking attestation? | +- **Allowed keys:** `MPC_*` matching regex `^MPC_[A-Z0-9_]{1,64}$`, plus `RUST_LOG`, `RUST_BACKTRACE`, `NEAR_BOOT_NODES` +- **Denied keys:** `MPC_P2P_PRIVATE_KEY`, `MPC_ACCOUNT_SK` +- **Launcher-only keys (not passed through):** `MPC_IMAGE_TAGS`, `MPC_IMAGE_NAME`, `MPC_REGISTRY`, `MPC_HASH_OVERRIDE`, `RPC_*` +- **Special handling:** `PORTS` → `-p` flags, `EXTRA_HOSTS` → `--add-host` flags +- **Limits:** max 64 vars, max 1024 bytes per value, max 32KB total + +This is all flat key-value. No structured data can pass through. + +### TEE/Attestation Constraints + +| Component | Measured in | Changeable without breaking attestation? | |-----------|-------------|------------------------------------------| -| Launcher docker image | RTMR3 (extended by launcher) | No | -| Launcher docker-compose | RTMR3 (extended by launcher) | No | -| MPC docker image hash | RTMR3 (extended by launcher) | No (but approved list is dynamic) | +| Launcher docker image | RTMR3 | No | +| Launcher docker-compose | RTMR3 | No | +| MPC docker image hash | RTMR3 | No (but approved list is dynamic) | | vCPU, Memory | RTMR2 | No | -| Guest OS / dstack version | MRTD, RTMR0-2 | No | -| `user-config.conf` | **Unmeasured** | Yes | -| `/mnt/shared/` contents | **Unmeasured** (encrypted at rest) | Yes | -| `config.yaml` | **Unmeasured** (inside encrypted CVM disk) | Yes (if a mechanism exists) | +| Guest OS / dstack | MRTD, RTMR0-2 | No | +| `user-config.conf` | **Not measured** | Yes | +| `/mnt/shared/` contents | **Not measured** (encrypted at rest) | Yes | +| `config.yaml` | **Not measured** (inside encrypted CVM disk) | Yes (if we can get data in) | + +Key insight: **`config.yaml` lives on the encrypted CVM disk and is not individually measured.** The attestation verifies that the correct *code* is running, not the *config data*. So updating config does not break attestation -- the challenge is purely mechanical: getting structured config data into the running node. + +### The Existing Dynamic Update Pattern + +The `allowed_image_hashes_watcher` (`crates/node/src/tee/allowed_image_hashes_watcher.rs`) provides a working pattern for runtime data updates: -Key insight: **`config.yaml` and environment variables live inside the CVM's encrypted filesystem, which is not individually measured.** The attestation guarantees that the correct code is running, but the config data itself is not part of the measurement. This means we can update configuration without breaking attestation -- the challenge is getting the data into the running process. +1. **Source:** The indexer monitors the contract for approved image hash changes +2. **Delivery:** Changes arrive via a `watch::Receiver>` channel +3. **Storage:** The watcher writes the hash list atomically to `/mnt/shared/image-digest.bin` (write to `.tmp`, then `rename`) +4. **Consumer:** The launcher reads this file on next boot to select which image to run + +This works well for data that originates from the contract. For operator-specific configuration like foreign chain providers and API keys, we need a different delivery mechanism. ## Proposed Solutions -### Option A: File-Based Config Hot-Reload (Recommended) +### Option A: Extend start.sh + Shared Volume Config File (Recommended) -Add a file watcher to the MPC node that monitors `config.yaml` for changes, and provide a mechanism to update the file from outside the CVM. +Solve both the "initial delivery" and "runtime update" problems by: +1. Extending `start.sh` to support a config overlay file +2. Having the node watch that file for changes #### Design -1. **Config file watcher in the MPC node:** - - Add a `tokio::fs::watch` (or `notify` crate) watcher on `config.yaml` - - On file change, re-parse and validate the new config - - Apply changes to `foreign_chains` (and potentially other safe-to-reload fields) without restart - - Reject invalid configs with a warning log, keeping the old config active - - Follow the existing pattern from `allowed_image_hashes_watcher.rs` for crash-safe atomic file writes - -2. **Config update delivery via shared volume:** - - Extend the `/mnt/shared/` volume pattern already used for `image-digest.bin` - - The launcher (or a new sidecar) watches for config update files on the shared volume - - The launcher writes the validated config into the MPC container's data volume - - Alternatively, the node itself watches a "config overlay" file on `/mnt/shared/` - -3. **Operator workflow:** - - Operator updates `user-config.conf` with new foreign chain settings - - Operator calls `vmm-cli.py update-user-config` (no CVM restart needed) - - Launcher detects the change and writes the new config to the shared volume - - MPC node picks up the change via file watcher - -#### What Changes Can Be Hot-Reloaded - -| Config field | Hot-reloadable? | Notes | -|-------------|-----------------|-------| -| `foreign_chains` | Yes | Primary use case | -| `triple`, `presignature`, `signature` timeouts | Potentially | Low risk, but needs careful handling of in-flight protocols | -| `my_near_account_id` | No | Fundamental identity, requires restart | -| `indexer` settings | No | Requires indexer restart | -| `web_ui`, `pprof_bind_address` | No | Bound at startup | - -#### API Key Delivery - -For API keys referenced via `TokenConfig::Env`: -- New env var values can be written to a `.env` file on the shared volume -- The node reads this file when resolving `TokenConfig::Env` tokens at signing time -- This avoids the need to inject env vars into a running container +**Part 1: Initial config delivery via `start.sh`** + +Modify `start.sh` to merge an optional config overlay file into `config.yaml` after generating the base template: + +```bash +# In start.sh, after initialize_mpc_config / update_mpc_config: +MPC_CONFIG_OVERLAY="/mnt/shared/config-overlay.yaml" +if [ -f "$MPC_CONFIG_OVERLAY" ]; then + echo "Merging config overlay from $MPC_CONFIG_OVERLAY" + # Use python3 (already available in the image) to deep-merge YAML + python3 -c " +import yaml, sys +base = yaml.safe_load(open('$MPC_NODE_CONFIG_FILE')) +overlay = yaml.safe_load(open('$MPC_CONFIG_OVERLAY')) +# Deep merge: overlay wins for conflicting keys +def merge(b, o): + for k, v in o.items(): + if k in b and isinstance(b[k], dict) and isinstance(v, dict): + merge(b[k], v) + else: + b[k] = v +merge(base, overlay) +yaml.dump(base, open('$MPC_NODE_CONFIG_FILE', 'w'), default_flow_style=False) +" +fi +``` -#### Pros -- Follows existing patterns (`allowed_image_hashes_watcher`) -- No new infrastructure (dstack, contract changes) -- Operator workflow is simple (update file, no restart) -- Gradual rollout: start with `foreign_chains` only, expand later -- API keys stay within the CVM's encrypted filesystem +The operator places `config-overlay.yaml` on the shared volume via an update to `user-config.conf` or a direct file write. Example overlay: + +```yaml +foreign_chains: + bitcoin: + timeout_sec: 30 + max_retries: 3 + providers: + public: + api_variant: esplora + rpc_url: "https://blockstream.info/api" + auth: + kind: none + ethereum: + timeout_sec: 30 + max_retries: 3 + providers: + alchemy: + api_variant: alchemy + rpc_url: "https://eth-mainnet.g.alchemy.com/v2/" + auth: + kind: header + name: Authorization + scheme: Bearer + token: + env: ALCHEMY_API_KEY +``` -#### Cons -- Requires MPC node code changes (file watcher, partial config reload) -- Need to carefully define which fields are safe to hot-reload -- Launcher changes needed to relay config updates -- No consensus mechanism: each operator updates independently (but foreign chain policy voting already handles consensus for the chain list) +**Part 2: Delivery mechanism for the overlay file** + +The overlay file lives on `/mnt/shared/`, which is a Docker volume shared between the launcher container and the MPC node container. We need a way to write to it: + +**Option A1: Launcher writes overlay from user-config.conf** + +Add a convention: any line in `user-config.conf` starting with `CONFIG_OVERLAY_BASE64=` contains a base64-encoded YAML overlay. The launcher decodes it and writes it to `/mnt/shared/config-overlay.yaml`. + +```python +# In launcher.py, before launching the MPC container: +overlay_b64 = dstack_config.get("CONFIG_OVERLAY_BASE64") +if overlay_b64: + import base64 + overlay_yaml = base64.b64decode(overlay_b64).decode("utf-8") + with open("/mnt/shared/config-overlay.yaml", "w") as f: + f.write(overlay_yaml) +``` + +Operator workflow: +```bash +# Encode the overlay +CONFIG_OVERLAY_BASE64=$(base64 -w0 < my-foreign-chains.yaml) +# Add to user-config.conf +echo "CONFIG_OVERLAY_BASE64=$CONFIG_OVERLAY_BASE64" >> user-config.conf +# Update the CVM config (requires launcher to mount shared-volume as rw) +vmm-cli.py update-user-config user-config.conf +``` + +**Option A2: Direct file write via dstack's file injection** + +If dstack supports writing files to CVM volumes directly (e.g., via the VMM API), the operator could write the overlay file without going through the launcher. This needs investigation into dstack capabilities. -#### Implementation Sketch +**Part 3: Runtime config reload in the MPC node** + +Add a file watcher to the MPC node that monitors the overlay file for changes: ```rust -// New module: crates/node/src/config/watcher.rs -pub async fn watch_config_file( - config_path: PathBuf, +// New: crates/node/src/config/watcher.rs +// Watches /mnt/shared/config-overlay.yaml for changes +// On change: re-reads, validates, and updates ForeignChainsConfig via watch channel +pub async fn watch_config_overlay( + overlay_path: PathBuf, + home_dir: PathBuf, foreign_chains_sender: watch::Sender, cancellation_token: CancellationToken, ) -> Result<(), ConfigWatchError> { - // Use notify crate or poll-based approach - // On change: parse, validate, send update via channel - // Consumers (coordinator, providers) receive via watch::Receiver + loop { + select! { + _ = cancellation_token.cancelled() => break Ok(()), + _ = wait_for_file_change(&overlay_path) => { + match load_and_validate_overlay(&overlay_path) { + Ok(overlay) => { + // Merge overlay into config.yaml on disk + merge_into_config_file(&home_dir, &overlay)?; + // Update in-memory config + foreign_chains_sender.send_replace(overlay.foreign_chains); + } + Err(e) => { + tracing::warn!("Invalid config overlay, keeping current config: {e}"); + } + } + } + } + } } ``` -```rust -// In coordinator.rs, replace static config read with watch channel -let foreign_chains_config = foreign_chains_receiver.borrow().clone(); -``` +Consumers in the coordinator and signature providers would receive updates via `watch::Receiver`, similar to how the image hash watcher works. + +**Part 4: API key delivery** -For API key delivery, a separate `.env` file approach: +API keys present a special challenge because they are secrets. Two sub-options: + +**A-keys-1: Env vars via user-config.conf (simple, current model)** + +API keys are passed as env vars in `user-config.conf` (e.g., `MPC_ALCHEMY_API_KEY=...`). The launcher passes them through as `MPC_*` env vars. The `foreign_chains` config references them via `TokenConfig::Env { env: "MPC_ALCHEMY_API_KEY" }`. + +Limitation: changing API keys requires CVM restart (env vars are fixed at container start). But API key rotation is infrequent enough that this may be acceptable initially. + +**A-keys-2: Secrets file on shared volume (supports hot-reload)** + +Write API keys to a file on `/mnt/shared/secrets.env`: +``` +ALCHEMY_API_KEY=abc123 +QUICKNODE_API_KEY=xyz789 +``` +Modify `TokenConfig::Env` resolution in `auth.rs` to check this file first: ```rust -// In auth.rs, modify token resolution impl TokenConfig { pub fn resolve(&self) -> Result { match self { TokenConfig::Val { val } => Ok(val.clone()), TokenConfig::Env { env } => { - // First check override file, then fall back to process env - if let Some(val) = read_env_override_file(env)? { + // Check secrets file first, then process env + if let Some(val) = read_secrets_file("/mnt/shared/secrets.env", env)? { Ok(val) } else { std::env::var(env).map_err(|_| AuthError::MissingEnvVar(env.clone())) @@ -182,120 +357,189 @@ impl TokenConfig { } ``` +The secrets file is re-read on each token resolution (at signing time), so key rotations take effect without restart. + +#### Launcher Volume Mount Change + +Currently the launcher mounts `shared-volume:/mnt/shared:ro` (read-only). For the launcher to write `config-overlay.yaml` to it, this needs to change to `:rw`. The MPC node container already mounts it as `rw`. + +**Important:** This change means updating the `launcher_docker_compose.yaml`, which is **measured** and affects attestation. This is a one-time change that needs to be voted in by all operators. After this change, the overlay mechanism works without further compose changes. + +Alternatively, the launcher could write the overlay before starting the MPC container, which works with the current read-only mount since Docker volumes are shared at the storage level regardless of mount flags -- but this is fragile and should be verified. + +#### Pros +- Solves both initial delivery and runtime updates +- Minimal changes to existing architecture +- `start.sh` changes are small and backwards-compatible (overlay is optional) +- Follows the existing `/mnt/shared/` pattern +- Config overlay is human-readable YAML +- API keys can be delivered via existing env var passthrough (phase 1) or secrets file (phase 2) + +#### Cons +- Requires a one-time launcher compose update (measured, needs voting) +- Base64 encoding in `user-config.conf` is not very ergonomic for large configs +- The overlay merge in `start.sh` requires `pyyaml` (need to verify it's in the image; `python3` is available but the yaml module may not be) +- Hot-reload requires node code changes (file watcher, watch channels) + --- -### Option B: Contract-Driven Configuration +### Option B: `StartWithConfigFile` with TOML Config on Shared Volume -Store configuration (chain definitions, RPC URLs) on the contract and have nodes read it via the indexer, similar to how `foreign_chain_policy` already works. +The node already supports a `StartWithConfigFile` command that reads the entire config from a TOML file (see `cli.rs:CliCommand::StartWithConfigFile` and `start.rs:StartConfig::from_toml_file`). Instead of modifying `start.sh`, we could switch TDX deployments to use this path. #### Design -1. **Extend the contract** with a new `node_config` or `foreign_chains_config` field -2. **Operators vote** on config changes (similar to `vote_foreign_chain_policy`) -3. **Nodes read** the config from the contract via the indexer -4. **API keys** still need a local mechanism (they cannot go on-chain) +1. **Replace `start.sh` with a new entrypoint** that reads config from `/mnt/shared/mpc-config.toml` +2. **The launcher writes this TOML file** from a base64-encoded config in `user-config.conf` +3. **Config includes everything**: node settings, TEE settings, secrets config, and `foreign_chains` +4. **Hot-reload**: same file watcher approach as Option A, but watching the TOML file + +Example `mpc-config.toml`: +```toml +home_dir = "/data" + +[secrets] +secret_store_key_hex = "..." + +[tee] +[tee.authority] +type = "dstack" +dstack_endpoint = "/var/run/dstack.sock" + +[node] +my_near_account_id = "my-account.testnet" +near_responder_account_id = "my-account.testnet" +number_of_responder_keys = 50 +web_ui = "0.0.0.0:8080" +# ... other fields ... + +[node.foreign_chains.bitcoin] +timeout_sec = 30 +max_retries = 3 +[node.foreign_chains.bitcoin.providers.public] +api_variant = "esplora" +rpc_url = "https://blockstream.info/api" +[node.foreign_chains.bitcoin.providers.public.auth] +kind = "none" +``` #### Pros -- Consensus built-in: all operators must agree on config changes -- Single source of truth for chain definitions -- Already partially implemented: `vote_foreign_chain_policy` exists +- Uses an existing, already-implemented code path (`StartWithConfigFile`) +- Cleaner than YAML overlay merging -- single source of truth +- TOML is well-supported in the Rust ecosystem +- No `start.sh` modifications needed (replace it entirely) +- Full config is in one place #### Cons -- API keys cannot be stored on-chain (secrets must remain local) -- Slow iteration: every config change requires a voting round -- RPC URLs are somewhat operator-specific (different providers, different API tiers) -- Over-engineers the problem: not all config should require consensus -- The existing `foreign_chain_policy` already handles the consensus part (which chains/URLs are accepted); the local config is for operator-specific settings like auth +- **Breaking change**: requires new entrypoint, new Docker image, or at least a new `start.sh` +- Operator must provide the full config, not just overrides +- Secrets (like `secret_store_key_hex`) end up in the config file on the shared volume +- Still needs the base64-in-user-config or direct file injection mechanism +- Hot-reload still needs the same file watcher work --- -### Option C: Sidecar Config Service +### Option C: Contract-Driven Configuration -Run a lightweight sidecar container alongside the MPC node that exposes an HTTP API for config updates, protected by mutual TLS or the dstack attestation mechanism. +Store the `foreign_chains` configuration on the contract and have nodes read it via the indexer. #### Design -1. **Config sidecar** runs in the same CVM, shares the data volume -2. **Exposes an API** (e.g., `POST /config/foreign_chains`) for config updates -3. **Writes config** atomically to the shared volume -4. **MPC node** watches the config file (same as Option A) -5. **Authentication** via the CVM's TLS certificate or a shared secret +The existing `vote_foreign_chain_policy` mechanism already stores chain/provider URLs on-chain. Extend it to store the full provider config (including `api_variant`, timeouts, retries) so nodes can reconstruct their `foreign_chains` config from contract state. + +API keys still need a local mechanism since they cannot go on-chain. But the chain definitions, provider URLs, and API variants could all come from the contract. #### Pros -- Clean API for config updates (could integrate with CI/CD) -- Sidecar can validate and merge configs before writing -- Could support encrypted config delivery via TLS +- Consensus built-in: all operators agree on config +- No file delivery mechanism needed for chain definitions +- Already partially implemented #### Cons -- Additional container to maintain and measure -- New attack surface (API endpoint inside the CVM) -- Launcher compose changes = new attestation measurements -- Over-engineered for the current needs +- API keys still need a local solution (we're back to the same problem for secrets) +- Slow iteration: every config change requires a voting round across all operators +- The contract would need schema changes to store `api_variant`, `timeout_sec`, `max_retries` +- RPC provider preferences are somewhat operator-specific (different API tiers, different providers) +- Does not solve the general config update problem (only foreign chains) --- -### Option D: Periodic Config Polling from External Source +### Option D: Node HTTP API for Config Updates -The MPC node periodically fetches configuration from an external source (S3 bucket, HTTP endpoint, etc.). +Add an HTTP endpoint to the MPC node's existing web server (port 8080) for receiving config updates. #### Design -1. **Node polls** an operator-defined URL for config updates -2. **Config is signed** by the operator's key to prevent tampering -3. **Node applies** validated config changes +1. Add `POST /config/foreign_chains` endpoint to the web server +2. Operator sends YAML/JSON config via `curl` from outside the CVM +3. Node validates, persists to disk, and applies in-memory +4. Authentication via the CVM's TLS certificate or a shared token #### Pros -- No CVM restart or dstack interaction needed -- Works with existing CI/CD and secret management tools +- Most ergonomic for operators (`curl -X POST ...`) +- No file delivery complexity +- Can support both config and API key updates +- Works with existing port forwarding (8080 is already exposed) #### Cons -- Introduces external dependency (what if the config server is down?) -- Needs a signing/verification scheme for config integrity -- Network access from within CVM may be restricted -- Significant new code for a simple problem +- New attack surface: anyone who can reach port 8080 can push config +- Authentication mechanism needs design (TLS mutual auth, bearer token?) +- Port 8080 is already public (used for `/public_data` and telemetry) +- Secrets transmitted over the network need encryption +- More code to write and maintain vs file-based approach ## Recommendation -**Option A (File-Based Config Hot-Reload)** is recommended as the primary approach, with the existing contract-based foreign chain policy voting (which is already implemented) providing consensus for the chain/provider list. +**Option A (Extend start.sh + Shared Volume Config Overlay)** is recommended for the following reasons: ### Rationale -1. **Follows existing patterns**: The `allowed_image_hashes_watcher` already demonstrates the file-watch + shared-volume pattern. We'd be extending a proven approach. +1. **Solves the immediate blocker.** The `config-overlay.yaml` on `/mnt/shared/` gives us a way to deliver `foreign_chains` config to TDX nodes, unblocking testnet migration. -2. **Minimal infrastructure changes**: No new containers, no contract changes, no external services. The main work is in the MPC node code. +2. **Minimal blast radius.** Changes are additive: `start.sh` gets a small merge step, the launcher gets an optional base64 decode step. Existing deployments without an overlay file continue to work identically. -3. **Separation of concerns**: The contract handles consensus (which chains are accepted), while local config handles operator-specific details (API keys, provider preferences, timeouts). +3. **Follows proven patterns.** The `/mnt/shared/` volume and the file-watcher pattern (`allowed_image_hashes_watcher`) are already battle-tested in this codebase. -4. **Incremental delivery**: Start with `foreign_chains` hot-reload only. Expand to other config fields later if needed. +4. **Separation of concerns is preserved.** The contract handles consensus (which chains/URLs are accepted via `vote_foreign_chain_policy`), while the local overlay handles operator-specific details (auth config, API keys, provider preferences, timeouts). -5. **API key handling**: The `.env` override file approach is simple and keeps secrets within the CVM's encrypted filesystem. +5. **Incremental delivery.** Phase 1 (initial delivery) unblocks TDX migration immediately. Phase 2 (hot-reload) and Phase 3 (ergonomic tooling) can follow independently. ### Proposed Implementation Plan -#### Phase 1: Config Hot-Reload in MPC Node -- Add a config file watcher for `config.yaml` (or a dedicated `foreign_chains.yaml`) -- Implement partial config reload for `foreign_chains` section -- Add an env-override file mechanism for API key updates -- Add metrics/logging for config reload events - -#### Phase 2: Launcher Support for Config Updates -- Extend the launcher to relay `user-config.conf` changes to the node's config files -- Support writing env override files from `user-config.conf` entries -- Document the operator workflow for config updates without CVM restart - -#### Phase 3: Operator Tooling -- Update `deploy-launcher.sh` and `vmm-cli.py` workflows -- Add a dedicated config update command/script -- Update the external operator guide with the new workflow +#### Phase 0: Unblock TDX Migration (Minimal) +- Modify `start.sh` to merge `config-overlay.yaml` from `/mnt/shared/` if present +- Add `pyyaml` to the node Docker image (or use a simpler JSON merge if yaml is unavailable) +- Operator manually writes overlay file to the shared volume +- API keys passed via env vars in `user-config.conf` (existing mechanism) +- **No launcher changes, no node code changes, no compose changes** + +#### Phase 1: Launcher Config Overlay Support +- Add `CONFIG_OVERLAY_BASE64` support to the launcher +- Launcher decodes and writes `/mnt/shared/config-overlay.yaml` +- Update `launcher_docker_compose.yaml` to mount shared-volume as rw (requires voting) +- Document operator workflow for `user-config.conf`-based config updates + +#### Phase 2: Runtime Config Reload +- Add config file watcher in the MPC node for `/mnt/shared/config-overlay.yaml` +- Implement `watch::channel`-based config propagation for `ForeignChainsConfig` +- Update coordinator to re-vote `foreign_chain_policy` on config change +- Add secrets file support (`/mnt/shared/secrets.env`) for hot-reloadable API keys + +#### Phase 3: Operator Tooling & Ergonomics +- Add a helper script to generate overlay files from a more ergonomic format +- Update `deploy-launcher.sh` and `deploy-launcher-guide.md` +- Update `running-an-mpc-node-in-tdx-external-guide.md` with foreign chain config instructions +- Consider switching to the `StartWithConfigFile` TOML path long-term (Option B) for cleaner architecture ### Open Questions -1. **Should we use a separate file for hot-reloadable config?** Using a dedicated `foreign_chains.yaml` (or `dynamic_config.yaml`) would make it clearer which fields support hot-reload and avoid the risk of operators editing non-reloadable fields expecting them to take effect. +1. **Is `pyyaml` available in the node Docker image?** The image is based on `debian:bookworm-slim` with `python3` installed, but not necessarily `pyyaml`. If not, we could use a JSON-based overlay instead, or add the dependency. Alternatively, the merge logic could be implemented in a small Rust helper or directly in the node binary (e.g., `mpc-node merge-config --overlay /mnt/shared/config-overlay.yaml`). + +2. **Launcher compose rw mount timing.** Changing `shared-volume:/mnt/shared:ro` to `rw` in the launcher compose requires a voting round. Can this be bundled with the next launcher image upgrade, or does it need to happen independently? Note: The launcher currently exits after starting the MPC container, so even with rw access, there's no persistent process that could tamper with the shared volume. -2. **How should the launcher relay config changes?** The launcher currently only runs at boot. Should it run a background loop watching for `user-config.conf` changes, or should we use a different mechanism (e.g., the node watches a config file on `/mnt/shared/` directly)? +3. **What happens if the overlay file is invalid?** `start.sh` should fail loudly (exit 1) if the overlay YAML is malformed, preventing the node from starting with a broken config. The runtime watcher should log a warning and keep the current config. -3. **Do we need config change auditing?** Should config changes be logged to an append-only file or reported via the `/public_data` endpoint for observability? +4. **Should the overlay support all config fields or just `foreign_chains`?** Starting with `foreign_chains` only is simpler and safer. But operators may also want to tune `triple.concurrency`, `presignature.desired_presignatures_to_buffer`, etc. A full overlay merge is more flexible at the cost of complexity. -4. **What is the interaction with `vote_foreign_chain_policy`?** Currently, the node votes its local `foreign_chains` config as the foreign chain policy. If the config is hot-reloaded, should the node automatically re-vote? This seems desirable but needs careful handling to avoid vote spam. +5. **Re-voting on config change.** When `foreign_chains` config changes at runtime, the node should automatically call `vote_foreign_chain_policy` with the new policy. This needs rate limiting to avoid vote spam if the operator is iterating on config. -5. **Should API keys be deliverable via dstack encrypted env vars?** Dstack supports encrypted environment variables via KMS, but we currently don't use KMS. If we adopt KMS in the future, this could be a cleaner path for secret delivery. +6. **How to deliver the overlay file in Phase 0 (before launcher support)?** The operator could SSH into the TDX host, use `vmm-cli.py` to access the VM, and write the file directly to the shared volume. This is manual but unblocks us immediately. Exact steps need documentation. From 4738ee706a14525d1c168f02281ca0149a36fdc1 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?M=C3=A5rten=20Blankfors?= Date: Fri, 13 Mar 2026 13:15:32 +0100 Subject: [PATCH 3/9] Update --- docs/tdx-config-updates-design.md | 445 +++++++++++++++--------------- 1 file changed, 229 insertions(+), 216 deletions(-) diff --git a/docs/tdx-config-updates-design.md b/docs/tdx-config-updates-design.md index be40ce657..a6f06f8f6 100644 --- a/docs/tdx-config-updates-design.md +++ b/docs/tdx-config-updates-design.md @@ -44,9 +44,55 @@ This template has **no `foreign_chains` section**. There is no mechanism to inje - **No config updates of any kind.** Even non-foreign-chain config changes (e.g., adjusting triple/presignature concurrency, changing boot nodes) require stopping the CVM, updating `user-config.conf`, and restarting -- which causes downtime. +### Recent Development: `start-with-config-file` (TOML) + +Commit `78d3e767` ("feat: allow configuration files for full config of the mpc node", PR #2332) introduced a new `start-with-config-file` CLI command that reads the **entire** node configuration from a single TOML file: + +``` +mpc-node start-with-config-file /path/to/mpc-config.toml +``` + +The `StartConfig` TOML struct includes all configuration in one file: + +```toml +home_dir = "/data" + +[secrets] +secret_store_key_hex = "..." + +[tee] +image_hash = "..." +latest_allowed_hash_file = "/mnt/shared/image-digest.bin" +[tee.authority] +type = "dstack" +dstack_endpoint = "/var/run/dstack.sock" + +[node] +my_near_account_id = "my-account.testnet" +# ... all node config fields ... + +[node.foreign_chains.bitcoin] +timeout_sec = 30 +max_retries = 3 +[node.foreign_chains.bitcoin.providers.public] +api_variant = "esplora" +rpc_url = "https://blockstream.info/api" +[node.foreign_chains.bitcoin.providers.public.auth] +kind = "none" +``` + +Key facts about this feature: +- **All pytests already use this path** -- tests write `start_config.toml` and launch via `start-with-config-file` +- **A TOML template exists** at `docs/localnet/mpc-config.template.toml` with `foreign_chains` included +- **The old `start` command is marked for deprecation** (`cli.rs:226`: `TODO(#2334): deprecate this`) +- **`StartConfig::from_toml_file()` validates the config** including `foreign_chains` +- **The `node` section is a `ConfigFile`** -- the exact same struct used by `config.yaml`, so feature parity is guaranteed + +This feature **fundamentally changes the design space**: instead of working around the limitations of `start.sh` with overlay hacks, we can switch TDX deployments to the TOML path and get `foreign_chains` support natively. + ## Current Architecture in Detail -### Deployment Flow +### Deployment Flow (Current -- `start.sh` Path) ``` Operator writes user-config.conf (flat KEY=VALUE) @@ -171,9 +217,9 @@ This is all flat key-value. No structured data can pass through. | Guest OS / dstack | MRTD, RTMR0-2 | No | | `user-config.conf` | **Not measured** | Yes | | `/mnt/shared/` contents | **Not measured** (encrypted at rest) | Yes | -| `config.yaml` | **Not measured** (inside encrypted CVM disk) | Yes (if we can get data in) | +| `config.yaml` / TOML | **Not measured** (inside encrypted CVM disk) | Yes (if we can get data in) | -Key insight: **`config.yaml` lives on the encrypted CVM disk and is not individually measured.** The attestation verifies that the correct *code* is running, not the *config data*. So updating config does not break attestation -- the challenge is purely mechanical: getting structured config data into the running node. +Key insight: **Config files live on the encrypted CVM disk and are not individually measured.** The attestation verifies that the correct *code* is running, not the *config data*. So updating config does not break attestation -- the challenge is purely mechanical: getting structured config data into the running node. ### The Existing Dynamic Update Pattern @@ -188,128 +234,118 @@ This works well for data that originates from the contract. For operator-specifi ## Proposed Solutions -### Option A: Extend start.sh + Shared Volume Config File (Recommended) +### Option A: Switch to TOML Config Path with Launcher-Generated Config (Recommended) -Solve both the "initial delivery" and "runtime update" problems by: -1. Extending `start.sh` to support a config overlay file -2. Having the node watch that file for changes +Switch TDX deployments from the legacy `start.sh` + `config.yaml` path to the new `start-with-config-file` TOML path. The launcher generates the TOML config file from operator-provided data and writes it to the shared volume. #### Design -**Part 1: Initial config delivery via `start.sh`** +**Part 1: Launcher generates TOML config** -Modify `start.sh` to merge an optional config overlay file into `config.yaml` after generating the base template: +The launcher already reads `user-config.conf` and builds a `docker run` command. Instead of passing individual `--env` flags, the launcher would: -```bash -# In start.sh, after initialize_mpc_config / update_mpc_config: -MPC_CONFIG_OVERLAY="/mnt/shared/config-overlay.yaml" -if [ -f "$MPC_CONFIG_OVERLAY" ]; then - echo "Merging config overlay from $MPC_CONFIG_OVERLAY" - # Use python3 (already available in the image) to deep-merge YAML - python3 -c " -import yaml, sys -base = yaml.safe_load(open('$MPC_NODE_CONFIG_FILE')) -overlay = yaml.safe_load(open('$MPC_CONFIG_OVERLAY')) -# Deep merge: overlay wins for conflicting keys -def merge(b, o): - for k, v in o.items(): - if k in b and isinstance(b[k], dict) and isinstance(v, dict): - merge(b[k], v) - else: - b[k] = v -merge(base, overlay) -yaml.dump(base, open('$MPC_NODE_CONFIG_FILE', 'w'), default_flow_style=False) -" -fi -``` +1. Read config from `user-config.conf` (existing flat key-value vars) **and** a base64-encoded TOML config +2. Write a complete `mpc-config.toml` to `/mnt/shared/mpc-config.toml` +3. Launch the MPC container with a modified entrypoint that uses `start-with-config-file` -The operator places `config-overlay.yaml` on the shared volume via an update to `user-config.conf` or a direct file write. Example overlay: +```python +# In launcher.py: +CONFIG_TOML_BASE64_KEY = "MPC_CONFIG_TOML_BASE64" -```yaml -foreign_chains: - bitcoin: - timeout_sec: 30 - max_retries: 3 - providers: - public: - api_variant: esplora - rpc_url: "https://blockstream.info/api" - auth: - kind: none - ethereum: - timeout_sec: 30 - max_retries: 3 - providers: - alchemy: - api_variant: alchemy - rpc_url: "https://eth-mainnet.g.alchemy.com/v2/" - auth: - kind: header - name: Authorization - scheme: Bearer - token: - env: ALCHEMY_API_KEY +def write_config_toml(dstack_config: dict, image_hash: str, platform: Platform): + """Generate mpc-config.toml on the shared volume.""" + config_b64 = dstack_config.get(CONFIG_TOML_BASE64_KEY) + if not config_b64: + return # Fall back to legacy start.sh path + + import base64 + config_toml = base64.b64decode(config_b64).decode("utf-8") + + # Write atomically + tmp_path = "/mnt/shared/mpc-config.toml.tmp" + final_path = "/mnt/shared/mpc-config.toml" + with open(tmp_path, "w") as f: + f.write(config_toml) + os.rename(tmp_path, final_path) ``` -**Part 2: Delivery mechanism for the overlay file** +**Part 2: Modified start.sh (or new entrypoint)** -The overlay file lives on `/mnt/shared/`, which is a Docker volume shared between the launcher container and the MPC node container. We need a way to write to it: +Modify `start.sh` to detect the TOML config and use it instead of generating `config.yaml`: -**Option A1: Launcher writes overlay from user-config.conf** +```bash +# At the top of start.sh: +MPC_CONFIG_TOML="/mnt/shared/mpc-config.toml" -Add a convention: any line in `user-config.conf` starting with `CONFIG_OVERLAY_BASE64=` contains a base64-encoded YAML overlay. The launcher decodes it and writes it to `/mnt/shared/config-overlay.yaml`. +if [ -f "$MPC_CONFIG_TOML" ]; then + echo "Found TOML config at $MPC_CONFIG_TOML, using start-with-config-file" -```python -# In launcher.py, before launching the MPC container: -overlay_b64 = dstack_config.get("CONFIG_OVERLAY_BASE64") -if overlay_b64: - import base64 - overlay_yaml = base64.b64decode(overlay_b64).decode("utf-8") - with open("/mnt/shared/config-overlay.yaml", "w") as f: - f.write(overlay_yaml) + # Still need to initialize the near node (genesis, config.json) + if [ ! -r "$NEAR_NODE_CONFIG_FILE" ]; then + initialize_near_node "$MPC_HOME_DIR" + fi + update_near_node_config + + echo "Starting mpc node with TOML config..." + /app/mpc-node start-with-config-file "$MPC_CONFIG_TOML" + exit $? +fi + +# ... existing start.sh logic for legacy path ... ``` -Operator workflow: +This is backwards-compatible: nodes without a TOML file continue using the legacy path. + +**Part 3: Operator workflow** + +The operator creates a TOML config file locally (using `mpc-config.template.toml` as a starting point), base64-encodes it, and includes it in `user-config.conf`: + ```bash -# Encode the overlay -CONFIG_OVERLAY_BASE64=$(base64 -w0 < my-foreign-chains.yaml) -# Add to user-config.conf -echo "CONFIG_OVERLAY_BASE64=$CONFIG_OVERLAY_BASE64" >> user-config.conf -# Update the CVM config (requires launcher to mount shared-volume as rw) +# 1. Create config from template (or manually) +envsubst < docs/localnet/mpc-config.template.toml > mpc-config.toml +# Edit to add foreign_chains, adjust settings, etc. + +# 2. Base64-encode and add to user-config.conf +MPC_CONFIG_TOML_BASE64=$(base64 -w0 < mpc-config.toml) + +# 3. user-config.conf now contains: +cat > user-config.conf << EOF +MPC_IMAGE_NAME=nearone/mpc-node +MPC_IMAGE_TAGS=latest +MPC_REGISTRY=registry.hub.docker.com +MPC_CONFIG_TOML_BASE64=$MPC_CONFIG_TOML_BASE64 +PORTS=8080:8080,3030:3030,80:80,24567:24567 +EOF + +# 4. Deploy or update vmm-cli.py update-user-config user-config.conf ``` -**Option A2: Direct file write via dstack's file injection** - -If dstack supports writing files to CVM volumes directly (e.g., via the VMM API), the operator could write the overlay file without going through the launcher. This needs investigation into dstack capabilities. +To update config (e.g., add a new foreign chain), the operator edits the TOML file, re-encodes, and updates `user-config.conf`. With a CVM restart, the node picks up the new config. -**Part 3: Runtime config reload in the MPC node** +**Part 4: Runtime config hot-reload (future phase)** -Add a file watcher to the MPC node that monitors the overlay file for changes: +Add a file watcher to the MPC node that monitors `/mnt/shared/mpc-config.toml` for changes: ```rust // New: crates/node/src/config/watcher.rs -// Watches /mnt/shared/config-overlay.yaml for changes -// On change: re-reads, validates, and updates ForeignChainsConfig via watch channel -pub async fn watch_config_overlay( - overlay_path: PathBuf, - home_dir: PathBuf, +pub async fn watch_config_file( + config_path: PathBuf, foreign_chains_sender: watch::Sender, cancellation_token: CancellationToken, ) -> Result<(), ConfigWatchError> { loop { select! { _ = cancellation_token.cancelled() => break Ok(()), - _ = wait_for_file_change(&overlay_path) => { - match load_and_validate_overlay(&overlay_path) { - Ok(overlay) => { - // Merge overlay into config.yaml on disk - merge_into_config_file(&home_dir, &overlay)?; - // Update in-memory config - foreign_chains_sender.send_replace(overlay.foreign_chains); + _ = wait_for_file_change(&config_path) => { + match StartConfig::from_toml_file(&config_path) { + Ok(new_config) => { + // Only hot-reload safe fields + foreign_chains_sender.send_replace(new_config.node.foreign_chains); + tracing::info!("Config reloaded successfully"); } Err(e) => { - tracing::warn!("Invalid config overlay, keeping current config: {e}"); + tracing::warn!("Invalid config file, keeping current config: {e}"); } } } @@ -318,124 +354,96 @@ pub async fn watch_config_overlay( } ``` -Consumers in the coordinator and signature providers would receive updates via `watch::Receiver`, similar to how the image hash watcher works. - -**Part 4: API key delivery** - -API keys present a special challenge because they are secrets. Two sub-options: +To trigger a hot-reload, the operator updates `user-config.conf` with a new `MPC_CONFIG_TOML_BASE64`, then restarts just the launcher (not the CVM). The launcher writes the new TOML file to the shared volume, and the node's file watcher picks it up. -**A-keys-1: Env vars via user-config.conf (simple, current model)** +**Part 5: API key delivery** -API keys are passed as env vars in `user-config.conf` (e.g., `MPC_ALCHEMY_API_KEY=...`). The launcher passes them through as `MPC_*` env vars. The `foreign_chains` config references them via `TokenConfig::Env { env: "MPC_ALCHEMY_API_KEY" }`. +API keys can be handled in two ways: -Limitation: changing API keys requires CVM restart (env vars are fixed at container start). But API key rotation is infrequent enough that this may be acceptable initially. +**Inline in TOML (simplest):** +```toml +[node.foreign_chains.ethereum.providers.alchemy.auth] +kind = "header" +name = "Authorization" +scheme = "Bearer" +[node.foreign_chains.ethereum.providers.alchemy.auth.token] +val = "my-api-key-here" +``` -**A-keys-2: Secrets file on shared volume (supports hot-reload)** +The key is embedded in the TOML config on the encrypted CVM disk. It's delivered via the base64-encoded config in `user-config.conf`. -Write API keys to a file on `/mnt/shared/secrets.env`: -``` -ALCHEMY_API_KEY=abc123 -QUICKNODE_API_KEY=xyz789 +**Via env var reference:** +```toml +[node.foreign_chains.ethereum.providers.alchemy.auth.token] +env = "MPC_ALCHEMY_API_KEY" ``` -Modify `TokenConfig::Env` resolution in `auth.rs` to check this file first: -```rust -impl TokenConfig { - pub fn resolve(&self) -> Result { - match self { - TokenConfig::Val { val } => Ok(val.clone()), - TokenConfig::Env { env } => { - // Check secrets file first, then process env - if let Some(val) = read_secrets_file("/mnt/shared/secrets.env", env)? { - Ok(val) - } else { - std::env::var(env).map_err(|_| AuthError::MissingEnvVar(env.clone())) - } - } - } - } -} -``` +The API key is passed as a separate env var in `user-config.conf` (`MPC_ALCHEMY_API_KEY=...`), which the launcher passes through to the container. The TOML config references it by name. -The secrets file is re-read on each token resolution (at signing time), so key rotations take effect without restart. +Both approaches work today. The inline approach is simpler and avoids the env-var-is-fixed-at-container-start limitation for hot-reload scenarios. #### Launcher Volume Mount Change -Currently the launcher mounts `shared-volume:/mnt/shared:ro` (read-only). For the launcher to write `config-overlay.yaml` to it, this needs to change to `:rw`. The MPC node container already mounts it as `rw`. +Currently the launcher mounts `shared-volume:/mnt/shared:ro` (read-only). For the launcher to write `mpc-config.toml`, this needs to change to `:rw`. -**Important:** This change means updating the `launcher_docker_compose.yaml`, which is **measured** and affects attestation. This is a one-time change that needs to be voted in by all operators. After this change, the overlay mechanism works without further compose changes. +**Important:** This change means updating the `launcher_docker_compose.yaml`, which is **measured** and affects attestation. This is a one-time change that needs to be voted in by all operators. After this change, the TOML mechanism works without further compose changes. -Alternatively, the launcher could write the overlay before starting the MPC container, which works with the current read-only mount since Docker volumes are shared at the storage level regardless of mount flags -- but this is fragile and should be verified. +Note: the launcher exits after starting the MPC container and does not run persistently, so rw access to the shared volume does not introduce a persistent tampering risk. #### Pros -- Solves both initial delivery and runtime updates -- Minimal changes to existing architecture -- `start.sh` changes are small and backwards-compatible (overlay is optional) -- Follows the existing `/mnt/shared/` pattern -- Config overlay is human-readable YAML -- API keys can be delivered via existing env var passthrough (phase 1) or secrets file (phase 2) +- **Uses an already-implemented, tested code path.** `start-with-config-file` is used by all pytests and localnet. It's not new code. +- **Natively supports `foreign_chains`.** The TOML `StartConfig` includes `ConfigFile` which has `foreign_chains`. No overlay merging, no YAML hacks. +- **Single source of truth.** One TOML file contains the entire config. No split between env vars, `config.yaml`, and overlay files. +- **Aligns with the deprecation direction.** The old `start` CLI command is already marked `TODO(#2334): deprecate this`. This moves TDX to the intended future path. +- **Backwards-compatible.** The `start.sh` change is a conditional branch: TOML file present → new path, absent → legacy path. +- **Operator has full control.** Any config field can be set, not just a predefined set of env vars. +- **Template already exists.** `docs/localnet/mpc-config.template.toml` provides a working starting point. #### Cons -- Requires a one-time launcher compose update (measured, needs voting) +- Requires a one-time launcher compose update (measured, needs voting) for the shared-volume rw mount - Base64 encoding in `user-config.conf` is not very ergonomic for large configs -- The overlay merge in `start.sh` requires `pyyaml` (need to verify it's in the image; `python3` is available but the yaml module may not be) -- Hot-reload requires node code changes (file watcher, watch channels) +- Operator must provide the full config, not just overrides (but the template makes this straightforward) +- Secrets (`secret_store_key_hex`) end up in the TOML file on the shared volume (encrypted at rest by the CVM, but visible to processes inside the CVM -- same security model as the current `config.yaml` + env vars) --- -### Option B: `StartWithConfigFile` with TOML Config on Shared Volume +### Option B: Extend start.sh with YAML Config Overlay -The node already supports a `StartWithConfigFile` command that reads the entire config from a TOML file (see `cli.rs:CliCommand::StartWithConfigFile` and `start.rs:StartConfig::from_toml_file`). Instead of modifying `start.sh`, we could switch TDX deployments to use this path. +Instead of switching to the TOML path, extend the existing `start.sh` to merge an overlay file into the generated `config.yaml`. #### Design -1. **Replace `start.sh` with a new entrypoint** that reads config from `/mnt/shared/mpc-config.toml` -2. **The launcher writes this TOML file** from a base64-encoded config in `user-config.conf` -3. **Config includes everything**: node settings, TEE settings, secrets config, and `foreign_chains` -4. **Hot-reload**: same file watcher approach as Option A, but watching the TOML file - -Example `mpc-config.toml`: -```toml -home_dir = "/data" - -[secrets] -secret_store_key_hex = "..." - -[tee] -[tee.authority] -type = "dstack" -dstack_endpoint = "/var/run/dstack.sock" - -[node] -my_near_account_id = "my-account.testnet" -near_responder_account_id = "my-account.testnet" -number_of_responder_keys = 50 -web_ui = "0.0.0.0:8080" -# ... other fields ... +Modify `start.sh` to merge an optional config overlay file into `config.yaml` after generating the base template: -[node.foreign_chains.bitcoin] -timeout_sec = 30 -max_retries = 3 -[node.foreign_chains.bitcoin.providers.public] -api_variant = "esplora" -rpc_url = "https://blockstream.info/api" -[node.foreign_chains.bitcoin.providers.public.auth] -kind = "none" +```bash +MPC_CONFIG_OVERLAY="/mnt/shared/config-overlay.yaml" +if [ -f "$MPC_CONFIG_OVERLAY" ]; then + python3 -c " +import yaml +base = yaml.safe_load(open('$MPC_NODE_CONFIG_FILE')) +overlay = yaml.safe_load(open('$MPC_CONFIG_OVERLAY')) +def merge(b, o): + for k, v in o.items(): + if k in b and isinstance(b[k], dict) and isinstance(v, dict): + merge(b[k], v) + else: + b[k] = v +merge(base, overlay) +yaml.dump(base, open('$MPC_NODE_CONFIG_FILE', 'w'), default_flow_style=False) +" +fi ``` #### Pros -- Uses an existing, already-implemented code path (`StartWithConfigFile`) -- Cleaner than YAML overlay merging -- single source of truth -- TOML is well-supported in the Rust ecosystem -- No `start.sh` modifications needed (replace it entirely) -- Full config is in one place +- Operators only need to provide the delta (e.g., just `foreign_chains`), not the full config +- Smaller base64 payload in `user-config.conf` #### Cons -- **Breaking change**: requires new entrypoint, new Docker image, or at least a new `start.sh` -- Operator must provide the full config, not just overrides -- Secrets (like `secret_store_key_hex`) end up in the config file on the shared volume -- Still needs the base64-in-user-config or direct file injection mechanism -- Hot-reload still needs the same file watcher work +- **Builds on the legacy path** that is being deprecated (`TODO(#2334)`) +- Requires `pyyaml` in the Docker image (not currently installed) +- YAML deep-merge is fragile and error-prone +- Two config formats to maintain (YAML for overlay, env vars for the rest) +- Still limited by what `start.sh` generates -- the overlay can add fields but can't cleanly modify nested structures generated by the template --- @@ -447,8 +455,6 @@ Store the `foreign_chains` configuration on the contract and have nodes read it The existing `vote_foreign_chain_policy` mechanism already stores chain/provider URLs on-chain. Extend it to store the full provider config (including `api_variant`, timeouts, retries) so nodes can reconstruct their `foreign_chains` config from contract state. -API keys still need a local mechanism since they cannot go on-chain. But the chain definitions, provider URLs, and API variants could all come from the contract. - #### Pros - Consensus built-in: all operators agree on config - No file delivery mechanism needed for chain definitions @@ -489,57 +495,64 @@ Add an HTTP endpoint to the MPC node's existing web server (port 8080) for recei ## Recommendation -**Option A (Extend start.sh + Shared Volume Config Overlay)** is recommended for the following reasons: +**Option A (Switch to TOML Config Path)** is the clear recommendation. ### Rationale -1. **Solves the immediate blocker.** The `config-overlay.yaml` on `/mnt/shared/` gives us a way to deliver `foreign_chains` config to TDX nodes, unblocking testnet migration. +1. **The code already exists.** `start-with-config-file` is implemented, tested (all pytests use it), and validated. We are not writing new config-parsing code -- we are routing TDX deployments to an existing, battle-tested code path. -2. **Minimal blast radius.** Changes are additive: `start.sh` gets a small merge step, the launcher gets an optional base64 decode step. Existing deployments without an overlay file continue to work identically. +2. **Aligns with the codebase direction.** The old `start` command is explicitly marked for deprecation (`TODO(#2334)`). Option B (YAML overlay) would invest in extending a path that we intend to remove. -3. **Follows proven patterns.** The `/mnt/shared/` volume and the file-watcher pattern (`allowed_image_hashes_watcher`) are already battle-tested in this codebase. +3. **Solves the full problem, not just foreign chains.** With TOML config, operators have full control over all configuration fields. This eliminates the entire class of "I need to change X but start.sh doesn't support it" problems. -4. **Separation of concerns is preserved.** The contract handles consensus (which chains/URLs are accepted via `vote_foreign_chain_policy`), while the local overlay handles operator-specific details (auth config, API keys, provider preferences, timeouts). +4. **No new dependencies or fragile merge logic.** Option B requires `pyyaml` and YAML deep-merge. Option A uses `toml` deserialization that's already in the binary. -5. **Incremental delivery.** Phase 1 (initial delivery) unblocks TDX migration immediately. Phase 2 (hot-reload) and Phase 3 (ergonomic tooling) can follow independently. +5. **Single source of truth.** One TOML file replaces the split between env vars, `config.yaml`, and potential overlays. This is easier to reason about, debug, and version-control. ### Proposed Implementation Plan -#### Phase 0: Unblock TDX Migration (Minimal) -- Modify `start.sh` to merge `config-overlay.yaml` from `/mnt/shared/` if present -- Add `pyyaml` to the node Docker image (or use a simpler JSON merge if yaml is unavailable) -- Operator manually writes overlay file to the shared volume -- API keys passed via env vars in `user-config.conf` (existing mechanism) -- **No launcher changes, no node code changes, no compose changes** - -#### Phase 1: Launcher Config Overlay Support -- Add `CONFIG_OVERLAY_BASE64` support to the launcher -- Launcher decodes and writes `/mnt/shared/config-overlay.yaml` -- Update `launcher_docker_compose.yaml` to mount shared-volume as rw (requires voting) -- Document operator workflow for `user-config.conf`-based config updates - -#### Phase 2: Runtime Config Reload -- Add config file watcher in the MPC node for `/mnt/shared/config-overlay.yaml` -- Implement `watch::channel`-based config propagation for `ForeignChainsConfig` -- Update coordinator to re-vote `foreign_chain_policy` on config change -- Add secrets file support (`/mnt/shared/secrets.env`) for hot-reloadable API keys - -#### Phase 3: Operator Tooling & Ergonomics -- Add a helper script to generate overlay files from a more ergonomic format -- Update `deploy-launcher.sh` and `deploy-launcher-guide.md` -- Update `running-an-mpc-node-in-tdx-external-guide.md` with foreign chain config instructions -- Consider switching to the `StartWithConfigFile` TOML path long-term (Option B) for cleaner architecture +#### Phase 1: TOML Config Delivery (Unblocks TDX Migration) + +**start.sh changes:** +- Add a conditional at the top: if `/mnt/shared/mpc-config.toml` exists, skip the legacy config generation and run `mpc-node start-with-config-file /mnt/shared/mpc-config.toml` instead +- Keep the Near node initialization (`initialize_near_node`, `update_near_node_config`) since the TOML path doesn't handle that + +**Launcher changes:** +- Add `MPC_CONFIG_TOML_BASE64` support: decode and write to `/mnt/shared/mpc-config.toml` +- Change `shared-volume` mount from `:ro` to `:rw` in `launcher_docker_compose.yaml` (requires voting) +- When TOML config is present, skip passing most `--env` flags (they're in the TOML). Still pass `NEAR_BOOT_NODES` for near node init. + +**Operator workflow:** +- Create TOML config from template, including `foreign_chains` +- Base64-encode, add to `user-config.conf` +- Deploy/update CVM + +**What this unblocks:** TDX nodes with `foreign_chains` config, testnet migration, arbitrary config customization. + +#### Phase 2: Runtime Config Hot-Reload + +- Add file watcher on `/mnt/shared/mpc-config.toml` in the MPC node +- `watch::channel`-based propagation of `ForeignChainsConfig` changes to coordinator and providers +- Coordinator re-votes `foreign_chain_policy` when config changes (with rate limiting) +- Operator updates config via `vmm-cli.py update-user-config` → launcher rewrites TOML → node picks up change + +#### Phase 3: Ergonomics and Tooling + +- Create a CLI tool or script to generate TOML configs from a simpler input format +- Support partial config updates (tool merges changes into existing TOML) +- Update operator guides (`running-an-mpc-node-in-tdx-external-guide.md`, `deploy-launcher-guide.md`) +- Consider moving Near node init into the TOML path to fully eliminate `start.sh` ### Open Questions -1. **Is `pyyaml` available in the node Docker image?** The image is based on `debian:bookworm-slim` with `python3` installed, but not necessarily `pyyaml`. If not, we could use a JSON-based overlay instead, or add the dependency. Alternatively, the merge logic could be implemented in a small Rust helper or directly in the node binary (e.g., `mpc-node merge-config --overlay /mnt/shared/config-overlay.yaml`). +1. **Launcher compose voting timeline.** Changing `shared-volume:/mnt/shared:ro` to `:rw` requires a compose update and voting round. Can this be bundled with the next planned launcher upgrade? -2. **Launcher compose rw mount timing.** Changing `shared-volume:/mnt/shared:ro` to `rw` in the launcher compose requires a voting round. Can this be bundled with the next launcher image upgrade, or does it need to happen independently? Note: The launcher currently exits after starting the MPC container, so even with rw access, there's no persistent process that could tamper with the shared volume. +2. **Near node initialization.** `start.sh` currently handles Near node initialization (`mpc-node init`, genesis download, `config.json` updates). The TOML path only covers MPC node config, not Near node setup. We should keep this part of `start.sh` for now and eventually fold it into the TOML path or a separate init command. -3. **What happens if the overlay file is invalid?** `start.sh` should fail loudly (exit 1) if the overlay YAML is malformed, preventing the node from starting with a broken config. The runtime watcher should log a warning and keep the current config. +3. **Secret placement.** The TOML config will contain `secret_store_key_hex` on the shared volume. This is encrypted at rest by the CVM, but it means the secret is in a file rather than a transient env var. Is this acceptable? (Note: the current architecture already has `MPC_SECRET_STORE_KEY` as a Docker env var, which is visible in `docker inspect` and persists in the container metadata -- arguably the TOML file is no worse.) -4. **Should the overlay support all config fields or just `foreign_chains`?** Starting with `foreign_chains` only is simpler and safer. But operators may also want to tune `triple.concurrency`, `presignature.desired_presignatures_to_buffer`, etc. A full overlay merge is more flexible at the cost of complexity. +4. **Base64 ergonomics.** For large configs, base64 encoding in `user-config.conf` is unwieldy. A future improvement could support direct file placement via dstack APIs, or a reference to a file path instead of inline base64. 5. **Re-voting on config change.** When `foreign_chains` config changes at runtime, the node should automatically call `vote_foreign_chain_policy` with the new policy. This needs rate limiting to avoid vote spam if the operator is iterating on config. -6. **How to deliver the overlay file in Phase 0 (before launcher support)?** The operator could SSH into the TDX host, use `vmm-cli.py` to access the VM, and write the file directly to the shared volume. This is manual but unblocks us immediately. Exact steps need documentation. +6. **Migration path for existing deployments.** Nodes already deployed with the legacy `start.sh` path can be migrated by adding `MPC_CONFIG_TOML_BASE64` to their `user-config.conf` and restarting. The TOML config takes precedence, and the legacy `config.yaml` is no longer read. From cab864c068013e4ba046c2cf30fceb8d10c03c35 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?M=C3=A5rten=20Blankfors?= Date: Fri, 13 Mar 2026 13:15:55 +0100 Subject: [PATCH 4/9] ... --- docs/tdx-config-updates-design.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tdx-config-updates-design.md b/docs/tdx-config-updates-design.md index a6f06f8f6..0d81bd2fa 100644 --- a/docs/tdx-config-updates-design.md +++ b/docs/tdx-config-updates-design.md @@ -2,7 +2,7 @@ **Status:** WIP / Design **Issue:** #2420 -**Authors:** TBD +**Authors:** Claude + Mårten **Date:** 2026-03-13 ## Problem Statement From b061f219f9b50c5c508d3ea701f210357cfa3e3f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?M=C3=A5rten=20Blankfors?= Date: Fri, 13 Mar 2026 13:24:00 +0100 Subject: [PATCH 5/9] wip: Simplification --- docs/tdx-config-updates-design.md | 24 ++++++------------------ 1 file changed, 6 insertions(+), 18 deletions(-) diff --git a/docs/tdx-config-updates-design.md b/docs/tdx-config-updates-design.md index 0d81bd2fa..d84c8968b 100644 --- a/docs/tdx-config-updates-design.md +++ b/docs/tdx-config-updates-design.md @@ -294,8 +294,6 @@ fi # ... existing start.sh logic for legacy path ... ``` -This is backwards-compatible: nodes without a TOML file continue using the legacy path. - **Part 3: Operator workflow** The operator creates a TOML config file locally (using `mpc-config.template.toml` as a starting point), base64-encodes it, and includes it in `user-config.conf`: @@ -384,23 +382,17 @@ Both approaches work today. The inline approach is simpler and avoids the env-va #### Launcher Volume Mount Change -Currently the launcher mounts `shared-volume:/mnt/shared:ro` (read-only). For the launcher to write `mpc-config.toml`, this needs to change to `:rw`. - -**Important:** This change means updating the `launcher_docker_compose.yaml`, which is **measured** and affects attestation. This is a one-time change that needs to be voted in by all operators. After this change, the TOML mechanism works without further compose changes. - -Note: the launcher exits after starting the MPC container and does not run persistently, so rw access to the shared volume does not introduce a persistent tampering risk. +Currently the launcher mounts `shared-volume:/mnt/shared:ro` (read-only). For the launcher to write `mpc-config.toml`, this needs to change to `:rw` in `launcher_docker_compose.yaml`. Since no nodes are running in TDX today, this is a straightforward change with no migration concerns. #### Pros - **Uses an already-implemented, tested code path.** `start-with-config-file` is used by all pytests and localnet. It's not new code. - **Natively supports `foreign_chains`.** The TOML `StartConfig` includes `ConfigFile` which has `foreign_chains`. No overlay merging, no YAML hacks. - **Single source of truth.** One TOML file contains the entire config. No split between env vars, `config.yaml`, and overlay files. - **Aligns with the deprecation direction.** The old `start` CLI command is already marked `TODO(#2334): deprecate this`. This moves TDX to the intended future path. -- **Backwards-compatible.** The `start.sh` change is a conditional branch: TOML file present → new path, absent → legacy path. - **Operator has full control.** Any config field can be set, not just a predefined set of env vars. - **Template already exists.** `docs/localnet/mpc-config.template.toml` provides a working starting point. #### Cons -- Requires a one-time launcher compose update (measured, needs voting) for the shared-volume rw mount - Base64 encoding in `user-config.conf` is not very ergonomic for large configs - Operator must provide the full config, not just overrides (but the template makes this straightforward) - Secrets (`secret_store_key_hex`) end up in the TOML file on the shared volume (encrypted at rest by the CVM, but visible to processes inside the CVM -- same security model as the current `config.yaml` + env vars) @@ -519,7 +511,7 @@ Add an HTTP endpoint to the MPC node's existing web server (port 8080) for recei **Launcher changes:** - Add `MPC_CONFIG_TOML_BASE64` support: decode and write to `/mnt/shared/mpc-config.toml` -- Change `shared-volume` mount from `:ro` to `:rw` in `launcher_docker_compose.yaml` (requires voting) +- Change `shared-volume` mount from `:ro` to `:rw` in `launcher_docker_compose.yaml` - When TOML config is present, skip passing most `--env` flags (they're in the TOML). Still pass `NEAR_BOOT_NODES` for near node init. **Operator workflow:** @@ -545,14 +537,10 @@ Add an HTTP endpoint to the MPC node's existing web server (port 8080) for recei ### Open Questions -1. **Launcher compose voting timeline.** Changing `shared-volume:/mnt/shared:ro` to `:rw` requires a compose update and voting round. Can this be bundled with the next planned launcher upgrade? - -2. **Near node initialization.** `start.sh` currently handles Near node initialization (`mpc-node init`, genesis download, `config.json` updates). The TOML path only covers MPC node config, not Near node setup. We should keep this part of `start.sh` for now and eventually fold it into the TOML path or a separate init command. - -3. **Secret placement.** The TOML config will contain `secret_store_key_hex` on the shared volume. This is encrypted at rest by the CVM, but it means the secret is in a file rather than a transient env var. Is this acceptable? (Note: the current architecture already has `MPC_SECRET_STORE_KEY` as a Docker env var, which is visible in `docker inspect` and persists in the container metadata -- arguably the TOML file is no worse.) +1. **Near node initialization.** `start.sh` currently handles Near node initialization (`mpc-node init`, genesis download, `config.json` updates). The TOML path only covers MPC node config, not Near node setup. We should keep this part of `start.sh` for now and eventually fold it into the TOML path or a separate init command. -4. **Base64 ergonomics.** For large configs, base64 encoding in `user-config.conf` is unwieldy. A future improvement could support direct file placement via dstack APIs, or a reference to a file path instead of inline base64. +2. **Secret placement.** The TOML config will contain `secret_store_key_hex` on the shared volume. This is encrypted at rest by the CVM, but it means the secret is in a file rather than a transient env var. Is this acceptable? (Note: the current architecture already has `MPC_SECRET_STORE_KEY` as a Docker env var, which is visible in `docker inspect` and persists in the container metadata -- arguably the TOML file is no worse.) -5. **Re-voting on config change.** When `foreign_chains` config changes at runtime, the node should automatically call `vote_foreign_chain_policy` with the new policy. This needs rate limiting to avoid vote spam if the operator is iterating on config. +3. **Base64 ergonomics.** For large configs, base64 encoding in `user-config.conf` is unwieldy. A future improvement could support direct file placement via dstack APIs, or a reference to a file path instead of inline base64. -6. **Migration path for existing deployments.** Nodes already deployed with the legacy `start.sh` path can be migrated by adding `MPC_CONFIG_TOML_BASE64` to their `user-config.conf` and restarting. The TOML config takes precedence, and the legacy `config.yaml` is no longer read. +4. **Re-voting on config change.** When `foreign_chains` config changes at runtime, the node should automatically call `vote_foreign_chain_policy` with the new policy. This needs rate limiting to avoid vote spam if the operator is iterating on config. From ede8b5eee2115670ad0f6fe818f25e0e11352948 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?M=C3=A5rten=20Blankfors?= Date: Fri, 13 Mar 2026 13:27:46 +0100 Subject: [PATCH 6/9] fix: Simplify design doc --- docs/tdx-config-updates-design.md | 85 +++++++------------------------ 1 file changed, 17 insertions(+), 68 deletions(-) diff --git a/docs/tdx-config-updates-design.md b/docs/tdx-config-updates-design.md index d84c8968b..17bf6f93c 100644 --- a/docs/tdx-config-updates-design.md +++ b/docs/tdx-config-updates-design.md @@ -240,42 +240,17 @@ Switch TDX deployments from the legacy `start.sh` + `config.yaml` path to the ne #### Design -**Part 1: Launcher generates TOML config** +**Part 1: Operator places TOML config in `/tapp/`** -The launcher already reads `user-config.conf` and builds a `docker run` command. Instead of passing individual `--env` flags, the launcher would: +The `/tapp/` directory is a dstack-managed directory where operators can place files. It is already mounted read-only into both the launcher container and the MPC node container (`/tapp:/tapp:ro`). The operator places a complete `mpc-config.toml` alongside `user-config.conf` in this directory. -1. Read config from `user-config.conf` (existing flat key-value vars) **and** a base64-encoded TOML config -2. Write a complete `mpc-config.toml` to `/mnt/shared/mpc-config.toml` -3. Launch the MPC container with a modified entrypoint that uses `start-with-config-file` - -```python -# In launcher.py: -CONFIG_TOML_BASE64_KEY = "MPC_CONFIG_TOML_BASE64" - -def write_config_toml(dstack_config: dict, image_hash: str, platform: Platform): - """Generate mpc-config.toml on the shared volume.""" - config_b64 = dstack_config.get(CONFIG_TOML_BASE64_KEY) - if not config_b64: - return # Fall back to legacy start.sh path - - import base64 - config_toml = base64.b64decode(config_b64).decode("utf-8") - - # Write atomically - tmp_path = "/mnt/shared/mpc-config.toml.tmp" - final_path = "/mnt/shared/mpc-config.toml" - with open(tmp_path, "w") as f: - f.write(config_toml) - os.rename(tmp_path, final_path) -``` - -**Part 2: Modified start.sh (or new entrypoint)** +**Part 2: Modified start.sh** Modify `start.sh` to detect the TOML config and use it instead of generating `config.yaml`: ```bash # At the top of start.sh: -MPC_CONFIG_TOML="/mnt/shared/mpc-config.toml" +MPC_CONFIG_TOML="/tapp/mpc-config.toml" if [ -f "$MPC_CONFIG_TOML" ]; then echo "Found TOML config at $MPC_CONFIG_TOML, using start-with-config-file" @@ -296,34 +271,22 @@ fi **Part 3: Operator workflow** -The operator creates a TOML config file locally (using `mpc-config.template.toml` as a starting point), base64-encodes it, and includes it in `user-config.conf`: +The operator creates a TOML config file locally (using `mpc-config.template.toml` as a starting point) and places it in the dstack app directory: ```bash # 1. Create config from template (or manually) envsubst < docs/localnet/mpc-config.template.toml > mpc-config.toml # Edit to add foreign_chains, adjust settings, etc. -# 2. Base64-encode and add to user-config.conf -MPC_CONFIG_TOML_BASE64=$(base64 -w0 < mpc-config.toml) - -# 3. user-config.conf now contains: -cat > user-config.conf << EOF -MPC_IMAGE_NAME=nearone/mpc-node -MPC_IMAGE_TAGS=latest -MPC_REGISTRY=registry.hub.docker.com -MPC_CONFIG_TOML_BASE64=$MPC_CONFIG_TOML_BASE64 -PORTS=8080:8080,3030:3030,80:80,24567:24567 -EOF - -# 4. Deploy or update -vmm-cli.py update-user-config user-config.conf +# 2. Place in the dstack app directory alongside user-config.conf +# The file will be available at /tapp/mpc-config.toml inside the CVM ``` -To update config (e.g., add a new foreign chain), the operator edits the TOML file, re-encodes, and updates `user-config.conf`. With a CVM restart, the node picks up the new config. +To update config (e.g., add a new foreign chain), the operator edits the TOML file and restarts the CVM. The node picks up the new config on boot. **Part 4: Runtime config hot-reload (future phase)** -Add a file watcher to the MPC node that monitors `/mnt/shared/mpc-config.toml` for changes: +Add a file watcher to the MPC node that monitors `/tapp/mpc-config.toml` for changes: ```rust // New: crates/node/src/config/watcher.rs @@ -352,7 +315,7 @@ pub async fn watch_config_file( } ``` -To trigger a hot-reload, the operator updates `user-config.conf` with a new `MPC_CONFIG_TOML_BASE64`, then restarts just the launcher (not the CVM). The launcher writes the new TOML file to the shared volume, and the node's file watcher picks it up. +The operator updates the TOML file in the dstack app directory and restarts the CVM (or, once hot-reload is implemented, the node picks up the change automatically). **Part 5: API key delivery** @@ -368,7 +331,7 @@ scheme = "Bearer" val = "my-api-key-here" ``` -The key is embedded in the TOML config on the encrypted CVM disk. It's delivered via the base64-encoded config in `user-config.conf`. +The key is embedded in the TOML config on the encrypted CVM disk. **Via env var reference:** ```toml @@ -380,10 +343,6 @@ The API key is passed as a separate env var in `user-config.conf` (`MPC_ALCHEMY_ Both approaches work today. The inline approach is simpler and avoids the env-var-is-fixed-at-container-start limitation for hot-reload scenarios. -#### Launcher Volume Mount Change - -Currently the launcher mounts `shared-volume:/mnt/shared:ro` (read-only). For the launcher to write `mpc-config.toml`, this needs to change to `:rw` in `launcher_docker_compose.yaml`. Since no nodes are running in TDX today, this is a straightforward change with no migration concerns. - #### Pros - **Uses an already-implemented, tested code path.** `start-with-config-file` is used by all pytests and localnet. It's not new code. - **Natively supports `foreign_chains`.** The TOML `StartConfig` includes `ConfigFile` which has `foreign_chains`. No overlay merging, no YAML hacks. @@ -393,9 +352,8 @@ Currently the launcher mounts `shared-volume:/mnt/shared:ro` (read-only). For th - **Template already exists.** `docs/localnet/mpc-config.template.toml` provides a working starting point. #### Cons -- Base64 encoding in `user-config.conf` is not very ergonomic for large configs - Operator must provide the full config, not just overrides (but the template makes this straightforward) -- Secrets (`secret_store_key_hex`) end up in the TOML file on the shared volume (encrypted at rest by the CVM, but visible to processes inside the CVM -- same security model as the current `config.yaml` + env vars) +- Secrets (`secret_store_key_hex`) end up in the TOML file in `/tapp/` (encrypted at rest by the CVM, but visible to processes inside the CVM -- same security model as the current `config.yaml` + env vars) --- @@ -428,7 +386,6 @@ fi #### Pros - Operators only need to provide the delta (e.g., just `foreign_chains`), not the full config -- Smaller base64 payload in `user-config.conf` #### Cons - **Builds on the legacy path** that is being deprecated (`TODO(#2334)`) @@ -506,27 +463,21 @@ Add an HTTP endpoint to the MPC node's existing web server (port 8080) for recei #### Phase 1: TOML Config Delivery (Unblocks TDX Migration) **start.sh changes:** -- Add a conditional at the top: if `/mnt/shared/mpc-config.toml` exists, skip the legacy config generation and run `mpc-node start-with-config-file /mnt/shared/mpc-config.toml` instead +- Add a conditional at the top: if `/tapp/mpc-config.toml` exists, skip the legacy config generation and run `mpc-node start-with-config-file /tapp/mpc-config.toml` instead - Keep the Near node initialization (`initialize_near_node`, `update_near_node_config`) since the TOML path doesn't handle that -**Launcher changes:** -- Add `MPC_CONFIG_TOML_BASE64` support: decode and write to `/mnt/shared/mpc-config.toml` -- Change `shared-volume` mount from `:ro` to `:rw` in `launcher_docker_compose.yaml` -- When TOML config is present, skip passing most `--env` flags (they're in the TOML). Still pass `NEAR_BOOT_NODES` for near node init. - **Operator workflow:** - Create TOML config from template, including `foreign_chains` -- Base64-encode, add to `user-config.conf` +- Place `mpc-config.toml` in the dstack app directory - Deploy/update CVM **What this unblocks:** TDX nodes with `foreign_chains` config, testnet migration, arbitrary config customization. #### Phase 2: Runtime Config Hot-Reload -- Add file watcher on `/mnt/shared/mpc-config.toml` in the MPC node +- Add file watcher on `/tapp/mpc-config.toml` in the MPC node - `watch::channel`-based propagation of `ForeignChainsConfig` changes to coordinator and providers - Coordinator re-votes `foreign_chain_policy` when config changes (with rate limiting) -- Operator updates config via `vmm-cli.py update-user-config` → launcher rewrites TOML → node picks up change #### Phase 3: Ergonomics and Tooling @@ -539,8 +490,6 @@ Add an HTTP endpoint to the MPC node's existing web server (port 8080) for recei 1. **Near node initialization.** `start.sh` currently handles Near node initialization (`mpc-node init`, genesis download, `config.json` updates). The TOML path only covers MPC node config, not Near node setup. We should keep this part of `start.sh` for now and eventually fold it into the TOML path or a separate init command. -2. **Secret placement.** The TOML config will contain `secret_store_key_hex` on the shared volume. This is encrypted at rest by the CVM, but it means the secret is in a file rather than a transient env var. Is this acceptable? (Note: the current architecture already has `MPC_SECRET_STORE_KEY` as a Docker env var, which is visible in `docker inspect` and persists in the container metadata -- arguably the TOML file is no worse.) - -3. **Base64 ergonomics.** For large configs, base64 encoding in `user-config.conf` is unwieldy. A future improvement could support direct file placement via dstack APIs, or a reference to a file path instead of inline base64. +2. **Secret placement.** The TOML config will contain `secret_store_key_hex` in `/tapp/`. This is encrypted at rest by the CVM, but it means the secret is in a file rather than a transient env var. Is this acceptable? (Note: the current architecture already has `MPC_SECRET_STORE_KEY` as a Docker env var, which is visible in `docker inspect` and persists in the container metadata -- arguably the TOML file is no worse.) -4. **Re-voting on config change.** When `foreign_chains` config changes at runtime, the node should automatically call `vote_foreign_chain_policy` with the new policy. This needs rate limiting to avoid vote spam if the operator is iterating on config. +3. **Re-voting on config change.** When `foreign_chains` config changes at runtime, the node should automatically call `vote_foreign_chain_policy` with the new policy. This needs rate limiting to avoid vote spam if the operator is iterating on config. From d69a290183f29472698f9b60a73d85c7116c8ed4 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?M=C3=A5rten=20Blankfors?= Date: Fri, 13 Mar 2026 14:50:59 +0100 Subject: [PATCH 7/9] fix: Revert back to using user-config.conf as that's the only file we can pass to the CVM --- docs/tdx-config-updates-design.md | 145 ++++++++++-------------------- 1 file changed, 49 insertions(+), 96 deletions(-) diff --git a/docs/tdx-config-updates-design.md b/docs/tdx-config-updates-design.md index 17bf6f93c..02df9d283 100644 --- a/docs/tdx-config-updates-design.md +++ b/docs/tdx-config-updates-design.md @@ -234,23 +234,29 @@ This works well for data that originates from the contract. For operator-specifi ## Proposed Solutions -### Option A: Switch to TOML Config Path with Launcher-Generated Config (Recommended) +### Option A: Switch to TOML Config Path (Recommended) -Switch TDX deployments from the legacy `start.sh` + `config.yaml` path to the new `start-with-config-file` TOML path. The launcher generates the TOML config file from operator-provided data and writes it to the shared volume. +Switch TDX deployments from the legacy `start.sh` + `config.yaml` path to the new `start-with-config-file` TOML path. The TOML config is delivered via `user-config.conf` (the only file delivery mechanism dstack provides) as a base64-encoded env var, and `start.sh` decodes it. + +#### Dstack constraint + +Dstack only allows operators to deliver a single `user_config` file (flat KEY=VALUE format) to the CVM via `vmm-cli.py update-user-config`. There is no mechanism to place arbitrary files in `/tapp/`. This means structured config must be embedded within `user-config.conf`. #### Design -**Part 1: Operator places TOML config in `/tapp/`** +**Part 1: Config delivery via `user-config.conf`** -The `/tapp/` directory is a dstack-managed directory where operators can place files. It is already mounted read-only into both the launcher container and the MPC node container (`/tapp:/tapp:ro`). The operator places a complete `mpc-config.toml` alongside `user-config.conf` in this directory. +The operator base64-encodes a TOML config file and includes it as `MPC_CONFIG_TOML_BASE64` in `user-config.conf`. The launcher passes this through as an env var (it matches the `MPC_*` pattern). `start.sh` decodes it and writes the TOML file to the persistent `/data` volume. **Part 2: Modified start.sh** -Modify `start.sh` to detect the TOML config and use it instead of generating `config.yaml`: - ```bash -# At the top of start.sh: -MPC_CONFIG_TOML="/tapp/mpc-config.toml" +MPC_CONFIG_TOML="$MPC_HOME_DIR/mpc-config.toml" + +if [ -n "$MPC_CONFIG_TOML_BASE64" ]; then + echo "Decoding TOML config from MPC_CONFIG_TOML_BASE64" + echo "$MPC_CONFIG_TOML_BASE64" | base64 -d > "$MPC_CONFIG_TOML" +fi if [ -f "$MPC_CONFIG_TOML" ]; then echo "Found TOML config at $MPC_CONFIG_TOML, using start-with-config-file" @@ -269,55 +275,34 @@ fi # ... existing start.sh logic for legacy path ... ``` -**Part 3: Operator workflow** +Note: the TOML is decoded and overwritten on every boot, so config changes in `user-config.conf` take effect on CVM restart. -The operator creates a TOML config file locally (using `mpc-config.template.toml` as a starting point) and places it in the dstack app directory: +**Part 3: Operator workflow** ```bash -# 1. Create config from template (or manually) +# 1. Create config from template envsubst < docs/localnet/mpc-config.template.toml > mpc-config.toml # Edit to add foreign_chains, adjust settings, etc. -# 2. Place in the dstack app directory alongside user-config.conf -# The file will be available at /tapp/mpc-config.toml inside the CVM -``` +# 2. Base64-encode and add to user-config.conf +MPC_CONFIG_TOML_BASE64=$(base64 -w0 < mpc-config.toml) + +# 3. user-config.conf: +cat > user-config.conf << EOF +MPC_IMAGE_NAME=nearone/mpc-node +MPC_IMAGE_TAGS=latest +MPC_REGISTRY=registry.hub.docker.com +MPC_CONFIG_TOML_BASE64=$MPC_CONFIG_TOML_BASE64 +PORTS=8080:8080,3030:3030,80:80,24567:24567 +EOF -To update config (e.g., add a new foreign chain), the operator edits the TOML file and restarts the CVM. The node picks up the new config on boot. - -**Part 4: Runtime config hot-reload (future phase)** - -Add a file watcher to the MPC node that monitors `/tapp/mpc-config.toml` for changes: - -```rust -// New: crates/node/src/config/watcher.rs -pub async fn watch_config_file( - config_path: PathBuf, - foreign_chains_sender: watch::Sender, - cancellation_token: CancellationToken, -) -> Result<(), ConfigWatchError> { - loop { - select! { - _ = cancellation_token.cancelled() => break Ok(()), - _ = wait_for_file_change(&config_path) => { - match StartConfig::from_toml_file(&config_path) { - Ok(new_config) => { - // Only hot-reload safe fields - foreign_chains_sender.send_replace(new_config.node.foreign_chains); - tracing::info!("Config reloaded successfully"); - } - Err(e) => { - tracing::warn!("Invalid config file, keeping current config: {e}"); - } - } - } - } - } -} +# 4. Deploy or update +vmm-cli.py update-user-config user-config.conf ``` -The operator updates the TOML file in the dstack app directory and restarts the CVM (or, once hot-reload is implemented, the node picks up the change automatically). +To update config, the operator edits the TOML file, re-encodes, updates `user-config.conf`, and restarts the CVM. -**Part 5: API key delivery** +**Part 4: API key delivery** API keys can be handled in two ways: @@ -350,53 +335,17 @@ Both approaches work today. The inline approach is simpler and avoids the env-va - **Aligns with the deprecation direction.** The old `start` CLI command is already marked `TODO(#2334): deprecate this`. This moves TDX to the intended future path. - **Operator has full control.** Any config field can be set, not just a predefined set of env vars. - **Template already exists.** `docs/localnet/mpc-config.template.toml` provides a working starting point. +- **No launcher changes required.** Only `start.sh` needs modification; the launcher already passes `MPC_*` env vars through. #### Cons +- Base64 encoding in `user-config.conf` is not very ergonomic for large configs - Operator must provide the full config, not just overrides (but the template makes this straightforward) -- Secrets (`secret_store_key_hex`) end up in the TOML file in `/tapp/` (encrypted at rest by the CVM, but visible to processes inside the CVM -- same security model as the current `config.yaml` + env vars) - ---- - -### Option B: Extend start.sh with YAML Config Overlay - -Instead of switching to the TOML path, extend the existing `start.sh` to merge an overlay file into the generated `config.yaml`. - -#### Design - -Modify `start.sh` to merge an optional config overlay file into `config.yaml` after generating the base template: - -```bash -MPC_CONFIG_OVERLAY="/mnt/shared/config-overlay.yaml" -if [ -f "$MPC_CONFIG_OVERLAY" ]; then - python3 -c " -import yaml -base = yaml.safe_load(open('$MPC_NODE_CONFIG_FILE')) -overlay = yaml.safe_load(open('$MPC_CONFIG_OVERLAY')) -def merge(b, o): - for k, v in o.items(): - if k in b and isinstance(b[k], dict) and isinstance(v, dict): - merge(b[k], v) - else: - b[k] = v -merge(base, overlay) -yaml.dump(base, open('$MPC_NODE_CONFIG_FILE', 'w'), default_flow_style=False) -" -fi -``` - -#### Pros -- Operators only need to provide the delta (e.g., just `foreign_chains`), not the full config - -#### Cons -- **Builds on the legacy path** that is being deprecated (`TODO(#2334)`) -- Requires `pyyaml` in the Docker image (not currently installed) -- YAML deep-merge is fragile and error-prone -- Two config formats to maintain (YAML for overlay, env vars for the rest) -- Still limited by what `start.sh` generates -- the overlay can add fields but can't cleanly modify nested structures generated by the template +- Secrets (`secret_store_key_hex`) end up in the TOML file on disk (encrypted at rest by the CVM -- same security model as the current `config.yaml` + env vars) +- The 1024-byte per-value limit in the launcher may need to be raised for large TOML configs --- -### Option C: Contract-Driven Configuration +### Option B: Contract-Driven Configuration Store the `foreign_chains` configuration on the contract and have nodes read it via the indexer. @@ -418,7 +367,7 @@ The existing `vote_foreign_chain_policy` mechanism already stores chain/provider --- -### Option D: Node HTTP API for Config Updates +### Option C: Node HTTP API for Config Updates Add an HTTP endpoint to the MPC node's existing web server (port 8080) for receiving config updates. @@ -450,32 +399,36 @@ Add an HTTP endpoint to the MPC node's existing web server (port 8080) for recei 1. **The code already exists.** `start-with-config-file` is implemented, tested (all pytests use it), and validated. We are not writing new config-parsing code -- we are routing TDX deployments to an existing, battle-tested code path. -2. **Aligns with the codebase direction.** The old `start` command is explicitly marked for deprecation (`TODO(#2334)`). Option B (YAML overlay) would invest in extending a path that we intend to remove. +2. **Aligns with the codebase direction.** The old `start` command is explicitly marked for deprecation (`TODO(#2334)`). 3. **Solves the full problem, not just foreign chains.** With TOML config, operators have full control over all configuration fields. This eliminates the entire class of "I need to change X but start.sh doesn't support it" problems. -4. **No new dependencies or fragile merge logic.** Option B requires `pyyaml` and YAML deep-merge. Option A uses `toml` deserialization that's already in the binary. +4. **No new dependencies or fragile merge logic.** Uses `toml` deserialization that's already in the binary. -5. **Single source of truth.** One TOML file replaces the split between env vars, `config.yaml`, and potential overlays. This is easier to reason about, debug, and version-control. +5. **Single source of truth.** One TOML file replaces the split between env vars and `config.yaml`. Easier to reason about, debug, and version-control. ### Proposed Implementation Plan #### Phase 1: TOML Config Delivery (Unblocks TDX Migration) **start.sh changes:** -- Add a conditional at the top: if `/tapp/mpc-config.toml` exists, skip the legacy config generation and run `mpc-node start-with-config-file /tapp/mpc-config.toml` instead +- If `MPC_CONFIG_TOML_BASE64` env var is set, decode it and write to `$MPC_HOME_DIR/mpc-config.toml` +- If `mpc-config.toml` exists, skip the legacy config generation and run `mpc-node start-with-config-file` instead - Keep the Near node initialization (`initialize_near_node`, `update_near_node_config`) since the TOML path doesn't handle that +**Launcher changes:** +- Raise the 1024-byte per-value limit (or exempt `MPC_CONFIG_TOML_BASE64` from it), since a base64-encoded TOML config will exceed this + **Operator workflow:** - Create TOML config from template, including `foreign_chains` -- Place `mpc-config.toml` in the dstack app directory +- Base64-encode and add as `MPC_CONFIG_TOML_BASE64` in `user-config.conf` - Deploy/update CVM **What this unblocks:** TDX nodes with `foreign_chains` config, testnet migration, arbitrary config customization. #### Phase 2: Runtime Config Hot-Reload -- Add file watcher on `/tapp/mpc-config.toml` in the MPC node +- Add file watcher on `$MPC_HOME_DIR/mpc-config.toml` in the MPC node - `watch::channel`-based propagation of `ForeignChainsConfig` changes to coordinator and providers - Coordinator re-votes `foreign_chain_policy` when config changes (with rate limiting) @@ -490,6 +443,6 @@ Add an HTTP endpoint to the MPC node's existing web server (port 8080) for recei 1. **Near node initialization.** `start.sh` currently handles Near node initialization (`mpc-node init`, genesis download, `config.json` updates). The TOML path only covers MPC node config, not Near node setup. We should keep this part of `start.sh` for now and eventually fold it into the TOML path or a separate init command. -2. **Secret placement.** The TOML config will contain `secret_store_key_hex` in `/tapp/`. This is encrypted at rest by the CVM, but it means the secret is in a file rather than a transient env var. Is this acceptable? (Note: the current architecture already has `MPC_SECRET_STORE_KEY` as a Docker env var, which is visible in `docker inspect` and persists in the container metadata -- arguably the TOML file is no worse.) +2. **Launcher env var size limit.** The launcher enforces a 1024-byte per-value limit and 32KB total limit. A base64-encoded TOML config with `foreign_chains` will likely exceed 1024 bytes. We need to either raise this limit for `MPC_CONFIG_TOML_BASE64` or remove the per-value cap entirely (the total payload cap is sufficient protection). 3. **Re-voting on config change.** When `foreign_chains` config changes at runtime, the node should automatically call `vote_foreign_chain_policy` with the new policy. This needs rate limiting to avoid vote spam if the operator is iterating on config. From 03cbfe74c08ca211cfe40002019a9f1889f27e0b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?M=C3=A5rten=20Blankfors?= Date: Fri, 13 Mar 2026 14:54:30 +0100 Subject: [PATCH 8/9] fix: add purpose --- docs/tdx-config-updates-design.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/docs/tdx-config-updates-design.md b/docs/tdx-config-updates-design.md index 02df9d283..9ccf177d4 100644 --- a/docs/tdx-config-updates-design.md +++ b/docs/tdx-config-updates-design.md @@ -1,9 +1,10 @@ # Dynamic Configuration Updates in TDX Deployments -**Status:** WIP / Design -**Issue:** #2420 -**Authors:** Claude + Mårten -**Date:** 2026-03-13 +## Purpose +This is a mostly claude-generated document used to outline how we can implement support for MPC configuration updates +in our TDX deployments. + +It's intended as a temporary design document and should be removed once we implement the feature. ## Problem Statement From 1ec00a6a1cd8866468df37860eacc446f92fa7ea Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?M=C3=A5rten=20Blankfors?= Date: Fri, 13 Mar 2026 15:19:55 +0100 Subject: [PATCH 9/9] close some open discussions --- docs/tdx-config-updates-design.md | 10 +--------- 1 file changed, 1 insertion(+), 9 deletions(-) diff --git a/docs/tdx-config-updates-design.md b/docs/tdx-config-updates-design.md index 9ccf177d4..1e5230377 100644 --- a/docs/tdx-config-updates-design.md +++ b/docs/tdx-config-updates-design.md @@ -342,7 +342,6 @@ Both approaches work today. The inline approach is simpler and avoids the env-va - Base64 encoding in `user-config.conf` is not very ergonomic for large configs - Operator must provide the full config, not just overrides (but the template makes this straightforward) - Secrets (`secret_store_key_hex`) end up in the TOML file on disk (encrypted at rest by the CVM -- same security model as the current `config.yaml` + env vars) -- The 1024-byte per-value limit in the launcher may need to be raised for large TOML configs --- @@ -418,7 +417,7 @@ Add an HTTP endpoint to the MPC node's existing web server (port 8080) for recei - Keep the Near node initialization (`initialize_near_node`, `update_near_node_config`) since the TOML path doesn't handle that **Launcher changes:** -- Raise the 1024-byte per-value limit (or exempt `MPC_CONFIG_TOML_BASE64` from it), since a base64-encoded TOML config will exceed this +- Raise the per-value size limit for `MPC_CONFIG_TOML_BASE64` (currently 1024 bytes, our own guardrail in `launcher.py`) **Operator workflow:** - Create TOML config from template, including `foreign_chains` @@ -440,10 +439,3 @@ Add an HTTP endpoint to the MPC node's existing web server (port 8080) for recei - Update operator guides (`running-an-mpc-node-in-tdx-external-guide.md`, `deploy-launcher-guide.md`) - Consider moving Near node init into the TOML path to fully eliminate `start.sh` -### Open Questions - -1. **Near node initialization.** `start.sh` currently handles Near node initialization (`mpc-node init`, genesis download, `config.json` updates). The TOML path only covers MPC node config, not Near node setup. We should keep this part of `start.sh` for now and eventually fold it into the TOML path or a separate init command. - -2. **Launcher env var size limit.** The launcher enforces a 1024-byte per-value limit and 32KB total limit. A base64-encoded TOML config with `foreign_chains` will likely exceed 1024 bytes. We need to either raise this limit for `MPC_CONFIG_TOML_BASE64` or remove the per-value cap entirely (the total payload cap is sufficient protection). - -3. **Re-voting on config change.** When `foreign_chains` config changes at runtime, the node should automatically call `vote_foreign_chain_policy` with the new policy. This needs rate limiting to avoid vote spam if the operator is iterating on config.