diff --git a/docs/tdx-config-updates-design.md b/docs/tdx-config-updates-design.md new file mode 100644 index 000000000..1e5230377 --- /dev/null +++ b/docs/tdx-config-updates-design.md @@ -0,0 +1,441 @@ +# Dynamic Configuration Updates in TDX Deployments + +## Purpose +This is a mostly claude-generated document used to outline how we can implement support for MPC configuration updates +in our TDX deployments. + +It's intended as a temporary design document and should be removed once we implement the feature. + +## Problem Statement + +MPC nodes running in TDX have **no way to receive arbitrary configuration**, most critically the `foreign_chains` section needed for foreign transaction validation. This is not merely a "hot-reload" problem -- the configuration literally cannot be delivered into the CVM at all with the current architecture. + +### The Core Gap + +The `config.yaml` file that the MPC node reads at startup is generated by `deployment/start.sh` using a **hardcoded template** that only includes a fixed set of fields derived from environment variables: + +```bash +# From deployment/start.sh:initialize_mpc_config() +cat <"$1" +my_near_account_id: $MPC_ACCOUNT_ID +near_responder_account_id: $responder_id +... +indexer: + ... + mpc_contract_id: $MPC_CONTRACT_ID + ... +EOF +``` + +This template has **no `foreign_chains` section**. There is no mechanism to inject one, because: + +1. **`start.sh` cannot template arbitrary YAML.** It uses simple `cat` heredocs and `sed` substitutions. The `foreign_chains` config is deeply nested YAML with per-chain provider lists, auth configs, and API key references -- it cannot be expressed as flat `KEY=VALUE` environment variables. + +2. **The launcher only passes flat env vars.** The launcher (`launcher.py`) reads `user-config.conf` (a flat `.env` file) and passes matching `MPC_*` keys as `--env` flags to `docker run`. There is no mechanism for structured config data. + +3. **`config.yaml` is only generated once.** On first boot, `start.sh` generates the file. On subsequent boots, `update_mpc_config()` only updates `my_near_account_id`, `mpc_contract_id`, and `near_responder_account_id` via `sed`. Any manual edits to `config.yaml` inside the CVM persist across restarts, but there is no external interface to make such edits. + +4. **No volume path for config files.** The CVM has two relevant volumes: `mpc-data:/data` (the MPC home dir, containing `config.yaml`) and `shared-volume:/mnt/shared` (shared between launcher and node). Neither is exposed to the operator for file injection after deployment. + +### Immediate Consequences + +- **Foreign transaction validation is blocked on TDX.** We cannot deploy nodes with `foreign_chains` config, which means TDX nodes cannot participate in foreign tx validation. This blocks testnet migration to TDX. + +- **API keys cannot be delivered.** Foreign chain providers requiring authentication (Alchemy, QuickNode, etc.) need API keys, delivered either as `TokenConfig::Env` (env var reference) or `TokenConfig::Val` (inline value). Neither path works: env vars are fixed at container creation, and inline values would need to be in `config.yaml` which can't be populated. + +- **No config updates of any kind.** Even non-foreign-chain config changes (e.g., adjusting triple/presignature concurrency, changing boot nodes) require stopping the CVM, updating `user-config.conf`, and restarting -- which causes downtime. + +### Recent Development: `start-with-config-file` (TOML) + +Commit `78d3e767` ("feat: allow configuration files for full config of the mpc node", PR #2332) introduced a new `start-with-config-file` CLI command that reads the **entire** node configuration from a single TOML file: + +``` +mpc-node start-with-config-file /path/to/mpc-config.toml +``` + +The `StartConfig` TOML struct includes all configuration in one file: + +```toml +home_dir = "/data" + +[secrets] +secret_store_key_hex = "..." + +[tee] +image_hash = "..." +latest_allowed_hash_file = "/mnt/shared/image-digest.bin" +[tee.authority] +type = "dstack" +dstack_endpoint = "/var/run/dstack.sock" + +[node] +my_near_account_id = "my-account.testnet" +# ... all node config fields ... + +[node.foreign_chains.bitcoin] +timeout_sec = 30 +max_retries = 3 +[node.foreign_chains.bitcoin.providers.public] +api_variant = "esplora" +rpc_url = "https://blockstream.info/api" +[node.foreign_chains.bitcoin.providers.public.auth] +kind = "none" +``` + +Key facts about this feature: +- **All pytests already use this path** -- tests write `start_config.toml` and launch via `start-with-config-file` +- **A TOML template exists** at `docs/localnet/mpc-config.template.toml` with `foreign_chains` included +- **The old `start` command is marked for deprecation** (`cli.rs:226`: `TODO(#2334): deprecate this`) +- **`StartConfig::from_toml_file()` validates the config** including `foreign_chains` +- **The `node` section is a `ConfigFile`** -- the exact same struct used by `config.yaml`, so feature parity is guaranteed + +This feature **fundamentally changes the design space**: instead of working around the limitations of `start.sh` with overlay hacks, we can switch TDX deployments to the TOML path and get `foreign_chains` support natively. + +## Current Architecture in Detail + +### Deployment Flow (Current -- `start.sh` Path) + +``` +Operator writes user-config.conf (flat KEY=VALUE) + | + v + dstack VMM deploys CVM with launcher docker-compose + | + v + Launcher container starts (launcher.py) + | + +--> Reads /tapp/user_config (the user-config.conf file) + +--> Selects & validates MPC docker image hash + +--> Extends RTMR3 with image hash (TEE attestation) + +--> Builds `docker run` command: + | --env MPC_ACCOUNT_ID=... + | --env MPC_CONTRACT_ID=... + | --env MPC_HOME_DIR=/data + | --env MPC_IMAGE_HASH=... + | --env DSTACK_ENDPOINT=... + | -v mpc-data:/data + | -v shared-volume:/mnt/shared + | (image digest) + | + v + MPC node container starts → /app/start.sh runs + | + +--> First boot: generates config.yaml from env vars (hardcoded template) + | (NO foreign_chains, NO custom fields) + +--> Subsequent boots: sed-updates account_id/contract_id only + +--> Generates secrets.json if missing (p2p key, signer key) + +--> Runs: /app/mpc-node start [local|dstack] + | + v + mpc-node process (cli.rs → run.rs): + +--> Reads config.yaml from $MPC_HOME_DIR/config.yaml + +--> Reads secrets.json + +--> Starts indexer, coordinator, web server + +--> Starts allowed_image_hashes_watcher (writes to /mnt/shared/) +``` + +### Volume Layout Inside the CVM + +``` +/data/ (mpc-data volume, persistent across restarts) +├── config.yaml (generated by start.sh, read by mpc-node) +├── config.json (near node config) +├── secrets.json (p2p key, signer key -- generated inside CVM) +├── data/ (near indexer state) +└── backup_encryption_key.hex + +/mnt/shared/ (shared-volume, shared between launcher and node) +└── image-digest.bin (written by node, read by launcher -- approved image hashes) + +/tapp/ (dstack app config, read-only in launcher container) +└── user_config (the user-config.conf file) +``` + +### What start.sh Actually Generates + +The `initialize_mpc_config()` function in `start.sh` produces: + +```yaml +my_near_account_id: +near_responder_account_id: +number_of_responder_keys: 50 +web_ui: 0.0.0.0:8080 +migration_web_ui: 0.0.0.0:8079 +pprof_bind_address: 0.0.0.0:34001 +triple: + concurrency: 2 + desired_triples_to_buffer: 1000000 + timeout_sec: 60 + parallel_triple_generation_stagger_time_sec: 1 +presignature: + concurrency: 16 + desired_presignatures_to_buffer: 8192 + timeout_sec: 60 +signature: + timeout_sec: 60 +ckd: + timeout_sec: 60 +indexer: + validate_genesis: false + sync_mode: Latest + concurrency: 1 + mpc_contract_id: + finality: optimistic + port_override: 80 # added via sed for non-localnet +cores: 12 +# NOTE: NO foreign_chains section +``` + +On subsequent boots, `update_mpc_config()` runs: +```bash +sed -i "s/my_near_account_id:.*/my_near_account_id: $MPC_ACCOUNT_ID/" "$1" +sed -i "s/mpc_contract_id:.*/mpc_contract_id: $MPC_CONTRACT_ID/" "$1" +sed -i "s/near_responder_account_id:.*/near_responder_account_id: $responder_id/" "$1" +``` + +Nothing else is updated. The `foreign_chains` section, if somehow manually added, would persist -- but there is no way to add it from outside the CVM. + +### What the Launcher Allows Through + +The launcher (`launcher.py`) passes env vars to the MPC container with strict filtering: + +- **Allowed keys:** `MPC_*` matching regex `^MPC_[A-Z0-9_]{1,64}$`, plus `RUST_LOG`, `RUST_BACKTRACE`, `NEAR_BOOT_NODES` +- **Denied keys:** `MPC_P2P_PRIVATE_KEY`, `MPC_ACCOUNT_SK` +- **Launcher-only keys (not passed through):** `MPC_IMAGE_TAGS`, `MPC_IMAGE_NAME`, `MPC_REGISTRY`, `MPC_HASH_OVERRIDE`, `RPC_*` +- **Special handling:** `PORTS` → `-p` flags, `EXTRA_HOSTS` → `--add-host` flags +- **Limits:** max 64 vars, max 1024 bytes per value, max 32KB total + +This is all flat key-value. No structured data can pass through. + +### TEE/Attestation Constraints + +| Component | Measured in | Changeable without breaking attestation? | +|-----------|-------------|------------------------------------------| +| Launcher docker image | RTMR3 | No | +| Launcher docker-compose | RTMR3 | No | +| MPC docker image hash | RTMR3 | No (but approved list is dynamic) | +| vCPU, Memory | RTMR2 | No | +| Guest OS / dstack | MRTD, RTMR0-2 | No | +| `user-config.conf` | **Not measured** | Yes | +| `/mnt/shared/` contents | **Not measured** (encrypted at rest) | Yes | +| `config.yaml` / TOML | **Not measured** (inside encrypted CVM disk) | Yes (if we can get data in) | + +Key insight: **Config files live on the encrypted CVM disk and are not individually measured.** The attestation verifies that the correct *code* is running, not the *config data*. So updating config does not break attestation -- the challenge is purely mechanical: getting structured config data into the running node. + +### The Existing Dynamic Update Pattern + +The `allowed_image_hashes_watcher` (`crates/node/src/tee/allowed_image_hashes_watcher.rs`) provides a working pattern for runtime data updates: + +1. **Source:** The indexer monitors the contract for approved image hash changes +2. **Delivery:** Changes arrive via a `watch::Receiver>` channel +3. **Storage:** The watcher writes the hash list atomically to `/mnt/shared/image-digest.bin` (write to `.tmp`, then `rename`) +4. **Consumer:** The launcher reads this file on next boot to select which image to run + +This works well for data that originates from the contract. For operator-specific configuration like foreign chain providers and API keys, we need a different delivery mechanism. + +## Proposed Solutions + +### Option A: Switch to TOML Config Path (Recommended) + +Switch TDX deployments from the legacy `start.sh` + `config.yaml` path to the new `start-with-config-file` TOML path. The TOML config is delivered via `user-config.conf` (the only file delivery mechanism dstack provides) as a base64-encoded env var, and `start.sh` decodes it. + +#### Dstack constraint + +Dstack only allows operators to deliver a single `user_config` file (flat KEY=VALUE format) to the CVM via `vmm-cli.py update-user-config`. There is no mechanism to place arbitrary files in `/tapp/`. This means structured config must be embedded within `user-config.conf`. + +#### Design + +**Part 1: Config delivery via `user-config.conf`** + +The operator base64-encodes a TOML config file and includes it as `MPC_CONFIG_TOML_BASE64` in `user-config.conf`. The launcher passes this through as an env var (it matches the `MPC_*` pattern). `start.sh` decodes it and writes the TOML file to the persistent `/data` volume. + +**Part 2: Modified start.sh** + +```bash +MPC_CONFIG_TOML="$MPC_HOME_DIR/mpc-config.toml" + +if [ -n "$MPC_CONFIG_TOML_BASE64" ]; then + echo "Decoding TOML config from MPC_CONFIG_TOML_BASE64" + echo "$MPC_CONFIG_TOML_BASE64" | base64 -d > "$MPC_CONFIG_TOML" +fi + +if [ -f "$MPC_CONFIG_TOML" ]; then + echo "Found TOML config at $MPC_CONFIG_TOML, using start-with-config-file" + + # Still need to initialize the near node (genesis, config.json) + if [ ! -r "$NEAR_NODE_CONFIG_FILE" ]; then + initialize_near_node "$MPC_HOME_DIR" + fi + update_near_node_config + + echo "Starting mpc node with TOML config..." + /app/mpc-node start-with-config-file "$MPC_CONFIG_TOML" + exit $? +fi + +# ... existing start.sh logic for legacy path ... +``` + +Note: the TOML is decoded and overwritten on every boot, so config changes in `user-config.conf` take effect on CVM restart. + +**Part 3: Operator workflow** + +```bash +# 1. Create config from template +envsubst < docs/localnet/mpc-config.template.toml > mpc-config.toml +# Edit to add foreign_chains, adjust settings, etc. + +# 2. Base64-encode and add to user-config.conf +MPC_CONFIG_TOML_BASE64=$(base64 -w0 < mpc-config.toml) + +# 3. user-config.conf: +cat > user-config.conf << EOF +MPC_IMAGE_NAME=nearone/mpc-node +MPC_IMAGE_TAGS=latest +MPC_REGISTRY=registry.hub.docker.com +MPC_CONFIG_TOML_BASE64=$MPC_CONFIG_TOML_BASE64 +PORTS=8080:8080,3030:3030,80:80,24567:24567 +EOF + +# 4. Deploy or update +vmm-cli.py update-user-config user-config.conf +``` + +To update config, the operator edits the TOML file, re-encodes, updates `user-config.conf`, and restarts the CVM. + +**Part 4: API key delivery** + +API keys can be handled in two ways: + +**Inline in TOML (simplest):** +```toml +[node.foreign_chains.ethereum.providers.alchemy.auth] +kind = "header" +name = "Authorization" +scheme = "Bearer" +[node.foreign_chains.ethereum.providers.alchemy.auth.token] +val = "my-api-key-here" +``` + +The key is embedded in the TOML config on the encrypted CVM disk. + +**Via env var reference:** +```toml +[node.foreign_chains.ethereum.providers.alchemy.auth.token] +env = "MPC_ALCHEMY_API_KEY" +``` + +The API key is passed as a separate env var in `user-config.conf` (`MPC_ALCHEMY_API_KEY=...`), which the launcher passes through to the container. The TOML config references it by name. + +Both approaches work today. The inline approach is simpler and avoids the env-var-is-fixed-at-container-start limitation for hot-reload scenarios. + +#### Pros +- **Uses an already-implemented, tested code path.** `start-with-config-file` is used by all pytests and localnet. It's not new code. +- **Natively supports `foreign_chains`.** The TOML `StartConfig` includes `ConfigFile` which has `foreign_chains`. No overlay merging, no YAML hacks. +- **Single source of truth.** One TOML file contains the entire config. No split between env vars, `config.yaml`, and overlay files. +- **Aligns with the deprecation direction.** The old `start` CLI command is already marked `TODO(#2334): deprecate this`. This moves TDX to the intended future path. +- **Operator has full control.** Any config field can be set, not just a predefined set of env vars. +- **Template already exists.** `docs/localnet/mpc-config.template.toml` provides a working starting point. +- **No launcher changes required.** Only `start.sh` needs modification; the launcher already passes `MPC_*` env vars through. + +#### Cons +- Base64 encoding in `user-config.conf` is not very ergonomic for large configs +- Operator must provide the full config, not just overrides (but the template makes this straightforward) +- Secrets (`secret_store_key_hex`) end up in the TOML file on disk (encrypted at rest by the CVM -- same security model as the current `config.yaml` + env vars) + +--- + +### Option B: Contract-Driven Configuration + +Store the `foreign_chains` configuration on the contract and have nodes read it via the indexer. + +#### Design + +The existing `vote_foreign_chain_policy` mechanism already stores chain/provider URLs on-chain. Extend it to store the full provider config (including `api_variant`, timeouts, retries) so nodes can reconstruct their `foreign_chains` config from contract state. + +#### Pros +- Consensus built-in: all operators agree on config +- No file delivery mechanism needed for chain definitions +- Already partially implemented + +#### Cons +- API keys still need a local solution (we're back to the same problem for secrets) +- Slow iteration: every config change requires a voting round across all operators +- The contract would need schema changes to store `api_variant`, `timeout_sec`, `max_retries` +- RPC provider preferences are somewhat operator-specific (different API tiers, different providers) +- Does not solve the general config update problem (only foreign chains) + +--- + +### Option C: Node HTTP API for Config Updates + +Add an HTTP endpoint to the MPC node's existing web server (port 8080) for receiving config updates. + +#### Design + +1. Add `POST /config/foreign_chains` endpoint to the web server +2. Operator sends YAML/JSON config via `curl` from outside the CVM +3. Node validates, persists to disk, and applies in-memory +4. Authentication via the CVM's TLS certificate or a shared token + +#### Pros +- Most ergonomic for operators (`curl -X POST ...`) +- No file delivery complexity +- Can support both config and API key updates +- Works with existing port forwarding (8080 is already exposed) + +#### Cons +- New attack surface: anyone who can reach port 8080 can push config +- Authentication mechanism needs design (TLS mutual auth, bearer token?) +- Port 8080 is already public (used for `/public_data` and telemetry) +- Secrets transmitted over the network need encryption +- More code to write and maintain vs file-based approach + +## Recommendation + +**Option A (Switch to TOML Config Path)** is the clear recommendation. + +### Rationale + +1. **The code already exists.** `start-with-config-file` is implemented, tested (all pytests use it), and validated. We are not writing new config-parsing code -- we are routing TDX deployments to an existing, battle-tested code path. + +2. **Aligns with the codebase direction.** The old `start` command is explicitly marked for deprecation (`TODO(#2334)`). + +3. **Solves the full problem, not just foreign chains.** With TOML config, operators have full control over all configuration fields. This eliminates the entire class of "I need to change X but start.sh doesn't support it" problems. + +4. **No new dependencies or fragile merge logic.** Uses `toml` deserialization that's already in the binary. + +5. **Single source of truth.** One TOML file replaces the split between env vars and `config.yaml`. Easier to reason about, debug, and version-control. + +### Proposed Implementation Plan + +#### Phase 1: TOML Config Delivery (Unblocks TDX Migration) + +**start.sh changes:** +- If `MPC_CONFIG_TOML_BASE64` env var is set, decode it and write to `$MPC_HOME_DIR/mpc-config.toml` +- If `mpc-config.toml` exists, skip the legacy config generation and run `mpc-node start-with-config-file` instead +- Keep the Near node initialization (`initialize_near_node`, `update_near_node_config`) since the TOML path doesn't handle that + +**Launcher changes:** +- Raise the per-value size limit for `MPC_CONFIG_TOML_BASE64` (currently 1024 bytes, our own guardrail in `launcher.py`) + +**Operator workflow:** +- Create TOML config from template, including `foreign_chains` +- Base64-encode and add as `MPC_CONFIG_TOML_BASE64` in `user-config.conf` +- Deploy/update CVM + +**What this unblocks:** TDX nodes with `foreign_chains` config, testnet migration, arbitrary config customization. + +#### Phase 2: Runtime Config Hot-Reload + +- Add file watcher on `$MPC_HOME_DIR/mpc-config.toml` in the MPC node +- `watch::channel`-based propagation of `ForeignChainsConfig` changes to coordinator and providers +- Coordinator re-votes `foreign_chain_policy` when config changes (with rate limiting) + +#### Phase 3: Ergonomics and Tooling + +- Create a CLI tool or script to generate TOML configs from a simpler input format +- Support partial config updates (tool merges changes into existing TOML) +- Update operator guides (`running-an-mpc-node-in-tdx-external-guide.md`, `deploy-launcher-guide.md`) +- Consider moving Near node init into the TOML path to fully eliminate `start.sh` +