Skip to content

Port node launcher to Rust: review plan and follow-ups #2598

@barakeinav1

Description

@barakeinav1

Review Plan

This is the plan for reviewing and merging PR #2326 (port node launcher from Python to Rust).

The key principle: this is a code port/refactor, not new logic. Splitting into smaller PRs would create overhead and lose focus on that fact.

Steps

  1. Make sure basic flows are working (partially done)
  2. Update scripting and testing infra (partially done)
  3. Do a relatively quick review to locate main logical changes and areas of interest — done via AI-assisted analysis (see PR Analysis below)
  4. Address any critical issues and merge the code as-is
    • Priority: verify EmitEvent payload format parity (Finding 3 in analysis)
  5. Create follow-up issues to address relevant pieces (listed below)
  6. Address follow-up issues
  7. After all is done, remove old Python launcher

PR Analysis

Summary

PR #2326 replaces the Python-based TEE launcher (tee_launcher/) with a Rust implementation (crates/tee-launcher/). The launcher is responsible for securely pulling, validating, and starting the MPC node Docker container — optionally inside a TEE (Intel TDX via dstack).

62 files changed (+2514 / -2134)

Key Architectural Changes

  1. Config format: .env key=value pairs -> TOML with [launcher_config] and [mpc_node_config] sections
  2. MPC config delivery: docker run --env MPC_* flags -> TOML file at /mnt/shared/mpc-config.toml via start-with-config-file
  3. Container launch: docker run with CLI flags -> docker compose up -d with rendered templates
  4. dstack interaction: curl to unix socket (GetQuote + EmitEvent) -> dstack-sdk Rust crate (emit_event())
  5. TEE config to node: MPC_IMAGE_HASH + MPC_LATEST_ALLOWED_HASH_FILE env vars -> Injected [tee] TOML section with authority type, quote_upload_url, image_hash, file path
  6. Python launcher deleted: All files under tee_launcher/ removed
  7. Deployment configs migrated: .conf files replaced with .toml, files moved to deployment/cvm-deployment/
  8. CI updated: New jobs for launcher non-TEE check and reproducible Docker image build/verification

Functionality Comparison: Equivalent (no gaps)

Area Details
Image hash selection logic Both: override > approved file (newest first) > default
Multi-platform manifest resolution Both use deque/VecDeque, filter amd64/linux, queue child digests
Image validation flow Both: registry token -> manifest fetch -> docker pull by digest -> docker inspect -> compare config digest
Retry/backoff Both: exponential, factor 1.5, cap 60s, configurable max attempts + timeout + interval
Container security Both: no-new-privileges:true, /tapp:ro, same named volumes
Approved hashes JSON format Both use {"approved_hashes": [...]} key, NonEmpty constraint, newest-first ordering
RTMR3 extension Both emit event "mpc-image-digest" with image hash. Skipped in non-TEE
Container removal Both: docker rm -f mpc-node before launch, errors ignored
DOCKER_CONTENT_TRUST Both require =1

Intentional Differences (by design)

Area Python Rust Risk
Config format .env key=value TOML with sections None -- better structure
MPC config delivery --env MPC_* flags on docker run TOML file at /mnt/shared/mpc-config.toml None -- eliminates env var attack surface
Container launch docker run with CLI flags docker compose up -d with template None -- equivalent
dstack client curl to unix socket (GetQuote + EmitEvent) dstack-sdk Rust crate (emit_event only) See findings
TEE config to node MPC_IMAGE_HASH + MPC_LATEST_ALLOWED_HASH_FILE as env vars Injects [tee] section into TOML Improvement -- richer metadata

Findings: Potential Gaps and Behavioral Differences

Finding 1: GetQuote call removed

  • Python calls GetQuote before EmitEvent; Rust only calls emit_event() via dstack SDK
  • Risk: Low. GetQuote was likely a sanity check. emit_event() will fail if socket is unreachable.

Finding 2: dstack socket existence check removed

  • Python explicitly checks is_unix_socket() before TEE operations; Rust has no pre-check
  • Risk: Low -- error still surfaces, just with less descriptive message.

Finding 3: EmitEvent payload format difference ⚠️

  • Python sends bare hex string (without sha256: prefix); Rust sends image_hash.as_ref().to_vec() as bytes
  • Risk: Medium. RTMR3 measurements may differ between launchers for the same image. Could break attestation verification.

Finding 4: Env var passthrough completely removed ⚠️

  • Python passes RUST_LOG, RUST_BACKTRACE, NEAR_BOOT_NODES, and MPC_* as --env flags
  • Rust passes zero env vars (except DSTACK_ENDPOINT in TEE via compose template)
  • Risk: High if not handled. Must confirm node reads these from TOML config or they're set another way.

Finding 5: MPC_IMAGE_HASH and MPC_LATEST_ALLOWED_HASH_FILE env vars removed

  • Python sets these as container env vars; Rust injects via [tee] TOML section
  • Risk: Low if node reads from TOML. Verify no code still reads from env.

Finding 6: No env var sanitization -- different threat model

  • Python had 14+ validation steps (control chars, LD_PRELOAD, length limits, denied keys)
  • Rust only validates reserved [tee] key. Arbitrary TOML passed through.
  • Risk: Low. TOML file on volume has no shell interpretation. MPC node never accepted p2p_private_key or account_sk from config anyway (keys always generated internally).

Finding 7: Auth token URL hardcoded to Docker Hub

  • Both Python and Rust have same limitation. No gap.

Test Coverage Comparison

Category Python Rust Gap?
Config parsing 5 11 Rust has more
Port/host validation 3 4 Rust has more
Env var security 14 0 N/A -- no env passthrough in Rust
LD_PRELOAD prevention 5 1 N/A -- no shell passthrough
Image hash selection 5 6 Rust has more
Platform parsing 3 0 N/A -- clap validates
RTMR3 / dstack 4 0 Gap
Docker command/compose 2 8 Rust has more
Config interception 0 9 Rust-only, well tested
Manifest parsing 0 5 Rust has more
Registry resolution 1 6 Rust has more
Full E2E flow 2 1 Gap
Total ~61 46

Meaningful test gaps in Rust:

  • No dstack/RTMR3 failure tests (Medium)
  • No full E2E flow test (Medium)
  • No retry/backoff behavior test (Low)

Follow-Up Issues to Create

After merging PR #2326, the following issues should be created and addressed:

Attestation

  • Verify EmitEvent payload parity between Python and Rust launchers -- Confirm image_hash.as_ref().to_vec() produces same bytes as Python's get_bare_digest(). If different, RTMR3 measurements won't match, breaking attestation continuity. (Finding 3)
  • Add dstack/RTMR3 failure path tests -- Test that emit_event() errors propagate as DstackEmitEventFailed. Cover: socket missing, emit failure, NONTEE skip. (Test gap)

Testing

  • Add E2E flow test for Rust launcher -- Mock filesystem + registry + docker commands and run through run() to verify complete orchestration. Python had test_main_nontee_builds_expected_mpc_docker_cmd equivalent. (Test gap)
  • Add retry/backoff behavior tests -- Verify exponential backoff sequence and max attempt enforcement in registry manifest resolution. (Test gap, low priority)

Config & Env Vars

  • Confirm RUST_LOG / RUST_BACKTRACE passthrough works -- These were previously passed as container env vars by Python launcher. Verify MPC node reads them from TOML config file or they're set another way. (Finding 4)
  • Verify no MPC node code reads MPC_IMAGE_HASH from env -- Grep for std::env::var("MPC_IMAGE_HASH") and MPC_LATEST_ALLOWED_HASH_FILE env reads. These are now in [tee] config section. (Finding 5)

Code Structure & Quality

  • Refactor main.rs into smaller logical modules -- The file is ~1,279 lines mixing orchestration, registry interaction, validation, compose rendering, and config injection. Recommended split:
    • registry.rs -- RegistryInfo trait, DockerRegistry, get_manifest_digest() (~450 lines)
    • validation.rs -- validate_image_hash(), docker pull/inspect logic (~150 lines)
    • compose.rs -- render_compose_file(), launch_mpc_container() (~150 lines)
    • selection.rs -- select_image_hash() (~100 lines)
    • config.rs -- intercept_node_config(), insert_reserved() (~100 lines)
    • main.rs -- entry points main() and run() only (~100-150 lines)
  • Replace expect()/unwrap() calls in production code with proper error handling -- 6 instances in non-test code that could panic:
    • main.rs:307 -- Bearer token header parse (panics on malformed registry token)
    • main.rs:475 -- Docker inspect output digest parse (panics on unexpected output)
    • main.rs:173 -- TeeConfig TOML serialization
    • main.rs:144 -- mpc_node_config TOML serialization
    • main.rs:504 -- Port list JSON serialization
    • main.rs:138 -- IMAGE_DIGEST_FILE path parse
  • Fix duplicate error messages in LauncherError -- DockerRunFailed and DockerRunFailedExitStatus (error.rs lines 18 & 24) have identical #[error()] messages, making it impossible to distinguish them in logs.
  • Use DockerSha256Digest type for ManifestEntry::digest -- Currently String (docker_types.rs line 33), inconsistent with ManifestConfig::digest which correctly uses the typed wrapper. Could allow malformed digests to propagate.
  • Add validation for LauncherConfig RPC parameters -- rpc_request_timeout_secs, rpc_request_interval_secs, rpc_max_attempts (types.rs lines 62-66) have no bounds checking. Zero values cause silent failures. Consider NonZeroU64/NonZeroU32 or a validate() method.
  • Add manifest traversal depth limit -- get_manifest_digest() (main.rs line 302) processes tags from a deque that multi-platform manifests add to. Theoretically unbounded. Add a MAX_MANIFEST_DEPTH constant.

Security

  • Consider adding dstack socket pre-check in TEE mode -- Python explicitly verified socket existence before TEE operations for a clear error message. Rust could add a similar check for better UX. (Finding 2)

Cleanup

  • Remove old Python launcher -- Delete tee_launcher/ directory, remove Python test dependencies, clean up flake.nix PYTHONPATH reference. (Step 7 of plan)
  • Address TODO(tee-launcher: remove hard-coded Docker Hub registry URL from auth flow #2479): registry auth endpoint -- Currently hardcoded to Docker Hub (auth.docker.io). If non-Docker-Hub registries are needed, the RegistryInfo trait needs a configurable auth endpoint. (main.rs line 252)
  • Remove or document unused HostEntry struct -- (types.rs line 74) is defined and tested but doesn't appear in any function signatures or runtime code paths.

Created Follow-Up Issues

Attestation

Testing

Config & Env Vars

Code Structure & Quality

Security

Cleanup

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions