fix(resolve-compose): skip bad manifests instead of exiting#367
fix(resolve-compose): skip bad manifests instead of exiting#367bugman-007 wants to merge 2 commits intoLight-Heart-Labs:mainfrom
Conversation
6610cd7 to
aa92e97
Compare
aa92e97 to
b4d996f
Compare
Lightheartdevs
left a comment
There was a problem hiding this comment.
Review: fix(resolve-compose): skip bad manifests instead of exiting
CLAUDE.md Violations
1. Broad catch violates Error Handling Rule #1 (resolve-compose-stack.sh, line 154)
The except Exception as e: ... continue pattern is exactly what CLAUDE.md prohibits:
"No broad or silent catches. Never
except Exception: passorexcept Exception: return None."
The PR changes sys.exit(1) to continue, converting a crash into a swallowed error — making the broad catch worse, not better. The existing except Exception was already a violation, but at least it crashed (consistent with Let It Crash). Now it silently skips.
Fix: If the goal is resilience for YAML parse errors specifically, narrow the catch to the actual expected failure modes:
except (yaml.YAMLError, json.JSONDecodeError, KeyError, TypeError) as e:This satisfies CLAUDE.md Rule #2: "Narrow exceptions at I/O boundaries are fine... catch specific exception types when each maps to a distinct, meaningful status."
2. Violates "Let It Crash" (the #1 design principle)
CLAUDE.md priority order: Let It Crash > KISS > Pure Functions > SOLID.
A malformed manifest is a data integrity problem. The current crash behavior forces the operator to fix the manifest before the stack starts — which is the correct signal. Silently dropping a broken extension means the user may not notice a service is missing until something downstream fails in a harder-to-debug way.
If this is truly needed for operational resilience, it should be an opt-in flag (e.g., --skip-broken) rather than the default behavior, preserving fail-fast for normal operation.
Conflict with Open PRs
PR #321 (manifest enforcement) is tightening the manifest schema — adding additionalProperties: false, more required fields, stricter validation. That PR's philosophy is "reject bad manifests early." This PR's philosophy is "tolerate bad manifests at runtime." They directly conflict. These should be coordinated: if #321 lands first and enforces strict schemas, the need for this PR diminishes significantly.
PR #327 (extension catalog) touches the extension/audit surface area. While not a direct merge conflict, both PRs change how extensions are discovered and validated, so they should be aware of each other.
Test Quality
The test file (tests/test-resolve-compose-resilient.sh) only greps for string patterns in the source file — it never actually runs resolve-compose-stack.sh with a broken manifest to verify behavior. This means it tests the presence of code, not its correctness. A functional test that creates a temp directory with a malformed manifest.yaml and verifies the script exits 0 with the bad extension excluded from output would be far more valuable.
Summary
- Narrow the
except Exceptionto specific exception types (YAML/JSON parse errors). - Preserve fail-fast as default; consider
--skip-brokenflag for resilience mode. - Coordinate with PR #321 — strict schema enforcement upstream reduces the need for runtime tolerance.
- Replace grep-based tests with functional tests that exercise real behavior.
|
What's needed to get this merged:
The grep-based tests need to become behavioral — run the script with an actual broken manifest fixture. |
|
I addressed review feedback for resolve-compose resilient parsing. |
Summary
Fixes critical reliability issue where a single malformed extension manifest causes entire compose stack resolution to fail, preventing all services from starting.
Problem
Current behavior in
scripts/resolve-compose-stack.shline 158:Impact:
Solution
Replace
sys.exit(1)withcontinueto skip the bad manifest and process remaining extensions:Behavior Change
Before:
$ docker compose up ERROR: Failed to parse manifest for broken-ext: invalid YAML Manifest path: extensions/services/broken-ext/manifest.yaml This service will be skipped. Fix the manifest or disable the service. [exits with code 1 - no services start]After:
$ docker compose up ERROR: Failed to parse manifest for broken-ext: invalid YAML Manifest path: extensions/services/broken-ext/manifest.yaml This service will be skipped. Fix the manifest or disable the service. [continues processing - other services start normally]Testing
Impact
Related Work
Part of extension operability improvements. Complements PRs #355, #357, #360 by improving resilience when extensions have configuration issues.