docs: add ADR-005 for overlay refactoring to reduce duplication#439
docs: add ADR-005 for overlay refactoring to reduce duplication#439yuanchen8911 wants to merge 14 commits intoNVIDIA:mainfrom
Conversation
52ad12b to
6cb02dc
Compare
6cb02dc to
0f27c00
Compare
0f27c00 to
e2ef22f
Compare
Proposes a phased approach to reduce overlay duplication (~350+ redundant lines across 38 files) that grows with each new accelerator and service: - Phase 1: Reorder inheritance tree (no code changes, ~40% dedup) - Phase 2 (Reorder + Mixins): Add OS/platform mixins (~75% dedup, ~80 lines) - Phase 3 (Reorder + Deep Mixins): Add validation mixins after merge upgrade (~90% dedup) Flat Mixins and Auto-Compose options documented as alternatives if the team accepts abandoning the inheritance model. Refs: NVIDIA#305 Signed-off-by: $(git config user.name) <${SIGN_EMAIL}>
e2ef22f to
a19016d
Compare
There was a problem hiding this comment.
Good ADR. 3 areas that need clarification before this is ready to merge:
-
P0: Constraint evaluator change: Moving from per-overlay to post-merge evaluation is a significant behavioral change that affects
ExcludedOverlays/ConstraintWarningssemantics. This ADR should address what happens when mixin-contributed constraints fail against a snapshot — does the whole recipe fail, or is there a graceful degradation path? -
Mixin-vs-inherited conflict policy: CI lint covers mixin-vs-mixin conflicts but not mixin-vs-inherited conflicts. A mixin could silently weaken or override a constraint from the inheritance chain. We need explicit policy for that.
-
Decision section: The analysis clearly supports Reorder + Mixins. Either commit to it as "Proposed" or explicitly frame the TBD. ADR is meant to force a team vote between specific options.
mchmarny
left a comment
There was a problem hiding this comment.
Only the constraint evaluator change is a blocker
|
Re: P0 — Constraint evaluator change Good catch. The intent is not to introduce partial mixin degradation. In Phase 2, constraint evaluation would run against the fully composed leaf candidate: inheritance chain + mixins + leaf-local content. If any constraint contributed by that composed candidate fails, that candidate is excluded as a unit. Recipe generation still succeeds by falling back to the remaining matching overlays, consistent with today's To keep this debuggable, I'll add a "Constraint Failure Semantics" section to the ADR documenting this explicitly. |
- Add "Candidate selection" section defining maximal leaf candidate collapsing to prevent silent degradation to generic ancestors when a leaf candidate's constraints fail - Update code change scope to include candidate collapsing in BuildRecipeResultWithEvaluator() - Fix Phase 2 exit criteria to match the stated conflict policy: cover mixin-vs-inheritance duplicates, not just mixin-vs-mixin Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
…R-005 Expand Phase 2 risk entry to include mixin-vs-inheritance duplicate name conflicts, matching the conflict policy and exit criteria. Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
The growth comparison incorrectly stated "5 files vs 6 currently." Current H100-EKS has 7 files (2 intent + 2 ubuntu + 3 platform). Reorder adds 1 shared intermediate but doesn't reduce file count — the benefit is ~40% less duplicated content per file. Updated both the current-state count and the Reorder comparison to be accurate. Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
|
Phase 1 is a pure data/layout refactor with no code changes: we insert an additional Phase 2 introduces In the proposal, only leaf overlays in a chain can reference mixins via
cc @mchmarny |
| 1. `b200-eks.yaml` — GPU operator config (1 file) | ||
| 2. `b200-eks-training.yaml` — intent + validation (1 file) | ||
| 3. `b200-eks-inference.yaml` — intent + validation (1 file) | ||
| 4. `b200-eks-ubuntu-training.yaml` — leaf overlay with `mixins: [os-ubuntu]` (1 file) |
There was a problem hiding this comment.
If the only thing mixins are doing at the leaf is adding constraints is there even a point in putting these down as files? The recipe engine could add any matching mixins from the mixin directory that match the criteria. A mixin leaf would only make sense if it is overriding or adding components. If done as auto-mixin style that would likely even mean that the k8s components of training/inference become a mixin that just get added by the recipe engine. If the recipe engine also treated components as templat-able skyhook customization could fit there as well as the service, intent, accelerator could be injected at recipe render time.
There was a problem hiding this comment.
Interesting direction. Auto-applying mixins based on criteria would remove some boilerplate, but it also makes composition implicit: part of a leaf overlay's behavior would come from whatever mixins happen to match in the directory, not just what the leaf declares. That makes review, debugging, and provenance harder.
For Phase 2, explicit spec.mixins keeps composition local and visible in the overlay itself. If that turns out to be too verbose in practice, criteria-driven mixin selection could be a future direction, but I'd treat that as a separate step toward a more implicit composition model rather than fold it into this phase.
|
Another consideration: Remove inheritance chains completely and make it all leafs. This introduces a large amount of duplication but it is entirely copy/paste and is actually a reduction in total files. For example: With a folder structure of |
| │ │ │ │ └── h100-eks-ubuntu-training-kubeflow | ||
| │ │ │ └── (future: h100-eks-ubuntu-training-dynamo, etc.) | ||
| │ │ └── gb200-eks-training | ||
| │ │ └── gb200-eks-ubuntu-training |
There was a problem hiding this comment.
The Ubuntu leaf overlays (e.g., h100-eks-ubuntu-training) add no components, no helm values, and no OS-specific validation over their parent. The recipe output is identical regardless of OS.
The one meaningful constraint — kernel >= 6.8 — is a GPU driver requirement, not an OS requirement. It belongs on the accelerator layer alongside other H100-specific config.
Removing the OS layer eliminates 12 files and the main justification for mixins. Before building an abstraction to deduplicate this layer, should we ask whether it needs to exist?
There was a problem hiding this comment.
Good observation. Looking at the current leaf Ubuntu overlays, I think this does weaken the case for keeping -ubuntu-* as an inheritance layer.
Across the six non-platform Ubuntu overlays, the only additive content beyond the parent is the same three OS constraints: OS.release.ID=ubuntu, OS.release.VERSION_ID=24.04, and OS.sysctl./proc/sys/kernel/osrelease >= 6.8. They do not add components. The training variants also currently carry validation blocks, but those checks are not OS-specific; they sit there today because that is where the current tree bottoms out before platform-specific overlays.
So I think the implication for the ADR is that Phase 1 should move the validation blocks up to the intent layer, and likely move kernel >= 6.8 up with the accelerator or driver requirements as well. That would leave the Ubuntu dimension carrying only criteria.os=ubuntu plus the two Ubuntu release constraints. At that point, keeping -ubuntu-* as an intermediate inheritance node becomes hard to justify.
That does not eliminate the mixin case entirely, though. The platform overlays still duplicate real component definitions across accelerator+service combinations, which remains an independent justification for mixins. If Phase 2 still goes forward, os-ubuntu becomes a very thin additive mixin, while criteria.os=ubuntu remains on the leaf overlay.
|
@ayuskauskas This is a valid alternative, but it is different from the ADR's Flat Mixins option. What you're describing is a fully flat leaf-only model: no inheritance, no mixins, and straightforward copy/paste duplication. The tradeoff is clear: the structure becomes easier to navigate and reason about locally, but broad changes like version bumps or shared constraint updates become N-file edits unless we introduce some separate templating or defaults mechanism. The folder layout idea ( |
Rename 005-overlay-refactoring-adr.md to 005-overlay-refactoring.md for consistency with other ADRs in docs/design/. Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
Consider New Phase 1.5: Push validation up + delete redundant constraintsNo code changes. Only overlay YAML restructuring. The problem in one picture Here's the inheritance chain for h100-eks-ubuntu-training today: The leaf file (h100-eks-ubuntu-training.yaml) is 73 lines. Here's what's in it: Two problems:
This same pattern repeats across ~16 leaf overlays. The fix (two moves, zero code changes) Move 1: Push validation up to the intent overlay. The validation block describes what to check for a given {accelerator, service, intent} combination. Move it from the -ubuntu- leaf to its parent. This works because Merge() already handles validation inheritance — if the parent has a validation block, children inherit it. Children only need to re-declare validation if they want to override a phase. Move 2: Delete redundant constraint re-declarations. If h100-eks-training already says K8s >= 1.32.4, every child inherits it automatically. Delete the duplicate line from every leaf that just re-states what the parent already says. After both moves: leaf becomes trivial h100-eks-ubuntu-training.yaml — AFTER (was 73 lines, now ~25 with license header)That's it. Only the things that are genuinely unique to "this is Ubuntu" remain. Why this is safe
What's left after Phase 1 + 1.5Leaf overlays are now ~10-15 lines of actual content. The remaining duplication:
The Ubuntu constraints are only 3 lines per file — arguably not worth any machinery to deduplicate. The kubeflow block could be moved to {accel}-{service}-training-kubeflow parent overlays. The dynamo block is the only one where a sharing Summary
Phase 1.5 captures most of the mixin benefit with zero code changes and zero new concepts. |
|
@lockwobr Good point. I agree this identifies a meaningful no-code cleanup step: push validation up to the I don't think it fully removes the case for mixins, because the platform overlays still duplicate substantial component blocks (dynamo in particular at ~25 lines × 4 files). But it does strengthen the case that the Ubuntu layer itself should become much thinner before introducing any new mechanism. I'll fold this into the ADR framing, either as part of Phase 1 or as an explicit refinement step between Phase 1 and Phase 2. |
- Add implementation note that spec.mixins must be stripped before materializing the recipe result (Phase 2 code change section) - Update Phase 1 exit criteria to specify golden-file verification via aicr query --format json across all leaf overlay combinations Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
Replace "recipe generation commands produce identical output" with "recipe resolutions produce identical hydrated output" to match the actual verification mechanism (aicr query hydration, not all CLI paths). Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
PR #439 Review SummaryMark (mchmarny) — 8 inline commentsP0: Constraint evaluator change — Moving to post-merge evaluation changes Mixin-vs-inherited conflict policy — Mixins could silently weaken inherited constraints. Decision section TBD — Analysis clearly supports Reorder + Mixins, commit to it. Mixin field stripping — Use Nits — ADR-004→005 in PR description; rename file to drop Jason (xdu31) — 1 inline commentDoes the OS layer need to exist? — Ubuntu overlays add no components, no validation. Kernel >= 6.8 is a GPU driver requirement, not OS. Removing the OS layer eliminates 12 files and the main mixin justification. Brian (lockwobr) — 1 conversation commentPhase 1.5: Push validation up + delete redundant constraints — No code changes, captures ~70% dedup by moving validation to Alex (ayuskauskas) — 2 commentsAuto-mixin — If mixins only add constraints, the engine could auto-apply them by criteria match. Flat leaf-only structure — Remove inheritance entirely, one file per combo. |
Recent UpdatesReplies posted:
ADR content updates pushed: |
…ADR-005 Based on review feedback from Jason (OS layer justification) and Brian (Phase 1.5 proposal): - Phase 1 now includes validation lift-up to intent layer, kernel constraint move to accelerator layer, and redundant constraint cleanup - Updated dedup claims from ~40% to ~40%+ / "materially more than ~40%" - Removed validation from "What's NOT eliminated" since Phase 1 now addresses it - Updated os-ubuntu mixin description to Ubuntu release constraints only (kernel moves to accelerator layer in Phase 1) - Updated all os-ubuntu references across Flat Mixins and Auto-Compose sections for consistency - Added Phase 1 exit criteria for validation placement and constraint inheritance Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
- Update Reorder example to show validation on h100-eks-training, matching the rest of the ADR after Phase 1 validation lift-up - Align os-ubuntu terminology to "release constraints" in Phase 2 mixin list Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
Summary of Review Feedback and ADR Updates
|
Align Phase 2 exit criteria with Phase 1 by specifying aicr query --format json as the verification mechanism for hydrated output diffs. Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
| Mixins **cannot** contain `criteria`, `base`, `mixins`, or `validation`. | ||
|
|
||
| **Why not validation in mixins (Reorder + Mixins):** Current validation merge in | ||
| `RecipeMetadataSpec.Merge()` is phase-replacement, not deep merge. If two |
There was a problem hiding this comment.
Current validation merge in
RecipeMetadataSpec.Merge()is phase-replacement, not deep merge.
We can track and fix it separately
|
Here's the plan for Phase 1. I hope it addresses the comments. If we agree, I'll start an implementation. Phase 1 (no code changes):
|
Summary
Add ADR-005 proposing a phased refactoring of the overlay system to reduce duplication that grows with each new accelerator and service.
Motivation / Context
The overlay system has 38 files with ~350+ redundant lines. Adding new accelerators (B200, GB300) and services (OKE) would grow it to 96-120 files under the current structure. See issue #305 for detailed analysis.
Fixes: N/A
Related: #305
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)Implementation Notes
The ADR evaluates five options (A, B-lite, B-full, C, D) with tradeoff analysis and recommends a phased approach:
{accelerator}-{service}intermediates to share GPU config across training/inference. No code changes. ~40% dedup.RecipeMixinkind for OS and platform concerns. ~80 lines of code. ~75% dedup.Each phase has exit criteria, risk table, and rollback strategy.
Testing
N/A — documentation only.
Risk Assessment
Rollout notes: N/A — ADR document only, no code changes.
Checklist
make testwith-race)make lint)git commit -S) — GPG signing info