Skip to content

docs: add ADR-005 for overlay refactoring to reduce duplication#439

Open
yuanchen8911 wants to merge 14 commits intoNVIDIA:mainfrom
yuanchen8911:docs/adr-004-overlay-refactoring
Open

docs: add ADR-005 for overlay refactoring to reduce duplication#439
yuanchen8911 wants to merge 14 commits intoNVIDIA:mainfrom
yuanchen8911:docs/adr-004-overlay-refactoring

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented Mar 19, 2026

Summary

Add ADR-005 proposing a phased refactoring of the overlay system to reduce duplication that grows with each new accelerator and service.

Motivation / Context

The overlay system has 38 files with ~350+ redundant lines. Adding new accelerators (B200, GB300) and services (OKE) would grow it to 96-120 files under the current structure. See issue #305 for detailed analysis.

Fixes: N/A
Related: #305

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

The ADR evaluates five options (A, B-lite, B-full, C, D) with tradeoff analysis and recommends a phased approach:

  • Phase 1 (Reorder): Reorder inheritance tree — insert {accelerator}-{service} intermediates to share GPU config across training/inference. No code changes. ~40% dedup.
  • Phase 2 (Mixins): Add RecipeMixin kind for OS and platform concerns. ~80 lines of code. ~75% dedup.
  • Phase 3 (Deep Mixins): Add validation mixins after upgrading merge semantics to deep-merge. ~90% dedup.

Each phase has exit criteria, risk table, and rollback strategy.

Testing

N/A — documentation only.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: N/A — ADR document only, no code changes.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@yuanchen8911 yuanchen8911 requested a review from a team as a code owner March 19, 2026 18:23
@yuanchen8911 yuanchen8911 added documentation Improvements or additions to documentation area/recipes labels Mar 19, 2026
@yuanchen8911 yuanchen8911 force-pushed the docs/adr-004-overlay-refactoring branch 2 times, most recently from 52ad12b to 6cb02dc Compare March 19, 2026 18:35
@yuanchen8911 yuanchen8911 force-pushed the docs/adr-004-overlay-refactoring branch from 6cb02dc to 0f27c00 Compare March 19, 2026 18:38
@yuanchen8911 yuanchen8911 force-pushed the docs/adr-004-overlay-refactoring branch from 0f27c00 to e2ef22f Compare March 19, 2026 21:14
@yuanchen8911 yuanchen8911 changed the title docs: add ADR-004 for overlay refactoring to reduce duplication docs: add ADR-005 for overlay refactoring to reduce duplication Mar 19, 2026
Proposes a phased approach to reduce overlay duplication (~350+ redundant
lines across 38 files) that grows with each new accelerator and service:

- Phase 1: Reorder inheritance tree (no code changes, ~40% dedup)
- Phase 2 (Reorder + Mixins): Add OS/platform mixins (~75% dedup, ~80 lines)
- Phase 3 (Reorder + Deep Mixins): Add validation mixins after merge upgrade (~90% dedup)

Flat Mixins and Auto-Compose options documented as alternatives if the
team accepts abandoning the inheritance model.

Refs: NVIDIA#305

Signed-off-by: $(git config user.name) <${SIGN_EMAIL}>
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good ADR. 3 areas that need clarification before this is ready to merge:

  1. P0: Constraint evaluator change: Moving from per-overlay to post-merge evaluation is a significant behavioral change that affects ExcludedOverlays/ConstraintWarnings semantics. This ADR should address what happens when mixin-contributed constraints fail against a snapshot — does the whole recipe fail, or is there a graceful degradation path?

  2. Mixin-vs-inherited conflict policy: CI lint covers mixin-vs-mixin conflicts but not mixin-vs-inherited conflicts. A mixin could silently weaken or override a constraint from the inheritance chain. We need explicit policy for that.

  3. Decision section: The analysis clearly supports Reorder + Mixins. Either commit to it as "Proposed" or explicitly frame the TBD. ADR is meant to force a team vote between specific options.

Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the constraint evaluator change is a blocker

@yuanchen8911
Copy link
Copy Markdown
Contributor Author

Re: P0 — Constraint evaluator change

Good catch. The intent is not to introduce partial mixin degradation. In Phase 2, constraint evaluation would run against the fully composed leaf candidate: inheritance chain + mixins + leaf-local content. If any constraint contributed by that composed candidate fails, that candidate is excluded as a unit. Recipe generation still succeeds by falling back to the remaining matching overlays, consistent with today's ExcludedOverlays behavior; only if every matching overlay is excluded do we end up with base-only output.

To keep this debuggable, ConstraintWarnings will preserve provenance for the failing constraint via a Source field (e.g., "inheritance/h100-eks", "mixin/os-ubuntu", "leaf/h100-eks-ubuntu-training") — not just richer warning text. This makes it clear whether the failure came from the inheritance chain, a mixin, or the leaf overlay itself.

I'll add a "Constraint Failure Semantics" section to the ADR documenting this explicitly.

mchmarny
mchmarny previously approved these changes Apr 2, 2026
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@yuanchen8911 yuanchen8911 requested a review from mchmarny April 2, 2026 19:28
- Add "Candidate selection" section defining maximal leaf candidate
  collapsing to prevent silent degradation to generic ancestors when
  a leaf candidate's constraints fail
- Update code change scope to include candidate collapsing in
  BuildRecipeResultWithEvaluator()
- Fix Phase 2 exit criteria to match the stated conflict policy:
  cover mixin-vs-inheritance duplicates, not just mixin-vs-mixin

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
…R-005

Expand Phase 2 risk entry to include mixin-vs-inheritance duplicate
name conflicts, matching the conflict policy and exit criteria.

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
The growth comparison incorrectly stated "5 files vs 6 currently."
Current H100-EKS has 7 files (2 intent + 2 ubuntu + 3 platform).
Reorder adds 1 shared intermediate but doesn't reduce file count —
the benefit is ~40% less duplicated content per file. Updated both
the current-state count and the Reorder comparison to be accurate.

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
nvidiajeff
nvidiajeff previously approved these changes Apr 2, 2026
Copy link
Copy Markdown
Contributor

@nvidiajeff nvidiajeff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In support of this.

@yuanchen8911
Copy link
Copy Markdown
Contributor Author

Phase 1 is a pure data/layout refactor with no code changes: we insert an additional {accelerator}-{service} layer into the inheritance tree and reorder the overlay hierarchy so shared GPU/service configuration moves up the chain. That reduces the current training/inference duplication without changing resolver behavior.

Phase 2 introduces RecipeMixin as a constrained composition mechanism for orthogonal concerns like OS and platform. A leaf overlay keeps its single spec.base inheritance chain, but can also include mixins for additive constraints and componentRefs. Constraint evaluation moves to the fully composed leaf candidate and failure excludes that candidate as a unit. Validation stays out of mixins in this phase, and CI enforces a strict conflict policy: no duplicate constraint or component names across mixins or between any mixin and the inheritance chain.

In the proposal, only leaf overlays in a chain can reference mixins via spec.mixins. Intermediate overlays in the inheritance chain cannot.

  • Mixins can't chain — no mixin referencing other mixins
  • Intermediates don't use mixins — they share config through the inheritance hierarchy
  • Leaf overlays compose both — single parent via spec.base + orthogonal fragments via spec.mixins

cc @mchmarny

1. `b200-eks.yaml` — GPU operator config (1 file)
2. `b200-eks-training.yaml` — intent + validation (1 file)
3. `b200-eks-inference.yaml` — intent + validation (1 file)
4. `b200-eks-ubuntu-training.yaml` — leaf overlay with `mixins: [os-ubuntu]` (1 file)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the only thing mixins are doing at the leaf is adding constraints is there even a point in putting these down as files? The recipe engine could add any matching mixins from the mixin directory that match the criteria. A mixin leaf would only make sense if it is overriding or adding components. If done as auto-mixin style that would likely even mean that the k8s components of training/inference become a mixin that just get added by the recipe engine. If the recipe engine also treated components as templat-able skyhook customization could fit there as well as the service, intent, accelerator could be injected at recipe render time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting direction. Auto-applying mixins based on criteria would remove some boilerplate, but it also makes composition implicit: part of a leaf overlay's behavior would come from whatever mixins happen to match in the directory, not just what the leaf declares. That makes review, debugging, and provenance harder.

For Phase 2, explicit spec.mixins keeps composition local and visible in the overlay itself. If that turns out to be too verbose in practice, criteria-driven mixin selection could be a future direction, but I'd treat that as a separate step toward a more implicit composition model rather than fold it into this phase.

@ayuskauskas
Copy link
Copy Markdown
Contributor

Another consideration: Remove inheritance chains completely and make it all leafs. This introduces a large amount of duplication but it is entirely copy/paste and is actually a reduction in total files. For example:

 * aws-h100-training-ubuntu.yaml
 * aws-gb200-...
 * gke-h100-training.yaml
 * aks-h100-training-ubuntu.yaml

With a folder structure of {service}/{intent}/{os} it becomes fairly well nested and easy to find where you add files. And you don't have to think about inheritance chain as it is just whatever is in the file is it. Making broad updates like changing versions definitely becomes a problem though unless these support some sort of templating with defaults type behavior.

│ │ │ │ └── h100-eks-ubuntu-training-kubeflow
│ │ │ └── (future: h100-eks-ubuntu-training-dynamo, etc.)
│ │ └── gb200-eks-training
│ │ └── gb200-eks-ubuntu-training
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Ubuntu leaf overlays (e.g., h100-eks-ubuntu-training) add no components, no helm values, and no OS-specific validation over their parent. The recipe output is identical regardless of OS.

The one meaningful constraint — kernel >= 6.8 — is a GPU driver requirement, not an OS requirement. It belongs on the accelerator layer alongside other H100-specific config.

Removing the OS layer eliminates 12 files and the main justification for mixins. Before building an abstraction to deduplicate this layer, should we ask whether it needs to exist?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation. Looking at the current leaf Ubuntu overlays, I think this does weaken the case for keeping -ubuntu-* as an inheritance layer.

Across the six non-platform Ubuntu overlays, the only additive content beyond the parent is the same three OS constraints: OS.release.ID=ubuntu, OS.release.VERSION_ID=24.04, and OS.sysctl./proc/sys/kernel/osrelease >= 6.8. They do not add components. The training variants also currently carry validation blocks, but those checks are not OS-specific; they sit there today because that is where the current tree bottoms out before platform-specific overlays.

So I think the implication for the ADR is that Phase 1 should move the validation blocks up to the intent layer, and likely move kernel >= 6.8 up with the accelerator or driver requirements as well. That would leave the Ubuntu dimension carrying only criteria.os=ubuntu plus the two Ubuntu release constraints. At that point, keeping -ubuntu-* as an intermediate inheritance node becomes hard to justify.

That does not eliminate the mixin case entirely, though. The platform overlays still duplicate real component definitions across accelerator+service combinations, which remains an independent justification for mixins. If Phase 2 still goes forward, os-ubuntu becomes a very thin additive mixin, while criteria.os=ubuntu remains on the leaf overlay.

@yuanchen8911
Copy link
Copy Markdown
Contributor Author

@ayuskauskas This is a valid alternative, but it is different from the ADR's Flat Mixins option. What you're describing is a fully flat leaf-only model: no inheritance, no mixins, and straightforward copy/paste duplication.

The tradeoff is clear: the structure becomes easier to navigate and reason about locally, but broad changes like version bumps or shared constraint updates become N-file edits unless we introduce some separate templating or defaults mechanism. The folder layout idea ({service}/{intent}/{os}) is useful regardless of the composition model and is worth considering independently.

Rename 005-overlay-refactoring-adr.md to 005-overlay-refactoring.md
for consistency with other ADRs in docs/design/.

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
@lockwobr
Copy link
Copy Markdown
Contributor

lockwobr commented Apr 3, 2026

Consider New Phase 1.5: Push validation up + delete redundant constraints

No code changes. Only overlay YAML restructuring.

The problem in one picture

Here's the inheritance chain for h100-eks-ubuntu-training today:

base                          K8s >= 1.28 (from eks)
└── eks                       K8s >= 1.28, validation: {conformance: [5 checks]}
    └── eks-training          K8s >= 1.30
        └── h100-eks-training K8s >= 1.32.4, gpu-operator overrides, skyhook
            └── h100-eks-ubuntu-training   ← LEAF

The leaf file (h100-eks-ubuntu-training.yaml) is 73 lines. Here's what's in it:

spec:
  base: h100-eks-training
  criteria: { service: eks, accelerator: h100, os: ubuntu, intent: training }

  constraints:
    - name: K8s.server.version          # ← REDUNDANT: parent already says >= 1.32.4
      value: ">= 1.32.4"
    - name: OS.release.ID               # ← UNIQUE to this leaf (ubuntu-specific)
      value: ubuntu
    - name: OS.release.VERSION_ID       # ← UNIQUE to this leaf
      value: "24.04"
    - name: OS.sysctl.../osrelease      # ← UNIQUE to this leaf
      value: ">= 6.8"

  componentRefs: []                     # ← EMPTY, adds nothing

  validation:                           # ← 28 lines of validation config
    deployment:                         #    that could live in the parent
      checks: [operator-health, expected-resources, gpu-operator-version, check-nvidia-smi]
      constraints: [{ name: Deployment.gpu-operator.version, value: ">= v24.6.0" }]
    performance:
      checks: [nccl-all-reduce-bw]
      constraints: [{ name: nccl-all-reduce-bw, value: ">= 300" }]
    conformance:
      checks: [platform-health, gpu-operator-health, dra-support, ...]

Two problems:

  1. K8s >= 1.32.4 is already in the parent (h100-eks-training). Re-declaring it here does nothing — the merge engine already inherits it by constraint name. It's copy-paste noise.
  2. The validation block has nothing Ubuntu-specific. It's about H100 + EKS + training — it checks GPU operator health, NCCL bandwidth, conformance. It belongs in h100-eks-training, not in the ubuntu leaf.

This same pattern repeats across ~16 leaf overlays.

The fix (two moves, zero code changes)

Move 1: Push validation up to the intent overlay. The validation block describes what to check for a given {accelerator, service, intent} combination. Move it from the -ubuntu- leaf to its parent.

Before:
h100-eks-training.yaml        → has GPU config, NO validation
h100-eks-ubuntu-training.yaml → has OS constraints + validation (28 lines)

After:
h100-eks-training.yaml        → has GPU config + validation (moved here)
h100-eks-ubuntu-training.yaml → has OS constraints only

This works because Merge() already handles validation inheritance — if the parent has a validation block, children inherit it. Children only need to re-declare validation if they want to override a phase.

Move 2: Delete redundant constraint re-declarations. If h100-eks-training already says K8s >= 1.32.4, every child inherits it automatically. Delete the duplicate line from every leaf that just re-states what the parent already says.

After both moves: leaf becomes trivial

h100-eks-ubuntu-training.yaml — AFTER (was 73 lines, now ~25 with license header)

kind: RecipeMetadata
apiVersion: aicr.nvidia.com/v1alpha1
metadata:
  name: h100-eks-ubuntu-training
spec:
  base: h100-eks-training
  criteria:
    service: eks
    accelerator: h100
    os: ubuntu
    intent: training
  constraints:
    - name: OS.release.ID
      value: ubuntu
    - name: OS.release.VERSION_ID
      value: "24.04"
    - name: OS.sysctl./proc/sys/kernel/osrelease
      value: ">= 6.8"

That's it. Only the things that are genuinely unique to "this is Ubuntu" remain.

Why this is safe

  • Constraint inheritance already works this way. Merge() merges constraints by name — child overrides parent for the same name, parent values carry through for names the child doesn't mention. Deleting a redundant re-declaration changes
    nothing about the merged output.
  • Validation inheritance already works this way. Merge() does phase-level replacement — if the child doesn't declare a validation phase, it inherits the parent's. Moving validation to the parent and removing it from the child produces the
    same merged result.
  • Verifiable with a golden-file test. Generate recipe output for every leaf overlay before and after. Diff must be empty. If it's not empty, something was moved wrong.

What's left after Phase 1 + 1.5

Leaf overlays are now ~10-15 lines of actual content. The remaining duplication:

Still duplicated Lines Files Notes
Ubuntu OS constraints (3 lines) 3 × 12 = 36 12 Genuinely per-leaf (criteria-specific)
Kubeflow component block ~8 × 4 = 32 4 Could go in a parent overlay
Dynamo component block ~25 × 4 = 100 4 Strongest case for sharing

The Ubuntu constraints are only 3 lines per file — arguably not worth any machinery to deduplicate. The kubeflow block could be moved to {accel}-{service}-training-kubeflow parent overlays. The dynamo block is the only one where a sharing
mechanism (mixins or otherwise) has a real payoff, and even that could be handled with a {service}-inference-dynamo parent overlay instead.

Summary

Phase Dedup Code changes Risk
1: Reorder tree ~40% None Low (overlay restructure)
1.5: Push validation up + delete redundant constraints ~70% None Low (verify with golden-file diff)
2: Mixins (if still needed) ~75-80% ~80 lines Go Medium

Phase 1.5 captures most of the mixin benefit with zero code changes and zero new concepts.

@yuanchen8911
Copy link
Copy Markdown
Contributor Author

@lockwobr Good point. I agree this identifies a meaningful no-code cleanup step: push validation up to the {accelerator}-{service}-{intent} layer and delete leaf-level re-statements of constraints that are already inherited from the parent. That should be safe as long as the hydrated recipe output stays identical before and after — verifiable with aicr query golden-file diffs.

I don't think it fully removes the case for mixins, because the platform overlays still duplicate substantial component blocks (dynamo in particular at ~25 lines × 4 files). But it does strengthen the case that the Ubuntu layer itself should become much thinner before introducing any new mechanism. I'll fold this into the ADR framing, either as part of Phase 1 or as an explicit refinement step between Phase 1 and Phase 2.

- Add implementation note that spec.mixins must be stripped before
  materializing the recipe result (Phase 2 code change section)
- Update Phase 1 exit criteria to specify golden-file verification
  via aicr query --format json across all leaf overlay combinations

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
Replace "recipe generation commands produce identical output" with
"recipe resolutions produce identical hydrated output" to match the
actual verification mechanism (aicr query hydration, not all CLI paths).

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
@yuanchen8911
Copy link
Copy Markdown
Contributor Author

PR #439 Review Summary

Mark (mchmarny) — 8 inline comments

P0: Constraint evaluator change — Moving to post-merge evaluation changes ExcludedOverlays semantics. What happens when mixin constraints fail?
→ Composed leaf candidate is excluded as a unit. Added candidate selection and constraint failure semantics sections to ADR.

Mixin-vs-inherited conflict policy — Mixins could silently weaken inherited constraints.
→ Strict policy: duplicate constraint/component names between mixins and inheritance chain are forbidden. CI lint enforces.

Decision section TBD — Analysis clearly supports Reorder + Mixins, commit to it.
→ Updated to "Proposed: Reorder + Mixins." Validation excluded from Phase 2 mixins as a hard boundary.

Mixin field strippingspec.mixins could leak into recipe output.
→ Added implementation note: merge a copy with mixins cleared.

Use aicr query for golden-file testing — Hydrated output verification without running bundle.
→ Added to Phase 1 exit criteria.

Nits — ADR-004→005 in PR description; rename file to drop -adr suffix.
→ Both fixed.

Jason (xdu31) — 1 inline comment

Does the OS layer need to exist? — Ubuntu overlays add no components, no validation. Kernel >= 6.8 is a GPU driver requirement, not OS. Removing the OS layer eliminates 12 files and the main mixin justification.
→ Agreed the Ubuntu layer is thinner than framed. Phase 1 should move validation up to intent layer and kernel constraint to accelerator layer. Platform overlays (dynamo, kubeflow) remain independent justification for mixins.

Brian (lockwobr) — 1 conversation comment

Phase 1.5: Push validation up + delete redundant constraints — No code changes, captures ~70% dedup by moving validation to {accel}-{service}-{intent} and removing re-stated constraints from leaves.
→ Agreed. Will fold into ADR as part of Phase 1 or explicit refinement step. Doesn't fully remove mixin case (platform overlays still duplicate component blocks).

Alex (ayuskauskas) — 2 comments

Auto-mixin — If mixins only add constraints, the engine could auto-apply them by criteria match.
→ Explicit spec.mixins preferred for Phase 2 — keeps composition visible and debuggable. Criteria-driven selection could be a future direction.

Flat leaf-only structure — Remove inheritance entirely, one file per combo.
→ Valid alternative but different from ADR's Flat Mixins option. Tradeoff: simpler locally, but version bumps become N-file edits.

@yuanchen8911
Copy link
Copy Markdown
Contributor Author

yuanchen8911 commented Apr 3, 2026

Recent Updates

Replies posted:

  1. Reply to @xdu31 — OS layer comment (inline reply)
  2. Reply to @lockwobr — Phase 1.5 comment (conversation comment)
  3. Reply to @mchmarny — mixin field stripping (inline reply)
  4. Reply to @mchmarnyaicr query for golden-file testing (inline reply)
  5. Reply to @ayuskauskas — auto-mixin (inline reply)
  6. Reply to @ayuskauskas — flat leaf-only structure (conversation comment)

ADR content updates pushed:
7. Mixin stripping note — added implementation note that spec.mixins must be cleared before materializing the recipe result
8. Phase 1 exit criteria — added aicr query --format json golden-file verification across all leaf overlay combinations
9. Exit criteria wording — tightened "recipe generation commands" → "recipe resolutions produce identical hydrated output" in both Phase 1 and Phase 2
10. Phase 1 now includes validation lift-up and constraint cleanup (based on feedback from @xdu31 and @lockwobr):
- Move validation blocks up to {accelerator}-{service}-{intent} layer — these checks are not OS-specific
- Move kernel >= 6.8 out of Ubuntu leaf to the highest shared layer where it is a driver requirement
- Delete redundant constraint re-declarations from leaf overlays
11. Dedup estimates updated from ~40% to ~40%+ across all references
12. os-ubuntu mixin now described as Ubuntu release constraints only (kernel moves to accelerator layer in Phase 1)
13. New Phase 1 exit criteria: no leaf carries validation duplicated across siblings; no leaf re-declares inherited constraints
14. Full consistency pass across options summary, tradeoffs, Flat Mixins, Auto-Compose, and Recommendation sections

…ADR-005

Based on review feedback from Jason (OS layer justification) and Brian
(Phase 1.5 proposal):

- Phase 1 now includes validation lift-up to intent layer, kernel
  constraint move to accelerator layer, and redundant constraint cleanup
- Updated dedup claims from ~40% to ~40%+ / "materially more than ~40%"
- Removed validation from "What's NOT eliminated" since Phase 1 now
  addresses it
- Updated os-ubuntu mixin description to Ubuntu release constraints
  only (kernel moves to accelerator layer in Phase 1)
- Updated all os-ubuntu references across Flat Mixins and Auto-Compose
  sections for consistency
- Added Phase 1 exit criteria for validation placement and constraint
  inheritance

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
- Update Reorder example to show validation on h100-eks-training,
  matching the rest of the ADR after Phase 1 validation lift-up
- Align os-ubuntu terminology to "release constraints" in Phase 2
  mixin list

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
@yuanchen8911
Copy link
Copy Markdown
Contributor Author

Summary of Review Feedback and ADR Updates

# Concern From ADR Update
1 Constraint failure semantics (P0) Mark Added candidate selection + constraint failure sections: composed leaf candidate excluded as a unit
2 Mixin conflict policy Mark Strict duplicate-name prohibition for constraints and components, CI lint enforces
3 Mixin field stripping Mark Implementation note: merge a copy with mixins cleared before materializing recipe output
4 Golden-file testing Mark Added aicr query --format json as verification path in Phase 1 exit criteria
5 OS layer justification Jason Phase 1 now moves validation up to intent layer and kernel constraint to accelerator layer
6 Phase 1.5: validation lift-up Brian Folded into Phase 1 — validation lift-up + redundant constraint cleanup, no code changes
7 Auto-mixin / flat structure Alex Explicit spec.mixins preferred; flat leaf-only acknowledged as valid alternative
8 ADR-004 → ADR-005 naming Mark Fixed PR description; renamed file to 005-overlay-refactoring.md
9 Dedup estimates Updated from ~40% to ~40%+ across all references
10 os-ubuntu mixin scope Now Ubuntu release constraints only; kernel moves to accelerator layer in Phase 1
11 Consistency pass Updated options summary, tradeoffs, Flat Mixins, Auto-Compose, example trees, and Recommendation sections
12 Phase 1 exit criteria Added: no duplicated validation across siblings; no re-declared inherited constraints

Align Phase 2 exit criteria with Phase 1 by specifying aicr query
--format json as the verification mechanism for hydrated output diffs.

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
Mixins **cannot** contain `criteria`, `base`, `mixins`, or `validation`.

**Why not validation in mixins (Reorder + Mixins):** Current validation merge in
`RecipeMetadataSpec.Merge()` is phase-replacement, not deep merge. If two
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current validation merge in RecipeMetadataSpec.Merge() is phase-replacement, not deep merge.

We can track and fix it separately

@yuanchen8911
Copy link
Copy Markdown
Contributor Author

Here's the plan for Phase 1. I hope it addresses the comments. If we agree, I'll start an implementation.

Phase 1 (no code changes):

  1. Create {accelerator}-{service} intermediate overlays (h100-eks, gb200-eks, etc.)
  2. Move shared GPU operator overrides, skyhook config, and K8s constraints into those intermediates
  3. Re-parent intent overlays to inherit from the new intermediates
  4. Move validation blocks up to the {accelerator}-{service}-{intent} layer
  5. Move kernel >= 6.8 out of the Ubuntu leaf to the highest shared layer where it is actually a driver requirement
  6. Delete redundant constraint re-declarations from leaf overlays

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/docs documentation Improvements or additions to documentation size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants