This file provides guidance to Codex and other coding agents when working with code in this repository.
If present, also read AGENTS.local.md at the repo root. The file is gitignored repo-wide so personal overlays stay local — agents must check the exact path directly (e.g., Read or cat), not rely on ignore-respecting discovery tools such as rg, fd, or git ls-files. Treat it as a local overlay for this working copy: follow it when it does not conflict with higher-priority instructions or this shared AGENTS.md.
Act as a Principal Distributed Systems Architect with deep expertise in Go and cloud-native architectures. Focus on correctness, resiliency, and operational simplicity. All code must be production-grade, not illustrative pseudo-code.
NVIDIA AI Cluster Runtime (AICR) generates validated GPU-accelerated Kubernetes configurations.
Workflow: Snapshot → Recipe → Validate → Bundle
┌─────────┐ ┌────────┐ ┌──────────┐ ┌────────┐
│Snapshot │───▶│ Recipe │───▶│ Validate │───▶│ Bundle │
└─────────┘ └────────┘ └──────────┘ └────────┘
│ │ │ │
▼ ▼ ▼ ▼
Capture Generate Check Create
cluster optimized constraints Helm values,
state config vs actual manifests
Tech Stack: Go 1.26, Kubernetes 1.33+, golangci-lint v2.10.1, Ko for images
# IMPORTANT: goreleaser (used by make build, make qualify, e2e) fails if
# GITLAB_TOKEN is set alongside GITHUB_TOKEN. Always unset it first:
unset GITLAB_TOKEN
# Development workflow
make qualify # Full check: test + lint + e2e + scan (run before PR)
make test # Unit tests with -race
make lint # golangci-lint + yamllint
make scan # Grype vulnerability scan
make build # Build binaries
make tidy # Format + update deps
# Run single test
go test -v ./pkg/recipe/... -run TestSpecificFunction
# Run tests with race detector for specific package
go test -race -v ./pkg/collector/...
# Local development
make server # Start API server locally (debug mode)
make dev-env # Create Kind cluster + start Tilt
make dev-env-clean # Stop Tilt + delete cluster
# KWOK simulated cluster tests (no GPU hardware required)
make kwok-test-all # All recipes
make kwok-e2e RECIPE=eks-training # Single recipe
# E2E tests (unset GITLAB_TOKEN to avoid goreleaser conflicts)
unset GITLAB_TOKEN && ./tools/e2e
# Tools management
make tools-setup # Install all required tools
make tools-check # Verify versions match .settings.yaml
# Local health check validation
make check-health COMPONENT=nvsentinel # Direct chainsaw against Kind
make check-health-all # All components
make validate-local RECIPE=recipe.yaml # Full pipeline in Kind- Read before writing — Never modify code you haven't read
- Tests must pass —
make testwith race detector; never skip tests - Run
make qualifyoften — Run at every stopping point (after completing a phase, before commits, before moving on). Fix ALL lint/test failures before proceeding. Do not treat pre-existing failures as acceptable. - Use project patterns — Learn existing code before inventing new approaches
- 3-strike rule — After 3 failed fix attempts, stop and reassess
- Structured errors — Use
pkg/errorswith error codes (neverfmt.Errorf) - Context timeouts — All I/O operations need context with timeout
- Check context in loops — Always check
ctx.Done()in long-running operations
When providing review findings, use global GitHub file links by default
(https://github.com/<org>/<repo>/blob/<sha>/<path>#L<line>) instead of local
workspace paths. Use local file paths only when explicitly requested.
- Commit to
mainbranch (notmaster) - Do use
-Sto cryptographically sign the commit - Do NOT add
Co-Authored-Bylines (organization policy) - Do not sign-off commits (no
-sflag); cryptographic signing (-S) satisfies DCO for AI-authored commits
| Package | Purpose | Business Logic? |
|---|---|---|
pkg/cli |
User interaction, input validation, output formatting | No |
pkg/api |
REST API handlers | No |
pkg/recipe |
Recipe resolution, overlay system, component registry | Yes |
pkg/bundler |
Per-component Helm bundle generation from recipes | Yes |
pkg/component |
Bundler utilities and test helpers | Yes |
pkg/collector |
System state collection | Yes |
pkg/validator |
Constraint evaluation | Yes |
pkg/errors |
Structured error handling with codes | Yes |
pkg/manifest |
Shared Helm-compatible manifest rendering | Yes |
pkg/evidence |
Conformance evidence capture and formatting | Yes |
pkg/collector/topology |
Cluster-wide node taint/label topology collection | Yes |
pkg/snapshotter |
System state snapshot orchestration | Yes |
pkg/k8s/client |
Singleton Kubernetes client | Yes |
pkg/k8s/pod |
Shared K8s Job/Pod utilities (wait, logs, ConfigMap URIs) | Yes |
pkg/validator/helper |
Shared validator helpers (PodLifecycle, test context) | Yes |
pkg/defaults |
Centralized timeout and configuration constants | Yes |
Critical Architecture Principle:
pkg/cliandpkg/api= user interaction only, no business logic- Business logic lives in functional packages so CLI and API can both use it
Errors (always use pkg/errors):
import "github.com/NVIDIA/aicr/pkg/errors"
// Simple error
return errors.New(errors.ErrCodeNotFound, "GPU not found")
// Wrap existing error
return errors.Wrap(errors.ErrCodeInternal, "collection failed", err)
// With context
return errors.WrapWithContext(errors.ErrCodeTimeout, "operation timed out", ctx.Err(),
map[string]interface{}{"component": "gpu-collector", "timeout": "10s"})Error Codes: ErrCodeNotFound, ErrCodeUnauthorized, ErrCodeTimeout, ErrCodeInternal, ErrCodeInvalidRequest, ErrCodeUnavailable
Context with timeout (always):
// Collectors: 10s timeout
func (c *Collector) Collect(ctx context.Context) (*measurement.Measurement, error) {
ctx, cancel := context.WithTimeout(ctx, 10*time.Second)
defer cancel()
// ...
}
// HTTP handlers: 30s timeout
func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(r.Context(), 30*time.Second)
defer cancel()
// ...
}Table-driven tests (required for multiple cases):
func TestFunction(t *testing.T) {
tests := []struct {
name string
input string
expected string
wantErr bool
}{
{"valid input", "test", "test", false},
{"empty input", "", "", true},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
result, err := Function(tt.input)
if (err != nil) != tt.wantErr {
t.Errorf("error = %v, wantErr %v", err, tt.wantErr)
}
if result != tt.expected {
t.Errorf("got %v, want %v", result, tt.expected)
}
})
}
}Functional options (configuration):
builder := recipe.NewBuilder(
recipe.WithVersion(version),
)
server := server.New(
server.WithName("aicrd"),
server.WithVersion(version),
)Concurrency (errgroup):
g, ctx := errgroup.WithContext(ctx)
g.Go(func() error { return collector1.Collect(ctx) })
g.Go(func() error { return collector2.Collect(ctx) })
if err := g.Wait(); err != nil {
return errors.Wrap(errors.ErrCodeInternal, "collection failed", err)
}Structured logging (slog):
slog.Debug("request started", "requestID", requestID, "method", r.Method)
slog.Error("operation failed", "error", err, "component", "gpu-collector")| Task | Location | Key Points |
|---|---|---|
| New Helm component | recipes/registry.yaml |
Add entry with name, displayName, helm settings, nodeScheduling |
| New Kustomize component | recipes/registry.yaml |
Add entry with name, displayName, kustomize settings |
| Component values | recipes/components/<name>/ |
Create values.yaml with Helm chart configuration |
| New collector | pkg/collector/<type>/ |
Implement Collector interface, add to factory |
| New API endpoint | pkg/api/ |
Handler + middleware chain + OpenAPI spec update |
| Fix test failures | Run make test |
Check race conditions (-race), verify context handling |
| New health check | recipes/checks/<name>/ |
Create health-check.yaml, register in registry.yaml, test with make check-health |
Adding a Helm component (declarative - no Go code needed):
# recipes/registry.yaml
- name: my-operator
displayName: My Operator
valueOverrideKeys: [myoperator]
helm:
defaultRepository: https://charts.example.com
defaultChart: example/my-operator
nodeScheduling:
system:
nodeSelectorPaths: [operator.nodeSelector]Adding a Kustomize component (declarative - no Go code needed):
# recipes/registry.yaml
- name: my-kustomize-app
displayName: My Kustomize App
valueOverrideKeys: [mykustomize]
kustomize:
defaultSource: https://github.com/example/my-app
defaultPath: deploy/production
defaultTag: v1.0.0Note: A component must have either helm OR kustomize configuration, not both.
Using mixins for shared OS/platform content:
# Leaf overlay referencing mixins instead of duplicating content
spec:
base: h100-eks-ubuntu-training
mixins:
- os-ubuntu # Ubuntu constraints (defined once in recipes/mixins/)
- platform-kubeflow # kubeflow-trainer component (defined once in recipes/mixins/)
criteria:
service: eks
accelerator: h100
os: ubuntu
intent: training
platform: kubeflow
constraints:
- name: K8s.server.version
value: ">= 1.32.4"Mixins carry only constraints and componentRefs — no criteria, base, mixins, or validation. They live in recipes/mixins/ with kind: RecipeMixin.
Never return bare errors. Every return err must wrap with context:
// BAD - bare return loses context
if err := doSomething(); err != nil {
return err
}
// GOOD - wrapped with context
if err := doSomething(); err != nil {
return errors.Wrap(errors.ErrCodeInternal, "failed to do something", err)
}Don't double-wrap errors that already have proper codes. If a called function already returns a pkg/errors StructuredError with the right code, don't re-wrap and change its code:
// BAD - overwrites inner ErrCodeNotFound with ErrCodeInternal
content, err := readTemplateContent(ctx, path) // returns ErrCodeNotFound
return errors.Wrap(errors.ErrCodeInternal, "read failed", err)
// GOOD - propagate as-is when inner error already has correct code
content, err := readTemplateContent(ctx, path)
return errException: Wrapping is unnecessary for read-only Close() returns and K8s helpers like k8s.IgnoreNotFound(err).
Always use errors.Is() for sentinel error checks. golangci-lint enforces the errorlint rule — comparing errors with == fails on wrapped errors and will be rejected by CI:
// BAD - fails errorlint, breaks on wrapped errors
if err == io.EOF {
// GOOD - works with wrapped errors, passes linter
if errors.Is(err, io.EOF) {Note: in files that import pkg/errors, the standard library errors package is aliased as stderrors, so use stderrors.Is(...) there.
Writable file handles must check Close() errors. If a file handle is writable (e.g., from os.Create or os.OpenFile), closing it may flush buffered data; always capture and check the error:
// BAD - writable Close() error ignored
defer f.Close()
// GOOD - writable Close() error checked
closeErr := f.Close()
if err == nil {
err = closeErr
}Never use context.Background() in I/O methods. Use a timeout-bounded context:
// BAD - unbounded context
func (r *Reader) Read(url string) ([]byte, error) {
return r.ReadWithContext(context.Background(), url)
}
// GOOD - timeout-bounded
func (r *Reader) Read(url string) ([]byte, error) {
ctx, cancel := context.WithTimeout(context.Background(), r.TotalTimeout)
defer cancel()
return r.ReadWithContext(ctx, url)
}context.Background() is acceptable ONLY for: cleanup in deferred functions (when parent context is canceled), graceful shutdown, and test setup.
Never use http.DefaultClient. It has zero timeout. Always use a custom client with an explicit timeout:
// BAD - no timeout, can hang indefinitely
resp, err := http.DefaultClient.Do(req)
// GOOD - bounded timeout from pkg/defaults
client := &http.Client{Timeout: defaults.HTTPClientTimeout}
resp, err := client.Do(req)Always use slog for output in production code. Never use fmt.Println, fmt.Printf, or fmt.Fprintln for logging or streaming output:
// BAD
fmt.Println(scanner.Text())
// GOOD
slog.Info(scanner.Text())Exception: fmt.Fprintln(logWriter(), ...) for agent log output to stderr is acceptable when structured logging would add noise to raw log streaming.
Use named constants from pkg/defaults instead of magic literals. If a timeout, limit, or configuration value is used anywhere, it should be a named constant:
// BAD - magic literal
ExpectContinueTimeout: 1 * time.Second,
// GOOD - named constant
ExpectContinueTimeout: defaults.HTTPExpectContinueTimeout,Use watch API instead of polling for efficiency and reduced API server load:
// BAD - polling with sleep
ticker := time.NewTicker(500 * time.Millisecond)
for {
select {
case <-ticker.C:
pod, err := client.CoreV1().Pods(ns).Get(ctx, name, metav1.GetOptions{})
if pod.Status.Phase == v1.PodSucceeded {
return nil
}
}
}
// GOOD - watch API
watcher, err := client.CoreV1().Pods(ns).Watch(ctx, metav1.ListOptions{
FieldSelector: "metadata.name=" + name,
})
defer watcher.Stop()
for event := range watcher.ResultChan() {
pod := event.Object.(*v1.Pod)
if pod.Status.Phase == v1.PodSucceeded {
return nil
}
}Use create-or-update semantics for mutable K8s resources instead of IgnoreAlreadyExists:
// BAD - stale resource silently kept from prior run
_, err = clientset.RbacV1().Roles(ns).Create(ctx, role, metav1.CreateOptions{})
if apierrors.IsAlreadyExists(err) {
return nil // stale rules persist!
}
// GOOD - create, then update if exists
_, err = clientset.RbacV1().Roles(ns).Create(ctx, role, metav1.CreateOptions{})
if apierrors.IsAlreadyExists(err) {
_, err = clientset.RbacV1().Roles(ns).Update(ctx, role, metav1.UpdateOptions{})
if err != nil {
return errors.Wrap(errors.ErrCodeInternal, "failed to update Role", err)
}
return nil
}IgnoreAlreadyExists is acceptable ONLY for: immutable resources (ServiceAccounts, Namespaces) where updates are not needed.
Use shared utilities from pkg/k8s/pod instead of reimplementing:
// Use for Job completion
err := pod.WaitForJobCompletion(ctx, client, namespace, jobName, timeout)
// Use for pod logs
logs, err := pod.GetPodLogs(ctx, client, namespace, podName)
// Use for streaming logs
err := pod.StreamLogs(ctx, client, namespace, podName, os.Stdout)
// Use for ConfigMap URI parsing
namespace, name, err := pod.ParseConfigMapURI("cm://gpu-operator/aicr-snapshot")Always use --no-cluster flag in tests to prevent production cluster access:
// Unit tests: Use WithNoCluster(true)
v := validator.New(
validator.WithNoCluster(true),
validator.WithVersion(version),
)
// E2E tests: Use --no-cluster flag
aicr validate --recipe recipe.yaml --snapshot snapshot.yaml --no-cluster
// Chainsaw tests: Always include --no-cluster
${AICR_BIN} validate -r recipe.yaml -s snapshot.yaml --no-clusterTest mode behavior: When NoCluster is true:
- Validator skips RBAC creation (ServiceAccount, Role, ClusterRole)
- Validator skips Job deployment for checks
- All checks report status as "skipped - no-cluster mode (test mode)"
- Constraints are still evaluated inline (no cluster access needed)
Auto-anchors, no TOCs. Both GitHub and the Fern-rendered docs site
auto-generate anchor IDs from heading text (lowercase, spaces → hyphens).
Do not add ## Table of Contents blocks or explicit <a name="..."> /
{#slug} markup — they drift out of sync and duplicate what the platforms
already provide on hover.
Promote **Bold Label:** paragraphs to real headings sparingly. A bold
label becomes a heading only when it names a topic (feature, subsystem,
algorithm, pattern, named behavior) with substantial content beneath it
(≥ ~8 content lines is a useful rule of thumb). Leave as bold paragraphs:
- Scaffolding that recurs per section:
Synopsis,Flags,Examples,Example,Behavior,Usage,Parameters,Returns. - Generic structural labels that just describe what's in the next
block:
Output,Input Sources,Benefits,Responsibilities,Key Features,Key Points,Installation. - Thin sections (< 8 lines) even if the label is a named topic — a 2-sentence intro that mostly delegates to children isn't itself a topic.
- FAQ-style entries under a collection heading (e.g.
### Common Issueswith entries like**"Connection refused" error:**+ 2-line fix) — promoting each fragments navigation without adding substance. - Paired short subsections — if two thin labels are conceptual siblings
(e.g.
**Updating versions:**+**Adding components:**), promote both or neither.
Slug gotchas when promoting. GitHub preserves hyphens literally but strips most other punctuation:
- Trailing
(`--flag`)→ triple-hyphen slug (…values---dynamic). Drop the parenthetical if the flag name is already in the first paragraph. +,&,/between words → double-hyphen slugs (Base + Overlay Merging→base--overlay-merging). Rewrite withand/or.
Anchor link hygiene. Broken anchor links are caught in CI by
lychee on any PR that touches
docs/** (see .github/workflows/fern-docs-ci.yaml, config in
.lychee.toml) — make qualify does NOT run it, so CI is the safety
net. When renaming or removing a heading:
- Grep for
<filename>.md#<old-slug>across the repo first — other docs, Helm templates, andSECURITY.mdlink into user-facing anchors, and those inbound links won't be in the same file you're editing. - If intentionally removing a heading an external doc linked to, update the inbound link in the same PR.
| Anti-Pattern | Correct Approach |
|---|---|
| Modify code without reading it first | Always Read files before Edit |
| Skip or disable tests to make CI pass | Fix the actual issue |
| Invent new patterns | Study existing code in same package first |
Use fmt.Errorf for errors |
Use pkg/errors with error codes |
Return bare err without wrapping |
Always errors.Wrap() with context message |
Use context.Background() in I/O methods |
Use context.WithTimeout() with bounded deadline |
Use fmt.Println for logging |
Use slog.Info/Debug/Warn/Error |
| Hardcode timeout/limit values | Define in pkg/defaults and reference by name |
| Re-wrap errors that already have correct codes | Return as-is to preserve error code |
| Ignore context cancellation | Always check ctx.Done() in loops/operations |
| Add features not requested | Implement exactly what was asked |
| Create new files when editing suffices | Prefer Edit over Write |
| Guess at missing parameters | Ask for clarification |
| Continue after 3 failed fix attempts | Stop, reassess approach, explain blockers |
| Use polling loops for K8s operations | Use watch API for efficiency |
Compare errors with == (e.g., err == io.EOF) |
Use errors.Is(err, io.EOF) (stderrors.Is in files that alias stdlib errors) — errorlint enforced by CI |
| Duplicate K8s utilities across packages | Use shared utilities from pkg/k8s/pod |
| Run tests that connect to live clusters | Always use --no-cluster flag in tests |
| Use boolean flags to track options | Use pointer pattern (nil = not set, &value = set) |
Use http.DefaultClient |
Use custom &http.Client{Timeout: defaults.HTTPClientTimeout} |
Use IgnoreAlreadyExists for mutable K8s resources |
Use create-or-update semantics (Create, then Update if exists) |
Ignore Close() error on writable file handles |
Capture and check closeErr := f.Close() |
| Hardcode resource names from templates | Extract to named constants to keep code and templates in sync |
Pre-push checklist: Always run make qualify before pushing. This is the CI-equivalent gate that covers tests, linting (golangci-lint + yamllint), e2e, vulnerability scan, and repo-specific checks (docs sidebar, agents sync). Do not substitute a subset of commands — if make qualify passes locally, CI will pass.
Mandatory lint gate for Go changes: If your PR changes any .go files, you MUST run golangci-lint run -c .golangci.yaml on each affected package path (e.g., ./pkg/recipe/..., ./cmd/aicr/..., ./tests/chainsaw/...) and confirm zero issues before creating or pushing the PR. For a full module scan, use ./.... Do not rely on CI to catch lint failures — fix them locally first. This applies even to PRs labeled as "documentation only" if they include Go code changes.
Branch hygiene:
- Always rebase onto the target branch before pushing:
git fetch origin main && git rebase origin/main - Squash commits into a single commit before push
- Cryptographically sign commits (
git commit -S)
Documentation updates: When a PR adds or changes user-visible behavior (new CLI flag, API endpoint, component, recipe field, deployment pattern, environment variable, error code), update the relevant page in docs/ in the same PR — don't defer to a follow-up. Common targets by kind of change:
- CLI flag / subcommand →
docs/user/cli-reference.md - API endpoint / query parameter →
docs/user/api-reference.md - Registry component →
docs/user/component-catalog.md - Recipe / overlay / mixin structure →
docs/integrator/recipe-development.mdanddocs/contributor/data.md - Internal package or architecture →
docs/contributor/<area>.md - Enum/constant value added (e.g., new accelerator, service, OS, intent, platform, error code) → the value is usually enumerated in many files, not one, and grepping for the new value returns nothing. Start from the authoritative Go type (e.g.,
pkg/recipe/criteria.goforCriteriaAccelerator*), list every current value, and verify each appears wherever the enum is documented. Audit targets typically include: the OpenAPI contract atapi/aicr/v1/server.yaml(everyenum:block); doc pagesdocs/README.md(glossary),docs/user/cli-reference.md,docs/user/api-reference.md,docs/contributor/api-server.md,docs/contributor/cli.md,docs/contributor/data.md,docs/contributor/validations.md, and the site-docs mirror undersite/docs/(e.g.,site/docs/getting-started/index.md); Go-visible surfaces in the package that defines the type (package godoc inpkg/<area>/doc.go, field/type comments on the Go struct, and any urfave/cliDescription/Usagestrings that enumerate values, e.g.,pkg/cli/recipe.go); and issue templates that surface the enum in dropdowns (.github/ISSUE_TEMPLATE/*.yml). Greppingdocs/for an already-documented sibling value (e.g.,gb200) catches forward additions but misses pre-existing drift — check against the Go type, not a known-good sibling.
Follow the heading conventions in the ## Documentation Style section above. Doc-only PRs (label documentation) are still subject to the full make qualify gate.
PR description: Use the template from .github/PULL_REQUEST_TEMPLATE.md exactly as defined there. Do not inline a modified copy — read and fill in the canonical template. The template covers: Summary, Motivation/Context (with Fixes/Related), Type of Change, Components Affected, Implementation Notes, Testing, Risk Assessment, and Checklist.
Test coverage gate (Go packages only):
Before pushing a PR that changes Go source files, check test coverage on affected packages. Set pkg to the narrowest directory root you want to measure — $pkg/... intentionally includes descendant packages. Prefer the narrowest changed root (e.g., if only pkg/collector/topology changed, use pkg=pkg/collector/topology, not pkg=pkg/collector). Use a broader root only when you intentionally want one combined delta across related subpackages.
- Run
GOFLAGS="-mod=vendor" go test -coverprofile=cover.out ./$pkg/...on each changed package - Get the baseline using a clean worktree (changes must be committed first):
(git worktree add $TMPDIR/baseline origin/main && (cd $TMPDIR/baseline && GOFLAGS="-mod=vendor" go test -coverprofile=$TMPDIR/base.out ./$pkg/...); rc=$?; git worktree remove --force $TMPDIR/baseline; return $rc 2>/dev/null || (exit $rc)). This preserves the test exit status through cleanup. Write the profile to$TMPDIR/base.out(outside the worktree) so it survives cleanup. Compare withgo tool cover -funcon both profiles. Skip this step for entirely new packages. - Block if
make test-coveragefails — this enforces the project-wide 70% floor (from.settings.yaml). Do not use per-package profiles for this check. - Flag any package with per-package coverage decrease > 0.5% (comparing step 1 vs step 2)
- Block if any new exported function or method (identified via
git diff origin/main -- $pkg/— look for addedfunclines with uppercase names) has 0% coverage — add tests before pushing - Report the delta in the PR description's Testing section (e.g.,
pkg/recipe: 90.4% → 90.3% (-0.1%)) This rule does not apply to non-Go changes (YAML, docs, CI workflows). Note: CI also posts per-package coverage deltas post-push viago-coverage-reportinon-push-comment.yaml; this gate catches regressions before push.
PR policy:
- Do NOT add
Co-Authored-Bylines (organization policy) - Do NOT add "Generated with Claude Code", "Created by Codex", or similar attribution
- Add appropriate type labels:
enhancement,bug,documentation - Area labels are auto-assigned by
.github/labeler.ymlbased on changed file paths (e.g.,area/recipes,area/ci,area/api,area/cli,area/bundler,area/collector,area/validator,area/docs,area/infra,area/tests). You may also add them manually when the auto-labeler wouldn't match (e.g., issue-only PRs or cross-cutting changes). - Do NOT add
size/*labels (auto-assigned by bot) - Keep the PR title under 70 characters; use the description for details
| File | Purpose |
|---|---|
CONTRIBUTING.md |
Contribution guidelines, PR process, DCO |
DEVELOPMENT.md |
Development setup, architecture, Make targets |
RELEASING.md |
Release process for maintainers |
.settings.yaml |
Project settings: tool versions, quality thresholds, build/test config (single source of truth) |
recipes/registry.yaml |
Declarative component configuration |
recipes/overlays/*.yaml |
Recipe overlay definitions |
recipes/mixins/*.yaml |
Composable mixin fragments (OS constraints, platform components) |
recipes/components/*/values.yaml |
Component Helm values |
api/aicr/v1/server.yaml |
OpenAPI spec |
.goreleaser.yaml |
Release configuration |
| Issue | Check |
|---|---|
| K8s connection fails | ~/.kube/config or KUBECONFIG env |
| GPU not detected | nvidia-smi in PATH |
| Linter errors | Use errors.Is() not ==; add return after t.Fatal() |
| Race conditions | Run with -race flag |
| Build failures | Run make tidy |
Operational:
- Partial failure is the steady state — design for partitions, timeouts, bounded retries
- Boring first — default to proven, simple technologies
- Observability is mandatory — structured logging, metrics, tracing
Foundational:
- Local development equals CI —
.settings.yamlis single source of truth - Correctness must be reproducible — same inputs → same outputs, always
- Metadata is separate from consumption — recipes define what, bundlers determine how
- Recipe specialization requires explicit intent — never silently upgrade to specialized configs
- Trust requires verifiable provenance — SLSA, SBOM, Sigstore
When choosing between approaches, prioritize in this order:
- Testability — Can it be unit tested without external dependencies?
- Readability — Can another engineer understand it quickly?
- Consistency — Does it match existing patterns in the codebase?
- Simplicity — Is it the simplest solution that works?
- Reversibility — Can it be easily changed later?
# Capture system state
aicr snapshot --output snapshot.yaml
# Generate recipe from snapshot
aicr recipe --snapshot snapshot.yaml --intent training --output recipe.yaml
# Generate recipe from query parameters
aicr recipe --service eks --accelerator h100 --intent training --os ubuntu --platform kubeflow
# Create deployment bundle
aicr bundle --recipe recipe.yaml --output ./bundles
# Query a specific hydrated value from a recipe
aicr query --service eks --accelerator h100 --intent training \
--selector components.gpu-operator.values.driver.version
# Validate recipe against snapshot
aicr validate --recipe recipe.yaml --snapshot snapshot.yaml
# Bundle with value overrides
aicr bundle -r recipe.yaml \
--set gpuoperator:driver.version=570.86.16 \
--deployer argocd \
-o ./bundlesSee CONTRIBUTING.md, DEVELOPMENT.md, RELEASING.md, and .github/copilot-instructions.md for extended documentation including:
- Detailed code examples for collectors, bundlers, API endpoints
- GitHub Actions architecture (three-layer composite actions)
- CI/CD workflows, supply chain security (SLSA, SBOM, Cosign)
- E2E testing patterns and KWOK simulated cluster testing