Skip to content

[AIM] Consistent conditions for all resources #411

@salexo

Description

@salexo

Status & Conditions Model for AIM Resources

In order to provide a consistent, predictable, and machine-readable status surface across all AIM resources, we need a clear definition of:

  • How phases (status.status) behave
  • What condition types exist
  • How child resource health is surfaced
  • How various classes of issues (spec, auth, infra, capacity) are exposed

This issue proposes a unified model.


1. Phases (status.status)

All AIM resources standardize on constants.AIMStatus as their phase:

  • Pending
  • Starting
  • Ready / Running
  • Degraded
  • Failed
  • NotAvailable

Rules:

  • Phases are computed centrally by a State Engine.
  • Domains should not set this by hand except in advanced/override scenarios.
  • If a resource uses only a subset (e.g., no Running), we constrain via Kubebuilder enums.

2. Conditions Model

AIM resources expose two categories of conditions:

  1. Per-child conditions
  2. Parent-level conditions

The model is designed to:

  • Give users detailed breadcrumbs when debugging
  • Provide machine-readable signals for automation
  • Keep naming consistent across controllers

2.1 Per-Child Conditions {ChildName}Ready

For every child resource a parent tracks (e.g. Model, Template, InferenceService, Cache), we expose:

{ChildName}Ready

Behavior

  • Status=True
    → The child is fully ready.

  • Status=False
    → The child is progressing, degraded, failed, or its health cannot currently be determined.

Reason

The reason field reflects the child's current health state, and should include the generic phase when everything is going as expected (Starting, Ready), but be more detailed when there's an identified issue (ImagePullError).

Message

  • The message is taken directly from the child resource's own conditions/messages wherever possible.
  • This allows users to trace issues without kubectl-hunting pod logs.

2.2 Parent-Level Conditions

Every AIM resource also publishes a small, standardized set of top-level conditions:

1) Ready

Represents overall readiness of the resource.

Ready=True
→ The resource is usable.
Ready=False
→ Look at reason + other conditions for details.

This is the primary high-level signal.


2) ConfigValid

Represents validity of the resource's spec.

  • ConfigValid=False → Invalid spec fields, incompatible combinations, etc.
    The phase will be Failed.

This isolates user errors from runtime/system errors.


3) AuthValid

Represents validity of authentication/authorization, including:

  • Missing or malformed secrets
  • Invalid registry credentials
  • Unauthorized remote API calls
  • Image pull authentication failures

AuthValid=False → The phase becomes Degraded until fixed.


4) DependenciesReachable

Represents whether the controller can reach its required external dependencies.

This includes:

  • Remote APIs (model registry, inference APIs, template providers)
  • Storage endpoints
  • Registry endpoints (excluding pure auth failures)
  • Any external service the controller must fetch state from

Interpretation:

  • DependenciesReachable=True
    → All required external dependencies responded normally.

  • DependenciesReachable=False
    → Connectivity-like issues (timeouts, DNS failures, 5xx errors, refused connections).

Degradation logic for this condition

Transient issues do not immediately degrade the resource:

  • If the resource is in Pending, this stays informational.
  • Once in Starting/Ready, a timer-based threshold (e.g. 10 seconds) must pass with DependenciesReachable=False before phase transitions into Degraded.

This prevents short-lived network hiccups from causing noisy readiness flapping.


3. How Issue Classes Surface Across Conditions

This is the core of the model:

3.1 Spec errors

  • ConfigValid=False
  • Ready=False
  • Phase=Failed
  • Relevant child conditions (if applicable): {ChildName}Ready=False, Reason=InvalidSpec

3.2 Auth errors

  • AuthValid=False
  • Ready=False, Reason=AuthFailed
  • Phase=Degraded
  • Child: {ChildName}Ready=False, Reason=AuthFailed

3.3 External connectivity errors

  • DependenciesReachable=False
  • Ready=False, Reason=DependenciesUnreachable (or similar)
  • Phase=Starting → Degraded after threshold
  • Child health:
    • State=Unknown for components whose health could not be evaluated
      {ChildName}Ready=False, Reason=Unknown

3.4 Capacity/pending issues (e.g., no GPUs)

  • Not an error; a form of progressing.
  • Ready=False, Reason=PendingResources or InsufficientCapacity
  • Phase=Starting (or Degraded if persistently blocked)
  • Child shows:
    • {ChildName}Ready=False, Reason=InsufficientCapacity

4. Summary of Final Conditions Set

Child-level

{ChildName}Ready   (True/False)

Parent-level

Ready
ConfigValid
AuthValid
DependenciesReachable

Everything else (e.g. workload behavior, missing dependencies, infra health) is expressed through:

  • Phase transitions (Pending → Starting → Ready → Degraded → Failed)
  • Reason / Message fields of these standard conditions
  • {ChildName}Ready conditions reflecting detailed component-level health

5. Benefits of This Model

  • Fully machine-parseable structure
  • Minimal cognitive overhead for clients (just 4 parent-level conditions)
  • Consistent diagnostics across all AIM resources
  • Clear separation between:
    • user mistakes (ConfigValid)
    • auth issues (AuthValid)
    • remote problems (DependenciesReachable)
    • capacity/progress issues (child Ready conditions)
  • No condition flapping due to transient infra issues
  • Perfectly matches a centralized StateEngine design

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions