-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Status & Conditions Model for AIM Resources
In order to provide a consistent, predictable, and machine-readable status surface across all AIM resources, we need a clear definition of:
- How phases (
status.status) behave - What condition types exist
- How child resource health is surfaced
- How various classes of issues (spec, auth, infra, capacity) are exposed
This issue proposes a unified model.
1. Phases (status.status)
All AIM resources standardize on constants.AIMStatus as their phase:
PendingStartingReady/RunningDegradedFailedNotAvailable
Rules:
- Phases are computed centrally by a State Engine.
- Domains should not set this by hand except in advanced/override scenarios.
- If a resource uses only a subset (e.g., no
Running), we constrain via Kubebuilder enums.
2. Conditions Model
AIM resources expose two categories of conditions:
- Per-child conditions
- Parent-level conditions
The model is designed to:
- Give users detailed breadcrumbs when debugging
- Provide machine-readable signals for automation
- Keep naming consistent across controllers
2.1 Per-Child Conditions {ChildName}Ready
For every child resource a parent tracks (e.g. Model, Template, InferenceService, Cache), we expose:
{ChildName}Ready
Behavior
-
Status=True
→ The child is fully ready. -
Status=False
→ The child is progressing, degraded, failed, or its health cannot currently be determined.
Reason
The reason field reflects the child's current health state, and should include the generic phase when everything is going as expected (Starting, Ready), but be more detailed when there's an identified issue (ImagePullError).
Message
- The
messageis taken directly from the child resource's own conditions/messages wherever possible. - This allows users to trace issues without kubectl-hunting pod logs.
2.2 Parent-Level Conditions
Every AIM resource also publishes a small, standardized set of top-level conditions:
1) Ready
Represents overall readiness of the resource.
Ready=True
→ The resource is usable.
Ready=False
→ Look at reason + other conditions for details.
This is the primary high-level signal.
2) ConfigValid
Represents validity of the resource's spec.
ConfigValid=False→ Invalid spec fields, incompatible combinations, etc.
The phase will beFailed.
This isolates user errors from runtime/system errors.
3) AuthValid
Represents validity of authentication/authorization, including:
- Missing or malformed secrets
- Invalid registry credentials
- Unauthorized remote API calls
- Image pull authentication failures
AuthValid=False → The phase becomes Degraded until fixed.
4) DependenciesReachable
Represents whether the controller can reach its required external dependencies.
This includes:
- Remote APIs (model registry, inference APIs, template providers)
- Storage endpoints
- Registry endpoints (excluding pure auth failures)
- Any external service the controller must fetch state from
Interpretation:
-
DependenciesReachable=True
→ All required external dependencies responded normally. -
DependenciesReachable=False
→ Connectivity-like issues (timeouts, DNS failures, 5xx errors, refused connections).
Degradation logic for this condition
Transient issues do not immediately degrade the resource:
- If the resource is in
Pending, this stays informational. - Once in
Starting/Ready, a timer-based threshold (e.g. 10 seconds) must pass withDependenciesReachable=Falsebefore phase transitions intoDegraded.
This prevents short-lived network hiccups from causing noisy readiness flapping.
3. How Issue Classes Surface Across Conditions
This is the core of the model:
3.1 Spec errors
ConfigValid=FalseReady=FalsePhase=Failed- Relevant child conditions (if applicable):
{ChildName}Ready=False,Reason=InvalidSpec
3.2 Auth errors
AuthValid=FalseReady=False,Reason=AuthFailedPhase=Degraded- Child:
{ChildName}Ready=False,Reason=AuthFailed
3.3 External connectivity errors
DependenciesReachable=FalseReady=False,Reason=DependenciesUnreachable(or similar)Phase=Starting → Degradedafter threshold- Child health:
State=Unknownfor components whose health could not be evaluated
→{ChildName}Ready=False,Reason=Unknown
3.4 Capacity/pending issues (e.g., no GPUs)
- Not an error; a form of progressing.
Ready=False,Reason=PendingResourcesorInsufficientCapacityPhase=Starting(orDegradedif persistently blocked)- Child shows:
{ChildName}Ready=False,Reason=InsufficientCapacity
4. Summary of Final Conditions Set
Child-level
{ChildName}Ready (True/False)
Parent-level
Ready
ConfigValid
AuthValid
DependenciesReachable
Everything else (e.g. workload behavior, missing dependencies, infra health) is expressed through:
- Phase transitions (Pending → Starting → Ready → Degraded → Failed)
Reason/Messagefields of these standard conditions{ChildName}Readyconditions reflecting detailed component-level health
5. Benefits of This Model
- Fully machine-parseable structure
- Minimal cognitive overhead for clients (just 4 parent-level conditions)
- Consistent diagnostics across all AIM resources
- Clear separation between:
- user mistakes (ConfigValid)
- auth issues (AuthValid)
- remote problems (DependenciesReachable)
- capacity/progress issues (child Ready conditions)
- No condition flapping due to transient infra issues
- Perfectly matches a centralized StateEngine design