-
Notifications
You must be signed in to change notification settings - Fork 157
Error Categorization #4713
Description
Motivation
Armada has multiple mechanisms for handling job failures (executor pod checks, lease loss retries), but they don't share a consistent classification of why a job failed. The raw signals (exit code 74, termination message "CUDA error: GPU has fallen off the bus") flow through the system but carry no semantic meaning.
We want a single classification that serves observability. Instead of every consumer needing to understand raw k8s signals, the executor classifies failures into named categories that flow through the whole system.
Design
Classify failures at the executor
The executor already has the richest failure data: full pod status, Kubernetes events, container states. Instead of passing all k8s events and conditions up to the scheduler for it to make sense of, the executor classifies the failure into a named category and attaches it to the job run as a label.
Matching rules (exit codes, termination messages, conditions) run at the executor to set a "termination reason" category on the job. This can then flow through the whole system without every component needing to understand raw k8s signals.
Categories can live locally in the executor cluster. A GPU cluster defines cuda_error and infiniband_error; a CPU cluster doesn't need them.
Proto schema
A FailureInfo message on the Error proto carries structured failure metadata from the executor through Pulsar to Lookout:
exit_code- exit code of the first failed containertermination_message- termination message from the first failed containercategories- executor-assigned category labels matched from configcontainer_name- which container the exit code and termination message were extracted from
Categories
Configuration in executor, a category is just a name plus matching rules:
errorCategories:
- name: cuda_error
rules:
- onTerminationMessage: { pattern: ".*(CUDA error|Xid error).*" }
- onExitCodes: { operator: In, values: [74, 75] }
- name: infiniband_error
rules:
- onExitCodes: { operator: In, values: [42, 43] }
- onTerminationMessage: { pattern: ".*InfiniBand.*link down.*" }
- name: oom
rules:
- onConditions: [OOMKilled]
- onExitCodes: { operator: In, values: [137] }
- name: sidecar_crash
rules:
- containerName: istio-proxy
onExitCodes: { operator: NotIn, values: [0] }Rules within a category are OR'd. If exit code equals XX or termination log has YY, the category matches.
Each rule supports an optional containerName field. When set, the rule only matches failures from that specific container. When empty (the default), failures from any container can match. This is useful for multi-container pods where sidecars and init containers may fail independently and you want distinct categories for them.
Data flow
- Executor: On pod failure, the classifier evaluates all configured category rules against the pod status. Matched category names plus container failure details are packed into a
FailureInfoand attached to the error event sent via Pulsar. - Lookout ingester: Extracts
FailureInfofrom the error event and stores it as a JSONB column (failure_info) on thejob_runtable. - Lookout API: Returns
failureInfoon each run in the jobs response, including exit code, termination message, categories, and container name. - Lookout UI: Displays error categories in the jobs table as a column with client-side filtering, and shows full failure details (container name, categories, termination message) in the run details sidebar.
Enhanced failure debug info
Beyond categorization, capture richer structured data on failures to aid debugging and observability:
-
Failing container name - included in
ContainerError.objectMeta, identifies which container failed. -
Termination reason - included in
ContainerError.reason, carries theterminated.reasonfrom the container status. -
Did user code run - distinguish failures in user-submitted containers vs platform-injected containers (preflight controllers, sidecars, etc.). User code is whatever the user defined in their job submission (init containers and main containers). Anything injected after submission is non-user code. The approach for detecting this needs further design, as mutations can happen at multiple points in the pipeline (Armada-level tools, Kubernetes webhooks), so the solution needs to work regardless of where injection occurs.
-
Node conditions - when a pod fails, fetch the node via
GetNode(nodeName)and capturenode.Status.Conditions(Ready, DiskPressure, MemoryPressure, PIDPressure, NetworkUnavailable). If a job fails and the node hasMemoryPressure=True, that's infra, not user fault. Store onFailureInfoso the data flows through Pulsar to Lookout alongside categories. -
Warning events (structured) - pod events are already fetched but dumped as a text blob via
describe.DescribeEvents(). Capture them as structured data instead: event reason, message, type (Warning/Normal), timestamp, source component. This lets categorization rules match on events and Lookout can display them as a table. Cap at 10 events (Warning-first) to bound proto message size. -
Pod status snapshot - capture key fields from
pod.Statusstructurally: conditions, container statuses (all containers, including injected ones), phase, reason, message, start time. Not a raw JSON blob, specific fields in the proto. -
Stack trace summary - if the termination message is longer than ~10 lines, extract the last 10 lines as a summary field. Most stack traces have the actual error at the bottom.
Lookout UI - job insights and observability
Categories and debug info flow through to Lookout UI and parquet data, giving users a way to understand failure patterns across jobs:
- Filter and search jobs by error category in Lookout
- See category labels on job run details (e.g. "cuda_error" instead of raw exit code 74)
- See whether user code ran, node conditions at time of failure, structured warning events
- Query parquet data for analytics, e.g. "what percentage of failures this week were infra vs user errors?"
Related PRs
- PR Add error categorization proto schema and executor classifier #4741 - Proto schema, shared matching primitives, executor classifier
- PR Wire error categories into executor event reporting #4745 - Wire classifier into executor event reporting
- PR Store FailureInfo in lookout DB and expose via API #4755 - Lookout ingester, API, and UI
- PR Add e2e test cases for error categorization #4760 - E2e test infrastructure for asserting categories
Future work (out of scope)
- Retry policies using categories - tracked in Native support for retry policies #4683. Once categories are in place, the retry policy engine can reference them via
onFailureCategoryrules.