Skip to content

Error Categorization #4713

@dejanzele

Description

@dejanzele

Motivation

Armada has multiple mechanisms for handling job failures (executor pod checks, lease loss retries), but they don't share a consistent classification of why a job failed. The raw signals (exit code 74, termination message "CUDA error: GPU has fallen off the bus") flow through the system but carry no semantic meaning.

We want a single classification that serves observability. Instead of every consumer needing to understand raw k8s signals, the executor classifies failures into named categories that flow through the whole system.

Design

Classify failures at the executor

The executor already has the richest failure data: full pod status, Kubernetes events, container states. Instead of passing all k8s events and conditions up to the scheduler for it to make sense of, the executor classifies the failure into a named category and attaches it to the job run as a label.

Matching rules (exit codes, termination messages, conditions) run at the executor to set a "termination reason" category on the job. This can then flow through the whole system without every component needing to understand raw k8s signals.

Categories can live locally in the executor cluster. A GPU cluster defines cuda_error and infiniband_error; a CPU cluster doesn't need them.

Proto schema

A FailureInfo message on the Error proto carries structured failure metadata from the executor through Pulsar to Lookout:

  • exit_code - exit code of the first failed container
  • termination_message - termination message from the first failed container
  • categories - executor-assigned category labels matched from config
  • container_name - which container the exit code and termination message were extracted from

Categories

Configuration in executor, a category is just a name plus matching rules:

errorCategories:
  - name: cuda_error
    rules:
      - onTerminationMessage: { pattern: ".*(CUDA error|Xid error).*" }
      - onExitCodes: { operator: In, values: [74, 75] }

  - name: infiniband_error
    rules:
      - onExitCodes: { operator: In, values: [42, 43] }
      - onTerminationMessage: { pattern: ".*InfiniBand.*link down.*" }

  - name: oom
    rules:
      - onConditions: [OOMKilled]
      - onExitCodes: { operator: In, values: [137] }

  - name: sidecar_crash
    rules:
      - containerName: istio-proxy
        onExitCodes: { operator: NotIn, values: [0] }

Rules within a category are OR'd. If exit code equals XX or termination log has YY, the category matches.

Each rule supports an optional containerName field. When set, the rule only matches failures from that specific container. When empty (the default), failures from any container can match. This is useful for multi-container pods where sidecars and init containers may fail independently and you want distinct categories for them.

Data flow

  1. Executor: On pod failure, the classifier evaluates all configured category rules against the pod status. Matched category names plus container failure details are packed into a FailureInfo and attached to the error event sent via Pulsar.
  2. Lookout ingester: Extracts FailureInfo from the error event and stores it as a JSONB column (failure_info) on the job_run table.
  3. Lookout API: Returns failureInfo on each run in the jobs response, including exit code, termination message, categories, and container name.
  4. Lookout UI: Displays error categories in the jobs table as a column with client-side filtering, and shows full failure details (container name, categories, termination message) in the run details sidebar.

Enhanced failure debug info

Beyond categorization, capture richer structured data on failures to aid debugging and observability:

  • Failing container name - included in ContainerError.objectMeta, identifies which container failed.

  • Termination reason - included in ContainerError.reason, carries the terminated.reason from the container status.

  • Did user code run - distinguish failures in user-submitted containers vs platform-injected containers (preflight controllers, sidecars, etc.). User code is whatever the user defined in their job submission (init containers and main containers). Anything injected after submission is non-user code. The approach for detecting this needs further design, as mutations can happen at multiple points in the pipeline (Armada-level tools, Kubernetes webhooks), so the solution needs to work regardless of where injection occurs.

  • Node conditions - when a pod fails, fetch the node via GetNode(nodeName) and capture node.Status.Conditions (Ready, DiskPressure, MemoryPressure, PIDPressure, NetworkUnavailable). If a job fails and the node has MemoryPressure=True, that's infra, not user fault. Store on FailureInfo so the data flows through Pulsar to Lookout alongside categories.

  • Warning events (structured) - pod events are already fetched but dumped as a text blob via describe.DescribeEvents(). Capture them as structured data instead: event reason, message, type (Warning/Normal), timestamp, source component. This lets categorization rules match on events and Lookout can display them as a table. Cap at 10 events (Warning-first) to bound proto message size.

  • Pod status snapshot - capture key fields from pod.Status structurally: conditions, container statuses (all containers, including injected ones), phase, reason, message, start time. Not a raw JSON blob, specific fields in the proto.

  • Stack trace summary - if the termination message is longer than ~10 lines, extract the last 10 lines as a summary field. Most stack traces have the actual error at the bottom.

Lookout UI - job insights and observability

Categories and debug info flow through to Lookout UI and parquet data, giving users a way to understand failure patterns across jobs:

  • Filter and search jobs by error category in Lookout
  • See category labels on job run details (e.g. "cuda_error" instead of raw exit code 74)
  • See whether user code ran, node conditions at time of failure, structured warning events
  • Query parquet data for analytics, e.g. "what percentage of failures this week were infra vs user errors?"

Related PRs

Future work (out of scope)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions