Error Categorization

## Motivation

Armada has multiple mechanisms for handling job failures (executor pod checks, lease loss retries), but they don't share a consistent classification of *why* a job failed. The raw signals (exit code 74, termination message "CUDA error: GPU has fallen off the bus") flow through the system but carry no semantic meaning.

We want a single classification that serves **observability**. Instead of every consumer needing to understand raw k8s signals, the executor classifies failures into named categories that flow through the whole system.

## Design

### Classify failures at the executor

The executor already has the richest failure data: full pod status, Kubernetes events, container states. Instead of passing all k8s events and conditions up to the scheduler for it to make sense of, the executor classifies the failure into a named category and attaches it to the job run as a label.

Matching rules (exit codes, termination messages, conditions) run at the executor to set a "termination reason" category on the job. This can then flow through the whole system without every component needing to understand raw k8s signals.

Categories can live locally in the executor cluster. A GPU cluster defines `cuda_error` and `infiniband_error`; a CPU cluster doesn't need them.

### Proto schema

A `FailureInfo` message on the `Error` proto carries structured failure metadata from the executor through Pulsar to Lookout:

- `exit_code` - exit code of the first failed container
- `termination_message` - termination message from the first failed container
- `categories` - executor-assigned category labels matched from config
- `container_name` - which container the exit code and termination message were extracted from

### Categories

Configuration in executor, a category is just a name plus matching rules:

```yaml
errorCategories:
  - name: cuda_error
    rules:
      - onTerminationMessage: { pattern: ".*(CUDA error|Xid error).*" }
      - onExitCodes: { operator: In, values: [74, 75] }

  - name: infiniband_error
    rules:
      - onExitCodes: { operator: In, values: [42, 43] }
      - onTerminationMessage: { pattern: ".*InfiniBand.*link down.*" }

  - name: oom
    rules:
      - onConditions: [OOMKilled]
      - onExitCodes: { operator: In, values: [137] }

  - name: sidecar_crash
    rules:
      - containerName: istio-proxy
        onExitCodes: { operator: NotIn, values: [0] }
```

Rules within a category are OR'd. If exit code equals XX **or** termination log has YY, the category matches.

Each rule supports an optional `containerName` field. When set, the rule only matches failures from that specific container. When empty (the default), failures from any container can match. This is useful for multi-container pods where sidecars and init containers may fail independently and you want distinct categories for them.

### Data flow

1. **Executor**: On pod failure, the classifier evaluates all configured category rules against the pod status. Matched category names plus container failure details are packed into a `FailureInfo` and attached to the error event sent via Pulsar.
2. **Lookout ingester**: Extracts `FailureInfo` from the error event and stores it as a JSONB column (`failure_info`) on the `job_run` table.
3. **Lookout API**: Returns `failureInfo` on each run in the jobs response, including exit code, termination message, categories, and container name.
4. **Lookout UI**: Displays error categories in the jobs table as a column with client-side filtering, and shows full failure details (container name, categories, termination message) in the run details sidebar.

### Enhanced failure debug info

Beyond categorization, capture richer structured data on failures to aid debugging and observability:

- **Failing container name** - included in `ContainerError.objectMeta`, identifies which container failed.

- **Termination reason** - included in `ContainerError.reason`, carries the `terminated.reason` from the container status.

- **Did user code run** - distinguish failures in user-submitted containers vs platform-injected containers (preflight controllers, sidecars, etc.). User code is whatever the user defined in their job submission (init containers and main containers). Anything injected after submission is non-user code. The approach for detecting this needs further design, as mutations can happen at multiple points in the pipeline (Armada-level tools, Kubernetes webhooks), so the solution needs to work regardless of where injection occurs.

- **Node conditions** - when a pod fails, fetch the node via `GetNode(nodeName)` and capture `node.Status.Conditions` (Ready, DiskPressure, MemoryPressure, PIDPressure, NetworkUnavailable). If a job fails and the node has `MemoryPressure=True`, that's infra, not user fault. Store on `FailureInfo` so the data flows through Pulsar to Lookout alongside categories.

- **Warning events (structured)** - pod events are already fetched but dumped as a text blob via `describe.DescribeEvents()`. Capture them as structured data instead: event reason, message, type (Warning/Normal), timestamp, source component. This lets categorization rules match on events and Lookout can display them as a table. Cap at 10 events (Warning-first) to bound proto message size.

- **Pod status snapshot** - capture key fields from `pod.Status` structurally: conditions, container statuses (all containers, including injected ones), phase, reason, message, start time. Not a raw JSON blob, specific fields in the proto.

- **Stack trace summary** - if the termination message is longer than ~10 lines, extract the last 10 lines as a summary field. Most stack traces have the actual error at the bottom.

### Lookout UI - job insights and observability

Categories and debug info flow through to Lookout UI and parquet data, giving users a way to understand failure patterns across jobs:

- Filter and search jobs by error category in Lookout
- See category labels on job run details (e.g. "cuda_error" instead of raw exit code 74)
- See whether user code ran, node conditions at time of failure, structured warning events
- Query parquet data for analytics, e.g. "what percentage of failures this week were infra vs user errors?"

## Related PRs

- PR #4741 - Proto schema, shared matching primitives, executor classifier
- PR #4745 - Wire classifier into executor event reporting
- PR #4755 - Lookout ingester, API, and UI
- PR #4760 - E2e test infrastructure for asserting categories

## Future work (out of scope)

- **Retry policies using categories** - tracked in #4683. Once categories are in place, the retry policy engine can reference them via `onFailureCategory` rules.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error Categorization #4713

Motivation

Design

Classify failures at the executor

Proto schema

Categories

Data flow

Enhanced failure debug info

Lookout UI - job insights and observability

Related PRs

Future work (out of scope)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error Categorization #4713

Description

Motivation

Design

Classify failures at the executor

Proto schema

Categories

Data flow

Enhanced failure debug info

Lookout UI - job insights and observability

Related PRs

Future work (out of scope)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions