Skip to content

Add error categorization proto schema and executor classifier#4741

Merged
dejanzele merged 3 commits intoarmadaproject:masterfrom
dejanzele:error-categorization-proto
Apr 1, 2026
Merged

Add error categorization proto schema and executor classifier#4741
dejanzele merged 3 commits intoarmadaproject:masterfrom
dejanzele:error-categorization-proto

Conversation

@dejanzele
Copy link
Copy Markdown
Member

@dejanzele dejanzele commented Mar 5, 2026

What type of PR is this?

Feature (PR 1 of 4)

What this PR does / why we need it

Adds the proto schema and shared building blocks for error categorization in Armada.

  • Adds FailureInfo message to the Error proto with fields: exit_code, termination_message, categories, container_name
  • Adds errormatch package (internal/common/errormatch/) with shared matching primitives: ExitCodeMatcher (In/NotIn operators), RegexMatcher, and Kubernetes condition constants (OOMKilled, Evicted, DeadlineExceeded)
  • Adds categorizer package (internal/executor/categorizer/) - configurable classifier that matches pod failures against rules (exit codes, termination messages, Kubernetes conditions) and assigns named categories, with optional containerName scoping per rule
  • Adds ExtractFailureInfo() in pod_status.go to extract exit code, termination message, and container name from Kubernetes pod status into the FailureInfo proto
  • Adds ErrorCategories config field under ApplicationConfiguration for defining category rules

Nothing is wired into the event reporting path yet - this PR provides the building blocks that PR #4745 connects.

Which issue(s) this PR fixes

Part of #4713 (Error Categorization)

Special notes for your reviewer

@dejanzele dejanzele changed the title Add FailureCondition enum and FailureInfo message to Error proto Add error categorization proto schema and executor classifier Mar 5, 2026
@dejanzele dejanzele force-pushed the error-categorization-proto branch 4 times, most recently from 0303967 to 8d209d5 Compare March 6, 2026 13:43
@dejanzele dejanzele force-pushed the error-categorization-proto branch 2 times, most recently from 371113a to da1d029 Compare March 9, 2026 10:21
@dejanzele dejanzele force-pushed the error-categorization-proto branch from da1d029 to 052ced0 Compare March 9, 2026 13:03
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 9, 2026

Greptile Summary

This PR lays the foundational building blocks for error categorization in Armada: a FailureInfo proto message, a shared errormatch package of matching primitives, a configurable categorizer classifier, and ExtractFailureInfo() in pod_status.go. Nothing is wired into the event-reporting path yet; that comes in PR #4745.

The implementation is well-structured with thorough tests and good validation at construction time. Several issues flagged in earlier review rounds have been addressed (exit-code-0 filtering in failedContainers, empty-rules validation in NewClassifier, nil guard in MatchExitCode).

Findings:

  • AppError in categorizer/doc.go — The package doc lists AppError as a valid OnConditions value, but it is not present in errormatch.KnownConditions. Any operator who follows this documentation and adds \"AppError\" to their config will get a runtime error from NewClassifier, blocking executor startup.
  • containerName silently ignored for pod-level conditions — When a CategoryRule specifies both containerName and an onConditions of Evicted or DeadlineExceeded, the container filter has no effect because pod-level conditions are matched against pod.Status.Reason, not the container list. A user expecting to scope an eviction/deadline rule to a specific container will get unexpected false positives with no error or warning.

Confidence Score: 4/5

Safe to merge after fixing the AppError documentation error in categorizer/doc.go — it will cause executor startup failures for any operator who follows the documented config.

One P1 documentation bug (AppError listed as valid condition) would directly cause executor startup failures for users following the godoc. The rest of the implementation is solid: previously-flagged bugs around exit-code-0 filtering, empty-rules validation, and the nil guard are all fixed. One P2 behavioral gap (containerName silently ignored for pod-level conditions) is noted for awareness.

internal/executor/categorizer/doc.go — remove AppError from the valid conditions list. internal/executor/categorizer/classifier.go — consider validating or documenting the containerName + pod-level condition interaction.

Important Files Changed

Filename Overview
internal/executor/categorizer/doc.go Package documentation incorrectly lists AppError as a valid OnConditions value — it's not in KnownConditions and will be rejected at runtime by NewClassifier.
internal/executor/categorizer/classifier.go Core classifier implementation; exit-code-0 filtering and empty-rules validation are correctly implemented. Minor behavioral gap: containerName filter is silently ignored when combined with pod-level conditions (Evicted, DeadlineExceeded).
internal/executor/categorizer/classifier_test.go Thorough test coverage for all matchers, multi-category matches, container scoping, init containers, and validation errors.
internal/common/errormatch/match.go Clean implementation of MatchExitCode (with nil guard and exit-code-0 exclusion) and MatchPattern; no issues found.
internal/common/errormatch/types.go Defines ExitCodeMatcher, RegexMatcher, condition constants, and KnownConditions map; well-structured with correct set of pod-observable conditions.
internal/executor/util/pod_status.go Adds ExtractFailureInfo() and replaces local reason-string constants with errormatch package constants; logic is correct.
pkg/armadaevents/events.proto Adds FailureInfo message and failure_info field (tag 15) to Error; generated .pb.go is in sync with the proto source.
internal/executor/configuration/types.go Adds ErrorCategories []categorizer.CategoryConfig field to ApplicationConfiguration with appropriate YAML tag and documentation.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Pod fails] --> B[ExtractFailureInfo pod_status.go]
    A --> C[Classifier.Classify]

    B --> D[Iterate GetPodContainerStatuses]
    D --> E{First container\nwith exitCode != 0?}
    E -->|Yes| F[Set ExitCode, TerminationMessage, ContainerName in FailureInfo]
    E -->|No| G[ExitCode stays 0]

    C --> H[failedContainers\nfilter exitCode == 0]
    H --> I[For each CategoryConfig]
    I --> J[categoryMatches: OR rules]
    J --> K{Rule matcher type?}
    K -->|onConditions OOMKilled| L[Check container.reason]
    K -->|onConditions Evicted/Deadline| M[Check pod.Status.Reason]
    K -->|onExitCodes| N[MatchExitCode]
    K -->|onTerminationMessage| O[MatchPattern regex]
    L & M & N & O --> P{Matched?}
    P -->|Yes| Q[Append category name]
    P -->|No| R[Next rule]

    F --> S[FailureInfo proto\nCategories + ExitCode + TerminationMessage + ContainerName]
    Q --> S
    G --> S
Loading

Reviews (25): Last reviewed commit: "Merge branch 'master' into error-categor..." | Re-trigger Greptile

@dejanzele dejanzele force-pushed the error-categorization-proto branch from 052ced0 to 7af38d0 Compare March 9, 2026 13:19
@dejanzele dejanzele force-pushed the error-categorization-proto branch 2 times, most recently from 22d10e0 to f8a4e3e Compare March 10, 2026 12:34
@dejanzele
Copy link
Copy Markdown
Member Author

@greptile

@dejanzele dejanzele force-pushed the error-categorization-proto branch from f8a4e3e to def537a Compare March 10, 2026 16:50
@dejanzele
Copy link
Copy Markdown
Member Author

@greptile

@dejanzele dejanzele force-pushed the error-categorization-proto branch 2 times, most recently from 67989ad to fddf4eb Compare March 12, 2026 12:31
@dejanzele
Copy link
Copy Markdown
Member Author

@greptile

@dejanzele dejanzele force-pushed the error-categorization-proto branch from fddf4eb to 1cc4fc4 Compare March 12, 2026 14:14
@dejanzele
Copy link
Copy Markdown
Member Author

Changed the default from FAILURE_CONDITION_USER_ERROR to FAILURE_CONDITION_UNSPECIFIED - at large scale various infra/hardware/k8s errors appear that aren't the user's fault, so the catch-all should be honest about not knowing. The USER_ERROR enum value stays in the proto for operators who want to explicitly classify via errorCategories config.

@dejanzele dejanzele force-pushed the error-categorization-proto branch 3 times, most recently from 1e4008e to b1fe76c Compare March 13, 2026 12:19
@dejanzele dejanzele force-pushed the error-categorization-proto branch 3 times, most recently from dd8eb9e to a7d60fe Compare March 17, 2026 14:39
@dejanzele dejanzele force-pushed the error-categorization-proto branch from a7d60fe to 5f8edfb Compare March 30, 2026 08:26
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the error-categorization-proto branch 3 times, most recently from babf18c to 32d79be Compare March 30, 2026 15:57
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the error-categorization-proto branch from 32d79be to e3c73a7 Compare March 31, 2026 08:19
@dejanzele dejanzele enabled auto-merge (squash) April 1, 2026 15:37
@dejanzele dejanzele merged commit ccd1cb3 into armadaproject:master Apr 1, 2026
17 checks passed
dejanzele added a commit that referenced this pull request Apr 2, 2026
## What type of PR is this?

Feature (PR 2 of 4)

## What this PR does / why we need it

Wires the classifier and FailureInfo into the executor's event reporting
path.

- Constructs the `Classifier` from config at executor startup
unconditionally (with no rules configured, it simply returns empty
categories)
- Passes classifier to the pod issue handler and job state reporter
- Calls `ExtractFailureInfo()` + `classifier.Classify()` on every pod
failure, attaching structured `FailureInfo` (exit code, termination
message, categories, container name) to the `Error` events sent through
Pulsar
- After this PR, every failed pod event flowing into Pulsar carries exit
code, termination message, container name, and matched category names

## Which issue(s) this PR fixes

Part of #4713 (Error Categorization)

## Special notes for your reviewer

- This is PR 2 of 4: Proto + classifier (#4741) -> Wire into executor
(this) -> Store in lookout DB + UI (#4755) -> e2e tests (#4760)
- Depends on #4741
- The classifier is always instantiated at startup, even with no rules
configured - no conditional nil-checking needed at call sites
- Changes are in `application.go` (startup wiring),
`pod_issue_handler.go`, `job_state_reporter.go`, `reporter/event.go`,
`cluster_allocation.go`, and `preempt_runs.go`

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
dejanzele added a commit that referenced this pull request Apr 2, 2026
## What type of PR is this?

Feature (PR 3 of 4)

## What this PR does / why we need it

Stores FailureInfo in the lookout database, exposes it via the Lookout
API and public event API, and displays it in the Lookout UI.

- Adds DB migration (031) to add a `failure_info` JSONB column to
`job_run`
- Extracts `FailureInfo` from Error events in the lookout ingester and
stores as JSONB (exit code, termination message, categories, container
name)
- Adds `failureInfo` to the Lookout v2 Swagger spec and threads it
through the query builder, model, and conversions layer
- Displays FailureInfo in the Lookout UI sidebar on failed job runs:
container name, exit code, termination message, and matched categories
- Adds Error Categories column to the jobs table (display-only,
filtering is not yet supported since categories are stored in JSONB and
not a server-side filterable field)
- Propagates error categories in the public `JobFailedEvent` proto, so
categories are available in the event stream (used by the e2e testsuite
in PR #4760)
- Adds executor error category config to `_local/executor/config.yaml`
for local development

## Which issue(s) this PR fixes

Part of #4713 (Error Categorization)

## Special notes for your reviewer

- This is PR 3 of 4: Proto + classifier (#4741) -> Wire into executor
(#4745) -> Store in lookout DB + UI + public event (this) -> e2e tests
(#4760)
- Depends on #4745
- JSONB round-trip means `exitCode` (int32 in proto) arrives as float64
after PostgreSQL JSON deserialization - `failureInfoToSwagger()` handles
this via type assertion
- `failureInfoToMap()` only emits non-zero fields to keep the JSONB
clean
- Uses `coalesce(tmp.failure_info, job_run.failure_info)` in the batch
upsert so earlier partial updates aren't overwritten
- Only terminal errors have their FailureInfo persisted (`if e.Terminal`
guard in the ingester) to prevent transient failures from overwriting
the final classification
- Categories on `JobFailedEvent` are set in `FromInternalJobErrors()`
conversion from `FailureInfo.Categories`

<img width="523" height="310" alt="image"
src="https://github.com/user-attachments/assets/8e018b92-6b90-46a1-8b1f-81c44ffed4c3"
/>
<img width="1728" height="410" alt="image"
src="https://github.com/user-attachments/assets/b79a4b28-b713-44ba-99a8-5297104a9edc"
/>

---------

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
dejanzele added a commit that referenced this pull request Apr 3, 2026
## What type of PR is this?

Feature (PR 4 of 4)

## What this PR does / why we need it

Adds end-to-end test cases that verify error categories flow through the
full Armada pipeline, asserting directly from the public event stream.

- Extends `assertEventFailed` in the testsuite event watcher to compare
expected vs actual categories on `JobFailedEvent` (sorted exact match)
- Adds 2 categorization test cases: OOM kill (`dd` fills tmpfs to
trigger OOMKilled) and user_error (exit code 1)
- Test YAML declares expected categories inline on the `failed` event:
  ```yaml
  expectedEvents:
    - submitted:
    - failed:
        categories: ["oom"]
  ```

## Which issue(s) this PR fixes

Part of #4713 (Error Categorization)

## Special notes for your reviewer

- This is PR 4 of 4: Proto + classifier (#4741) -> Wire into executor
(#4745) -> Store in lookout DB + UI + public event (#4755) -> e2e tests
(this)
- Depends on #4755 (which added `categories` field to `JobFailedEvent`)
- Categories are asserted from the event stream, not from the Lookout
API - no HTTP polling or additional infrastructure needed
- CI executor config needs `errorCategories` rules added separately
(included in `e2e/config/executor_config.yaml`)
- Both tests validated locally against a full local dev stack

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants