Skip to content

Wire error categories into executor event reporting#4745

Merged
dejanzele merged 1 commit intoarmadaproject:masterfrom
dejanzele:wire-error-categories-executor
Apr 2, 2026
Merged

Wire error categories into executor event reporting#4745
dejanzele merged 1 commit intoarmadaproject:masterfrom
dejanzele:wire-error-categories-executor

Conversation

@dejanzele
Copy link
Copy Markdown
Member

@dejanzele dejanzele commented Mar 6, 2026

What type of PR is this?

Feature (PR 2 of 4)

What this PR does / why we need it

Wires the classifier and FailureInfo into the executor's event reporting path.

  • Constructs the Classifier from config at executor startup unconditionally (with no rules configured, it simply returns empty categories)
  • Passes classifier to the pod issue handler and job state reporter
  • Calls ExtractFailureInfo() + classifier.Classify() on every pod failure, attaching structured FailureInfo (exit code, termination message, categories, container name) to the Error events sent through Pulsar
  • After this PR, every failed pod event flowing into Pulsar carries exit code, termination message, container name, and matched category names

Which issue(s) this PR fixes

Part of #4713 (Error Categorization)

Special notes for your reviewer

@dejanzele dejanzele changed the title Wire error categories executor Wire error categories into executor event reporting Mar 6, 2026
@dejanzele dejanzele force-pushed the wire-error-categories-executor branch 2 times, most recently from 5a399ca to 8a49600 Compare March 9, 2026 10:11
@dejanzele dejanzele force-pushed the wire-error-categories-executor branch from 8a49600 to b828caf Compare March 9, 2026 10:21
@dejanzele dejanzele force-pushed the wire-error-categories-executor branch from b828caf to 30aa1ca Compare March 9, 2026 13:03
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 9, 2026

Greptile Summary

This PR (2 of 4 in the error categorization series) wires the Classifier and FailureInfo into the executor's event reporting path, so every failed pod event flowing into Pulsar now carries structured failure data (exit code, termination message, container name, and matched category names).

Key changes:

  • Classifier is constructed unconditionally at executor startup from config.Application.ErrorCategories; with no rules configured it returns empty categories, so no nil-checks are required at call sites.
  • CreateEventForCurrentState, CreateJobFailedEvent, and CreateSimpleJobFailedEvent gain new classifier/failureInfo parameters — all nil-safe via the Classify nil-receiver guard.
  • ExtractFailureInfo + classifier.Classify are called in two places: CreateEventForCurrentState (for the PodFailed phase path) and handleNonRetryableJobIssue (for stuck/issue-handler path).
  • Preempted and submission-failure events deliberately pass nil for FailureInfo.
  • Tests cover nil classifier, classifier-with-rules, and the end-to-end issue-handler integration path.

Two minor concerns worth tracking toward the final PR in the series:

  • ExtractFailureInfo always returns a non-nil proto even when no container exit data exists (e.g., eviction, deadline-exceeded), leaving ExitCode=0 and empty ContainerName — downstream consumers cannot distinguish "genuinely zero" from "not found."
  • handleNonRetryableJobIssue attaches FailureInfo for every non-retryable issue type including ExternallyDeleted and ErrorDuringIssueHandling, where container-level data is typically absent.

Confidence Score: 5/5

Safe to merge — all remaining findings are P2 style/semantic concerns with no current runtime breakage.

The core wiring is correct and nil-safe throughout. The two findings are forward-looking quality concerns (zero-ExitCode ambiguity for eviction events, and FailureInfo attached to all non-retryable issue types regardless of relevance), neither of which causes broken behaviour in the current PR. Previous substantive concerns about PodCheckRetryable, preemption field divergence, and multi-container ExitCode correlation were already flagged in prior review threads. Test coverage for the new paths is solid.

internal/executor/reporter/event.go and internal/executor/service/pod_issue_handler.go carry the two P2 notes, but neither blocks merge.

Important Files Changed

Filename Overview
internal/executor/application.go Adds Classifier construction from config at startup and threads it into PodIssueHandler and JobStateReporter; straightforward wiring with fatal-on-config-error guard.
internal/executor/reporter/event.go Adds classifier param to CreateEventForCurrentState and failureInfo param to CreateJobFailedEvent/CreateSimpleJobFailedEvent; nil-safe via classifier's nil-receiver guard, but always attaches non-nil FailureInfo even for eviction/deadline pods where no container exit code exists.
internal/executor/service/pod_issue_handler.go Adds classifier field and wires ExtractFailureInfo+Classify into handleNonRetryableJobIssue; nil-safe but ExtractFailureInfo runs for all non-retryable issue types including eviction/external-deletion where container data may not be present.
internal/executor/service/job_state_reporter.go Adds classifier field, passes it to CreateEventForCurrentState; straightforward and nil-safe delegation.
internal/executor/service/cluster_allocation.go Updates call to CreateSimpleJobFailedEvent with explicit nil failureInfo for submission failures — deliberate, as these are not runtime pod failures.
internal/executor/job/processors/preempt_runs.go Updates CreateSimpleJobFailedEvent call with explicit nil failureInfo for preempted runs — deliberate choice to not attach container failure detail for preemption events.
internal/executor/reporter/event_test.go Adds coverage for nil classifier, non-nil classifier with exit-code matching, and asserts FailureInfo is always present on PodFailed events; good test hygiene.
internal/executor/service/pod_issue_handler_test.go New integration-style test verifies classifier-matched categories appear on FailureInfo for stuck-terminating OOMKilled pods; well-structured and tests end-to-end path.

Sequence Diagram

sequenceDiagram
    participant App as application.go
    participant Clf as Classifier
    participant JSR as JobStateReporter
    participant PIH as PodIssueHandler
    participant Evt as reporter/event.go
    participant Util as util.ExtractFailureInfo
    participant Pulsar as Pulsar (EventSender)

    App->>Clf: NewClassifier(config.ErrorCategories)
    App->>PIH: NewPodIssuerHandler(..., classifier)
    App->>JSR: NewJobStateReporter(..., classifier)

    Note over JSR: Pod phase changes (Kubernetes watch)
    JSR->>Evt: CreateEventForCurrentState(pod, clusterId, classifier)
    Evt->>Clf: classifier.Classify(pod)
    Clf-->>Evt: []string categories
    Evt->>Util: ExtractFailureInfo(pod, categories)
    Util-->>Evt: *FailureInfo
    Evt-->>JSR: EventSequence (with FailureInfo on PodFailed)
    JSR->>Pulsar: Report(event)

    Note over PIH: Non-retryable issue detected
    PIH->>Clf: classifier.Classify(originalPodState)
    Clf-->>PIH: []string categories
    PIH->>Util: ExtractFailureInfo(originalPodState, categories)
    Util-->>PIH: *FailureInfo
    PIH->>Evt: CreateSimpleJobFailedEvent(..., failureInfo)
    Evt-->>PIH: EventSequence (with FailureInfo)
    PIH->>Pulsar: Report(event)
Loading

Reviews (25): Last reviewed commit: "Wire error categories into executor even..." | Re-trigger Greptile

@dejanzele dejanzele force-pushed the wire-error-categories-executor branch from 30aa1ca to ef4e22d Compare March 9, 2026 13:20
@dejanzele dejanzele force-pushed the wire-error-categories-executor branch 2 times, most recently from 3d469d0 to 20f2fd3 Compare March 10, 2026 12:34
@dejanzele
Copy link
Copy Markdown
Member Author

@greptile

@dejanzele dejanzele force-pushed the wire-error-categories-executor branch from 20f2fd3 to 71d2cd3 Compare March 10, 2026 16:51
@dejanzele dejanzele force-pushed the wire-error-categories-executor branch 2 times, most recently from 5ab3dda to 87974a6 Compare March 12, 2026 12:31
@dejanzele
Copy link
Copy Markdown
Member Author

@greptile

@dejanzele dejanzele force-pushed the wire-error-categories-executor branch from 87974a6 to e1d4957 Compare March 12, 2026 14:14
@dejanzele
Copy link
Copy Markdown
Member Author

Updated the default FailureCondition from USER_ERROR to UNSPECIFIED in the upstream PR, so this concern is addressed now.

@dejanzele dejanzele force-pushed the wire-error-categories-executor branch 8 times, most recently from c20362e to 113b764 Compare March 17, 2026 14:40
@dejanzele dejanzele force-pushed the wire-error-categories-executor branch 4 times, most recently from aaa3f3a to 8aa5ea8 Compare March 30, 2026 15:58
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

@dejanzele dejanzele force-pushed the wire-error-categories-executor branch from 8aa5ea8 to 15998e3 Compare March 31, 2026 08:20
dejanzele added a commit that referenced this pull request Apr 1, 2026
## What type of PR is this?

Feature (PR 1 of 4)

## What this PR does / why we need it

Adds the proto schema and shared building blocks for error
categorization in Armada.

- Adds `FailureInfo` message to the `Error` proto with fields:
`exit_code`, `termination_message`, `categories`, `container_name`
- Adds `errormatch` package (`internal/common/errormatch/`) with shared
matching primitives: `ExitCodeMatcher` (In/NotIn operators),
`RegexMatcher`, and Kubernetes condition constants (OOMKilled, Evicted,
DeadlineExceeded)
- Adds `categorizer` package (`internal/executor/categorizer/`) -
configurable classifier that matches pod failures against rules (exit
codes, termination messages, Kubernetes conditions) and assigns named
categories, with optional `containerName` scoping per rule
- Adds `ExtractFailureInfo()` in `pod_status.go` to extract exit code,
termination message, and container name from Kubernetes pod status into
the `FailureInfo` proto
- Adds `ErrorCategories` config field under `ApplicationConfiguration`
for defining category rules

Nothing is wired into the event reporting path yet - this PR provides
the building blocks that PR #4745 connects.

## Which issue(s) this PR fixes

Part of #4713 (Error Categorization)

## Special notes for your reviewer

- This is PR 1 of 4: Proto + classifier (this) -> Wire into executor
(#4745) -> Store in lookout DB + UI (#4755) -> e2e tests (#4760)
- The `errormatch` package is in `internal/common/` since it will be
reused by the lookout ingester
- The categorizer has thorough `doc.go` explaining config format and
validation
- Exit code 0 containers are skipped during classification (only
failures are categorized)
- Rules within a category are OR'd; categories are evaluated
independently

---------

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the wire-error-categories-executor branch from 15998e3 to ba6f876 Compare April 1, 2026 16:27
@dejanzele dejanzele merged commit e72c90e into armadaproject:master Apr 2, 2026
18 checks passed
dejanzele added a commit that referenced this pull request Apr 2, 2026
## What type of PR is this?

Feature (PR 3 of 4)

## What this PR does / why we need it

Stores FailureInfo in the lookout database, exposes it via the Lookout
API and public event API, and displays it in the Lookout UI.

- Adds DB migration (031) to add a `failure_info` JSONB column to
`job_run`
- Extracts `FailureInfo` from Error events in the lookout ingester and
stores as JSONB (exit code, termination message, categories, container
name)
- Adds `failureInfo` to the Lookout v2 Swagger spec and threads it
through the query builder, model, and conversions layer
- Displays FailureInfo in the Lookout UI sidebar on failed job runs:
container name, exit code, termination message, and matched categories
- Adds Error Categories column to the jobs table (display-only,
filtering is not yet supported since categories are stored in JSONB and
not a server-side filterable field)
- Propagates error categories in the public `JobFailedEvent` proto, so
categories are available in the event stream (used by the e2e testsuite
in PR #4760)
- Adds executor error category config to `_local/executor/config.yaml`
for local development

## Which issue(s) this PR fixes

Part of #4713 (Error Categorization)

## Special notes for your reviewer

- This is PR 3 of 4: Proto + classifier (#4741) -> Wire into executor
(#4745) -> Store in lookout DB + UI + public event (this) -> e2e tests
(#4760)
- Depends on #4745
- JSONB round-trip means `exitCode` (int32 in proto) arrives as float64
after PostgreSQL JSON deserialization - `failureInfoToSwagger()` handles
this via type assertion
- `failureInfoToMap()` only emits non-zero fields to keep the JSONB
clean
- Uses `coalesce(tmp.failure_info, job_run.failure_info)` in the batch
upsert so earlier partial updates aren't overwritten
- Only terminal errors have their FailureInfo persisted (`if e.Terminal`
guard in the ingester) to prevent transient failures from overwriting
the final classification
- Categories on `JobFailedEvent` are set in `FromInternalJobErrors()`
conversion from `FailureInfo.Categories`

<img width="523" height="310" alt="image"
src="https://github.com/user-attachments/assets/8e018b92-6b90-46a1-8b1f-81c44ffed4c3"
/>
<img width="1728" height="410" alt="image"
src="https://github.com/user-attachments/assets/b79a4b28-b713-44ba-99a8-5297104a9edc"
/>

---------

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
dejanzele added a commit that referenced this pull request Apr 3, 2026
## What type of PR is this?

Feature (PR 4 of 4)

## What this PR does / why we need it

Adds end-to-end test cases that verify error categories flow through the
full Armada pipeline, asserting directly from the public event stream.

- Extends `assertEventFailed` in the testsuite event watcher to compare
expected vs actual categories on `JobFailedEvent` (sorted exact match)
- Adds 2 categorization test cases: OOM kill (`dd` fills tmpfs to
trigger OOMKilled) and user_error (exit code 1)
- Test YAML declares expected categories inline on the `failed` event:
  ```yaml
  expectedEvents:
    - submitted:
    - failed:
        categories: ["oom"]
  ```

## Which issue(s) this PR fixes

Part of #4713 (Error Categorization)

## Special notes for your reviewer

- This is PR 4 of 4: Proto + classifier (#4741) -> Wire into executor
(#4745) -> Store in lookout DB + UI + public event (#4755) -> e2e tests
(this)
- Depends on #4755 (which added `categories` field to `JobFailedEvent`)
- Categories are asserted from the event stream, not from the Lookout
API - no HTTP polling or additional infrastructure needed
- CI executor config needs `errorCategories` rules added separately
(included in `e2e/config/executor_config.yaml`)
- Both tests validated locally against a full local dev stack

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants