Wire error categories into executor event reporting#4745
Wire error categories into executor event reporting#4745dejanzele merged 1 commit intoarmadaproject:masterfrom
Conversation
5a399ca to
8a49600
Compare
8a49600 to
b828caf
Compare
b828caf to
30aa1ca
Compare
Greptile SummaryThis PR (2 of 4 in the error categorization series) wires the Key changes:
Two minor concerns worth tracking toward the final PR in the series:
Confidence Score: 5/5Safe to merge — all remaining findings are P2 style/semantic concerns with no current runtime breakage. The core wiring is correct and nil-safe throughout. The two findings are forward-looking quality concerns (zero-ExitCode ambiguity for eviction events, and FailureInfo attached to all non-retryable issue types regardless of relevance), neither of which causes broken behaviour in the current PR. Previous substantive concerns about PodCheckRetryable, preemption field divergence, and multi-container ExitCode correlation were already flagged in prior review threads. Test coverage for the new paths is solid. internal/executor/reporter/event.go and internal/executor/service/pod_issue_handler.go carry the two P2 notes, but neither blocks merge. Important Files Changed
Sequence DiagramsequenceDiagram
participant App as application.go
participant Clf as Classifier
participant JSR as JobStateReporter
participant PIH as PodIssueHandler
participant Evt as reporter/event.go
participant Util as util.ExtractFailureInfo
participant Pulsar as Pulsar (EventSender)
App->>Clf: NewClassifier(config.ErrorCategories)
App->>PIH: NewPodIssuerHandler(..., classifier)
App->>JSR: NewJobStateReporter(..., classifier)
Note over JSR: Pod phase changes (Kubernetes watch)
JSR->>Evt: CreateEventForCurrentState(pod, clusterId, classifier)
Evt->>Clf: classifier.Classify(pod)
Clf-->>Evt: []string categories
Evt->>Util: ExtractFailureInfo(pod, categories)
Util-->>Evt: *FailureInfo
Evt-->>JSR: EventSequence (with FailureInfo on PodFailed)
JSR->>Pulsar: Report(event)
Note over PIH: Non-retryable issue detected
PIH->>Clf: classifier.Classify(originalPodState)
Clf-->>PIH: []string categories
PIH->>Util: ExtractFailureInfo(originalPodState, categories)
Util-->>PIH: *FailureInfo
PIH->>Evt: CreateSimpleJobFailedEvent(..., failureInfo)
Evt-->>PIH: EventSequence (with FailureInfo)
PIH->>Pulsar: Report(event)
Reviews (25): Last reviewed commit: "Wire error categories into executor even..." | Re-trigger Greptile |
30aa1ca to
ef4e22d
Compare
3d469d0 to
20f2fd3
Compare
|
@greptile |
20f2fd3 to
71d2cd3
Compare
5ab3dda to
87974a6
Compare
|
@greptile |
87974a6 to
e1d4957
Compare
|
Updated the default FailureCondition from USER_ERROR to UNSPECIFIED in the upstream PR, so this concern is addressed now. |
c20362e to
113b764
Compare
aaa3f3a to
8aa5ea8
Compare
8aa5ea8 to
15998e3
Compare
## What type of PR is this? Feature (PR 1 of 4) ## What this PR does / why we need it Adds the proto schema and shared building blocks for error categorization in Armada. - Adds `FailureInfo` message to the `Error` proto with fields: `exit_code`, `termination_message`, `categories`, `container_name` - Adds `errormatch` package (`internal/common/errormatch/`) with shared matching primitives: `ExitCodeMatcher` (In/NotIn operators), `RegexMatcher`, and Kubernetes condition constants (OOMKilled, Evicted, DeadlineExceeded) - Adds `categorizer` package (`internal/executor/categorizer/`) - configurable classifier that matches pod failures against rules (exit codes, termination messages, Kubernetes conditions) and assigns named categories, with optional `containerName` scoping per rule - Adds `ExtractFailureInfo()` in `pod_status.go` to extract exit code, termination message, and container name from Kubernetes pod status into the `FailureInfo` proto - Adds `ErrorCategories` config field under `ApplicationConfiguration` for defining category rules Nothing is wired into the event reporting path yet - this PR provides the building blocks that PR #4745 connects. ## Which issue(s) this PR fixes Part of #4713 (Error Categorization) ## Special notes for your reviewer - This is PR 1 of 4: Proto + classifier (this) -> Wire into executor (#4745) -> Store in lookout DB + UI (#4755) -> e2e tests (#4760) - The `errormatch` package is in `internal/common/` since it will be reused by the lookout ingester - The categorizer has thorough `doc.go` explaining config format and validation - Exit code 0 containers are skipped during classification (only failures are categorized) - Rules within a category are OR'd; categories are evaluated independently --------- Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
15998e3 to
ba6f876
Compare
## What type of PR is this? Feature (PR 3 of 4) ## What this PR does / why we need it Stores FailureInfo in the lookout database, exposes it via the Lookout API and public event API, and displays it in the Lookout UI. - Adds DB migration (031) to add a `failure_info` JSONB column to `job_run` - Extracts `FailureInfo` from Error events in the lookout ingester and stores as JSONB (exit code, termination message, categories, container name) - Adds `failureInfo` to the Lookout v2 Swagger spec and threads it through the query builder, model, and conversions layer - Displays FailureInfo in the Lookout UI sidebar on failed job runs: container name, exit code, termination message, and matched categories - Adds Error Categories column to the jobs table (display-only, filtering is not yet supported since categories are stored in JSONB and not a server-side filterable field) - Propagates error categories in the public `JobFailedEvent` proto, so categories are available in the event stream (used by the e2e testsuite in PR #4760) - Adds executor error category config to `_local/executor/config.yaml` for local development ## Which issue(s) this PR fixes Part of #4713 (Error Categorization) ## Special notes for your reviewer - This is PR 3 of 4: Proto + classifier (#4741) -> Wire into executor (#4745) -> Store in lookout DB + UI + public event (this) -> e2e tests (#4760) - Depends on #4745 - JSONB round-trip means `exitCode` (int32 in proto) arrives as float64 after PostgreSQL JSON deserialization - `failureInfoToSwagger()` handles this via type assertion - `failureInfoToMap()` only emits non-zero fields to keep the JSONB clean - Uses `coalesce(tmp.failure_info, job_run.failure_info)` in the batch upsert so earlier partial updates aren't overwritten - Only terminal errors have their FailureInfo persisted (`if e.Terminal` guard in the ingester) to prevent transient failures from overwriting the final classification - Categories on `JobFailedEvent` are set in `FromInternalJobErrors()` conversion from `FailureInfo.Categories` <img width="523" height="310" alt="image" src="https://github.com/user-attachments/assets/8e018b92-6b90-46a1-8b1f-81c44ffed4c3" /> <img width="1728" height="410" alt="image" src="https://github.com/user-attachments/assets/b79a4b28-b713-44ba-99a8-5297104a9edc" /> --------- Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
## What type of PR is this?
Feature (PR 4 of 4)
## What this PR does / why we need it
Adds end-to-end test cases that verify error categories flow through the
full Armada pipeline, asserting directly from the public event stream.
- Extends `assertEventFailed` in the testsuite event watcher to compare
expected vs actual categories on `JobFailedEvent` (sorted exact match)
- Adds 2 categorization test cases: OOM kill (`dd` fills tmpfs to
trigger OOMKilled) and user_error (exit code 1)
- Test YAML declares expected categories inline on the `failed` event:
```yaml
expectedEvents:
- submitted:
- failed:
categories: ["oom"]
```
## Which issue(s) this PR fixes
Part of #4713 (Error Categorization)
## Special notes for your reviewer
- This is PR 4 of 4: Proto + classifier (#4741) -> Wire into executor
(#4745) -> Store in lookout DB + UI + public event (#4755) -> e2e tests
(this)
- Depends on #4755 (which added `categories` field to `JobFailedEvent`)
- Categories are asserted from the event stream, not from the Lookout
API - no HTTP polling or additional infrastructure needed
- CI executor config needs `errorCategories` rules added separately
(included in `e2e/config/executor_config.yaml`)
- Both tests validated locally against a full local dev stack
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
What type of PR is this?
Feature (PR 2 of 4)
What this PR does / why we need it
Wires the classifier and FailureInfo into the executor's event reporting path.
Classifierfrom config at executor startup unconditionally (with no rules configured, it simply returns empty categories)ExtractFailureInfo()+classifier.Classify()on every pod failure, attaching structuredFailureInfo(exit code, termination message, categories, container name) to theErrorevents sent through PulsarWhich issue(s) this PR fixes
Part of #4713 (Error Categorization)
Special notes for your reviewer
application.go(startup wiring),pod_issue_handler.go,job_state_reporter.go,reporter/event.go,cluster_allocation.go, andpreempt_runs.go