Add error categorization proto schema and executor classifier#4741
Add error categorization proto schema and executor classifier#4741dejanzele merged 3 commits intoarmadaproject:masterfrom
Conversation
0303967 to
8d209d5
Compare
371113a to
da1d029
Compare
da1d029 to
052ced0
Compare
Greptile SummaryThis PR lays the foundational building blocks for error categorization in Armada: a The implementation is well-structured with thorough tests and good validation at construction time. Several issues flagged in earlier review rounds have been addressed (exit-code-0 filtering in Findings:
Confidence Score: 4/5Safe to merge after fixing the One P1 documentation bug (
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Pod fails] --> B[ExtractFailureInfo pod_status.go]
A --> C[Classifier.Classify]
B --> D[Iterate GetPodContainerStatuses]
D --> E{First container\nwith exitCode != 0?}
E -->|Yes| F[Set ExitCode, TerminationMessage, ContainerName in FailureInfo]
E -->|No| G[ExitCode stays 0]
C --> H[failedContainers\nfilter exitCode == 0]
H --> I[For each CategoryConfig]
I --> J[categoryMatches: OR rules]
J --> K{Rule matcher type?}
K -->|onConditions OOMKilled| L[Check container.reason]
K -->|onConditions Evicted/Deadline| M[Check pod.Status.Reason]
K -->|onExitCodes| N[MatchExitCode]
K -->|onTerminationMessage| O[MatchPattern regex]
L & M & N & O --> P{Matched?}
P -->|Yes| Q[Append category name]
P -->|No| R[Next rule]
F --> S[FailureInfo proto\nCategories + ExitCode + TerminationMessage + ContainerName]
Q --> S
G --> S
Reviews (25): Last reviewed commit: "Merge branch 'master' into error-categor..." | Re-trigger Greptile |
052ced0 to
7af38d0
Compare
22d10e0 to
f8a4e3e
Compare
|
@greptile |
f8a4e3e to
def537a
Compare
|
@greptile |
67989ad to
fddf4eb
Compare
|
@greptile |
fddf4eb to
1cc4fc4
Compare
|
Changed the default from FAILURE_CONDITION_USER_ERROR to FAILURE_CONDITION_UNSPECIFIED - at large scale various infra/hardware/k8s errors appear that aren't the user's fault, so the catch-all should be honest about not knowing. The USER_ERROR enum value stays in the proto for operators who want to explicitly classify via errorCategories config. |
1e4008e to
b1fe76c
Compare
dd8eb9e to
a7d60fe
Compare
a7d60fe to
5f8edfb
Compare
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
babf18c to
32d79be
Compare
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
32d79be to
e3c73a7
Compare
## What type of PR is this? Feature (PR 2 of 4) ## What this PR does / why we need it Wires the classifier and FailureInfo into the executor's event reporting path. - Constructs the `Classifier` from config at executor startup unconditionally (with no rules configured, it simply returns empty categories) - Passes classifier to the pod issue handler and job state reporter - Calls `ExtractFailureInfo()` + `classifier.Classify()` on every pod failure, attaching structured `FailureInfo` (exit code, termination message, categories, container name) to the `Error` events sent through Pulsar - After this PR, every failed pod event flowing into Pulsar carries exit code, termination message, container name, and matched category names ## Which issue(s) this PR fixes Part of #4713 (Error Categorization) ## Special notes for your reviewer - This is PR 2 of 4: Proto + classifier (#4741) -> Wire into executor (this) -> Store in lookout DB + UI (#4755) -> e2e tests (#4760) - Depends on #4741 - The classifier is always instantiated at startup, even with no rules configured - no conditional nil-checking needed at call sites - Changes are in `application.go` (startup wiring), `pod_issue_handler.go`, `job_state_reporter.go`, `reporter/event.go`, `cluster_allocation.go`, and `preempt_runs.go` Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
## What type of PR is this? Feature (PR 3 of 4) ## What this PR does / why we need it Stores FailureInfo in the lookout database, exposes it via the Lookout API and public event API, and displays it in the Lookout UI. - Adds DB migration (031) to add a `failure_info` JSONB column to `job_run` - Extracts `FailureInfo` from Error events in the lookout ingester and stores as JSONB (exit code, termination message, categories, container name) - Adds `failureInfo` to the Lookout v2 Swagger spec and threads it through the query builder, model, and conversions layer - Displays FailureInfo in the Lookout UI sidebar on failed job runs: container name, exit code, termination message, and matched categories - Adds Error Categories column to the jobs table (display-only, filtering is not yet supported since categories are stored in JSONB and not a server-side filterable field) - Propagates error categories in the public `JobFailedEvent` proto, so categories are available in the event stream (used by the e2e testsuite in PR #4760) - Adds executor error category config to `_local/executor/config.yaml` for local development ## Which issue(s) this PR fixes Part of #4713 (Error Categorization) ## Special notes for your reviewer - This is PR 3 of 4: Proto + classifier (#4741) -> Wire into executor (#4745) -> Store in lookout DB + UI + public event (this) -> e2e tests (#4760) - Depends on #4745 - JSONB round-trip means `exitCode` (int32 in proto) arrives as float64 after PostgreSQL JSON deserialization - `failureInfoToSwagger()` handles this via type assertion - `failureInfoToMap()` only emits non-zero fields to keep the JSONB clean - Uses `coalesce(tmp.failure_info, job_run.failure_info)` in the batch upsert so earlier partial updates aren't overwritten - Only terminal errors have their FailureInfo persisted (`if e.Terminal` guard in the ingester) to prevent transient failures from overwriting the final classification - Categories on `JobFailedEvent` are set in `FromInternalJobErrors()` conversion from `FailureInfo.Categories` <img width="523" height="310" alt="image" src="https://github.com/user-attachments/assets/8e018b92-6b90-46a1-8b1f-81c44ffed4c3" /> <img width="1728" height="410" alt="image" src="https://github.com/user-attachments/assets/b79a4b28-b713-44ba-99a8-5297104a9edc" /> --------- Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
## What type of PR is this?
Feature (PR 4 of 4)
## What this PR does / why we need it
Adds end-to-end test cases that verify error categories flow through the
full Armada pipeline, asserting directly from the public event stream.
- Extends `assertEventFailed` in the testsuite event watcher to compare
expected vs actual categories on `JobFailedEvent` (sorted exact match)
- Adds 2 categorization test cases: OOM kill (`dd` fills tmpfs to
trigger OOMKilled) and user_error (exit code 1)
- Test YAML declares expected categories inline on the `failed` event:
```yaml
expectedEvents:
- submitted:
- failed:
categories: ["oom"]
```
## Which issue(s) this PR fixes
Part of #4713 (Error Categorization)
## Special notes for your reviewer
- This is PR 4 of 4: Proto + classifier (#4741) -> Wire into executor
(#4745) -> Store in lookout DB + UI + public event (#4755) -> e2e tests
(this)
- Depends on #4755 (which added `categories` field to `JobFailedEvent`)
- Categories are asserted from the event stream, not from the Lookout
API - no HTTP polling or additional infrastructure needed
- CI executor config needs `errorCategories` rules added separately
(included in `e2e/config/executor_config.yaml`)
- Both tests validated locally against a full local dev stack
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
What type of PR is this?
Feature (PR 1 of 4)
What this PR does / why we need it
Adds the proto schema and shared building blocks for error categorization in Armada.
FailureInfomessage to theErrorproto with fields:exit_code,termination_message,categories,container_nameerrormatchpackage (internal/common/errormatch/) with shared matching primitives:ExitCodeMatcher(In/NotIn operators),RegexMatcher, and Kubernetes condition constants (OOMKilled, Evicted, DeadlineExceeded)categorizerpackage (internal/executor/categorizer/) - configurable classifier that matches pod failures against rules (exit codes, termination messages, Kubernetes conditions) and assigns named categories, with optionalcontainerNamescoping per ruleExtractFailureInfo()inpod_status.goto extract exit code, termination message, and container name from Kubernetes pod status into theFailureInfoprotoErrorCategoriesconfig field underApplicationConfigurationfor defining category rulesNothing is wired into the event reporting path yet - this PR provides the building blocks that PR #4745 connects.
Which issue(s) this PR fixes
Part of #4713 (Error Categorization)
Special notes for your reviewer
errormatchpackage is ininternal/common/since it will be reused by the lookout ingesterdoc.goexplaining config format and validation