Add e2e test cases for error categorization#4760
Add e2e test cases for error categorization#4760dejanzele merged 1 commit intoarmadaproject:masterfrom
Conversation
Greptile SummaryThis PR adds end-to-end test cases for the error categorization feature (the final PR in a 4-part series), verifying that OOM kills and user errors flow through the full Armada pipeline and appear as the correct categories on the public Key changes:
Notable issues found:
Confidence Score: 3/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Test YAML\nexpectedEvents.failed.categories] --> B{assertEventFailed}
B --> C{reason != empty?}
C -- yes --> D[compile regex\ncheck match]
C -- no --> E{categories\nlen > 0?}
D --> E
E -- yes --> F[assertCategories\nexact sorted match\nslices.Equal]
E -- no --> G[skip check\nreturn nil]
F --> H{match?}
H -- yes --> I[return nil ✓]
H -- no --> J[return error\nexpected categories X but got Y]
G --> I
|
e496a6b to
ad68ab3
Compare
063ca98 to
dd853af
Compare
dd853af to
e628632
Compare
e628632 to
f509f03
Compare
f509f03 to
334ece4
Compare
334ece4 to
e10920f
Compare
e10920f to
f813bb0
Compare
|
@greptile |
f813bb0 to
79e8d4d
Compare
79e8d4d to
ce20b48
Compare
ce20b48 to
47cf523
Compare
47cf523 to
413f8ec
Compare
be68a53 to
88aa7a0
Compare
d18879d to
fa29480
Compare
622465e to
0e25ca8
Compare
## What type of PR is this? Feature (PR 1 of 4) ## What this PR does / why we need it Adds the proto schema and shared building blocks for error categorization in Armada. - Adds `FailureInfo` message to the `Error` proto with fields: `exit_code`, `termination_message`, `categories`, `container_name` - Adds `errormatch` package (`internal/common/errormatch/`) with shared matching primitives: `ExitCodeMatcher` (In/NotIn operators), `RegexMatcher`, and Kubernetes condition constants (OOMKilled, Evicted, DeadlineExceeded) - Adds `categorizer` package (`internal/executor/categorizer/`) - configurable classifier that matches pod failures against rules (exit codes, termination messages, Kubernetes conditions) and assigns named categories, with optional `containerName` scoping per rule - Adds `ExtractFailureInfo()` in `pod_status.go` to extract exit code, termination message, and container name from Kubernetes pod status into the `FailureInfo` proto - Adds `ErrorCategories` config field under `ApplicationConfiguration` for defining category rules Nothing is wired into the event reporting path yet - this PR provides the building blocks that PR #4745 connects. ## Which issue(s) this PR fixes Part of #4713 (Error Categorization) ## Special notes for your reviewer - This is PR 1 of 4: Proto + classifier (this) -> Wire into executor (#4745) -> Store in lookout DB + UI (#4755) -> e2e tests (#4760) - The `errormatch` package is in `internal/common/` since it will be reused by the lookout ingester - The categorizer has thorough `doc.go` explaining config format and validation - Exit code 0 containers are skipped during classification (only failures are categorized) - Rules within a category are OR'd; categories are evaluated independently --------- Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
## What type of PR is this? Feature (PR 2 of 4) ## What this PR does / why we need it Wires the classifier and FailureInfo into the executor's event reporting path. - Constructs the `Classifier` from config at executor startup unconditionally (with no rules configured, it simply returns empty categories) - Passes classifier to the pod issue handler and job state reporter - Calls `ExtractFailureInfo()` + `classifier.Classify()` on every pod failure, attaching structured `FailureInfo` (exit code, termination message, categories, container name) to the `Error` events sent through Pulsar - After this PR, every failed pod event flowing into Pulsar carries exit code, termination message, container name, and matched category names ## Which issue(s) this PR fixes Part of #4713 (Error Categorization) ## Special notes for your reviewer - This is PR 2 of 4: Proto + classifier (#4741) -> Wire into executor (this) -> Store in lookout DB + UI (#4755) -> e2e tests (#4760) - Depends on #4741 - The classifier is always instantiated at startup, even with no rules configured - no conditional nil-checking needed at call sites - Changes are in `application.go` (startup wiring), `pod_issue_handler.go`, `job_state_reporter.go`, `reporter/event.go`, `cluster_allocation.go`, and `preempt_runs.go` Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
## What type of PR is this? Feature (PR 3 of 4) ## What this PR does / why we need it Stores FailureInfo in the lookout database, exposes it via the Lookout API and public event API, and displays it in the Lookout UI. - Adds DB migration (031) to add a `failure_info` JSONB column to `job_run` - Extracts `FailureInfo` from Error events in the lookout ingester and stores as JSONB (exit code, termination message, categories, container name) - Adds `failureInfo` to the Lookout v2 Swagger spec and threads it through the query builder, model, and conversions layer - Displays FailureInfo in the Lookout UI sidebar on failed job runs: container name, exit code, termination message, and matched categories - Adds Error Categories column to the jobs table (display-only, filtering is not yet supported since categories are stored in JSONB and not a server-side filterable field) - Propagates error categories in the public `JobFailedEvent` proto, so categories are available in the event stream (used by the e2e testsuite in PR #4760) - Adds executor error category config to `_local/executor/config.yaml` for local development ## Which issue(s) this PR fixes Part of #4713 (Error Categorization) ## Special notes for your reviewer - This is PR 3 of 4: Proto + classifier (#4741) -> Wire into executor (#4745) -> Store in lookout DB + UI + public event (this) -> e2e tests (#4760) - Depends on #4745 - JSONB round-trip means `exitCode` (int32 in proto) arrives as float64 after PostgreSQL JSON deserialization - `failureInfoToSwagger()` handles this via type assertion - `failureInfoToMap()` only emits non-zero fields to keep the JSONB clean - Uses `coalesce(tmp.failure_info, job_run.failure_info)` in the batch upsert so earlier partial updates aren't overwritten - Only terminal errors have their FailureInfo persisted (`if e.Terminal` guard in the ingester) to prevent transient failures from overwriting the final classification - Categories on `JobFailedEvent` are set in `FromInternalJobErrors()` conversion from `FailureInfo.Categories` <img width="523" height="310" alt="image" src="https://github.com/user-attachments/assets/8e018b92-6b90-46a1-8b1f-81c44ffed4c3" /> <img width="1728" height="410" alt="image" src="https://github.com/user-attachments/assets/b79a4b28-b713-44ba-99a8-5297104a9edc" /> --------- Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
0e25ca8 to
0c82847
Compare
7f4ddf9 to
47732e7
Compare
Assert error categories directly from the public JobFailedEvent in the
event stream, rather than polling the Lookout API. Categories now flow
through the event API (added in parent commit), so the testsuite can
validate them inline with existing event assertions.
Test YAMLs declare expected categories on the failed event:
expectedEvents:
- failed:
categories: ["oom"]
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
47732e7 to
7e71ae0
Compare
What type of PR is this?
Feature (PR 4 of 4)
What this PR does / why we need it
Adds end-to-end test cases that verify error categories flow through the full Armada pipeline, asserting directly from the public event stream.
assertEventFailedin the testsuite event watcher to compare expected vs actual categories onJobFailedEvent(sorted exact match)ddfills tmpfs to trigger OOMKilled) and user_error (exit code 1)failedevent:Which issue(s) this PR fixes
Part of #4713 (Error Categorization)
Special notes for your reviewer
categoriesfield toJobFailedEvent)errorCategoriesrules added separately (included ine2e/config/executor_config.yaml)