Store FailureInfo in lookout DB and expose via API#4755
Store FailureInfo in lookout DB and expose via API#4755dejanzele merged 3 commits intoarmadaproject:masterfrom
Conversation
2ff8836 to
7a1b2c0
Compare
|
@dejanzele let's see if greptile works. hey @greptile please review this PR |
Greptile SummaryThis PR (3 of 4 in the error-categorisation series) wires Key observations:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Executor
participant Pulsar
participant LookoutIngester
participant PostgreSQL
participant LookoutAPI
participant LookoutUI
participant PublicEventAPI
Executor->>Pulsar: JobRunErrors (Error{Terminal=true, FailureInfo{exitCode, terminationMessage, categories, containerName}})
Pulsar->>LookoutIngester: consume event
LookoutIngester->>LookoutIngester: failureInfoToMap(fi) → map[string]any
LookoutIngester->>PostgreSQL: UPDATE job_run SET failure_info = COALESCE($11, failure_info)
LookoutIngester->>PostgreSQL: (batch path) COPY to tmp table → UPDATE with COALESCE
LookoutUI->>LookoutAPI: GET /api/v1/jobs
LookoutAPI->>PostgreSQL: json_build_object(..., 'failureInfo', failure_info) json_agg
PostgreSQL-->>LookoutAPI: runs JSONB (exitCode as float64, categories as []interface{})
LookoutAPI->>LookoutAPI: failureInfoToSwagger(map) → *RunFailureInfo
LookoutAPI-->>LookoutUI: Job{runs[{failureInfo{exitCode, terminationMessage, categories, containerName}}]}
LookoutUI->>LookoutUI: Display sidebar + ErrorCategories column
Executor->>PublicEventAPI: JobErrors → FromInternalJobErrors
PublicEventAPI->>PublicEventAPI: failed.Categories = msgErr.GetFailureInfo().GetCategories()
PublicEventAPI-->>PublicEventAPI: JobFailedEvent{categories: [...]}
|
7a1b2c0 to
da5394c
Compare
|
@greptile |
da5394c to
ba36271
Compare
ba36271 to
3d2e605
Compare
3d2e605 to
4b6bc89
Compare
655f8bc to
6012479
Compare
|
@greptile |
6012479 to
fcfbfcf
Compare
|
@greptile |
23eb171 to
c233be8
Compare
|
@greptile |
682b171 to
9212cbd
Compare
a108da6 to
81e3c5f
Compare
516d7db to
509638b
Compare
## What type of PR is this? Feature (PR 1 of 4) ## What this PR does / why we need it Adds the proto schema and shared building blocks for error categorization in Armada. - Adds `FailureInfo` message to the `Error` proto with fields: `exit_code`, `termination_message`, `categories`, `container_name` - Adds `errormatch` package (`internal/common/errormatch/`) with shared matching primitives: `ExitCodeMatcher` (In/NotIn operators), `RegexMatcher`, and Kubernetes condition constants (OOMKilled, Evicted, DeadlineExceeded) - Adds `categorizer` package (`internal/executor/categorizer/`) - configurable classifier that matches pod failures against rules (exit codes, termination messages, Kubernetes conditions) and assigns named categories, with optional `containerName` scoping per rule - Adds `ExtractFailureInfo()` in `pod_status.go` to extract exit code, termination message, and container name from Kubernetes pod status into the `FailureInfo` proto - Adds `ErrorCategories` config field under `ApplicationConfiguration` for defining category rules Nothing is wired into the event reporting path yet - this PR provides the building blocks that PR #4745 connects. ## Which issue(s) this PR fixes Part of #4713 (Error Categorization) ## Special notes for your reviewer - This is PR 1 of 4: Proto + classifier (this) -> Wire into executor (#4745) -> Store in lookout DB + UI (#4755) -> e2e tests (#4760) - The `errormatch` package is in `internal/common/` since it will be reused by the lookout ingester - The categorizer has thorough `doc.go` explaining config format and validation - Exit code 0 containers are skipped during classification (only failures are categorized) - Rules within a category are OR'd; categories are evaluated independently --------- Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
## What type of PR is this? Feature (PR 2 of 4) ## What this PR does / why we need it Wires the classifier and FailureInfo into the executor's event reporting path. - Constructs the `Classifier` from config at executor startup unconditionally (with no rules configured, it simply returns empty categories) - Passes classifier to the pod issue handler and job state reporter - Calls `ExtractFailureInfo()` + `classifier.Classify()` on every pod failure, attaching structured `FailureInfo` (exit code, termination message, categories, container name) to the `Error` events sent through Pulsar - After this PR, every failed pod event flowing into Pulsar carries exit code, termination message, container name, and matched category names ## Which issue(s) this PR fixes Part of #4713 (Error Categorization) ## Special notes for your reviewer - This is PR 2 of 4: Proto + classifier (#4741) -> Wire into executor (this) -> Store in lookout DB + UI (#4755) -> e2e tests (#4760) - Depends on #4741 - The classifier is always instantiated at startup, even with no rules configured - no conditional nil-checking needed at call sites - Changes are in `application.go` (startup wiring), `pod_issue_handler.go`, `job_state_reporter.go`, `reporter/event.go`, `cluster_allocation.go`, and `preempt_runs.go` Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Show failure condition, categories, and termination message in the job run details sidebar. Condition enums are mapped to human-friendly labels and categories are title-cased after stripping the ERROR_CATEGORY_ prefix. Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
96f39a3 to
fd0957d
Compare
Add categories field to JobFailedEvent proto so error categories flow through the public event API, not just Lookout. The conversion layer extracts categories from FailureInfo on terminal errors and attaches them to the public event. Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
fd0957d to
88d0393
Compare
## What type of PR is this?
Feature (PR 4 of 4)
## What this PR does / why we need it
Adds end-to-end test cases that verify error categories flow through the
full Armada pipeline, asserting directly from the public event stream.
- Extends `assertEventFailed` in the testsuite event watcher to compare
expected vs actual categories on `JobFailedEvent` (sorted exact match)
- Adds 2 categorization test cases: OOM kill (`dd` fills tmpfs to
trigger OOMKilled) and user_error (exit code 1)
- Test YAML declares expected categories inline on the `failed` event:
```yaml
expectedEvents:
- submitted:
- failed:
categories: ["oom"]
```
## Which issue(s) this PR fixes
Part of #4713 (Error Categorization)
## Special notes for your reviewer
- This is PR 4 of 4: Proto + classifier (#4741) -> Wire into executor
(#4745) -> Store in lookout DB + UI + public event (#4755) -> e2e tests
(this)
- Depends on #4755 (which added `categories` field to `JobFailedEvent`)
- Categories are asserted from the event stream, not from the Lookout
API - no HTTP polling or additional infrastructure needed
- CI executor config needs `errorCategories` rules added separately
(included in `e2e/config/executor_config.yaml`)
- Both tests validated locally against a full local dev stack
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
What type of PR is this?
Feature (PR 3 of 4)
What this PR does / why we need it
Stores FailureInfo in the lookout database, exposes it via the Lookout API and public event API, and displays it in the Lookout UI.
failure_infoJSONB column tojob_runFailureInfofrom Error events in the lookout ingester and stores as JSONB (exit code, termination message, categories, container name)failureInfoto the Lookout v2 Swagger spec and threads it through the query builder, model, and conversions layerJobFailedEventproto, so categories are available in the event stream (used by the e2e testsuite in PR Add e2e test cases for error categorization #4760)_local/executor/config.yamlfor local developmentWhich issue(s) this PR fixes
Part of #4713 (Error Categorization)
Special notes for your reviewer
exitCode(int32 in proto) arrives as float64 after PostgreSQL JSON deserialization -failureInfoToSwagger()handles this via type assertionfailureInfoToMap()only emits non-zero fields to keep the JSONB cleancoalesce(tmp.failure_info, job_run.failure_info)in the batch upsert so earlier partial updates aren't overwrittenif e.Terminalguard in the ingester) to prevent transient failures from overwriting the final classificationJobFailedEventare set inFromInternalJobErrors()conversion fromFailureInfo.Categories