Skip to content

Store FailureInfo in lookout DB and expose via API#4755

Merged
dejanzele merged 3 commits intoarmadaproject:masterfrom
dejanzele:lookout-stores-failure-info
Apr 2, 2026
Merged

Store FailureInfo in lookout DB and expose via API#4755
dejanzele merged 3 commits intoarmadaproject:masterfrom
dejanzele:lookout-stores-failure-info

Conversation

@dejanzele
Copy link
Copy Markdown
Member

@dejanzele dejanzele commented Mar 9, 2026

What type of PR is this?

Feature (PR 3 of 4)

What this PR does / why we need it

Stores FailureInfo in the lookout database, exposes it via the Lookout API and public event API, and displays it in the Lookout UI.

  • Adds DB migration (031) to add a failure_info JSONB column to job_run
  • Extracts FailureInfo from Error events in the lookout ingester and stores as JSONB (exit code, termination message, categories, container name)
  • Adds failureInfo to the Lookout v2 Swagger spec and threads it through the query builder, model, and conversions layer
  • Displays FailureInfo in the Lookout UI sidebar on failed job runs: container name, exit code, termination message, and matched categories
  • Adds Error Categories column to the jobs table (display-only, filtering is not yet supported since categories are stored in JSONB and not a server-side filterable field)
  • Propagates error categories in the public JobFailedEvent proto, so categories are available in the event stream (used by the e2e testsuite in PR Add e2e test cases for error categorization #4760)
  • Adds executor error category config to _local/executor/config.yaml for local development

Which issue(s) this PR fixes

Part of #4713 (Error Categorization)

Special notes for your reviewer

image image

@dejanzele dejanzele force-pushed the lookout-stores-failure-info branch from 2ff8836 to 7a1b2c0 Compare March 9, 2026 10:21
@pavlovic-ivan
Copy link
Copy Markdown
Collaborator

@dejanzele let's see if greptile works. hey @greptile please review this PR

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 9, 2026

Greptile Summary

This PR (3 of 4 in the error-categorisation series) wires FailureInfo end-to-end: a DB migration adds a failure_info JSONB column to job_run, the lookout ingester extracts and persists it for terminal errors only, the Lookout v2 query layer surfaces it through the Swagger API, the UI sidebar shows container name / termination message / categories, and a new read-only "Error Categories" column is added to the jobs table. Categories are also propagated onto the public JobFailedEvent proto for use by downstream consumers and e2e tests.

Key observations:

  • The write path (failureInfoToMap) correctly skips zero-value fields and returns nil for an all-zero FailureInfo, preventing empty {} JSONB from being written and inadvertently overwriting existing data via COALESCE.
  • The e.Terminal guard ensures only final, unrecoverable errors persist FailureInfo, keeping transient failures from polluting the stored classification.
  • The read path (failureInfoToSwagger) handles the float64 type produced by PostgreSQL JSON round-trip for exitCode, but lacks an int32 fallback — the same gap that was already noted for categories ([]string vs []interface{}). The function also has no unit tests in convert_test.go.
  • JobsTableContainer.tsx fixes a stale useMemo dependency array (columnFilterState/allColumnscolumnMatches) as a drive-by improvement, but introduces a minor reference-stability issue with data ?? [].
  • A stale comment in pkg/armadaevents/events.proto references a condition field that does not exist in FailureInfo.

Confidence Score: 4/5

  • Safe to merge with minor follow-up: the exitCode int32 type-assertion gap is a latent bug but does not affect the production read path through PostgreSQL.
  • The core write path (ingester → DB) and read path (query builder → Swagger) are correct and well-tested at the integration level. The one P1 issue (exitCode missing int32 branch in failureInfoToSwagger) only manifests outside the production PostgreSQL round-trip, so it carries no immediate runtime risk. All other findings are style/documentation level. The DB migration is safe (nullable column, COALESCE-guarded updates). Proto changes are backward-compatible (new optional field with next available number).
  • internal/lookout/conversions/convert.gofailureInfoToSwagger needs an int32 fallback for exitCode and unit test coverage.

Important Files Changed

Filename Overview
internal/lookoutingester/instructions/instructions.go Adds failureInfoToMap helper and correctly guards FailureInfo persistence behind e.Terminal; includes truncation for terminationMessage and nil-return for empty maps.
internal/lookout/conversions/convert.go Adds failureInfoToSwagger for reading FailureInfo from the JSONB map; exitCode only type-asserts to float64, silently dropping int32 values (same gap as the already-flagged categories issue), and the function has no unit tests.
internal/lookoutingester/lookoutdb/insertion.go Correctly threads failure_info JSONB through temp-table batch upsert and scalar UPDATE, both guarded by COALESCE to avoid overwriting existing data.
internal/lookout/repository/querybuilder.go Adds failure_info to the json_build_object call in the lateral subquery so it is included in the aggregated runs JSONB returned to callers.
internal/lookoutui/src/pages/jobs/components/JobsTableContainer.tsx Fixes the useMemo dependency array for selectedItemsFilters; filteredData = data ?? [] introduces a new array reference each render when data is null, causing unnecessary memo invalidation.
pkg/armadaevents/events.proto Inline comment for the failure_info field references a non-existent condition field ("exit code, condition, categories") — stale from an earlier design iteration.

Sequence Diagram

sequenceDiagram
    participant Executor
    participant Pulsar
    participant LookoutIngester
    participant PostgreSQL
    participant LookoutAPI
    participant LookoutUI
    participant PublicEventAPI

    Executor->>Pulsar: JobRunErrors (Error{Terminal=true, FailureInfo{exitCode, terminationMessage, categories, containerName}})
    Pulsar->>LookoutIngester: consume event
    LookoutIngester->>LookoutIngester: failureInfoToMap(fi) → map[string]any
    LookoutIngester->>PostgreSQL: UPDATE job_run SET failure_info = COALESCE($11, failure_info)
    LookoutIngester->>PostgreSQL: (batch path) COPY to tmp table → UPDATE with COALESCE

    LookoutUI->>LookoutAPI: GET /api/v1/jobs
    LookoutAPI->>PostgreSQL: json_build_object(..., 'failureInfo', failure_info) json_agg
    PostgreSQL-->>LookoutAPI: runs JSONB (exitCode as float64, categories as []interface{})
    LookoutAPI->>LookoutAPI: failureInfoToSwagger(map) → *RunFailureInfo
    LookoutAPI-->>LookoutUI: Job{runs[{failureInfo{exitCode, terminationMessage, categories, containerName}}]}
    LookoutUI->>LookoutUI: Display sidebar + ErrorCategories column

    Executor->>PublicEventAPI: JobErrors → FromInternalJobErrors
    PublicEventAPI->>PublicEventAPI: failed.Categories = msgErr.GetFailureInfo().GetCategories()
    PublicEventAPI-->>PublicEventAPI: JobFailedEvent{categories: [...]}
Loading

Comments Outside Diff (3)

  1. internal/lookout/conversions/convert.go, line 133-138 (link)

    exitCode silently dropped for int32 values

    failureInfoToSwagger only type-asserts exitCode to float64, but failureInfoToMap stores it as int32 (the native Go proto type). The production path is safe because data always flows through PostgreSQL JSON round-trip (int32 → JSON number → float64), but any call to failureInfoToSwagger with a map that was not round-tripped (e.g., a future unit test or in-process conversion) will silently produce ExitCode = 0 with no error.

    This is the same pattern as the already-flagged categories issue. Notably, the previous review thread claimed exitCode "correctly handles both int32 and float64" — looking at the actual code, only float64 is handled.

    Adding an int32 fallback makes the contract explicit and consistent with how failureInfoToMap serialises the value:

    if v, ok := failureInfo["exitCode"].(float64); ok {
        result.ExitCode = int32(v)
        populated = true
    } else if v, ok := failureInfo["exitCode"].(int32); ok {
        result.ExitCode = v
        populated = true
    }

    Also, convert_test.go currently has no test coverage for failureInfoToSwagger. A table-driven test covering both float64 and int32 exit codes (and the []string vs []interface{} categories path) would prevent regressions here.

  2. pkg/armadaevents/events.proto, line 439 (link)

    Stale inline comment references a non-existent condition field

    The comment reads // Structured failure metadata (exit code, condition, categories) but the FailureInfo message has no condition field — its fields are exit_code, termination_message, categories, and container_name. The word "condition" appears to be a leftover from an earlier design that was revised. This can mislead readers into looking for a condition field that does not exist.

  3. internal/lookoutui/src/pages/jobs/components/JobsTableContainer.tsx, line 673-678 (link)

    filteredData = data ?? [] creates a new array reference on every render when data is null/undefined

    const filteredData = data ?? [] is declared inline in the component body. When data is null or undefined, the [] fallback is a new array literal on every render, so filteredData gets a new reference each render, which invalidates the useMemo cache on every render during loading states.

    This is the same behaviour that already existed in the previous data ?? [] expression passed directly to useReactTable, so this PR doesn't regress anything. But now that filteredData is shared across both useMemo and useReactTable, the stable-reference problem is more visible. Hoisting the fallback to a stable constant would fix both call sites:

    const EMPTY_DATA: JobTableRow[] = []
    
    // ...inside the component:
    const filteredData = data ?? EMPTY_DATA

    Or useMemo can be used to keep the stable reference inside the component without a module-level constant.

Reviews (41): Last reviewed commit: "Propagate error categories in public Job..." | Re-trigger Greptile

@dejanzele
Copy link
Copy Markdown
Member Author

@greptile

@dejanzele dejanzele force-pushed the lookout-stores-failure-info branch from da5394c to ba36271 Compare March 9, 2026 13:04
@dejanzele dejanzele force-pushed the lookout-stores-failure-info branch from ba36271 to 3d2e605 Compare March 9, 2026 13:21
@dejanzele dejanzele force-pushed the lookout-stores-failure-info branch from 3d2e605 to 4b6bc89 Compare March 9, 2026 13:39
@dejanzele dejanzele force-pushed the lookout-stores-failure-info branch 3 times, most recently from 655f8bc to 6012479 Compare March 10, 2026 09:42
@dejanzele
Copy link
Copy Markdown
Member Author

@greptile

@dejanzele dejanzele force-pushed the lookout-stores-failure-info branch from 6012479 to fcfbfcf Compare March 10, 2026 12:36
@dejanzele
Copy link
Copy Markdown
Member Author

@greptile

@dejanzele dejanzele force-pushed the lookout-stores-failure-info branch 8 times, most recently from 23eb171 to c233be8 Compare March 10, 2026 16:51
@dejanzele
Copy link
Copy Markdown
Member Author

@greptile

@dejanzele dejanzele force-pushed the lookout-stores-failure-info branch 9 times, most recently from 682b171 to 9212cbd Compare March 19, 2026 14:43
@dejanzele dejanzele force-pushed the lookout-stores-failure-info branch 5 times, most recently from a108da6 to 81e3c5f Compare March 30, 2026 16:04
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

@dejanzele dejanzele force-pushed the lookout-stores-failure-info branch 3 times, most recently from 516d7db to 509638b Compare March 31, 2026 08:20
dejanzele added a commit that referenced this pull request Apr 1, 2026
## What type of PR is this?

Feature (PR 1 of 4)

## What this PR does / why we need it

Adds the proto schema and shared building blocks for error
categorization in Armada.

- Adds `FailureInfo` message to the `Error` proto with fields:
`exit_code`, `termination_message`, `categories`, `container_name`
- Adds `errormatch` package (`internal/common/errormatch/`) with shared
matching primitives: `ExitCodeMatcher` (In/NotIn operators),
`RegexMatcher`, and Kubernetes condition constants (OOMKilled, Evicted,
DeadlineExceeded)
- Adds `categorizer` package (`internal/executor/categorizer/`) -
configurable classifier that matches pod failures against rules (exit
codes, termination messages, Kubernetes conditions) and assigns named
categories, with optional `containerName` scoping per rule
- Adds `ExtractFailureInfo()` in `pod_status.go` to extract exit code,
termination message, and container name from Kubernetes pod status into
the `FailureInfo` proto
- Adds `ErrorCategories` config field under `ApplicationConfiguration`
for defining category rules

Nothing is wired into the event reporting path yet - this PR provides
the building blocks that PR #4745 connects.

## Which issue(s) this PR fixes

Part of #4713 (Error Categorization)

## Special notes for your reviewer

- This is PR 1 of 4: Proto + classifier (this) -> Wire into executor
(#4745) -> Store in lookout DB + UI (#4755) -> e2e tests (#4760)
- The `errormatch` package is in `internal/common/` since it will be
reused by the lookout ingester
- The categorizer has thorough `doc.go` explaining config format and
validation
- Exit code 0 containers are skipped during classification (only
failures are categorized)
- Rules within a category are OR'd; categories are evaluated
independently

---------

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
dejanzele added a commit that referenced this pull request Apr 2, 2026
## What type of PR is this?

Feature (PR 2 of 4)

## What this PR does / why we need it

Wires the classifier and FailureInfo into the executor's event reporting
path.

- Constructs the `Classifier` from config at executor startup
unconditionally (with no rules configured, it simply returns empty
categories)
- Passes classifier to the pod issue handler and job state reporter
- Calls `ExtractFailureInfo()` + `classifier.Classify()` on every pod
failure, attaching structured `FailureInfo` (exit code, termination
message, categories, container name) to the `Error` events sent through
Pulsar
- After this PR, every failed pod event flowing into Pulsar carries exit
code, termination message, container name, and matched category names

## Which issue(s) this PR fixes

Part of #4713 (Error Categorization)

## Special notes for your reviewer

- This is PR 2 of 4: Proto + classifier (#4741) -> Wire into executor
(this) -> Store in lookout DB + UI (#4755) -> e2e tests (#4760)
- Depends on #4741
- The classifier is always instantiated at startup, even with no rules
configured - no conditional nil-checking needed at call sites
- Changes are in `application.go` (startup wiring),
`pod_issue_handler.go`, `job_state_reporter.go`, `reporter/event.go`,
`cluster_allocation.go`, and `preempt_runs.go`

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Show failure condition, categories, and termination message in the job
run details sidebar. Condition enums are mapped to human-friendly labels
and categories are title-cased after stripping the ERROR_CATEGORY_ prefix.

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the lookout-stores-failure-info branch 2 times, most recently from 96f39a3 to fd0957d Compare April 2, 2026 08:59
Add categories field to JobFailedEvent proto so error categories
flow through the public event API, not just Lookout. The conversion
layer extracts categories from FailureInfo on terminal errors and
attaches them to the public event.

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the lookout-stores-failure-info branch from fd0957d to 88d0393 Compare April 2, 2026 11:23
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

@dejanzele dejanzele merged commit 777e103 into armadaproject:master Apr 2, 2026
32 checks passed
dejanzele added a commit that referenced this pull request Apr 3, 2026
## What type of PR is this?

Feature (PR 4 of 4)

## What this PR does / why we need it

Adds end-to-end test cases that verify error categories flow through the
full Armada pipeline, asserting directly from the public event stream.

- Extends `assertEventFailed` in the testsuite event watcher to compare
expected vs actual categories on `JobFailedEvent` (sorted exact match)
- Adds 2 categorization test cases: OOM kill (`dd` fills tmpfs to
trigger OOMKilled) and user_error (exit code 1)
- Test YAML declares expected categories inline on the `failed` event:
  ```yaml
  expectedEvents:
    - submitted:
    - failed:
        categories: ["oom"]
  ```

## Which issue(s) this PR fixes

Part of #4713 (Error Categorization)

## Special notes for your reviewer

- This is PR 4 of 4: Proto + classifier (#4741) -> Wire into executor
(#4745) -> Store in lookout DB + UI + public event (#4755) -> e2e tests
(this)
- Depends on #4755 (which added `categories` field to `JobFailedEvent`)
- Categories are asserted from the event stream, not from the Lookout
API - no HTTP polling or additional infrastructure needed
- CI executor config needs `errorCategories` rules added separately
(included in `e2e/config/executor_config.yaml`)
- Both tests validated locally against a full local dev stack

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants