Skip to content

Add e2e test cases for error categorization#4760

Merged
dejanzele merged 1 commit intoarmadaproject:masterfrom
dejanzele:testsuite-error-categories
Apr 3, 2026
Merged

Add e2e test cases for error categorization#4760
dejanzele merged 1 commit intoarmadaproject:masterfrom
dejanzele:testsuite-error-categories

Conversation

@dejanzele
Copy link
Copy Markdown
Member

@dejanzele dejanzele commented Mar 11, 2026

What type of PR is this?

Feature (PR 4 of 4)

What this PR does / why we need it

Adds end-to-end test cases that verify error categories flow through the full Armada pipeline, asserting directly from the public event stream.

  • Extends assertEventFailed in the testsuite event watcher to compare expected vs actual categories on JobFailedEvent (sorted exact match)
  • Adds 2 categorization test cases: OOM kill (dd fills tmpfs to trigger OOMKilled) and user_error (exit code 1)
  • Test YAML declares expected categories inline on the failed event:
    expectedEvents:
      - submitted:
      - failed:
          categories: ["oom"]

Which issue(s) this PR fixes

Part of #4713 (Error Categorization)

Special notes for your reviewer

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 11, 2026

Greptile Summary

This PR adds end-to-end test cases for the error categorization feature (the final PR in a 4-part series), verifying that OOM kills and user errors flow through the full Armada pipeline and appear as the correct categories on the public JobFailedEvent stream. It also fixes a pre-existing logic inversion bug in assertEventFailed (the old code returned an error when the failure reason did match the regex, i.e. if re.MatchString(...) without !; the new code correctly errors when it does not match).

Key changes:

  • validators.go: Refactors assertEventFailed to support category assertions with exact sorted-match semantics; fixes the long-standing regex logic inversion.
  • app.go: Extends TestSpecsFromPattern to accept comma-separated glob patterns and updates the Prometheus failure-detection heuristic to recognise "expected categories" errors.
  • magefiles/ci.go: Adds categorization/* to the CI test sweep.
  • docker-compose.yaml + e2e/config/executor_config.yaml: Mounts a new executor config that activates the oom and user_error categorisation rules in docker-compose environments.
  • Two new YAML test specs (oom.yaml, user_error.yaml) define the concrete scenarios.

Notable issues found:

  • The assertCategories guard in validators.go silently skips the check when no expected categories are provided — there is no way to assert "this job should have zero categories" via the event stream path, leaving a coverage gap for future regression tests.
  • TestSpecsFromPattern does not deduplicate file paths, so overlapping comma-separated patterns would run the same spec twice.
  • The docker-compose.yaml change permanently enables the e2e errorCategories rules for all developers using the default dev stack, without a comment explaining the dependency.

Confidence Score: 3/5

  • Safe to merge for CI coverage, but the validators.go category guard leaves a silent coverage gap that may weaken future regression tests.
  • The two new test cases are correct and well-formed. The logic inversion fix in assertEventFailed is a genuine improvement. The main concern is the assertCategories guard that skips the check when expected categories are empty — this makes it impossible to assert zero categories via the event stream, and means existing basic/* tests provide no protection against future category-pollution bugs. The docker-compose change also broadly affects all developer environments without documentation. These are not blocking merge-safety issues, but they reduce the long-term reliability of the test suite.
  • internal/testsuite/eventwatcher/validators.go — the assertCategories guard (line 41) and its interaction with tests that omit categories deserves attention before this pattern is extended further.

Important Files Changed

Filename Overview
internal/testsuite/eventwatcher/validators.go Refactors assertEventFailed to also check categories; fixes a pre-existing logic inversion bug (! was missing on re.MatchString); the new assertCategories helper does exact sorted match, but the guard that invokes it silently skips the check for empty expected categories, making it impossible to assert "zero categories" via the event stream.
internal/testsuite/app.go Extends TestSpecsFromPattern to accept comma-separated glob patterns; adds "expected categories" substring to the Prometheus failure-detection heuristic. No deduplication when patterns produce overlapping paths.
docker-compose.yaml Mounts the e2e executor config and passes --config to the executor, activating errorCategories rules in all docker-compose environments (not just CI e2e). Correct for the feature but adds a permanent dev-stack dependency on the e2e config file.
e2e/config/executor_config.yaml New executor config file defining two errorCategories rules: OOMKilled condition → oom, exit codes {1,2,126,127} → user_error. Rules align correctly with the two test cases.
magefiles/ci.go Extends the --tests flag to include testsuite/testcases/categorization/* alongside basic/*; relies on the new comma-separated glob splitting in TestSpecsFromPattern.
testsuite/testcases/categorization/oom.yaml New e2e test: writes 256 MB to /dev/shm with a 25 Mi memory limit to trigger OOMKilled; asserts the failed event carries categories: ["oom"]. Well-formed for cgroup v2 CI environments.
testsuite/testcases/categorization/user_error.yaml New e2e test: exit 1 triggers the user_error rule (exit codes In [1,2,126,127]); asserts the failed event carries categories: ["user_error"]. Straightforward and correct.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Test YAML\nexpectedEvents.failed.categories] --> B{assertEventFailed}
    B --> C{reason != empty?}
    C -- yes --> D[compile regex\ncheck match]
    C -- no --> E{categories\nlen > 0?}
    D --> E
    E -- yes --> F[assertCategories\nexact sorted match\nslices.Equal]
    E -- no --> G[skip check\nreturn nil]
    F --> H{match?}
    H -- yes --> I[return nil ✓]
    H -- no --> J[return error\nexpected categories X but got Y]
    G --> I
Loading

Comments Outside Diff (3)

  1. internal/testsuite/eventwatcher/validators.go, line 41-45 (link)

    Silent skip when no categories expected — cannot assert zero categories

    assertCategories is only invoked when the expected list is non-empty (line 41). When a test YAML omits categories: (or sets it to an empty list), the entire category check is skipped — the actual event's categories are never inspected.

    This means:

    • There is no way to write a test that asserts "this job should have received zero categories" via the event stream path.
    • If a misconfigured or accidentally-broad category rule is added to executor_config.yaml in the future, all existing basic test specs (which don't specify categories:) will continue to pass silently, regardless of what categories the job actually received.

    The two new test cases (oom.yaml and user_error.yaml) are safe because they do supply expected categories and the exact-match check will catch unexpected additions. But for other tests in basic/ the guard means category pollution is completely invisible.

    Consider inverting the guard so the check always runs — an empty expected list asserts an empty actual list:

    // Always check categories (empty expected ⟹ actual must also be empty)
    if err := assertCategories(expected.Failed.GetCategories(), actual.Failed.GetCategories()); err != nil {
        return err
    }

    If exact-match-on-empty is undesirable for backward compatibility with existing basic/ specs, you could instead document the limitation with a comment so future test authors know they cannot express "no categories" via this path.

  2. internal/testsuite/app.go, line 90-103 (link)

    No deduplication when comma-separated glob patterns overlap

    TestSpecsFromPattern now splits the input on , and appends all results from each glob into allPaths without deduplication. If two patterns ever match the same file (e.g. testsuite/testcases/**/*,testsuite/testcases/categorization/*), the same test spec will be loaded and executed twice with no warning, potentially doubling the test run time or producing misleading duplicate results in the JUnit report.

    For the current patterns (basic/*,categorization/*) the directories are disjoint, so this is benign today — but adding a third, broader glob in the future could cause silent duplication.

    Consider deduplicating allPaths before returning:

    seen := make(map[string]struct{})
    for _, path := range allPaths {
        if _, ok := seen[path]; !ok {
            seen[path] = struct{}{}
            uniquePaths = append(uniquePaths, path)
        }
    }
    return TestSpecsFromFilePaths(uniquePaths)
  3. docker-compose.yaml, line 133-141 (link)

    e2e executor config permanently changes the default developer stack

    Mounting ./e2e/config/executor_config.yaml and passing --config /config/executor_config.yaml to the executor affects every developer who runs docker-compose up, not just the CI e2e run. The errorCategories rules are now always active in local dev environments regardless of whether the developer is running the e2e suite.

    Two concrete consequences:

    1. Tight coupling to an e2e artefact — if e2e/config/executor_config.yaml is deleted or moved (e.g. during a reorganisation of the e2e/ directory), the docker-compose executor silently fails to start because the volume mount path no longer exists on the host.

    2. Undocumented dev-stack behaviour change — developers debugging failed jobs in the local stack will now see categories populated on failure events, which could be surprising and confusing if they are not aware of this PR. A comment in docker-compose.yaml near the new volume/command lines would help:

    # Mounts the e2e executor config to enable errorCategories rules
    # required by testsuite/testcases/categorization/* test cases.
    - "./e2e/config/executor_config.yaml:/config/executor_config.yaml"

Reviews (34): Last reviewed commit: "Add testsuite support for asserting erro..." | Re-trigger Greptile

Comment thread internal/testsuite/lookoutclient/lookoutclient.go Outdated
Comment thread internal/testsuite/testrunner.go Outdated
@dejanzele dejanzele force-pushed the testsuite-error-categories branch from e496a6b to ad68ab3 Compare March 11, 2026 14:40
Comment thread internal/executor/util/pod_status.go Outdated
Comment thread internal/testsuite/lookoutclient/lookoutclient.go Outdated
Comment thread internal/lookoutingester/instructions/instructions.go
@dejanzele dejanzele force-pushed the testsuite-error-categories branch 2 times, most recently from 063ca98 to dd853af Compare March 11, 2026 15:08
Comment thread internal/testsuite/lookoutclient/assert.go Outdated
Comment thread internal/testsuite/lookoutclient/lookoutclient.go Outdated
@dejanzele dejanzele force-pushed the testsuite-error-categories branch from dd853af to e628632 Compare March 11, 2026 15:32
Comment thread internal/testsuite/lookoutclient/lookoutclient.go Outdated
Comment thread internal/testsuite/testrunner.go
@dejanzele dejanzele force-pushed the testsuite-error-categories branch from e628632 to f509f03 Compare March 11, 2026 16:35
Comment thread magefiles/ci.go Outdated
Comment thread internal/executor/util/pod_status.go Outdated
Comment thread internal/testsuite/lookoutclient/assert.go Outdated
@dejanzele dejanzele force-pushed the testsuite-error-categories branch from f509f03 to 334ece4 Compare March 11, 2026 18:26
Comment thread internal/testsuite/lookoutclient/lookoutclient.go Outdated
Comment thread testsuite/testcases/categorization/oom.yaml
@dejanzele dejanzele force-pushed the testsuite-error-categories branch from 334ece4 to e10920f Compare March 12, 2026 12:10
Comment thread internal/executor/categorizer/classifier.go
@dejanzele dejanzele force-pushed the testsuite-error-categories branch from e10920f to f813bb0 Compare March 12, 2026 12:31
Comment thread internal/testsuite/lookoutclient/lookoutclient.go Outdated
@dejanzele
Copy link
Copy Markdown
Member Author

@greptile

Comment thread pkg/api/testspec.proto Outdated
Comment thread internal/testsuite/lookoutclient/lookoutclient.go Outdated
@dejanzele dejanzele force-pushed the testsuite-error-categories branch from f813bb0 to 79e8d4d Compare March 12, 2026 14:14
Comment thread internal/lookoutingester/instructions/instructions.go
Comment thread internal/lookoutingester/instructions/instructions.go
@dejanzele dejanzele force-pushed the testsuite-error-categories branch from ce20b48 to 47cf523 Compare March 13, 2026 01:55
Comment thread internal/common/errormatch/match.go
Comment thread internal/lookoutingester/instructions/instructions.go
@dejanzele dejanzele force-pushed the testsuite-error-categories branch from 47cf523 to 413f8ec Compare March 13, 2026 10:29
Comment thread internal/testsuite/testrunner.go
Comment thread internal/executor/util/pod_status.go Outdated
@dejanzele dejanzele force-pushed the testsuite-error-categories branch 4 times, most recently from be68a53 to 88aa7a0 Compare March 19, 2026 14:44
@dejanzele dejanzele force-pushed the testsuite-error-categories branch 5 times, most recently from d18879d to fa29480 Compare March 30, 2026 16:13
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

@dejanzele dejanzele force-pushed the testsuite-error-categories branch 6 times, most recently from 622465e to 0e25ca8 Compare March 31, 2026 13:43
dejanzele added a commit that referenced this pull request Apr 1, 2026
## What type of PR is this?

Feature (PR 1 of 4)

## What this PR does / why we need it

Adds the proto schema and shared building blocks for error
categorization in Armada.

- Adds `FailureInfo` message to the `Error` proto with fields:
`exit_code`, `termination_message`, `categories`, `container_name`
- Adds `errormatch` package (`internal/common/errormatch/`) with shared
matching primitives: `ExitCodeMatcher` (In/NotIn operators),
`RegexMatcher`, and Kubernetes condition constants (OOMKilled, Evicted,
DeadlineExceeded)
- Adds `categorizer` package (`internal/executor/categorizer/`) -
configurable classifier that matches pod failures against rules (exit
codes, termination messages, Kubernetes conditions) and assigns named
categories, with optional `containerName` scoping per rule
- Adds `ExtractFailureInfo()` in `pod_status.go` to extract exit code,
termination message, and container name from Kubernetes pod status into
the `FailureInfo` proto
- Adds `ErrorCategories` config field under `ApplicationConfiguration`
for defining category rules

Nothing is wired into the event reporting path yet - this PR provides
the building blocks that PR #4745 connects.

## Which issue(s) this PR fixes

Part of #4713 (Error Categorization)

## Special notes for your reviewer

- This is PR 1 of 4: Proto + classifier (this) -> Wire into executor
(#4745) -> Store in lookout DB + UI (#4755) -> e2e tests (#4760)
- The `errormatch` package is in `internal/common/` since it will be
reused by the lookout ingester
- The categorizer has thorough `doc.go` explaining config format and
validation
- Exit code 0 containers are skipped during classification (only
failures are categorized)
- Rules within a category are OR'd; categories are evaluated
independently

---------

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
dejanzele added a commit that referenced this pull request Apr 2, 2026
## What type of PR is this?

Feature (PR 2 of 4)

## What this PR does / why we need it

Wires the classifier and FailureInfo into the executor's event reporting
path.

- Constructs the `Classifier` from config at executor startup
unconditionally (with no rules configured, it simply returns empty
categories)
- Passes classifier to the pod issue handler and job state reporter
- Calls `ExtractFailureInfo()` + `classifier.Classify()` on every pod
failure, attaching structured `FailureInfo` (exit code, termination
message, categories, container name) to the `Error` events sent through
Pulsar
- After this PR, every failed pod event flowing into Pulsar carries exit
code, termination message, container name, and matched category names

## Which issue(s) this PR fixes

Part of #4713 (Error Categorization)

## Special notes for your reviewer

- This is PR 2 of 4: Proto + classifier (#4741) -> Wire into executor
(this) -> Store in lookout DB + UI (#4755) -> e2e tests (#4760)
- Depends on #4741
- The classifier is always instantiated at startup, even with no rules
configured - no conditional nil-checking needed at call sites
- Changes are in `application.go` (startup wiring),
`pod_issue_handler.go`, `job_state_reporter.go`, `reporter/event.go`,
`cluster_allocation.go`, and `preempt_runs.go`

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
dejanzele added a commit that referenced this pull request Apr 2, 2026
## What type of PR is this?

Feature (PR 3 of 4)

## What this PR does / why we need it

Stores FailureInfo in the lookout database, exposes it via the Lookout
API and public event API, and displays it in the Lookout UI.

- Adds DB migration (031) to add a `failure_info` JSONB column to
`job_run`
- Extracts `FailureInfo` from Error events in the lookout ingester and
stores as JSONB (exit code, termination message, categories, container
name)
- Adds `failureInfo` to the Lookout v2 Swagger spec and threads it
through the query builder, model, and conversions layer
- Displays FailureInfo in the Lookout UI sidebar on failed job runs:
container name, exit code, termination message, and matched categories
- Adds Error Categories column to the jobs table (display-only,
filtering is not yet supported since categories are stored in JSONB and
not a server-side filterable field)
- Propagates error categories in the public `JobFailedEvent` proto, so
categories are available in the event stream (used by the e2e testsuite
in PR #4760)
- Adds executor error category config to `_local/executor/config.yaml`
for local development

## Which issue(s) this PR fixes

Part of #4713 (Error Categorization)

## Special notes for your reviewer

- This is PR 3 of 4: Proto + classifier (#4741) -> Wire into executor
(#4745) -> Store in lookout DB + UI + public event (this) -> e2e tests
(#4760)
- Depends on #4745
- JSONB round-trip means `exitCode` (int32 in proto) arrives as float64
after PostgreSQL JSON deserialization - `failureInfoToSwagger()` handles
this via type assertion
- `failureInfoToMap()` only emits non-zero fields to keep the JSONB
clean
- Uses `coalesce(tmp.failure_info, job_run.failure_info)` in the batch
upsert so earlier partial updates aren't overwritten
- Only terminal errors have their FailureInfo persisted (`if e.Terminal`
guard in the ingester) to prevent transient failures from overwriting
the final classification
- Categories on `JobFailedEvent` are set in `FromInternalJobErrors()`
conversion from `FailureInfo.Categories`

<img width="523" height="310" alt="image"
src="https://github.com/user-attachments/assets/8e018b92-6b90-46a1-8b1f-81c44ffed4c3"
/>
<img width="1728" height="410" alt="image"
src="https://github.com/user-attachments/assets/b79a4b28-b713-44ba-99a8-5297104a9edc"
/>

---------

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the testsuite-error-categories branch from 0e25ca8 to 0c82847 Compare April 2, 2026 14:22
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

@dejanzele dejanzele force-pushed the testsuite-error-categories branch 3 times, most recently from 7f4ddf9 to 47732e7 Compare April 2, 2026 15:36
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

Assert error categories directly from the public JobFailedEvent in the
event stream, rather than polling the Lookout API. Categories now flow
through the event API (added in parent commit), so the testsuite can
validate them inline with existing event assertions.

Test YAMLs declare expected categories on the failed event:
  expectedEvents:
    - failed:
        categories: ["oom"]

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the testsuite-error-categories branch from 47732e7 to 7e71ae0 Compare April 2, 2026 15:58
@dejanzele dejanzele merged commit 9f3c095 into armadaproject:master Apr 3, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants