Add retry policy engine and matching primitives by dejanzele · Pull Request #4804 · armadaproject/armada

dejanzele · 2026-03-30T13:15:59Z

What type of PR is this?

Feature (retry policy PR 1 of 4)

What this PR does / why we need it

Adds the retry policy evaluation engine as self-contained, dead code. Nothing calls it yet - this PR provides the building blocks that the scheduler wiring PR connects.

Adds internal/scheduler/retry/ package with the core engine:
- types.go - Policy, Rule, Result, Action types with CompileRules() for pre-compiling regex patterns
- extract.go - Four extraction functions that pull condition, exit code, termination message, and categories from *armadaevents.Error, preferring FailureInfo with fallback to ContainerError
- matcher.go - matchRule (AND logic across all non-empty fields) and matchRules (first-match-wins iteration)
- engine.go - Engine.Evaluate() with global cap check, rule matching, default action fallback, and per-policy retry limit
Adds RetryPolicyConfig to scheduler configuration (feature flag + global max retries, disabled by default)
Reuses errormatch.ExitCodeMatcher, errormatch.RegexMatcher, and errormatch.MatchExitCode/MatchPattern from internal/common/errormatch/
Condition extraction derives failure conditions directly from the Error proto oneof + KubernetesReason, no proto changes needed

Which issue(s) this PR fixes

Part of #4683 (Retry Policy)

Special notes for your reviewer

This is retry policy PR 1 of 4: Engine + config (this) and CRUD + armadactl (independent) -> Scheduler wiring -> Backoff + pod naming
Depends on Add error categorization proto schema and executor classifier #4741 (error categorization proto + errormatch package)
This is entirely dead code with no behavior change. Focus review on matching correctness and edge case handling.
Exit code 0 never matches (proto3 default = "not set")
RetryLimit: 0 means unlimited retries (subject to global cap)
Rule matching uses AND logic: all non-empty fields must match. Rules are evaluated in order, first match wins.
41 test cases covering all match types, edge cases, nil handling, and regex compilation

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

greptile-apps · 2026-03-30T13:20:47Z

Greptile Summary

This PR introduces the retry policy evaluation engine as self-contained dead code — nothing calls it yet. It adds four new packages: internal/common/errormatch/ (shared matching primitives), internal/executor/categorizer/ (pod-failure classifier), internal/scheduler/retry/ (the engine itself), and a RetryPolicyConfig feature-flag entry in the scheduler configuration.

What was addressed from prior review rounds:

globalMaxRetries = 0 now correctly means "unlimited" via the e.globalMaxRetries > 0 && guard, backed by a dedicated test case.
validateRule now rejects empty OnTerminationMessage.Pattern, invalid ExitCodeOperator, and empty Values lists, and requires at least one match field per rule.
ValidatePolicy validates DefaultAction before delegating to CompileRules.

Outstanding items to watch before the scheduler-wiring PR:

matchRule guards the termination-message check on compiledPattern != nil rather than OnTerminationMessage != nil. A Policy evaluated without CompileRules being called first will silently skip that check.
validateRule does not validate per-rule Action fields. A zero-value or misspelled Action (e.g. \"fail\" vs \"Fail\") silently falls through to the retry branch with an empty Reason string.
OnConditions values are not validated for empty strings — unlike unknown strings (which silently never match), an empty string \"\" silently matches all unrecognised error types.

Overall: The core extraction, matching, and engine logic are sound, tests are thorough (41 cases), and the errormatch/categorizer primitives are clean. The remaining gaps centre on the incompleteness of validateRule and should be closed before this code is wired into the scheduler.

Confidence Score: 4/5

Safe to merge as dead code; the open validateRule gaps should be addressed before wiring into the scheduler.

The remaining open items (compiledPattern guard, per-rule Action validation, OnConditions empty-string) are latent bugs that only manifest when callers skip CompileRules or use malformed configs. Since this is deliberately dead code in this PR they cause no current regression, but they are real correctness issues that will surface in the scheduler-wiring PR (3/4). Score 4 rather than 5 to flag them for resolution before wiring.

internal/scheduler/retry/types.go (validateRule completeness) and internal/scheduler/retry/matcher.go (compiledPattern vs OnTerminationMessage guard)

Important Files Changed

Filename	Overview
internal/scheduler/retry/engine.go	Core evaluation loop: global cap check, rule matching, default fallback, per-policy limit. The globalMaxRetries=0 unlimited semantics are now guarded correctly. Per-rule Action validation is still absent from validateRule, meaning a zero-value or misspelled action silently falls through to retry.
internal/scheduler/retry/types.go	Policy/Rule/Result types with CompileRules and validateRule. DefaultAction is now validated, empty regex patterns are rejected, and exit-code operator/values are validated. validateRule still does not validate rule.Action, and OnConditions values are not checked for empty strings.
internal/scheduler/retry/matcher.go	AND-across-fields, first-match-wins logic. Correct for all four fields. Still guards termination-message check on compiledPattern != nil rather than OnTerminationMessage != nil, so a policy evaluated without calling CompileRules first would silently skip that check.
internal/scheduler/retry/extract.go	Four extraction helpers pulling condition, exit code, termination message, and categories from the Error proto with correct FailureInfo-first, ContainerError-fallback logic. All nil paths are guarded.
internal/common/errormatch/match.go	Clean MatchExitCode (nil-safe, exit-code-0 guard, In/NotIn) and MatchPattern (empty-string guard) helpers. Logic is correct and well-tested.
internal/common/errormatch/types.go	Defines shared ExitCodeMatcher, RegexMatcher, condition constants, and KnownConditions map. Clean separation of pod-level vs non-pod conditions. No issues.
internal/executor/categorizer/classifier.go	New pod-failure categoriser with strict construction-time validation (exactly one matcher per rule, known conditions, valid operator, non-empty values, unique names). Classify logic is correct.
internal/scheduler/configuration/retry.go	Minimal config struct with Enabled feature flag and GlobalMaxRetries. Clean; GlobalMaxRetries=0 is now safe (treated as unlimited by the engine).
internal/executor/util/pod_status.go	Refactored to use errormatch constants instead of bare string literals for Evicted and DeadlineExceeded. Safe mechanical refactoring.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Engine.Evaluate called] --> B{runError == nil?}
    B -- yes --> C[Return ShouldRetry=false
'no error information']
    B -- no --> D{globalMaxRetries > 0
AND totalRuns >= cap?}
    D -- yes --> E[Return ShouldRetry=false
'global max retries exceeded']
    D -- no --> F[Extract condition / exitCode /
termMsg / categories from Error]
    F --> G[matchRules: iterate rules
first-match-wins]
    G --> H{Rule matched?}
    H -- no --> I[action = DefaultAction
reason = 'no rule matched']
    H -- yes --> J[action = rule.Action
reason = reasonForAction map]
    I --> K{action == ActionFail?}
    J --> K
    K -- yes --> L[Return ShouldRetry=false]
    K -- no --> M{RetryLimit > 0
AND failureCount >= limit?}
    M -- yes --> N[Return ShouldRetry=false
'policy retry limit exceeded']
    M -- no --> O[Return ShouldRetry=true]

_{Reviews (9): Last reviewed commit: "Add retry policy engine and matching pri..." | Re-trigger Greptile}

internal/scheduler/retry/engine.go

internal/scheduler/retry/types.go

internal/scheduler/retry/matcher.go

internal/scheduler/retry/engine.go

dejanzele · 2026-03-30T15:02:55Z

@greptileai

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

dejanzele · 2026-03-30T16:16:07Z

@greptileai

Introduces a policy-based retry engine that evaluates Error protos against configurable rules to decide whether a failed job run should be retried. This is dead code - nothing calls the engine yet. - types.go: Policy, Rule, Result, Action types with regex pre-compilation - extract.go: extract condition, exit code, termination message, categories from armadaevents.Error (FailureInfo preferred, ContainerError fallback) - matcher.go: AND-logic rule matching, first-match-wins rule list evaluation - engine.go: Evaluate() with global cap, policy retry limit, and rule matching - configuration: RetryPolicyConfig type, wired into SchedulingConfig - config.yaml: default retryPolicy (disabled, globalMaxRetries=20) Signed-off-by: Dejan Zele Pejchev <dejan.pejchev@gmail.com> Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

Add FailureCondition enum and FailureInfo message to Error proto

ea6cd57

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

greptile-apps bot reviewed Mar 30, 2026

View reviewed changes

internal/scheduler/retry/engine.go Show resolved Hide resolved

internal/scheduler/retry/types.go Show resolved Hide resolved

internal/scheduler/retry/types.go Show resolved Hide resolved

dejanzele force-pushed the retry-engine-config branch from bca4896 to 1ed96c9 Compare March 30, 2026 13:27

greptile-apps bot reviewed Mar 30, 2026

View reviewed changes

internal/scheduler/retry/matcher.go Show resolved Hide resolved

dejanzele force-pushed the retry-engine-config branch from 1ed96c9 to ec73ad8 Compare March 30, 2026 14:21

greptile-apps bot reviewed Mar 30, 2026

View reviewed changes

internal/scheduler/retry/engine.go Show resolved Hide resolved

dejanzele force-pushed the retry-engine-config branch from ec73ad8 to a3d9e04 Compare March 30, 2026 14:48

Add executor error category classifier and ExtractFailureInfo

babf18c

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

dejanzele force-pushed the retry-engine-config branch 2 times, most recently from e39ec2f to 33b0802 Compare March 30, 2026 16:14

dejanzele force-pushed the retry-engine-config branch from 33b0802 to 995056d Compare March 30, 2026 16:34

dejanzele force-pushed the retry-engine-config branch from 995056d to cd5ea06 Compare March 31, 2026 10:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retry policy engine and matching primitives#4804

Add retry policy engine and matching primitives#4804
dejanzele wants to merge 3 commits intoarmadaproject:masterfrom
dejanzele:retry-engine-config

dejanzele commented Mar 30, 2026

Uh oh!

greptile-apps bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dejanzele commented Mar 30, 2026

Uh oh!

dejanzele commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dejanzele commented Mar 30, 2026

What type of PR is this?

What this PR does / why we need it

Which issue(s) this PR fixes

Special notes for your reviewer

Uh oh!

greptile-apps bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dejanzele commented Mar 30, 2026

Uh oh!

dejanzele commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps bot commented Mar 30, 2026 •

edited

Loading