Skip to content

Add retry policy engine and matching primitives#4804

Open
dejanzele wants to merge 3 commits intoarmadaproject:masterfrom
dejanzele:retry-engine-config
Open

Add retry policy engine and matching primitives#4804
dejanzele wants to merge 3 commits intoarmadaproject:masterfrom
dejanzele:retry-engine-config

Conversation

@dejanzele
Copy link
Copy Markdown
Member

What type of PR is this?

Feature (retry policy PR 1 of 4)

What this PR does / why we need it

Adds the retry policy evaluation engine as self-contained, dead code. Nothing calls it yet - this PR provides the building blocks that the scheduler wiring PR connects.

  • Adds internal/scheduler/retry/ package with the core engine:
    • types.go - Policy, Rule, Result, Action types with CompileRules() for pre-compiling regex patterns
    • extract.go - Four extraction functions that pull condition, exit code, termination message, and categories from *armadaevents.Error, preferring FailureInfo with fallback to ContainerError
    • matcher.go - matchRule (AND logic across all non-empty fields) and matchRules (first-match-wins iteration)
    • engine.go - Engine.Evaluate() with global cap check, rule matching, default action fallback, and per-policy retry limit
  • Adds RetryPolicyConfig to scheduler configuration (feature flag + global max retries, disabled by default)
  • Reuses errormatch.ExitCodeMatcher, errormatch.RegexMatcher, and errormatch.MatchExitCode/MatchPattern from internal/common/errormatch/
  • Condition extraction derives failure conditions directly from the Error proto oneof + KubernetesReason, no proto changes needed

Which issue(s) this PR fixes

Part of #4683 (Retry Policy)

Special notes for your reviewer

  • This is retry policy PR 1 of 4: Engine + config (this) and CRUD + armadactl (independent) -> Scheduler wiring -> Backoff + pod naming
  • Depends on Add error categorization proto schema and executor classifier #4741 (error categorization proto + errormatch package)
  • This is entirely dead code with no behavior change. Focus review on matching correctness and edge case handling.
  • Exit code 0 never matches (proto3 default = "not set")
  • RetryLimit: 0 means unlimited retries (subject to global cap)
  • Rule matching uses AND logic: all non-empty fields must match. Rules are evaluated in order, first match wins.
  • 41 test cases covering all match types, edge cases, nil handling, and regex compilation

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 30, 2026

Greptile Summary

This PR introduces the retry policy evaluation engine as self-contained dead code — nothing calls it yet. It adds four new packages: internal/common/errormatch/ (shared matching primitives), internal/executor/categorizer/ (pod-failure classifier), internal/scheduler/retry/ (the engine itself), and a RetryPolicyConfig feature-flag entry in the scheduler configuration.

What was addressed from prior review rounds:

  • globalMaxRetries = 0 now correctly means "unlimited" via the e.globalMaxRetries > 0 && guard, backed by a dedicated test case.
  • validateRule now rejects empty OnTerminationMessage.Pattern, invalid ExitCodeOperator, and empty Values lists, and requires at least one match field per rule.
  • ValidatePolicy validates DefaultAction before delegating to CompileRules.

Outstanding items to watch before the scheduler-wiring PR:

  • matchRule guards the termination-message check on compiledPattern != nil rather than OnTerminationMessage != nil. A Policy evaluated without CompileRules being called first will silently skip that check.
  • validateRule does not validate per-rule Action fields. A zero-value or misspelled Action (e.g. \"fail\" vs \"Fail\") silently falls through to the retry branch with an empty Reason string.
  • OnConditions values are not validated for empty strings — unlike unknown strings (which silently never match), an empty string \"\" silently matches all unrecognised error types.

Overall: The core extraction, matching, and engine logic are sound, tests are thorough (41 cases), and the errormatch/categorizer primitives are clean. The remaining gaps centre on the incompleteness of validateRule and should be closed before this code is wired into the scheduler.

Confidence Score: 4/5

Safe to merge as dead code; the open validateRule gaps should be addressed before wiring into the scheduler.

The remaining open items (compiledPattern guard, per-rule Action validation, OnConditions empty-string) are latent bugs that only manifest when callers skip CompileRules or use malformed configs. Since this is deliberately dead code in this PR they cause no current regression, but they are real correctness issues that will surface in the scheduler-wiring PR (3/4). Score 4 rather than 5 to flag them for resolution before wiring.

internal/scheduler/retry/types.go (validateRule completeness) and internal/scheduler/retry/matcher.go (compiledPattern vs OnTerminationMessage guard)

Important Files Changed

Filename Overview
internal/scheduler/retry/engine.go Core evaluation loop: global cap check, rule matching, default fallback, per-policy limit. The globalMaxRetries=0 unlimited semantics are now guarded correctly. Per-rule Action validation is still absent from validateRule, meaning a zero-value or misspelled action silently falls through to retry.
internal/scheduler/retry/types.go Policy/Rule/Result types with CompileRules and validateRule. DefaultAction is now validated, empty regex patterns are rejected, and exit-code operator/values are validated. validateRule still does not validate rule.Action, and OnConditions values are not checked for empty strings.
internal/scheduler/retry/matcher.go AND-across-fields, first-match-wins logic. Correct for all four fields. Still guards termination-message check on compiledPattern != nil rather than OnTerminationMessage != nil, so a policy evaluated without calling CompileRules first would silently skip that check.
internal/scheduler/retry/extract.go Four extraction helpers pulling condition, exit code, termination message, and categories from the Error proto with correct FailureInfo-first, ContainerError-fallback logic. All nil paths are guarded.
internal/common/errormatch/match.go Clean MatchExitCode (nil-safe, exit-code-0 guard, In/NotIn) and MatchPattern (empty-string guard) helpers. Logic is correct and well-tested.
internal/common/errormatch/types.go Defines shared ExitCodeMatcher, RegexMatcher, condition constants, and KnownConditions map. Clean separation of pod-level vs non-pod conditions. No issues.
internal/executor/categorizer/classifier.go New pod-failure categoriser with strict construction-time validation (exactly one matcher per rule, known conditions, valid operator, non-empty values, unique names). Classify logic is correct.
internal/scheduler/configuration/retry.go Minimal config struct with Enabled feature flag and GlobalMaxRetries. Clean; GlobalMaxRetries=0 is now safe (treated as unlimited by the engine).
internal/executor/util/pod_status.go Refactored to use errormatch constants instead of bare string literals for Evicted and DeadlineExceeded. Safe mechanical refactoring.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Engine.Evaluate called] --> B{runError == nil?}
    B -- yes --> C[Return ShouldRetry=false
'no error information']
    B -- no --> D{globalMaxRetries > 0
AND totalRuns >= cap?}
    D -- yes --> E[Return ShouldRetry=false
'global max retries exceeded']
    D -- no --> F[Extract condition / exitCode /
termMsg / categories from Error]
    F --> G[matchRules: iterate rules
first-match-wins]
    G --> H{Rule matched?}
    H -- no --> I[action = DefaultAction
reason = 'no rule matched']
    H -- yes --> J[action = rule.Action
reason = reasonForAction map]
    I --> K{action == ActionFail?}
    J --> K
    K -- yes --> L[Return ShouldRetry=false]
    K -- no --> M{RetryLimit > 0
AND failureCount >= limit?}
    M -- yes --> N[Return ShouldRetry=false
'policy retry limit exceeded']
    M -- no --> O[Return ShouldRetry=true]
Loading

Reviews (9): Last reviewed commit: "Add retry policy engine and matching pri..." | Re-trigger Greptile

@dejanzele dejanzele force-pushed the retry-engine-config branch from bca4896 to 1ed96c9 Compare March 30, 2026 13:27
@dejanzele dejanzele force-pushed the retry-engine-config branch from 1ed96c9 to ec73ad8 Compare March 30, 2026 14:21
@dejanzele dejanzele force-pushed the retry-engine-config branch from ec73ad8 to a3d9e04 Compare March 30, 2026 14:48
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the retry-engine-config branch 2 times, most recently from e39ec2f to 33b0802 Compare March 30, 2026 16:14
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

@dejanzele dejanzele force-pushed the retry-engine-config branch from 33b0802 to 995056d Compare March 30, 2026 16:34
Introduces a policy-based retry engine that evaluates Error protos against
configurable rules to decide whether a failed job run should be retried.
This is dead code - nothing calls the engine yet.

- types.go: Policy, Rule, Result, Action types with regex pre-compilation
- extract.go: extract condition, exit code, termination message, categories
  from armadaevents.Error (FailureInfo preferred, ContainerError fallback)
- matcher.go: AND-logic rule matching, first-match-wins rule list evaluation
- engine.go: Evaluate() with global cap, policy retry limit, and rule matching
- configuration: RetryPolicyConfig type, wired into SchedulingConfig
- config.yaml: default retryPolicy (disabled, globalMaxRetries=20)

Signed-off-by: Dejan Zele Pejchev <dejan.pejchev@gmail.com>
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the retry-engine-config branch from 995056d to cd5ea06 Compare March 31, 2026 10:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant