Add retry policy engine and matching primitives#4804
Add retry policy engine and matching primitives#4804dejanzele wants to merge 3 commits intoarmadaproject:masterfrom
Conversation
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Greptile SummaryThis PR introduces the retry policy evaluation engine as self-contained dead code — nothing calls it yet. It adds four new packages: What was addressed from prior review rounds:
Outstanding items to watch before the scheduler-wiring PR:
Overall: The core extraction, matching, and engine logic are sound, tests are thorough (41 cases), and the Confidence Score: 4/5Safe to merge as dead code; the open validateRule gaps should be addressed before wiring into the scheduler. The remaining open items (compiledPattern guard, per-rule Action validation, OnConditions empty-string) are latent bugs that only manifest when callers skip CompileRules or use malformed configs. Since this is deliberately dead code in this PR they cause no current regression, but they are real correctness issues that will surface in the scheduler-wiring PR (3/4). Score 4 rather than 5 to flag them for resolution before wiring. internal/scheduler/retry/types.go (validateRule completeness) and internal/scheduler/retry/matcher.go (compiledPattern vs OnTerminationMessage guard) Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Engine.Evaluate called] --> B{runError == nil?}
B -- yes --> C[Return ShouldRetry=false
'no error information']
B -- no --> D{globalMaxRetries > 0
AND totalRuns >= cap?}
D -- yes --> E[Return ShouldRetry=false
'global max retries exceeded']
D -- no --> F[Extract condition / exitCode /
termMsg / categories from Error]
F --> G[matchRules: iterate rules
first-match-wins]
G --> H{Rule matched?}
H -- no --> I[action = DefaultAction
reason = 'no rule matched']
H -- yes --> J[action = rule.Action
reason = reasonForAction map]
I --> K{action == ActionFail?}
J --> K
K -- yes --> L[Return ShouldRetry=false]
K -- no --> M{RetryLimit > 0
AND failureCount >= limit?}
M -- yes --> N[Return ShouldRetry=false
'policy retry limit exceeded']
M -- no --> O[Return ShouldRetry=true]
Reviews (9): Last reviewed commit: "Add retry policy engine and matching pri..." | Re-trigger Greptile |
bca4896 to
1ed96c9
Compare
1ed96c9 to
ec73ad8
Compare
ec73ad8 to
a3d9e04
Compare
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
e39ec2f to
33b0802
Compare
33b0802 to
995056d
Compare
Introduces a policy-based retry engine that evaluates Error protos against configurable rules to decide whether a failed job run should be retried. This is dead code - nothing calls the engine yet. - types.go: Policy, Rule, Result, Action types with regex pre-compilation - extract.go: extract condition, exit code, termination message, categories from armadaevents.Error (FailureInfo preferred, ContainerError fallback) - matcher.go: AND-logic rule matching, first-match-wins rule list evaluation - engine.go: Evaluate() with global cap, policy retry limit, and rule matching - configuration: RetryPolicyConfig type, wired into SchedulingConfig - config.yaml: default retryPolicy (disabled, globalMaxRetries=20) Signed-off-by: Dejan Zele Pejchev <dejan.pejchev@gmail.com> Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
995056d to
cd5ea06
Compare
What type of PR is this?
Feature (retry policy PR 1 of 4)
What this PR does / why we need it
Adds the retry policy evaluation engine as self-contained, dead code. Nothing calls it yet - this PR provides the building blocks that the scheduler wiring PR connects.
internal/scheduler/retry/package with the core engine:types.go-Policy,Rule,Result,Actiontypes withCompileRules()for pre-compiling regex patternsextract.go- Four extraction functions that pull condition, exit code, termination message, and categories from*armadaevents.Error, preferringFailureInfowith fallback toContainerErrormatcher.go-matchRule(AND logic across all non-empty fields) andmatchRules(first-match-wins iteration)engine.go-Engine.Evaluate()with global cap check, rule matching, default action fallback, and per-policy retry limitRetryPolicyConfigto scheduler configuration (feature flag + global max retries, disabled by default)errormatch.ExitCodeMatcher,errormatch.RegexMatcher, anderrormatch.MatchExitCode/MatchPatternfrominternal/common/errormatch/Which issue(s) this PR fixes
Part of #4683 (Retry Policy)
Special notes for your reviewer
RetryLimit: 0means unlimited retries (subject to global cap)