Skip to content

Add RetryPolicy CRUD API and armadactl support#4805

Open
dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
dejanzele:retry-policy-crud
Open

Add RetryPolicy CRUD API and armadactl support#4805
dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
dejanzele:retry-policy-crud

Conversation

@dejanzele
Copy link
Copy Markdown
Member

What type of PR is this?

Feature (retry policy PR 2 of 4)

What this PR does / why we need it

Adds RetryPolicy as a first-class API resource with full CRUD operations, following the same patterns as Queue. Operators can create retry policies and assign them to queues by name.

  • Adds RetryPolicy, RetryRule, RetryExitCodeMatcher proto messages and RetryAction enum to submit.proto
  • Adds RetryPolicyService gRPC service with Create/Update/Delete/Get/List RPCs
  • Adds REST gateway bindings for all operations (/v1/retry-policy/...)
  • Adds retry_policy field (string, references policy by name) to the Queue proto message
  • Adds internal/server/retrypolicy/ - repository (PostgreSQL) and gRPC handler, mirroring internal/server/queue/
  • Adds pkg/client/retrypolicy/ - client library for all CRUD operations
  • Adds cmd/armadactl/cmd/retrypolicy.go - CLI commands for managing retry policies
  • Adds --retry-policy flag to queue create/update commands
  • Adds RetryPolicy resource kind for file-based creation (armadactl create -f policy.yaml)

Usage:

armadactl create retry-policy -f policy.yaml
armadactl get retry-policy ml-training
armadactl get retry-policies
armadactl delete retry-policy ml-training
armadactl create queue ml-queue --retry-policy ml-training

Which issue(s) this PR fixes

Part of #4683 (Retry Policy)

Special notes for your reviewer

  • This is retry policy PR 2 of 4: Engine + config (independent) and CRUD + armadactl (this) -> Scheduler wiring -> Backoff + pod naming
  • No dependency on error categorization PRs - this is pure CRUD infrastructure
  • Independent of the engine PR (PR 1) - these can be reviewed in parallel
  • Follows the queue CRUD pattern exactly: proto -> server handler -> PostgreSQL repository -> client library -> armadactl
  • Repository uses INSERT with unique constraint check (not read-then-write) for concurrent-safe creates
  • Update uses single UPDATE with RowsAffected check instead of read-then-upsert
  • The retry_policy table (name text PK, definition bytea) stores serialized proto, same as the queue table
  • File-based create/update because rules are too complex for CLI flags
  • No scheduling behavior change in this PR

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 30, 2026

Greptile Summary

This PR introduces RetryPolicy as a first-class API resource with full CRUD operations (Create/Update/Delete/Get/List), closely mirroring the existing Queue pattern. It adds proto messages and a RetryPolicyService gRPC service, a PostgreSQL repository backed by a new retry_policy table, a dedicated gRPC handler with per-operation permission checks, a client library, and armadactl CLI commands. The REST gateway is wired through the existing Submit server via delegation, consistent with how queue operations are exposed.

Key highlights:

  • Proto design: RetryAction and ExitCodeOperator enums prevent typo-silent failures for the two most critical matching fields; the remaining on_conditions repeated string is the one field still open to typos (see inline comment)
  • Auth model: Each CRUD operation has its own dedicated permission constant (create_retry_policy, update_retry_policy, delete_retry_policy); read endpoints are intentionally unauthenticated with an explanatory comment, consistent with GetQueue
  • Idempotent deletes: documented by comment, consistent with DeleteQueue
  • Migration gap: 032_create_retry_policy.sql skips number 031 — if any concurrent in-flight PR claims that number, it would be permanently skipped after this PR is deployed first (see inline comment on the migration file)
  • nil slice on empty list: GetAllRetryPolicies returns a nil Go slice when no policies exist, which JSON-marshals as null instead of [] — minor but can surprise REST clients
  • Test quality: unit tests cover all five service methods across auth, validation, and repository error paths; RequireGrpcCode is correctly extracted into a shared servertest package

Confidence Score: 4/5

Safe to merge after confirming the migration numbering gap will not conflict with any other in-flight PR

The implementation is well-structured and cleanly mirrors the queue CRUD pattern. All concerns from the prior review iteration have been addressed. One P1 remains: the migration numbering gap (030→032) could cause migration 031 to be permanently skipped if another concurrent PR claims that number and is merged after this one. Once that ordering is confirmed or the numbering is coordinated, the remaining findings are P2 style/quality suggestions that do not block merge.

internal/lookout/schema/migrations/032_create_retry_policy.sql — migration numbering gap needs confirmation before merging

Important Files Changed

Filename Overview
internal/lookout/schema/migrations/032_create_retry_policy.sql Adds retry_policy table migration; migration number 031 is missing — if another in-flight PR claims 031, it will be permanently skipped after this PR deploys
internal/server/retrypolicy/repository.go PostgreSQL repository for retry policies; correct concurrent-safe create/update/delete, but GetAllRetryPolicies returns nil slice instead of empty slice on empty table
pkg/api/submit.proto Adds RetryPolicy messages, ExitCodeOperator enum, RetryPolicyService gRPC service, and REST gateway bindings on Submit; on_conditions is still an untyped string list
internal/server/retrypolicy/service.go New gRPC service handler for RetryPolicy CRUD; auth, input validation, and error mapping are all correct after prior review iteration
internal/server/retrypolicy/service_test.go Comprehensive unit tests covering auth, validation, and error-path for all five service methods using mocks
internal/server/submit/submit.go Adds retryPolicyService field and five delegation methods so Submit server satisfies the REST gateway bindings; mirrors existing queue delegation pattern exactly
internal/server/permissions/permissions.go Adds CreateRetryPolicy, UpdateRetryPolicy, DeleteRetryPolicy permission constants; each operation has its own dedicated permission
cmd/armadactl/cmd/retrypolicy.go CLI commands for create/update/get/list/delete retry policies; correctly requires --file flag for create/update and single name arg for get/delete
internal/armadactl/retrypolicy.go App-layer wiring for retry policy CLI operations; YAML marshaling for output is consistent with queue pattern
internal/server/servertest/grpc.go Extracts shared RequireGrpcCode test helper, replacing duplicate local implementations in queue and retry policy test packages
pkg/client/retrypolicy/create.go Client function for CreateRetryPolicy RPC; follows identical connection/context pattern as queue client

Sequence Diagram

sequenceDiagram
    participant CLI as armadactl
    participant Submit as Submit Server (REST gateway)
    participant RPS as RetryPolicyService Server
    participant Repo as PostgresRetryPolicyRepository
    participant DB as PostgreSQL (retry_policy table)

    Note over CLI,DB: Create / Update / Delete (auth required)
    CLI->>Submit: POST /v1/retry-policy (CreateRetryPolicy)
    Submit->>RPS: delegate CreateRetryPolicy(policy)
    RPS->>RPS: AuthorizeAction(CreateRetryPolicy)
    RPS->>RPS: validate name != ""
    RPS->>Repo: CreateRetryPolicy(ctx, policy)
    Repo->>DB: INSERT INTO retry_policy ON CONFLICT DO NOTHING
    DB-->>Repo: RowsAffected
    Repo-->>RPS: nil or ErrAlreadyExists
    RPS-->>Submit: Empty or gRPC status error
    Submit-->>CLI: HTTP 200 or 4xx/5xx

    Note over CLI,DB: Read (unauthenticated, consistent with GetQueue)
    CLI->>Submit: GET /v1/retry-policy/{name}
    Submit->>RPS: delegate GetRetryPolicy(req)
    RPS->>RPS: validate name != ""
    RPS->>Repo: GetRetryPolicy(ctx, name)
    Repo->>DB: SELECT definition FROM retry_policy WHERE name=$1
    DB-->>Repo: definition bytes or ErrNoRows
    Repo-->>RPS: *RetryPolicy or ErrNotFound
    RPS-->>Submit: RetryPolicy or NotFound status
    Submit-->>CLI: HTTP 200 + JSON or 404
Loading

Reviews (9): Last reviewed commit: "Add RetryPolicy CRUD API and armadactl s..." | Re-trigger Greptile

@dejanzele dejanzele force-pushed the retry-policy-crud branch 2 times, most recently from 1d4f603 to 45023c8 Compare March 30, 2026 14:06
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

@dejanzele dejanzele force-pushed the retry-policy-crud branch 3 times, most recently from a797d8b to 63f1c89 Compare March 30, 2026 16:15
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

Introduce RetryPolicy as a first-class API resource with full CRUD
operations. This is pure infrastructure with no scheduling behavior
changes.

Proto: Add RetryPolicy, RetryRule, RetryExitCodeMatcher messages and
RetryPolicyService gRPC service with REST gateway bindings on the
Submit service. Add retry_policy field to Queue message.

Server: Add retrypolicy package with PostgresRetryPolicyRepository
(stores serialized proto in retry_policy table) and Server handler
with authorization checks. Wire into server startup and register
the gRPC service. Add CreateRetryPolicy/DeleteRetryPolicy permissions.

Client: Add pkg/client/retrypolicy with Create/Update/Delete/Get/GetAll
functions matching the queue client pattern.

CLI: Add armadactl commands for create/update/delete/get retry-policy
and get retry-policies, all using file-based input for create/update.
Add --retry-policy flag to queue create and update commands.
Add RetryPolicy as a valid ResourceKind for file-based creation.

Signed-off-by: Dejan Zele Pejchev <pejchev@gmail.com>
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant