Skip to content

Design same-session polecat recovery after refinery failure #22

@julianknutsen

Description

@julianknutsen

Problem

Today Gastown's polecat/refinery flow preserves work identity across a failed merge, but not polecat session identity.

Current behavior:

  • a polecat finishes work and hands the bead to refinery
  • refinery merges or rejects
  • on rejection, the work returns to the pool
  • a new polecat session may pick it up and resume from metadata

That works mechanically, but it loses the live context of the original polecat session. For merge failures and test failures, we want a new feature that lets the same polecat session pick the work back up when possible, instead of always relying on a fresh pool claim.

Goal

Design and implement a same-session refinery feedback flow for Gastown:

  • a polecat can submit work to refinery
  • if refinery needs fixes, the same polecat session can resume the work
  • if refinery succeeds, the polecat can clean up normally

This is a feature request, not just an upstream parity port. The design should match Gas City's wait/session/pool model.

Current Limitation

We already identified one important blocker in the current architecture:

  • the new wait subsystem is the right foundation for "polecat waits for refinery"
  • but pooled polecats can still fall out of desired state and be drained when pool sizing shrinks
  • so a waiting polecat is not yet a first-class pinned pool member

That means the feature needs real lifecycle design, not just prompt tweaks.

Desired Outcome

Allow a polecat to remain the owner of its in-flight work across refinery verdicts when feasible.

In particular:

  • failed merges should be resumable by the same polecat session
  • successful merges should let the polecat finish cleanly without leaking sessions or worktrees
  • if the original polecat is gone, unhealthy, or no longer recoverable, the system should still fall back safely to the existing pool-based recovery path

Design Areas

1. Verdict / handoff model

  • How does refinery communicate FIX_NEEDED vs success?
  • Is the wake trigger the original work bead, a verdict bead, a wait object, or some other durable receipt?
  • What is the durable source of truth for the refinery verdict?

2. Polecat session lifecycle

  • Does the polecat sleep and wait for refinery instead of drain-ack?
  • How is the same session woken back up?
  • What is the timeout / abandonment story if refinery never replies?

3. Pool slot semantics

  • How do waiting polecats interact with pool desired-state reconciliation?
  • Do waiting polecats pin a pool slot?
  • How do we avoid slot leakage or dead capacity when many polecats are waiting on refinery?

4. Recovery / fallback

  • If the original polecat session dies, can another polecat still resume the work from metadata?
  • What metadata must still be recorded for crash recovery?
  • What happens across controller restart, compaction, or workspace cleanup?

5. Cleanup rules

  • When is the worktree kept vs deleted?
  • When does a successful merge retire the polecat?
  • How do we avoid leaving stale waiting sessions behind?

Non-Goals

  • Do not require ACP propulsion work for this feature.
  • Do not force the full upstream event model if a simpler Gas City-native contract is better.
  • Do not break the current pool-based fallback path.

Minimum Acceptance Criteria

  • A failed refinery merge can be resumed by the same polecat session when that session is still healthy.
  • Pool reconciliation does not accidentally kill a polecat just because it is waiting on refinery.
  • If the original polecat is unavailable, the existing metadata-based fallback path still works.
  • Successful merges still result in correct cleanup.
  • The lifecycle contract is documented and covered by tests.

Open Questions

  • Should this be implemented directly with gc session wait, or does it need a narrower polecat/refinery-specific primitive?
  • Should wait-held pooled agents count as desired capacity automatically?
  • What is the right durable verdict artifact?
  • Should the original polecat remain assigned to the work, or should assignment move while identity is tracked separately?

Reference

  • docs/archive/analysis/gastown-upstream-audit.md
    Delta 4: event-driven polecat/refinery lifecycle

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/designDesign / Product questions we need to resolvekind/enhancementNew capabilitypriority/p2Medium — real problem, workaround existsstatus/needs-triageInbox — we haven't looked at it yet

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions