[awf] cli/docker-manager: SIGTERM from GH Actions step timeout does not kill the agent container fast enough

## Problem

When a GitHub Actions workflow sets `timeout-minutes` on a step that runs `awf`, the agent container is not reliably terminated when the timeout fires. The agent process inside the Docker container continues running past the step timeout, consuming runner time until the job-level (6-hour) or workflow-level (72-hour) timeout is hit.

GH Actions enforces `timeout-minutes` by sending **SIGTERM** to the step process (`awf`), followed by **SIGKILL** after a short grace period (~10 s). The `awf` Node.js process has a SIGTERM handler (`src/cli.ts:1895–1898`) that calls `performCleanup()` → `stopContainers()` → `docker compose down -v`. However:

1. `docker compose down -v` is slow — it gracefully stops services and tears down volumes, which can take 10–30 seconds.
2. If GH Actions sends SIGKILL to `awf` before `docker compose down` completes, `awf` is killed immediately while the Docker container (`awf-agent`) keeps running as an orphan.
3. Even in the non-SIGKILL path, there is a window where the container is still running after the step timeout fires.

The root cause is that the SIGTERM handler does not immediately kill the container before embarking on the slower graceful cleanup path.

## Context

- Original issue: https://github.com/github/gh-aw/issues/23965
- AWF already has `--agent-timeout <minutes>` (`src/cli.ts:1402`, `src/docker-manager.ts:1996–2022`) which uses `docker stop -t 10 awf-agent` when the internal timer fires. But this is a separate mechanism from GH Actions step-level `timeout-minutes`, which signals the `awf` host process directly.

## Root Cause

**`src/cli.ts:1895–1898`** — the SIGTERM handler calls `await performCleanup('SIGTERM')` which calls `stopContainers()` → `docker compose down -v`. This is too slow to reliably complete before GH Actions sends SIGKILL.

**`src/docker-manager.ts:2089`** (`stopContainers`) — uses `docker compose down -v` with default stop timeouts. No fast-path kill of `awf-agent` when called under signal pressure.

## Proposed Solution

### 1. Fast-kill the container at the top of the SIGTERM/SIGINT handlers

In `src/cli.ts`, before calling the slow `performCleanup()`, immediately stop the container so the agent can't outlive the `awf` process:

```typescript
process.on('SIGTERM', async () => {
  // Fast-kill the container immediately so it cannot outlive this process.
  // docker compose down (called in performCleanup) is too slow and may be
  // interrupted by a follow-up SIGKILL from the GH Actions runner.
  try {
    await execa('docker', ['stop', '-t', '3', 'awf-agent'], { reject: false });
  } catch { /* best-effort */ }
  await performCleanup('SIGTERM');
  process.exit(143);
});
```

A 3-second graceful window for the container gives the agent a chance to flush logs, while still completing well within the GH Actions grace period before SIGKILL.

### 2. Document `--agent-timeout` as the preferred workaround

Until a fix ships, users can set `--agent-timeout <minutes>` in their `awf` invocation to cap agent execution at the AWF level, which already does `docker stop -t 10 awf-agent` correctly. The compiled GH Actions workflow could accept a `timeout-minutes` input and pass it as `--agent-timeout`. This is a `gh-aw` CLI concern but the AWF documentation should surface the option.

### 3. (Optional) Add `--stop-timeout` to `stopContainers`

Expose a `stopTimeoutSeconds` parameter in `stopContainers()` so callers from signal handlers can request a faster teardown (e.g., `docker compose down --timeout 5`) instead of the default 10-second container stop grace period.




> Generated by [Firewall Issue Dispatcher](https://github.com/github/gh-aw-firewall/actions/runs/23878309621) · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw-firewall+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw-firewall%2Ffirewall-issue-dispatcher%22&type=issues)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[awf] cli/docker-manager: SIGTERM from GH Actions step timeout does not kill the agent container fast enough #1590

Problem

Context

Root Cause

Proposed Solution

1. Fast-kill the container at the top of the SIGTERM/SIGINT handlers

2. Document `--agent-timeout` as the preferred workaround

3. (Optional) Add `--stop-timeout` to `stopContainers`

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[awf] cli/docker-manager: SIGTERM from GH Actions step timeout does not kill the agent container fast enough #1590

Description

Problem

Context

Root Cause

Proposed Solution

1. Fast-kill the container at the top of the SIGTERM/SIGINT handlers

2. Document --agent-timeout as the preferred workaround

3. (Optional) Add --stop-timeout to stopContainers

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. Document `--agent-timeout` as the preferred workaround

3. (Optional) Add `--stop-timeout` to `stopContainers`