Skip to content

[awf] cli/docker-manager: SIGTERM from GH Actions step timeout does not kill the agent container fast enough #1590

@lpcox

Description

@lpcox

Problem

When a GitHub Actions workflow sets timeout-minutes on a step that runs awf, the agent container is not reliably terminated when the timeout fires. The agent process inside the Docker container continues running past the step timeout, consuming runner time until the job-level (6-hour) or workflow-level (72-hour) timeout is hit.

GH Actions enforces timeout-minutes by sending SIGTERM to the step process (awf), followed by SIGKILL after a short grace period (~10 s). The awf Node.js process has a SIGTERM handler (src/cli.ts:1895–1898) that calls performCleanup()stopContainers()docker compose down -v. However:

  1. docker compose down -v is slow — it gracefully stops services and tears down volumes, which can take 10–30 seconds.
  2. If GH Actions sends SIGKILL to awf before docker compose down completes, awf is killed immediately while the Docker container (awf-agent) keeps running as an orphan.
  3. Even in the non-SIGKILL path, there is a window where the container is still running after the step timeout fires.

The root cause is that the SIGTERM handler does not immediately kill the container before embarking on the slower graceful cleanup path.

Context

  • Original issue: timeout-minutes on agent step not enforced inside AWF container gh-aw#23965
  • AWF already has --agent-timeout <minutes> (src/cli.ts:1402, src/docker-manager.ts:1996–2022) which uses docker stop -t 10 awf-agent when the internal timer fires. But this is a separate mechanism from GH Actions step-level timeout-minutes, which signals the awf host process directly.

Root Cause

src/cli.ts:1895–1898 — the SIGTERM handler calls await performCleanup('SIGTERM') which calls stopContainers()docker compose down -v. This is too slow to reliably complete before GH Actions sends SIGKILL.

src/docker-manager.ts:2089 (stopContainers) — uses docker compose down -v with default stop timeouts. No fast-path kill of awf-agent when called under signal pressure.

Proposed Solution

1. Fast-kill the container at the top of the SIGTERM/SIGINT handlers

In src/cli.ts, before calling the slow performCleanup(), immediately stop the container so the agent can't outlive the awf process:

process.on('SIGTERM', async () => {
  // Fast-kill the container immediately so it cannot outlive this process.
  // docker compose down (called in performCleanup) is too slow and may be
  // interrupted by a follow-up SIGKILL from the GH Actions runner.
  try {
    await execa('docker', ['stop', '-t', '3', 'awf-agent'], { reject: false });
  } catch { /* best-effort */ }
  await performCleanup('SIGTERM');
  process.exit(143);
});

A 3-second graceful window for the container gives the agent a chance to flush logs, while still completing well within the GH Actions grace period before SIGKILL.

2. Document --agent-timeout as the preferred workaround

Until a fix ships, users can set --agent-timeout <minutes> in their awf invocation to cap agent execution at the AWF level, which already does docker stop -t 10 awf-agent correctly. The compiled GH Actions workflow could accept a timeout-minutes input and pass it as --agent-timeout. This is a gh-aw CLI concern but the AWF documentation should surface the option.

3. (Optional) Add --stop-timeout to stopContainers

Expose a stopTimeoutSeconds parameter in stopContainers() so callers from signal handlers can request a faster teardown (e.g., docker compose down --timeout 5) instead of the default 10-second container stop grace period.

Generated by Firewall Issue Dispatcher ·

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions