[CI/CD Assessment] CI/CD Pipelines and Integration Tests Gap Assessment #1505

2026-03-30T22:25:42Z

github-actions[bot]
bot Mar 30, 2026

📊 Current CI/CD Pipeline Status

The repository has a mature, multi-layered CI/CD system combining traditional GitHub Actions YAML workflows with agentic (AI-driven) workflows. As of March 2026:

~18 traditional YAML workflows covering build, test, lint, security, docs, and release
~22 agentic workflows (.md compiled to .lock.yml) covering AI-assisted code review, security auditing, smoke tests with real agents, and multi-ecosystem build validation
3-tier integration test suite: 34 integration test files (~265 tests), 4-job chroot suite, and 4 smoke test workflows
All workflows use pinned SHA action references for supply chain security (except performance-monitor.yml)

Overall health is good: the pipeline covers a wide range of checks, but coverage thresholds are permissive and 9 of 34 integration test files have no CI workflow running them.

✅ Existing Quality Gates

On Every PR (against `main`)

Check	Workflow	Details
TypeScript Build	`build.yml`	Node 20 & 22 matrix, `tsc`, dist verification
ESLint	`build.yml` + `lint.yml`	`eslint-plugin-security`, custom no-unsafe-execa rule
Markdown Lint	`lint.yml`	`markdownlint-cli2` on all `.md` files
TypeScript Type Check	`test-integration.yml` (named "TypeScript Type Check")	`tsc --noEmit` strict mode
Unit Tests + Coverage	`test-coverage.yml`	Jest + coverage comparison vs base branch, PR comment
Integration Tests	`test-integration-suite.yml`	5 parallel jobs: Domain, Network, Protocol/Security, Container/Ops, API Proxy
Chroot Integration Tests	`test-chroot.yml`	4 parallel jobs: Languages, Package Managers, /proc, Edge Cases
Examples Validation	`test-examples.yml`	4 example shell scripts run end-to-end
Setup Action Tests	`test-action.yml`	Action install, versioning, image pull, invalid version rejection
Dependency Audit	`dependency-audit.yml`	`npm audit --audit-level=high`, SARIF upload to Security tab
CodeQL	`codeql.yml`	`javascript-typescript` + `actions` languages, security-extended queries
PR Title	`pr-title.yml`	Conventional commits enforcement
Security Guard (agentic)	`security-guard.lock.yml`	Claude reviews for iptables, Squid config, container security regressions
Build Test Suite (agentic)	`build-test.lock.yml`	8 language ecosystems (Bun, C++, Deno, .NET, Go, Java, Node.js, Rust)
Smoke Tests	`smoke-claude/codex/copilot/chroot.lock.yml`	Real AI agent runs inside AWF; reaction-gated

Scheduled / Non-PR

Weekly performance benchmarks with regression issue creation
Daily dependency security monitoring (agentic)
Weekly link checking on documentation
Weekly CodeQL scan
Daily/weekly security review (agentic)
Documentation build preview on docs-related PRs

🔍 Identified Gaps

🔴 High Priority

1. Very Low Coverage Thresholds

Current thresholds: 38% statements, 30% branches, 35% functions, 38% lines — and the two most critical files have near-zero coverage:

cli.ts: 0% coverage (0/69 statements)
docker-manager.ts: 18% coverage (45/250 statements, 4% function coverage)

These files orchestrate the entire AWF lifecycle. A PR could eliminate key functionality in these files and pass the coverage gate.

2. Nine Integration Test Files Have Zero CI Coverage

The following test files exist in tests/integration/ but are not run by any workflow job:

Test File	What It Tests
`gh-host-injection.test.ts`	GH_HOST env var injection security
`ghes-auto-populate.test.ts`	GHES domain auto-population
`host-tcp-services.test.ts`	Host TCP service blocking
`workdir-tmpfs-hiding.test.ts`	Work directory tmpfs security
`chroot-capsh-chain.test.ts`	Capability dropping chain
`chroot-copilot-home.test.ts`	Copilot home directory isolation
`api-proxy-observability.test.ts`	API proxy logging/monitoring
`api-proxy-rate-limit.test.ts`	Rate limiting behavior
`api-target-allowlist.test.ts`	API proxy domain allowlist

Several of these test security-critical behaviors (credential hiding, capability drop, host service isolation).

3. No Container Image Vulnerability Scanning

The three Docker images (squid, agent, api-proxy) are built from Ubuntu base images and installed packages. There is no Trivy, Grype, or similar scanner integrated into the PR workflow. A dependency update in a Dockerfile could introduce a high-severity CVE undetected.

4. Performance Monitor Not Gated on PRs

performance-monitor.yml runs weekly only. A PR that increases container startup time from 3s to 15s would be merged before the regression is detected. The benchmark infrastructure already exists (scripts/ci/benchmark-performance.ts) but isn't used in PR checks.

🟡 Medium Priority

5. Coverage Regression Check Uses `continue-on-error: true`

In test-coverage.yml (line 85), the coverage comparison step has continue-on-error: true. The final failure step (line 196-203) checks steps.compare.outcome == 'failure', but if the compare script errors (not just detects regression), the outcome is failure for both regression and script errors. More critically, if base branch coverage isn't available (line 79 condition), no regression check runs at all — a PR can drop coverage from 80% to 10% without failing.

6. No SBOM (Software Bill of Materials) Generation

No workflow generates or publishes a Software Bill of Materials for the npm packages or container images. This is increasingly expected for security-conscious tooling, especially for a product that wraps AI agents.

7. Smoke Tests Are Reaction-Gated (Not Automatic)

The smoke tests (smoke-claude, smoke-codex, smoke-copilot) require specific emoji reactions to trigger on PRs (❤️, 🎉, 👀 respectively). They also run on a 12h schedule. This means a PR that breaks the Claude smoke test path may not be caught until after merge unless a reviewer manually triggers it.

smoke-chroot does trigger automatically on relevant path changes (src/**, containers/**), which is better practice.

8. `performance-monitor.yml` Uses Unpinned Action SHAs

Unlike all other workflows which use pinned SHA references (e.g., actions/checkout@de0fac2e...), performance-monitor.yml uses floating tags:

uses: actions/checkout@v4
uses: actions/setup-node@v4
uses: actions/upload-artifact@v4
uses: actions/github-script@v7

This is a supply chain security inconsistency.

9. No User-Mode Integration Tests in CI

tests/user-mode.test.sh exists but there is no CI workflow that runs it. The user-mode path (non-sudo execution) may regress silently.

10. Integration Tests Run on All PRs Without Path Filtering

test-integration-suite.yml runs 5 parallel jobs (each 45 min timeout) on every PR, including docs-only changes. Path filtering (similar to how test-examples.yml ignores *.md files) would reduce unnecessary CI load and faster feedback on docs PRs.

🟢 Low Priority

11. No Prettier/Formatting Check

Only ESLint is enforced; no code formatter (Prettier) is configured. TypeScript code style varies across files. This is not a functional gap but increases review friction.

12. No Mutation Testing

With ~38% unit test coverage and low thresholds, the test suite may have low "kill rate" against mutations. Tools like Stryker could identify tests that pass regardless of code changes.

13. No macOS/Windows Build Verification

build.yml only targets ubuntu-latest. While AWF is Linux-focused (requires Docker + iptables), the CLI itself could theoretically surface installation issues on macOS (a common developer platform). Even a simple npm ci && npm run build && npm run type-check on macOS would catch platform-specific issues.

14. No Test Retry Logic for Flaky Integration Tests

Network-dependent integration tests (Docker container startup, Squid proxy health checks) can flake under load. There's no --retries flag or retry step configuration. This increases false-negative noise in CI.

15. Documentation Preview Not Deployed (Artifact Only)

docs-preview.yml builds the Astro Starlight docs and uploads an artifact — reviewers must download and unzip to preview. A deployment to GitHub Pages or a preview service (e.g., Netlify/Cloudflare Pages) would significantly improve docs PR review experience.

📋 Actionable Recommendations

High Priority

Gap	Recommendation	Complexity	Impact
Low coverage thresholds	Raise to 60% statements, 50% branches incrementally; add per-file thresholds for `cli.ts` and `docker-manager.ts`	Low	High
9 uncovered integration tests	Add a new CI job or extend existing workflow jobs to include missing test patterns	Low	High
No container image scanning	Add `aquasecurity/trivy-action` to `build.yml` for all three container images; upload SARIF to Security tab	Low	High
No PR performance gate	Add a lightweight benchmark step to `build.yml` (container startup time < N seconds) using existing `benchmark-performance.ts`	Medium	Medium

Medium Priority

Gap	Recommendation	Complexity	Impact
Coverage regression `continue-on-error`	Restructure comparison logic: fail hard when coverage drops > 2%; allow script errors without blocking	Low	Medium
Unpinned actions in `performance-monitor.yml`	Pin all 4 actions to SHA digests (consistent with rest of repo)	Low	Medium
Reaction-gated smoke tests	Make `smoke-claude`/`smoke-codex`/`smoke-copilot` auto-trigger on `src/` and `containers/` path changes (like `smoke-chroot`)	Low	Medium
User-mode tests not in CI	Add a `test-user-mode` job to `test-integration-suite.yml` running `tests/user-mode.test.sh`	Low	Medium
Integration tests on docs PRs	Add `paths-ignore: ['*/.md', 'docs/', 'docs-site/']` to `test-integration-suite.yml`	Low	Low
SBOM generation	Add `anchore/sbom-action` to `release.yml`; publish as release asset	Medium	Medium

Low Priority

Gap	Recommendation	Complexity	Impact
No Prettier	Add Prettier config + `lint:format` script; enforce in `lint.yml`	Low	Low
No mutation testing	Evaluate Stryker for unit tests once coverage > 60%	High	Medium
macOS build	Add `macos-latest` to `build.yml` matrix for `npm ci && tsc` only	Low	Low
No test retry	Add `jest --retries=2` for integration tests or use `retry-on-error: true` in step config	Low	Low
Docs deployment preview	Configure Cloudflare Pages or Netlify preview deployments for `docs-site/**` PRs	Medium	Low

📈 Metrics Summary

Metric	Value
Total YAML workflows	18
Total agentic workflows (.md)	22
Workflows triggering on PRs	15+
Integration test files	34
Integration test files with CI coverage	25 / 34 (74%)
Unit test files	14
Unit test coverage (statements)	38.39%
Coverage threshold (statements)	38% (barely passing)
`cli.ts` coverage	0%
`docker-manager.ts` coverage	18% (4% function coverage)
Performance benchmarking	Weekly only (not on PRs)
Container image scanning	None
Action SHA pinning compliance	17/18 YAML workflows ✅ (performance-monitor.yml ❌)

The most impactful improvements, in order: (1) add missing integration tests to CI, (2) add container image scanning, (3) raise coverage thresholds for the two core modules. These are all low-to-medium complexity changes that would meaningfully raise the quality bar for every PR.

AI generated by CI/CD Pipelines and Integration Tests Gap Assessment

expires on Apr 6, 2026, 10:25 PM UTC

2026-03-30T23:57:48Z

github-actions[bot]
bot Mar 30, 2026
Author

🔮 The ancient spirits stir over this repository.
The smoke-test oracle has passed through these halls,
marking the run with signs of build, browser, and file rites.
May the wards hold, and the workflows remain true.

🔮 The oracle has spoken through Smoke Codex

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI/CD Assessment] CI/CD Pipelines and Integration Tests Gap Assessment #1505

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[CI/CD Assessment] CI/CD Pipelines and Integration Tests Gap Assessment #1505

Uh oh!

github-actions[bot] bot Mar 30, 2026

📊 Current CI/CD Pipeline Status

✅ Existing Quality Gates

On Every PR (against main)

Scheduled / Non-PR

🔍 Identified Gaps

🔴 High Priority

1. Very Low Coverage Thresholds

2. Nine Integration Test Files Have Zero CI Coverage

3. No Container Image Vulnerability Scanning

4. Performance Monitor Not Gated on PRs

🟡 Medium Priority

5. Coverage Regression Check Uses continue-on-error: true

6. No SBOM (Software Bill of Materials) Generation

7. Smoke Tests Are Reaction-Gated (Not Automatic)

8. performance-monitor.yml Uses Unpinned Action SHAs

9. No User-Mode Integration Tests in CI

10. Integration Tests Run on All PRs Without Path Filtering

🟢 Low Priority

11. No Prettier/Formatting Check

12. No Mutation Testing

13. No macOS/Windows Build Verification

14. No Test Retry Logic for Flaky Integration Tests

15. Documentation Preview Not Deployed (Artifact Only)

📋 Actionable Recommendations

High Priority

Medium Priority

Low Priority

📈 Metrics Summary

Replies: 1 comment

Uh oh!

github-actions[bot] bot Mar 30, 2026 Author

github-actions[bot]
bot Mar 30, 2026

On Every PR (against `main`)

5. Coverage Regression Check Uses `continue-on-error: true`

8. `performance-monitor.yml` Uses Unpinned Action SHAs

github-actions[bot]
bot Mar 30, 2026
Author