PRODUCT ROADMAP — IMPLEMENTATION ORDER

# Product Roadmap — Implementation Order

This issue defines the execution order for all open work across Project Ouroboros. It reflects the current state of the codebase after the #49/#50 integration gap fixes, and prioritizes **foundation integrity before new features**.

---

## Guiding Principle

> Don't build on top of broken state tracking and false-positive tests. Make the system honest about its own state first, then make it runnable, then build new capabilities.

The system currently has a deceptive surface: tests pass, lint passes, workflows compile — but some of those passes are vacuous. Tautological tests, unimplemented lint rules, and missing status updates mean the safety net has holes. Fixing these holes first ensures that every future change gets **real feedback**, not green checkmarks that mask silent failures.

---

## Dependency Map

```
#46 ROADMAP (GCP, secrets, local config)
 └── everything depends on this for real agent runs

#51 System integrity gaps (parent)
 ├── #52 Workflow state consistency
 ├── #53 Lint rules + test coverage
 ├── #54 Docs & architecture alignment
 └── #55 Test quality

#47 Docker observability in CI
#48 Agent-written code must produce queryable logs

#39-#44 Feature issues (test quality framework)
 ├── #39 AST-based test quality gate
 ├── #40 Behavioral contracts
 ├── #41 Adversarial test writer agent
 ├── #42 Mutation sampling
 ├── #43 Human test anchors
 └── #44 Reviewer test quality assessment
```

---

## Phase 1: Fix the Foundation (~3 PRs)

**Goal:** Make every test, lint check, and workflow state transition trustworthy.

| Order | Issue | Why now |
|:-----:|-------|---------|
| **1** | **#52** — Workflow state consistency | The core workflow loop has silent bugs (phantom `"merged"` status, 4 nodes missing status updates on success, empty dict returns losing observability). Every agent run is affected. If you build on top of broken state tracking, new features inherit the bugs. |
| **2** | **#53** — Lint rules + test coverage | WF-004 and WF-006 are supposed to protect you from guard bypass and tool budget errors. Without them, there's no safety net as you modify workflows. GP-011–014 tests ensure your lint checks don't silently regress. The `PerfComparison` → `PerfComparisonResult` catalog mismatch also lives here. |
| **3** | **#55** — Test quality | Tautological tests give false confidence. 4 guard boundary tests verify arithmetic (`0 + 50 > 50 == False`) instead of calling actual guard functions. Fix these before you start relying on "all tests pass" as a signal that changes are safe. Quick fix — 4 tests rewritten + 2 lint runner integration tests added. |

**After Phase 1:** The test suite, lint rules, and workflow state transitions are all trustworthy. "All tests pass" actually means something.

---

## Phase 2: Make It Runnable (~4 PRs)

**Goal:** Get agents executing end-to-end — locally and in CI — with full observability.

| Order | Issue | Why now |
|:-----:|-------|---------|
| **4** | **#46** — Roadmap to fully functional system | GCP project setup, Vertex AI API, service account, GitHub Actions secrets, local `.env` config. Nothing runs for real until this is done. It's mostly configuration, not code. |
| **5** | **#54** — Docs & architecture alignment | Align ARCHITECTURE.md with what arch_lint.py actually enforces, make guard limits configurable via env vars. Do this alongside #46 since you'll be touching config anyway. Documents the typed state model as a future design goal. |
| **6** | **#47** — Docker observability in CI | Once secrets are configured (#46), agents can run in CI. But without the observability stack running, `query_logs`/`query_metrics` tools waste budget on connection refused errors. Each failed tool call burns against `MAX_TOOL_CALLS_PER_NODE=50`. This makes CI runs functional. |
| **7** | **#48** — Agent-written code must produce logs | Completes the observability loop. After #47 the pipeline exists in CI; after #48 agents actually produce structured logs into it. This is the last piece before agents can self-debug — querying their own application's runtime behavior. |

**After Phase 2:** Agents can run end-to-end in CI and locally, with full observability. The `query_logs → VictoriaLogs → Vector → app` pipeline is functional. The system is production-ready for autonomous operation.

---

## Phase 3: Build New Capabilities (~6 PRs)

**Goal:** Advanced test quality framework — agents that can verify, challenge, and improve their own test suites.

| Order | Issue | Why now |
|:-----:|-------|---------|
| **8** | **#39** — AST-based test quality gate | Foundation for all test quality work. Parses test files structurally to verify they contain real assertions, not just smoke tests. |
| **9** | **#44** — Reviewer test quality assessment | Gives the reviewer agent the ability to evaluate test quality during PR review, using the AST gate from #39. |
| **10** | **#43** — Human test anchors | Protected invariant test files that agents cannot modify. Establishes a baseline of human-verified test coverage that agent-generated tests build on top of. |
| **11** | **#42** — Mutation sampling | Verifies test effectiveness by introducing small mutations to source code and checking that tests catch them. Proves tests aren't tautological at the source level. |
| **12** | **#41** — Adversarial test writer agent | An agent that deliberately writes tricky edge cases and failure scenarios. Uses mutation sampling (#42) to verify its tests are meaningful. |
| **13** | **#40** — Behavioral contracts | Planner-emitted, deterministically verified contracts. The capstone — the planner specifies expected behaviors as formal contracts, and the system verifies them automatically. |

**Dependency chain:** #39 → #44 → #43 → #42 → #41 → #40

Each issue builds on the previous one. The AST gate (#39) is the foundation; behavioral contracts (#40) are the capstone.

**After Phase 3:** The system has a complete test quality framework. Agents can write tests, verify their quality, challenge them with mutations, and express behavioral expectations as formal contracts.

---

## Execution Summary

```
Phase 1: Fix the foundation      #52 → #53 → #55           ~3 PRs
Phase 2: Make it runnable         #46 → #54 → #47 → #48    ~4 PRs
Phase 3: Build new capabilities   #39 → #44 → #43 → #42 → #41 → #40   ~6 PRs
```

**Total: ~13 PRs across 14 issues** (some issues may be combined into a single PR where changes overlap).

---

## How to Use This Roadmap

1. Work top-to-bottom within each phase
2. Don't start Phase 2 until Phase 1 is merged — the foundation must be solid
3. Phase 2 items 4 and 5 (#46 and #54) can be done in parallel since they touch different files
4. Phase 2 items 6 and 7 (#47 and #48) are strictly sequential — pipeline must exist before apps can log into it
5. Phase 3 is strictly sequential — each issue depends on the previous one
6. Update this issue as items are completed to track progress

---

## Related

- #49 — System integration gaps (closed via #50)
- #50 — PR: bridge 9 system integration gaps (merged)
- #51 — System integrity gaps (parent for #52–#55)

Order	Issue	Why now
1	#52 — Workflow state consistency	The core workflow loop has silent bugs (phantom `"merged"` status, 4 nodes missing status updates on success, empty dict returns losing observability). Every agent run is affected. If you build on top of broken state tracking, new features inherit the bugs.
2	#53 — Lint rules + test coverage	WF-004 and WF-006 are supposed to protect you from guard bypass and tool budget errors. Without them, there's no safety net as you modify workflows. GP-011–014 tests ensure your lint checks don't silently regress. The `PerfComparison` → `PerfComparisonResult` catalog mismatch also lives here.
3	#55 — Test quality	Tautological tests give false confidence. 4 guard boundary tests verify arithmetic (`0 + 50 > 50 == False`) instead of calling actual guard functions. Fix these before you start relying on "all tests pass" as a signal that changes are safe. Quick fix — 4 tests rewritten + 2 lint runner integration tests added.

Order	Issue	Why now
4	#46 — Roadmap to fully functional system	GCP project setup, Vertex AI API, service account, GitHub Actions secrets, local `.env` config. Nothing runs for real until this is done. It's mostly configuration, not code.
5	#54 — Docs & architecture alignment	Align ARCHITECTURE.md with what arch_lint.py actually enforces, make guard limits configurable via env vars. Do this alongside #46 since you'll be touching config anyway. Documents the typed state model as a future design goal.
6	#47 — Docker observability in CI	Once secrets are configured (#46), agents can run in CI. But without the observability stack running, `query_logs`/`query_metrics` tools waste budget on connection refused errors. Each failed tool call burns against `MAX_TOOL_CALLS_PER_NODE=50`. This makes CI runs functional.
7	#48 — Agent-written code must produce logs	Completes the observability loop. After #47 the pipeline exists in CI; after #48 agents actually produce structured logs into it. This is the last piece before agents can self-debug — querying their own application's runtime behavior.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PRODUCT ROADMAP — IMPLEMENTATION ORDER #56

Product Roadmap — Implementation Order

Guiding Principle

Dependency Map

Phase 1: Fix the Foundation (~3 PRs)

Phase 2: Make It Runnable (~4 PRs)

Phase 3: Build New Capabilities (~6 PRs)

Execution Summary

How to Use This Roadmap

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Order	Issue	Why now
8	#39 — AST-based test quality gate	Foundation for all test quality work. Parses test files structurally to verify they contain real assertions, not just smoke tests.
9	#44 — Reviewer test quality assessment	Gives the reviewer agent the ability to evaluate test quality during PR review, using the AST gate from #39.
10	#43 — Human test anchors	Protected invariant test files that agents cannot modify. Establishes a baseline of human-verified test coverage that agent-generated tests build on top of.
11	#42 — Mutation sampling	Verifies test effectiveness by introducing small mutations to source code and checking that tests catch them. Proves tests aren't tautological at the source level.
12	#41 — Adversarial test writer agent	An agent that deliberately writes tricky edge cases and failure scenarios. Uses mutation sampling (#42) to verify its tests are meaningful.
13	#40 — Behavioral contracts	Planner-emitted, deterministically verified contracts. The capstone — the planner specifies expected behaviors as formal contracts, and the system verifies them automatically.

PRODUCT ROADMAP — IMPLEMENTATION ORDER #56

Description

Product Roadmap — Implementation Order

Guiding Principle

Dependency Map

Phase 1: Fix the Foundation (~3 PRs)

Phase 2: Make It Runnable (~4 PRs)

Phase 3: Build New Capabilities (~6 PRs)

Execution Summary

How to Use This Roadmap

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions