You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue defines the execution order for all open work across Project Ouroboros. It reflects the current state of the codebase after the #49/#50 integration gap fixes, and prioritizes foundation integrity before new features.
Guiding Principle
Don't build on top of broken state tracking and false-positive tests. Make the system honest about its own state first, then make it runnable, then build new capabilities.
The system currently has a deceptive surface: tests pass, lint passes, workflows compile — but some of those passes are vacuous. Tautological tests, unimplemented lint rules, and missing status updates mean the safety net has holes. Fixing these holes first ensures that every future change gets real feedback, not green checkmarks that mask silent failures.
Dependency Map
#46 ROADMAP (GCP, secrets, local config)
└── everything depends on this for real agent runs
#51 System integrity gaps (parent)
├── #52 Workflow state consistency
├── #53 Lint rules + test coverage
├── #54 Docs & architecture alignment
└── #55 Test quality
#47 Docker observability in CI
#48 Agent-written code must produce queryable logs
#39-#44 Feature issues (test quality framework)
├── #39 AST-based test quality gate
├── #40 Behavioral contracts
├── #41 Adversarial test writer agent
├── #42 Mutation sampling
├── #43 Human test anchors
└── #44 Reviewer test quality assessment
Phase 1: Fix the Foundation (~3 PRs)
Goal: Make every test, lint check, and workflow state transition trustworthy.
The core workflow loop has silent bugs (phantom "merged" status, 4 nodes missing status updates on success, empty dict returns losing observability). Every agent run is affected. If you build on top of broken state tracking, new features inherit the bugs.
WF-004 and WF-006 are supposed to protect you from guard bypass and tool budget errors. Without them, there's no safety net as you modify workflows. GP-011–014 tests ensure your lint checks don't silently regress. The PerfComparison → PerfComparisonResult catalog mismatch also lives here.
Tautological tests give false confidence. 4 guard boundary tests verify arithmetic (0 + 50 > 50 == False) instead of calling actual guard functions. Fix these before you start relying on "all tests pass" as a signal that changes are safe. Quick fix — 4 tests rewritten + 2 lint runner integration tests added.
After Phase 1: The test suite, lint rules, and workflow state transitions are all trustworthy. "All tests pass" actually means something.
Phase 2: Make It Runnable (~4 PRs)
Goal: Get agents executing end-to-end — locally and in CI — with full observability.
GCP project setup, Vertex AI API, service account, GitHub Actions secrets, local .env config. Nothing runs for real until this is done. It's mostly configuration, not code.
Align ARCHITECTURE.md with what arch_lint.py actually enforces, make guard limits configurable via env vars. Do this alongside #46 since you'll be touching config anyway. Documents the typed state model as a future design goal.
Once secrets are configured (#46), agents can run in CI. But without the observability stack running, query_logs/query_metrics tools waste budget on connection refused errors. Each failed tool call burns against MAX_TOOL_CALLS_PER_NODE=50. This makes CI runs functional.
Completes the observability loop. After #47 the pipeline exists in CI; after #48 agents actually produce structured logs into it. This is the last piece before agents can self-debug — querying their own application's runtime behavior.
After Phase 2: Agents can run end-to-end in CI and locally, with full observability. The query_logs → VictoriaLogs → Vector → app pipeline is functional. The system is production-ready for autonomous operation.
Phase 3: Build New Capabilities (~6 PRs)
Goal: Advanced test quality framework — agents that can verify, challenge, and improve their own test suites.
Protected invariant test files that agents cannot modify. Establishes a baseline of human-verified test coverage that agent-generated tests build on top of.
Verifies test effectiveness by introducing small mutations to source code and checking that tests catch them. Proves tests aren't tautological at the source level.
Planner-emitted, deterministically verified contracts. The capstone — the planner specifies expected behaviors as formal contracts, and the system verifies them automatically.
Each issue builds on the previous one. The AST gate (#39) is the foundation; behavioral contracts (#40) are the capstone.
After Phase 3: The system has a complete test quality framework. Agents can write tests, verify their quality, challenge them with mutations, and express behavioral expectations as formal contracts.
Execution Summary
Phase 1: Fix the foundation #52 → #53 → #55 ~3 PRs
Phase 2: Make it runnable #46 → #54 → #47 → #48 ~4 PRs
Phase 3: Build new capabilities #39 → #44 → #43 → #42 → #41 → #40 ~6 PRs
Total: ~13 PRs across 14 issues (some issues may be combined into a single PR where changes overlap).
How to Use This Roadmap
Work top-to-bottom within each phase
Don't start Phase 2 until Phase 1 is merged — the foundation must be solid
Product Roadmap — Implementation Order
This issue defines the execution order for all open work across Project Ouroboros. It reflects the current state of the codebase after the #49/#50 integration gap fixes, and prioritizes foundation integrity before new features.
Guiding Principle
The system currently has a deceptive surface: tests pass, lint passes, workflows compile — but some of those passes are vacuous. Tautological tests, unimplemented lint rules, and missing status updates mean the safety net has holes. Fixing these holes first ensures that every future change gets real feedback, not green checkmarks that mask silent failures.
Dependency Map
Phase 1: Fix the Foundation (~3 PRs)
Goal: Make every test, lint check, and workflow state transition trustworthy.
"merged"status, 4 nodes missing status updates on success, empty dict returns losing observability). Every agent run is affected. If you build on top of broken state tracking, new features inherit the bugs.PerfComparison→PerfComparisonResultcatalog mismatch also lives here.0 + 50 > 50 == False) instead of calling actual guard functions. Fix these before you start relying on "all tests pass" as a signal that changes are safe. Quick fix — 4 tests rewritten + 2 lint runner integration tests added.After Phase 1: The test suite, lint rules, and workflow state transitions are all trustworthy. "All tests pass" actually means something.
Phase 2: Make It Runnable (~4 PRs)
Goal: Get agents executing end-to-end — locally and in CI — with full observability.
.envconfig. Nothing runs for real until this is done. It's mostly configuration, not code.query_logs/query_metricstools waste budget on connection refused errors. Each failed tool call burns againstMAX_TOOL_CALLS_PER_NODE=50. This makes CI runs functional.After Phase 2: Agents can run end-to-end in CI and locally, with full observability. The
query_logs → VictoriaLogs → Vector → apppipeline is functional. The system is production-ready for autonomous operation.Phase 3: Build New Capabilities (~6 PRs)
Goal: Advanced test quality framework — agents that can verify, challenge, and improve their own test suites.
Dependency chain: #39 → #44 → #43 → #42 → #41 → #40
Each issue builds on the previous one. The AST gate (#39) is the foundation; behavioral contracts (#40) are the capstone.
After Phase 3: The system has a complete test quality framework. Agents can write tests, verify their quality, challenge them with mutations, and express behavioral expectations as formal contracts.
Execution Summary
Total: ~13 PRs across 14 issues (some issues may be combined into a single PR where changes overlap).
How to Use This Roadmap
Related