Research-backed proposals: deeper front-loaded discussion, smarter context injection, and codebase intelligence for complex projects #3333
Replies: 8 comments 5 replies
-
|
Very much looking forward to this if you decide to proceed. Can't emphasize enough how much I miss the old interaction & input heavy discussion format in particular. Today I played around with implementing a system using Beans CLI as the base and the create slash command skill specifically to provide more manual control. Select items I wanted to work on from Beans, discuss / plan them out, link them together, and then route them into the queue system to create a milestone that /auto would work through. Definitely an improvement but still seems like it's a step backwards from the old workflow with GSD 1. |
Beta Was this translation helpful? Give feedback.
-
|
@TigerTugger Appreciate the validation, especially interesting that you're also working around the gap with external tooling. I've already submitted PRs for proposals 3 and 4 (pre-execution plan verification and cross-task consistency checking) at #3467 and #3468. The deep discussion system (proposal 1) is next on my list. Will share progress here as it develops. |
Beta Was this translation helpful? Give feedback.
-
|
This also sounds exactly what we need - well done! Discussions aren't nearly deep enough in V2, or as deep as I remember them being in V1. I have to keep prompting to go deeper on research discussions. It currently makes way too many assumptions. |
Beta Was this translation helpful? Give feedback.
-
|
Quick update: The enhanced verification system from proposals 3 and 4 was merged in PR #3468. The deep evidence-backed discussion system from proposal 1 is now up for review in PR #3602 — it adds a mechanical preparation engine and a structured 4-layer discussion protocol with mandatory rounds. Would appreciate feedback from anyone who's felt the discussion depth gap. |
Beta Was this translation helpful? Give feedback.
-
|
Update: Both the enhanced verification system (#3468) and the deep evidence-backed discussion system (#3602) are now merged! Proposals 1, 3, and 4 from the original post are shipped. Next up: tiered context injection (proposal 2) to address the monolithic context inlining problem the ETH Zurich research identified. |
Beta Was this translation helpful? Give feedback.
-
|
Update: Tiered context injection merged in PR #3766 (65.7% context
For the remaining two proposals (progressive planning + mid-execution |
Beta Was this translation helpful? Give feedback.
-
|
Update: Layered depth enforcement merged in PR #4079. Discussion That's 5 contributions shipped:
Open: RFC #3789 (progressive planning + mid-execution escalation) |
Beta Was this translation helpful? Give feedback.
-
|
PR #4079 merged. Marking this as answered — the layered discussion rounds described here are now live. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I've been using GSD-2 on complex projects and ran into the same wall that @turtlemath, @stoersignal, and @benjaminalgreen described in #2468 — auto-mode is impressive engineering, but for projects with real architectural complexity, I need more control at decision boundaries, not less.
Rather than just voicing frustration, I dug into the recent research on AI coding agent failure modes to understand why this happens and what specific changes would fix it. Sharing the findings here because I think they're directly actionable and I'd like maintainer feedback before writing any code.
The core problem
The discussion flow in GSD v2 is too detached from the main pipeline to be useful on complex projects.
/gsd discussruns in a second terminal as a sidebar while auto-mode races ahead — in practice, auto-mode moves too fast for the discussion to meaningfully shape anything.require_slice_discussionexists but even when enabled, the discussion lacks depth: it doesn't read the codebase, doesn't review prior decisions, doesn't research the ecosystem. It just asks questions.What actually worked in GSD v1 was that
discuss-phasewas front-loaded, mandatory, and contextually adaptive — it analyzed the work type, surfaced gray areas, and produced a CONTEXT.md that directly constrained the planner. The discussion happened before any planning, not alongside it.The users in #2468 nailed it: the system should collaborate with you at decision boundaries and execute autonomously between them. The ideal isn't more discussion points or a second terminal — it's one deep, evidence-backed discussion front-loaded into the milestone setup (during
/gsdnew-milestone flow or the initial/gsdstep-mode entry) that covers everything before the first slice ever plans.What the research says
I reviewed ~30 papers and analyses from 2025-2026. Here are the findings most relevant to GSD:
Context rot is worse than "big context window" assumes
Chroma Research (July 2025) tested 18 frontier models including Claude Opus 4 and GPT-4.1. Every model degrades at every input length increment — even well below the context window limit. A 200K-token model shows significant degradation at 50K tokens. This isn't a capability gap; it's an architectural property of transformer attention.
GSD impact: ADR-003's merged plan-milestone session (research + planning in one) creates the heaviest session in the pipeline — and the most vulnerable to quality degradation. The session where all architectural decisions get made is now the session most likely to suffer from context rot.
Monolithic context files hurt more than they help
ETH Zurich (February 2026) — the first rigorous study of AGENTS.md-style context files:
GSD impact: PROJECT.md, DECISIONS.md, KNOWLEDGE.md, and REQUIREMENTS.md are inlined into nearly every session. This is exactly the anti-pattern the research identifies. ADR-003's Phase 4 moves toward file-path references, which is the right direction — but the ETH Zurich data suggests going further with scoped, task-relevant injection.
Importantly, this also means a static markdown codebase map (file listing with descriptions) would actively hurt — it's the kind of redundant overview the ETH Zurich study found adds noise without helping navigation. What does help is structural/graph-based codebase understanding (call graphs, dependency chains, impact analysis) — tools like GrapeRoot, Axon, or codebase-memory-mcp that let the agent query relationships ("what calls this function?" "what would break if I change this interface?") rather than reading a flat file. GSD doesn't currently have any codebase intelligence integration — this seems like a significant gap, especially for brownfield projects.
Compound failure rates kill long pipelines
Zylos Research (February 2026): At 95% per-step reliability over a 20-step workflow, the combined success rate drops to 36%. A 2% misalignment early on compounds into a 40% failure rate by the end. 65% of enterprise AI failures were attributed to context drift — not context exhaustion.
GSD impact: The quality-profile pipeline runs 30 sessions for a 4-slice milestone. Even at 95% per-session reliability, that's a ~21% compound success rate. ADR-003 reduces this to 16 sessions, which helps — but the research suggests the real fix is human checkpoints at decision boundaries combined with fewer sessions, not just fewer sessions alone.
Agents hallucinate rather than use tools
OpenAI, "Why Language Models Hallucinate" (September 2025): Training objectives reward confident guessing over calibrated uncertainty. Models are literally trained to produce answers, not to say "I don't know" or use retrieval tools. This means any knowledge system that relies on the agent choosing to search will fail silently — the agent will just confidently make something up instead. Knowledge retrieval has to be structural (the pipeline handles it before dispatch), not optional (hoping the agent decides to search).
The perception gap is real
The METR study found experienced open-source maintainers were 19% slower with AI tools while believing they were 20% faster — a 39-point perception gap. Google's 2025 DORA report found 90% AI adoption increase correlates with 9% more bugs and 91% more code review time.
What I'm proposing (and willing to build)
None of this requires fighting GSD's architecture. It's about making existing features deeper and smarter.
1. Deep, evidence-backed discussion front-loaded into milestone setup
This is the single highest-impact change. The discussion that happens when you start a new milestone (during
/gsdstep-mode entry or new-milestone flow) should be dramatically deeper. Before asking any questions, the agent should:The discussion proceeds in layers: what (scope/features) → how (architecture/patterns) → what-if (error states/edge cases) → quality bar (testing/performance expectations). Each layer gets explicit sign-off before moving to the next. The output is a structured CONTEXT.md with decisions, rationale, interface contracts, and acceptance criteria that the planner treats as a hard constraint.
This isn't a sidebar. It's part of the main pipeline, front-loaded before any roadmap gets created. Auto-mode should not be able to proceed past milestone setup without a signed-off design document.
Scope: Enhancement to the milestone setup / discussion flow within the GSD extension. No changes to the auto-mode dispatch table or pipeline structure.
2. Tiered context injection (supporting ADR-003 Phase 4)
Replace monolithic inlining with scoped, task-relevant injection following ETH Zurich evidence:
Verification that the agent actually read references: plan output must cite the ROADMAP's slice description by title and risk level.
Scope: Refactor of auto-prompts.ts context inlining. Aligns directly with ADR-003 Phase 4.
3. Pre-execution plan verification
Mechanical checks (no LLM cost) before tasks execute:
npm info/pip show)Catching a bad plan before 3 tasks execute against it saves far more tokens than catching bad implementation after.
Scope: Extension to the verification gate. No pipeline changes.
4. Cross-task consistency checking + hallucination cascade prevention
After each task, mechanically verify:
This catches hallucination cascades — where task 1 introduces a subtle error that tasks 2-3 build on. Surge AI's SWE-bench case studies showed SOTA models inventing entire classes and methods without realizing they'd lost touch with the real codebase. Mechanical cross-task verification catches this.
Scope: Extension to post-unit verification. No pipeline changes.
5. Structured findings for cross-task knowledge transfer
Task executors write
T##-FINDINGS.jsonwith tagged discoveries (API behavior surprises, library quirks, pattern decisions). The orchestrator queries relevant findings before dispatching the next task and injects them — structurally, not optionally.This addresses the tool-use reliability problem directly: the agent doesn't need to "remember" to search because the pipeline does the retrieval before dispatch. The agent never has to decide whether to look something up.
Scope: New artifact type + dispatch prompt enhancement. Moderate change to auto-prompts.ts.
6. Codebase intelligence integration
GSD currently has no structural understanding of the codebase beyond what the agent discovers per-session via grep/file reads. For complex projects, this means every planning session re-discovers the same architecture from scratch. Integrating with a graph-based codebase tool (GrapeRoot, Axon, or codebase-memory-mcp via MCP) would let the planning session query "what modules depend on the auth system?" or "what would be affected by changing the User type?" without burning tokens on manual exploration.
This could be as simple as: if a codebase intelligence MCP server is available, the discussion and planning flows query it for structural context. If not, fall back to the existing grep/read approach. No hard dependency.
Scope: Optional MCP integration in the discussion and planning prompts. Extension-level.
What I'm asking for
I'm happy to write ADRs for anything that needs architectural review, and I'll follow the contribution guide (extension-first, simplicity over abstraction, tests for everything). I want to make sure I'm building what the project actually wants before I start writing code.
I have a detailed research synthesis document covering all the papers cited here, the specific GSD impact analysis, risk assessment with mitigations, and a phased implementation plan. Happy to share as a Gist if anyone wants the full picture.
Beta Was this translation helpful? Give feedback.
All reactions