Research-backed proposals: deeper front-loaded discussion, smarter context injection, and codebase intelligence for complex projects #3333

OfficialDelta · 2026-03-31T03:59:05Z

OfficialDelta
Mar 31, 2026

I've been using GSD-2 on complex projects and ran into the same wall that @turtlemath, @stoersignal, and @benjaminalgreen described in #2468 — auto-mode is impressive engineering, but for projects with real architectural complexity, I need more control at decision boundaries, not less.

Rather than just voicing frustration, I dug into the recent research on AI coding agent failure modes to understand why this happens and what specific changes would fix it. Sharing the findings here because I think they're directly actionable and I'd like maintainer feedback before writing any code.

The core problem

The discussion flow in GSD v2 is too detached from the main pipeline to be useful on complex projects. /gsd discuss runs in a second terminal as a sidebar while auto-mode races ahead — in practice, auto-mode moves too fast for the discussion to meaningfully shape anything. require_slice_discussion exists but even when enabled, the discussion lacks depth: it doesn't read the codebase, doesn't review prior decisions, doesn't research the ecosystem. It just asks questions.

What actually worked in GSD v1 was that discuss-phase was front-loaded, mandatory, and contextually adaptive — it analyzed the work type, surfaced gray areas, and produced a CONTEXT.md that directly constrained the planner. The discussion happened before any planning, not alongside it.

The users in #2468 nailed it: the system should collaborate with you at decision boundaries and execute autonomously between them. The ideal isn't more discussion points or a second terminal — it's one deep, evidence-backed discussion front-loaded into the milestone setup (during /gsd new-milestone flow or the initial /gsd step-mode entry) that covers everything before the first slice ever plans.

What the research says

I reviewed ~30 papers and analyses from 2025-2026. Here are the findings most relevant to GSD:

Context rot is worse than "big context window" assumes

Chroma Research (July 2025) tested 18 frontier models including Claude Opus 4 and GPT-4.1. Every model degrades at every input length increment — even well below the context window limit. A 200K-token model shows significant degradation at 50K tokens. This isn't a capability gap; it's an architectural property of transformer attention.

GSD impact: ADR-003's merged plan-milestone session (research + planning in one) creates the heaviest session in the pipeline — and the most vulnerable to quality degradation. The session where all architectural decisions get made is now the session most likely to suffer from context rot.

Monolithic context files hurt more than they help

ETH Zurich (February 2026) — the first rigorous study of AGENTS.md-style context files:

LLM-generated context files reduce task success rates by ~3% while increasing costs by 20%+
Agents are "too obedient" — unnecessary requirements make tasks harder
Codebase overviews and directory listings don't help agents navigate faster — they discover structure on their own
Tiered injection (only task-relevant subsets) reduces context 60-80% while maintaining or improving quality

GSD impact: PROJECT.md, DECISIONS.md, KNOWLEDGE.md, and REQUIREMENTS.md are inlined into nearly every session. This is exactly the anti-pattern the research identifies. ADR-003's Phase 4 moves toward file-path references, which is the right direction — but the ETH Zurich data suggests going further with scoped, task-relevant injection.

Importantly, this also means a static markdown codebase map (file listing with descriptions) would actively hurt — it's the kind of redundant overview the ETH Zurich study found adds noise without helping navigation. What does help is structural/graph-based codebase understanding (call graphs, dependency chains, impact analysis) — tools like GrapeRoot, Axon, or codebase-memory-mcp that let the agent query relationships ("what calls this function?" "what would break if I change this interface?") rather than reading a flat file. GSD doesn't currently have any codebase intelligence integration — this seems like a significant gap, especially for brownfield projects.

Compound failure rates kill long pipelines

Zylos Research (February 2026): At 95% per-step reliability over a 20-step workflow, the combined success rate drops to 36%. A 2% misalignment early on compounds into a 40% failure rate by the end. 65% of enterprise AI failures were attributed to context drift — not context exhaustion.

GSD impact: The quality-profile pipeline runs 30 sessions for a 4-slice milestone. Even at 95% per-session reliability, that's a ~21% compound success rate. ADR-003 reduces this to 16 sessions, which helps — but the research suggests the real fix is human checkpoints at decision boundaries combined with fewer sessions, not just fewer sessions alone.

Agents hallucinate rather than use tools

OpenAI, "Why Language Models Hallucinate" (September 2025): Training objectives reward confident guessing over calibrated uncertainty. Models are literally trained to produce answers, not to say "I don't know" or use retrieval tools. This means any knowledge system that relies on the agent choosing to search will fail silently — the agent will just confidently make something up instead. Knowledge retrieval has to be structural (the pipeline handles it before dispatch), not optional (hoping the agent decides to search).

The perception gap is real

The METR study found experienced open-source maintainers were 19% slower with AI tools while believing they were 20% faster — a 39-point perception gap. Google's 2025 DORA report found 90% AI adoption increase correlates with 9% more bugs and 91% more code review time.

What I'm proposing (and willing to build)

None of this requires fighting GSD's architecture. It's about making existing features deeper and smarter.

1. Deep, evidence-backed discussion front-loaded into milestone setup

This is the single highest-impact change. The discussion that happens when you start a new milestone (during /gsd step-mode entry or new-milestone flow) should be dramatically deeper. Before asking any questions, the agent should:

Read the codebase structure (ideally via a graph-based tool like GrapeRoot, but at minimum a targeted scan of key entry points, dependencies, and existing patterns)
Read DECISIONS.md, REQUIREMENTS.md, prior milestone summaries
Conduct web research on the relevant tech stack (best practices, known issues, version-specific gotchas)
Then present: "Here's what I found in the codebase, here are the relevant prior decisions, here's what the ecosystem recommends. Based on this, here's my recommendation — what do you think?"

The discussion proceeds in layers: what (scope/features) → how (architecture/patterns) → what-if (error states/edge cases) → quality bar (testing/performance expectations). Each layer gets explicit sign-off before moving to the next. The output is a structured CONTEXT.md with decisions, rationale, interface contracts, and acceptance criteria that the planner treats as a hard constraint.

This isn't a sidebar. It's part of the main pipeline, front-loaded before any roadmap gets created. Auto-mode should not be able to proceed past milestone setup without a signed-off design document.

Scope: Enhancement to the milestone setup / discussion flow within the GSD extension. No changes to the auto-mode dispatch table or pipeline structure.

2. Tiered context injection (supporting ADR-003 Phase 4)

Replace monolithic inlining with scoped, task-relevant injection following ETH Zurich evidence:

Working memory (<300 lines, always injected): task plan, slice goal, scoped design contract excerpt
Reference tier (file paths with mandatory-read directives): ROADMAP, DECISIONS, REQUIREMENTS
On-demand (queried when needed): prior task findings, cross-milestone patterns

Verification that the agent actually read references: plan output must cite the ROADMAP's slice description by title and risk level.

Scope: Refactor of auto-prompts.ts context inlining. Aligns directly with ADR-003 Phase 4.

3. Pre-execution plan verification

Mechanical checks (no LLM cost) before tasks execute:

Interface contract consistency between task plans
Package existence (npm info / pip show)
File path validation against actual project structure
Dependency ordering sanity check

Catching a bad plan before 3 tasks execute against it saves far more tokens than catching bad implementation after.

Scope: Extension to the verification gate. No pipeline changes.

4. Cross-task consistency checking + hallucination cascade prevention

After each task, mechanically verify:

New imports resolve to actual files/packages
Function signatures match what prior tasks actually created (not what summaries claim)
Pattern consistency within the slice (async style, error handling, naming)

This catches hallucination cascades — where task 1 introduces a subtle error that tasks 2-3 build on. Surge AI's SWE-bench case studies showed SOTA models inventing entire classes and methods without realizing they'd lost touch with the real codebase. Mechanical cross-task verification catches this.

Scope: Extension to post-unit verification. No pipeline changes.

5. Structured findings for cross-task knowledge transfer

Task executors write T##-FINDINGS.json with tagged discoveries (API behavior surprises, library quirks, pattern decisions). The orchestrator queries relevant findings before dispatching the next task and injects them — structurally, not optionally.

This addresses the tool-use reliability problem directly: the agent doesn't need to "remember" to search because the pipeline does the retrieval before dispatch. The agent never has to decide whether to look something up.

Scope: New artifact type + dispatch prompt enhancement. Moderate change to auto-prompts.ts.

6. Codebase intelligence integration

GSD currently has no structural understanding of the codebase beyond what the agent discovers per-session via grep/file reads. For complex projects, this means every planning session re-discovers the same architecture from scratch. Integrating with a graph-based codebase tool (GrapeRoot, Axon, or codebase-memory-mcp via MCP) would let the planning session query "what modules depend on the auth system?" or "what would be affected by changing the User type?" without burning tokens on manual exploration.

This could be as simple as: if a codebase intelligence MCP server is available, the discussion and planning flows query it for structural context. If not, fall back to the existing grep/read approach. No hard dependency.

Scope: Optional MCP integration in the discussion and planning prompts. Extension-level.

What I'm asking for

Feedback on the direction. Does this align with where GSD-2 is heading? Am I solving the right problems?
Guidance on scoping. Which of these would you accept as PRs? Which need RFCs/ADRs?
Priority input. If you could only pick two or three of the six, which matter most to the project?

I'm happy to write ADRs for anything that needs architectural review, and I'll follow the contribution guide (extension-first, simplicity over abstraction, tests for everything). I want to make sure I'm building what the project actually wants before I start writing code.

I have a detailed research synthesis document covering all the papers cited here, the specific GSD impact analysis, risk assessment with mitigations, and a phased implementation plan. Happy to share as a Gist if anyone wants the full picture.

TigerTugger · 2026-03-31T23:56:00Z

TigerTugger
Mar 31, 2026

Very much looking forward to this if you decide to proceed. Can't emphasize enough how much I miss the old interaction & input heavy discussion format in particular. Today I played around with implementing a system using Beans CLI as the base and the create slash command skill specifically to provide more manual control. Select items I wanted to work on from Beans, discuss / plan them out, link them together, and then route them into the queue system to create a milestone that /auto would work through. Definitely an improvement but still seems like it's a step backwards from the old workflow with GSD 1.

0 replies

OfficialDelta · 2026-04-03T20:28:08Z

OfficialDelta
Apr 3, 2026
Author

@TigerTugger Appreciate the validation, especially interesting that you're also working around the gap with external tooling. I've already submitted PRs for proposals 3 and 4 (pre-execution plan verification and cross-task consistency checking) at #3467 and #3468. The deep discussion system (proposal 1) is next on my list. Will share progress here as it develops.

0 replies

homeplayltd · 2026-04-05T07:36:09Z

homeplayltd
Apr 5, 2026

This also sounds exactly what we need - well done!

Discussions aren't nearly deep enough in V2, or as deep as I remember them being in V1. I have to keep prompting to go deeper on research discussions. It currently makes way too many assumptions.

0 replies

OfficialDelta · 2026-04-06T06:49:33Z

OfficialDelta
Apr 6, 2026
Author

Quick update: The enhanced verification system from proposals 3 and 4 was merged in PR #3468. The deep evidence-backed discussion system from proposal 1 is now up for review in PR #3602 — it adds a mechanical preparation engine and a structured 4-layer discussion protocol with mandatory rounds. Would appreciate feedback from anyone who's felt the discussion depth gap.

2 replies

mavuio Apr 7, 2026

thanks !

will do parallel tests this week.

i switched to https://github.com/HazAT/pi-interactive-subagents with a custom "QRSPI" pipeline
for the lack of discussion depth (and gsd-updates) last week, so i have something to compare with.

OfficialDelta Apr 7, 2026
Author

Sounds good! Please share your experience is with the new deep evidence-backed discussion system, including pros, cons, suggestions, etc. Now that it's merged to main you should be able to use it with the next versioned gsd-2 release.

OfficialDelta · 2026-04-07T16:03:10Z

OfficialDelta
Apr 7, 2026
Author

Update: Both the enhanced verification system (#3468) and the deep evidence-backed discussion system (#3602) are now merged! Proposals 1, 3, and 4 from the original post are shipped. Next up: tiered context injection (proposal 2) to address the monolithic context inlining problem the ETH Zurich research identified.

0 replies

OfficialDelta · 2026-04-08T14:10:55Z

OfficialDelta
Apr 8, 2026
Author

Update: Tiered context injection merged in PR #3766 (65.7% context
reduction for slice-level prompts). That's 4 of 6 original proposals
shipped:

✅ Proposal 1: Deep evidence-backed discussion (feat(gsd): add deep evidence-backed discussion system with preparation engine #3602)
✅ Proposal 2: Tiered context injection (feat(gsd): tiered context injection with scoped decisions, knowledge, and roadmap excerpts #3766)
✅ Proposal 3: Pre-execution verification (feat(gsd): add enhanced verification checks for auto-mode #3468)
✅ Proposal 4: Cross-task consistency checking (feat(gsd): add enhanced verification checks for auto-mode #3468)

For the remaining two proposals (progressive planning + mid-execution
escalation), I've opened RFC #3789 since they require pipeline
changes. Feedback welcome there.

0 replies

OfficialDelta · 2026-04-13T12:37:44Z

OfficialDelta
Apr 13, 2026
Author

Update: Layered depth enforcement merged in PR #4079. Discussion
rounds are now organized into Scope → Architecture → Error States →
Quality Bar with user-confirmed gates between each layer.

That's 5 contributions shipped:

✅ Discussion preparation engine (feat(gsd): add deep evidence-backed discussion system with preparation engine #3602)
✅ Layered depth enforcement (feat(gsd): layered depth enforcement for discuss.md #4079)
✅ Pre/post-execution verification (feat(gsd): add enhanced verification checks for auto-mode #3468)
✅ Tiered context injection (feat(gsd): tiered context injection with scoped decisions, knowledge, and roadmap excerpts #3766)

Open: RFC #3789 (progressive planning + mid-execution escalation)
awaiting feedback.

3 replies

chrisl10 Apr 15, 2026

Why are they not moving faster in that RFC? This is the best discussion and implementation.

jeremymcs Apr 15, 2026
Collaborator

@chrisl10 we'll get it merged ;)

chrisl10 Apr 17, 2026

@jeremymcs ?

trek-e · 2026-04-20T15:25:26Z

trek-e
Apr 20, 2026
Collaborator

PR #4079 merged. Marking this as answered — the layered discussion rounds described here are now live.

0 replies

Uh oh!

Research-backed proposals: deeper front-loaded discussion, smarter context injection, and codebase intelligence for complex projects #3333

Uh oh!

OfficialDelta Mar 31, 2026

The core problem

What the research says

Context rot is worse than "big context window" assumes

Monolithic context files hurt more than they help

Compound failure rates kill long pipelines

Agents hallucinate rather than use tools

The perception gap is real

What I'm proposing (and willing to build)

1. Deep, evidence-backed discussion front-loaded into milestone setup

2. Tiered context injection (supporting ADR-003 Phase 4)

3. Pre-execution plan verification

4. Cross-task consistency checking + hallucination cascade prevention

5. Structured findings for cross-task knowledge transfer

6. Codebase intelligence integration

What I'm asking for

Replies: 8 comments · 5 replies

Uh oh!

TigerTugger Mar 31, 2026

Uh oh!

OfficialDelta Apr 3, 2026 Author

Uh oh!

homeplayltd Apr 5, 2026

Uh oh!

OfficialDelta Apr 6, 2026 Author

Uh oh!

mavuio Apr 7, 2026

Uh oh!

OfficialDelta Apr 7, 2026 Author

Uh oh!

OfficialDelta Apr 7, 2026 Author

Uh oh!

Uh oh!

OfficialDelta Apr 8, 2026 Author

Uh oh!

OfficialDelta Apr 13, 2026 Author

Uh oh!

chrisl10 Apr 15, 2026

Uh oh!

jeremymcs Apr 15, 2026 Collaborator

Uh oh!

chrisl10 Apr 17, 2026

Uh oh!

trek-e Apr 20, 2026 Collaborator

OfficialDelta
Mar 31, 2026

Replies: 8 comments 5 replies

TigerTugger
Mar 31, 2026

OfficialDelta
Apr 3, 2026
Author

homeplayltd
Apr 5, 2026

OfficialDelta
Apr 6, 2026
Author

OfficialDelta Apr 7, 2026
Author

OfficialDelta
Apr 7, 2026
Author

OfficialDelta
Apr 8, 2026
Author

OfficialDelta
Apr 13, 2026
Author

jeremymcs Apr 15, 2026
Collaborator

trek-e
Apr 20, 2026
Collaborator