Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/ADR-004-capability-aware-model-routing.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# ADR-004: Capability-Aware Model Routing

**Status:** Proposed (Revised)
**Status:** Implemented (Phase 2)
**Date:** 2026-03-26
**Revised:** 2026-03-26
**Revised:** 2026-04-03
**Deciders:** Jeremy McSpadden
**Related:** ADR-003 (pipeline simplification), [Issue #2655](https://github.com/gsd-build/gsd-2/issues/2655), `docs/dynamic-model-routing.md`

Expand Down
13 changes: 13 additions & 0 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -686,6 +686,7 @@ Complexity-based model routing. See [Dynamic Model Routing](./dynamic-model-rout
```yaml
dynamic_routing:
enabled: true
capability_routing: true # score models by task capability (v2.59)
tier_models:
light: claude-haiku-4-5
standard: claude-sonnet-4-6
Expand All @@ -695,6 +696,18 @@ dynamic_routing:
cross_provider: true
```

### `context_management` (v2.59)

Controls observation masking and tool result truncation during auto-mode sessions. Reduces context bloat between compactions with zero LLM overhead.

```yaml
context_management:
observation_masking: true # replace old tool results with placeholders (default: true)
observation_mask_turns: 8 # keep results from last N user turns (1-50, default: 8)
compaction_threshold_percent: 0.70 # target compaction at 70% context usage (0.5-0.95, default: 0.70)
tool_result_max_chars: 800 # cap individual tool result content (200-10000, default: 800)
```

### `service_tier` (v2.42)

OpenAI service tier preference for supported models. Toggle with `/gsd fast`.
Expand Down
30 changes: 30 additions & 0 deletions docs/dynamic-model-routing.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,36 @@ When approaching the budget ceiling, the router progressively downgrades:

When enabled, the router may select models from providers other than your primary. This uses the built-in cost table to find the cheapest model at each tier. Requires the target provider to be configured.

## Capability-Aware Scoring

*Introduced in v2.59.0 (ADR-004 Phase 2)*

When `capability_routing` is enabled, the router goes beyond tier classification and scores models against task-specific capability requirements. Each known model has a 7-dimension profile:

| Dimension | What It Measures |
|-----------|-----------------|
| `coding` | Code generation, refactoring, implementation quality |
| `debugging` | Error diagnosis, fix accuracy |
| `research` | Information gathering, codebase exploration |
| `reasoning` | Multi-step logic, architectural decisions |
| `speed` | Response latency (inverse of cost) |
| `longContext` | Performance with large context windows |
| `instruction` | Adherence to structured instructions and templates |

Each unit type maps to a weighted requirement vector. For example, `execute-task` weights `coding: 0.9, reasoning: 0.6, debugging: 0.5` while `research-slice` weights `research: 0.9, reasoning: 0.7, longContext: 0.5`.

For `execute-task` units, the classifier also inspects task metadata (tags, description) to refine requirements. Documentation tasks boost `instruction` and lower `coding`; test tasks boost `debugging`.

Enable capability routing:

```yaml
dynamic_routing:
enabled: true
capability_routing: true
```

When enabled, models within the target tier are ranked by capability score rather than selected arbitrarily. When disabled (the default), the existing tier-only selection applies.

## Complexity Classification

Units are classified using pure heuristics — no LLM calls, sub-millisecond:
Expand Down
198 changes: 198 additions & 0 deletions docs/pi-context-optimization-opportunities.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
# pi-coding-agent: Context Optimization Opportunities

> **Status**: Research only — not planned for implementation.
> Scope: `packages/pi-coding-agent` and `packages/pi-agent-core` infrastructure.
> These changes would benefit every consumer of the pi engine, not just GSD.

---

## 1. Prompt Caching (`cache_control`) — Highest Impact

**Current state**: Every LLM call re-pays full input token cost for the system prompt, tool definitions, and context files. No `cache_control` breakpoints are set anywhere in the API call path.

**Opportunity**: Anthropic's KV cache delivers 90% cost reduction on cached tokens (0.1x input rate). Claude Code achieves 92–98% cache hit rates by placing stable content before volatile content.

**Where to instrument** (`packages/pi-ai/src/providers/anthropic.ts`):
- Set `cache_control: { type: "ephemeral" }` on the last tool definition block
- Set `cache_control` after the static system prompt sections (base boilerplate + context files)
- Leave the per-turn user message uncached

**Critical constraint**: The cache breakpoint must be placed *after* all static content and *before* any dynamic content (timestamps, per-request variables). Moving a timestamp before a cache breakpoint defeats it on every call.

**Cache hierarchy**: Tools → system → messages. Changing a tool definition invalidates system and message caches. Tool definitions should be sorted deterministically (alphabetically) to prevent spurious cache misses.

**Expected savings**: 80–90% reduction in input token cost for multi-turn sessions (the dominant cost pattern in GSD auto-mode).

---

## 2. Observation Masking in the Message Pipeline

**Current state**: `agent-loop.ts` passes the full `context.messages` array to the LLM on every turn. Tool results from 50 turns ago are re-read in full on every subsequent call. The `transformContext` hook exists on `AgentContext` and fires before every LLM call, but has no default implementation — extensions are responsible for any pruning.

**Opportunity**: Replace old tool result content with lightweight placeholders after N turns. JetBrains Research tested this on SWE-bench Verified (500 tasks, up to 250-turn trajectories) and found:
- 50%+ cost reduction vs. unmanaged history
- Performance matched or slightly exceeded LLM summarization
- Zero overhead (no extra LLM call required)

**Proposed implementation** (default `transformContext` in `pi-agent-core`):
```typescript
// Keep last KEEP_RECENT_TURNS verbatim; mask older tool results
const KEEP_RECENT_TURNS = 8;

function defaultObservationMask(messages: AgentMessage[]): AgentMessage[] {
const cutoff = findTurnBoundary(messages, KEEP_RECENT_TURNS);
return messages.map((m, i) => {
if (i >= cutoff) return m;
if (m.type === "toolResult" || m.type === "bashExecution") {
return { ...m, content: "[result masked — within summarized history]", excludeFromContext: false };
}
return m;
});
}
```

**Compaction interaction**: Observation masking reduces the token accumulation rate, pushing the compaction threshold further out. The two mechanisms are complementary — masking handles the steady state, compaction handles the rare deep-session case.

---

## 3. Earlier Compaction Threshold

**Current state** (`packages/pi-coding-agent/src/core/constants.ts`):
```typescript
COMPACTION_RESERVE_TOKENS = 16_384 // triggers at contextWindow - 16K
COMPACTION_KEEP_RECENT_TOKENS = 20_000
```

For a 200K context window, compaction fires at ~183K tokens — 91.5% utilization.

**Problem**: Context drift (not raw exhaustion) causes ~65% of enterprise agent failures. Performance degrades measurably beyond ~30K tokens per Zylos production data. The current threshold lets sessions run degraded for a long stretch before compaction fires.

**Opportunity**: Lower the trigger to 70% utilization. For a 200K window, this means compacting at ~140K tokens — 43K tokens earlier.

```typescript
// Proposed
COMPACTION_THRESHOLD_PERCENT = 0.70 // fire at 70% of contextWindow
COMPACTION_RESERVE_TOKENS = contextWindow * (1 - COMPACTION_THRESHOLD_PERCENT)
```

**Trade-off**: More frequent compactions, each happening earlier when there's more "fresh" content to keep. Summary quality improves because less material needs to be discarded at each cut.

---

## 4. Tool Result Truncation at Write Time

**Current state**: `TOOL_RESULT_MAX_CHARS = 2_000` in `constants.ts`, but this limit is only applied *during compaction summarization*, not when the tool result enters the message store. A bash result returning 50KB of log output is stored and re-sent verbatim until compaction fires.

**Opportunity**: Truncate at write time in `messages.ts` → `convertToLlm()` or in the tool result handler. Two strategies:

- **Hard truncation**: Slice at N chars, append `"\n[truncated — {original_length} chars]"`. Simple, zero overhead.
- **Semantic head/tail**: Keep first 500 chars (context, command echo) + last 1000 chars (final output, errors). Better for bash results where the end contains the error.

**Recommendation**: Semantic head/tail as the default, configurable per tool type. File read results benefit from head; bash/test output benefits from head+tail.

---

## 5. Context File Deduplication and Trim

**Current state** (`packages/pi-coding-agent/src/core/resource-loader.ts`, lines 84–109):
- Searches from `~/.gsd/agent/` → ancestor dirs → cwd
- Deduplicates by *file path* but not by *content*
- Entire file content concatenated verbatim into system prompt — no trimming, no summarization

**Anti-pattern**: A project with AGENTS.md at 3 ancestor levels (repo root, workspace, home) injects all three in full. If they share common boilerplate, that content is re-injected multiple times.

**Opportunities**:
1. **Content deduplication**: Hash paragraph-level chunks; skip any chunk already seen in a previously-loaded file
2. **Section-aware loading**: Parse `## ` headings in AGENTS.md; only include sections relevant to the current task type (e.g., `## Testing` section only when running tests)
3. **Token budget enforcement**: If total context files exceed N tokens, summarize oldest/most-distant file rather than including verbatim

---

## 6. Skill Content Lazy Loading and Summarization

**Current state**: When `/skill:name` is invoked, the full skill file content is injected inline as `<skill>...</skill>` in the user message. No chunking, no summarization. A 10KB skill file adds ~2,500 tokens to that turn.

**Opportunity**:
- **Cached skill injection**: If the same skill is used across multiple turns (rare but possible), it's re-injected each time. Cache with `cache_control` after first injection.
- **Skill digest mode**: Inject a 200-token summary of the skill on first reference; full content only if the model requests it via a `get_skill_detail` tool call. Reduces cost for skills that don't end up being followed.
- **Skill prefetching**: Before a known long session (e.g., auto-mode start), pre-inject all likely skills with `cache_control` so they're cached for the entire session.

---

## 7. Token Estimation Accuracy

**Current state** (`compaction.ts`, line 216): `chars / 4` heuristic. This overestimates token count for English prose (~3.5 chars/token) and underestimates for code with short identifiers or Unicode.

**Opportunity**: Use a proper tokenizer.
- `@anthropic-ai/tokenizer` (tiktoken-compatible, ships with the SDK) — accurate but ~5ms per call
- Tiered approach: use chars/4 for display; use proper tokenizer only for compaction threshold decisions (where accuracy matters)

**Impact**: More accurate compaction timing, fewer unnecessary compactions, slightly better `COMPACTION_KEEP_RECENT_TOKENS` boundary placement.

---

## 8. Format: Markdown over XML for Internal Context

**Current state**: The message pipeline uses `<skill>`, `<summary>`, `<compaction>` XML wrappers in several places. System prompt sections are largely prose Markdown.

**Findings**: XML tags carry 15–40% more tokens than equivalent Markdown for the same semantic content, due to paired open/close tags. However, Claude was optimized for XML and shows higher accuracy on tasks requiring precise section parsing.

**Recommendation**: Audit XML usage in the pipeline and convert to Markdown where the content is:
- Non-nested (flat instructions, status messages)
- Human-readable rather than machine-parsed by the model
- Not requiring precise boundary detection

Keep XML for: few-shot examples with ambiguous boundaries, skill content (requires precise isolation from surrounding text), compaction summaries that the model must treat as authoritative history.

**Estimated savings**: 5–15% reduction in system prompt token count.

---

## 9. Dynamic Tool Set Delivery

**Current state**: All tool definitions are included in every LLM request. Tool descriptions consume 60–80% of input tokens in static configurations. As new extensions register tools, the baseline grows linearly.

**Opportunity** (higher complexity): Implement the three-function Dynamic Toolset pattern:
1. `search_tools(query)` — semantic search over tool catalog
2. `describe_tools(ids[])` — fetch full schemas on demand
3. `execute_tool(id, params)` — unchanged execution

Speakeasy measured 91–97% token reduction with 100% task success rate. Trade-off: 2–3x more tool calls, ~50% longer wall time. Net cost dramatically lower.

**Feasibility for pi**: The tool registry (`packages/pi-coding-agent/src/core/tool-registry.ts`) already stores tool metadata separately from definitions. The primary engineering work is the semantic search index and the `describe_tools` / `search_tools` tool implementations.

---

## 10. Cost Attribution and Per-Phase Reporting

**Current state**: `SessionManager.getUsageTotals()` accumulates cost across the entire session. No per-phase or per-agent breakdown is stored. Cost visibility is limited to the footer total and `GSD_SHOW_TOKEN_COST=1` per-turn display.

**Opportunity**: Emit structured cost events that extensions can subscribe to:
```typescript
interface CostCheckpointEvent {
type: "cost_checkpoint";
label: string; // "discuss-phase", "execute-slice-3"
deltaTokens: Usage; // tokens since last checkpoint
cumulativeTokens: Usage;
cumulativeCost: number;
}
```

GSD extension could consume these events to surface per-milestone cost in `/gsd stats` and flag milestones that are disproportionately expensive — enabling budget-aware planning.

---

## Implementation Ordering (if pursued)

| Priority | Item | Effort | Expected Impact |
|----------|------|--------|-----------------|
| 1 | Prompt caching (`cache_control`) | Low | 80–90% input cost reduction |
| 2 | Earlier compaction threshold (70%) | Trivial | Reduces drift in long sessions |
| 3 | Tool result truncation at write time | Low | Reduces context bloat between compactions |
| 4 | Context file deduplication | Medium | Variable — high for multi-level AGENTS.md setups |
| 5 | Observation masking (default `transformContext`) | Medium | 50%+ on long-running agents |
| 6 | Token estimation (proper tokenizer) | Low | Accuracy improvement, minor cost impact |
| 7 | Markdown over XML audit | Low | 5–15% system prompt reduction |
| 8 | Skill caching with `cache_control` | Low | Meaningful for skill-heavy sessions |
| 9 | Dynamic tool set delivery | High | 90%+ on large tool catalogs; major architecture change |
| 10 | Per-phase cost attribution events | Medium | Visibility only; enables future budget routing |
54 changes: 49 additions & 5 deletions docs/token-optimization.md
Original file line number Diff line number Diff line change
Expand Up @@ -262,15 +262,59 @@ PREFERENCES.md
├─ resolveProfileDefaults() → model defaults + phase skip defaults
├─ resolveInlineLevel() → standard
│ └─ prompt builders gate context inclusion by level
└─ classifyUnitComplexity() → routes to execution/execution_simple model
├─ task plan analysis (steps, files, signals)
├─ unit type defaults
├─ budget pressure adjustment
└─ adaptive learning from routing-history.json
├─ classifyUnitComplexity() → routes to execution/execution_simple model
│ ├─ task plan analysis (steps, files, signals)
│ ├─ unit type defaults
│ ├─ budget pressure adjustment
│ ├─ adaptive learning from routing-history.json
│ └─ capability scoring (when capability_routing: true)
│ └─ 7-dimension model profiles × task requirement vectors
└─ context_management
├─ observation masking (before_provider_request hook)
├─ tool result truncation (tool_result_max_chars)
└─ phase handoff anchors (injected into prompt builders)
```

The profile is resolved once and flows through the entire dispatch pipeline. Explicit preferences override profile defaults at every layer.

## Observation Masking

*Introduced in v2.59.0*

During auto-mode sessions, tool results accumulate in the conversation history and consume context window space. Observation masking replaces tool result content older than N user turns with a lightweight placeholder before each LLM call. This reduces token usage with zero LLM overhead — no summarization calls, no latency.

Masking is enabled by default during auto-mode. Configure via preferences:

```yaml
context_management:
observation_masking: true # default: true (set false to disable)
observation_mask_turns: 8 # keep results from last 8 user turns (range: 1-50)
tool_result_max_chars: 800 # truncate individual tool results beyond this length
```

### How It Works

1. Before each provider request, the `before_provider_request` hook inspects the messages array
2. Tool results (`toolResult`, `bashExecution`) older than the configured turn threshold are replaced with `[result masked — within summarized history]`
3. Recent tool results (within the keep window) are preserved in full
4. All assistant and user messages are always preserved — only tool result content is masked

This pairs with the existing compaction system: masking reduces context pressure between compactions, and compaction handles the full context reset when the window fills.

### Tool Result Truncation

Individual tool results that exceed `tool_result_max_chars` (default: 800) are truncated with a `…[truncated]` marker. This prevents a single large tool output from dominating the context window.

## Phase Handoff Anchors

*Introduced in v2.59.0*

When auto-mode transitions between phases (research → planning → execution), structured JSON anchors are written to `.gsd/milestones/<mid>/anchors/<phase>.json`. Downstream prompt builders inject these anchors so the next phase inherits intent, decisions, blockers, and next steps without re-inferring from artifact files.

This reduces context drift — the 65% of enterprise agent failures caused by agents losing track of prior decisions across phase boundaries.

Anchors are written automatically after successful completion of `research-milestone`, `research-slice`, `plan-milestone`, and `plan-slice` units. No configuration needed.

## Prompt Compression

*Introduced in v2.29.0*
Expand Down
Loading
Loading