Project Horizon: Context Optimization Platform

# Project Horizon: Context Optimization Platform

## Vision

Evolve skim from a streaming code reader into the **definitive context optimization tool for AI coding agents**. While skim already excels at source code transformation via tree-sitter AST parsing, there is a much larger surface area of token waste in AI-assisted development workflows that remains unaddressed — or addressed only by fragile regex-based approaches.

**North Star:** Every token an AI agent consumes should carry maximum information density. Skim should be the single tool that ensures this — for code reading, command output, and beyond.

## Market Context

The token optimization space is heating up. Current approaches in the ecosystem rely on:
- Regex-based text filtering (fragile, breaks on edge cases)
- Heuristic `chars/4` token estimation (25%+ inaccuracy)
- Per-tool hardcoded filters with no composability
- No caching, no parallelism, no multi-file support

**Skim's moat:** Tree-sitter AST parsing guarantees **semantically correct** output. When an AI agent receives incorrect code structure (broken by regex edge cases), it makes incorrect decisions. Correctness is not just a nice-to-have — it's the difference between a helpful agent and a hallucinating one.

## Key Insight: Distribution Strategy

Modern AI coding agents already have intelligent built-in file reading (offsets, limits, selective reading). Hooking `cat` → `skim` adds marginal value. Instead:

- **Phase A:** Make skim so capable that agents and humans *choose* to use it (novel features like token budgets, AST-aware truncation)
- **Phase B:** Introduce hooks when we add command output compression — that's where agents have no built-in intelligence and transparent interception genuinely adds value

## Phased Roadmap

### Phase A: The Sharpest Code Reader
> Make skim indispensable by capability, not by interception.

- [ ] `--mode=minimal` — Strip comments/blank lines, keep bodies (15-30% savings)
- [ ] `--max-lines N` — AST-aware smart truncation (cut at node boundaries, never mid-expression)
- [ ] `--tokens N` — Token budget mode: fit maximum information into exactly N tokens (novel — no tool does this)
- [ ] C/C++ and TOML language support (11-12 total languages)
- [ ] Stdin/pipe improvements for agent workflows

**Tracking:** #18

### Phase B: Command Output Optimization + Hooks
> Expand beyond code files. This is where hooks become valuable.

- [ ] `skim init` — Agent hook installation (rewrites test/git/build commands)
- [ ] `skim rewrite` — Command rewriting engine
- [ ] Test output compression (cargo test, pytest, vitest) via structured JSON parsing
- [ ] Git output compression (status, diff, log)
- [ ] Build output compression (cargo build, tsc, clippy)
- [ ] Three-tier parse degradation: Structured → Regex fallback → Passthrough
- [ ] Tee/recovery system for raw output preservation

**Tracking:** #19

### Phase C: Token Analytics & Intelligence
> Prove ROI. Let users see exactly how much context they're saving.

- [ ] SQLite-based token tracking (persistent, cross-session)
- [ ] `skim stats` dashboard — per-command savings, daily/weekly trends
- [ ] `skim discover` — scan AI agent session history, find missed optimization opportunities
- [ ] Accurate tiktoken-based measurement throughout
- [ ] Per-project scoping for team analytics

**Tracking:** #20

### Phase D: Platform & Ecosystem
> Blue ocean features no existing tool offers.

- [ ] Context budgeting across multiple files
- [ ] Import graph analysis — serve only referenced symbols
- [ ] Diff-aware context — changed regions + structural context
- [ ] Plugin/filter trait for community-contributed command filters
- [ ] Semantic deduplication across files

**Tracking:** #21

## Disruptive Opportunities (Beyond Existing Approaches)

These are areas where no current tool operates:

1. **Token budget allocation** — `skim file.rs --tokens 2000` — fit the most information-dense view into exactly N tokens. Agents can request precisely what fits their remaining context window.
2. **Import graph pruning** — AI agents read entire files when they only need specific exports. Skim could analyze `import` statements and serve only the referenced symbols.
3. **Diff-aware context windows** — Instead of full files, serve git diff regions + structural context (function signatures containing the changes).
4. **Cross-file deduplication** — When an agent reads 20 files, collapse similar patterns with references.
5. **Output format optimization** — Research which format carries the most information per token for different LLM architectures.

## Success Metrics

| Metric | Target |
|--------|--------|
| Token reduction (code reading) | 70-85% (structure), 15-30% (minimal) |
| Token reduction (command output) | 60-80% (Phase B) |
| Correctness | 100% — zero tolerance for broken output |
| Token budget accuracy | Within 5% of target |
| Startup overhead | <10ms per command |
| Supported languages | 11-12 (Phase A) |

## Design Principles

1. **Correctness over coverage** — Better to support 10 commands perfectly than 40 with regex hacks
2. **AST where it matters, structured parsing elsewhere** — Use the right tool for each job
3. **Zero config** — CLI flags only, no config files
4. **Streaming first** — stdout always, never intermediate files
5. **Fail loud** — Never silently corrupt output; degrade gracefully with clear markers
6. **Measure accurately** — tiktoken, not `chars/4`
7. **Chosen, not forced** — Build features so good agents choose to use skim, don't hook-inject it where agents already handle things well

---

*This issue tracks the overall initiative. Individual phases are tracked in linked issues.*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Horizon: Context Optimization Platform #17

Project Horizon: Context Optimization Platform

Vision

Market Context

Key Insight: Distribution Strategy

Phased Roadmap

Phase A: The Sharpest Code Reader

Phase B: Command Output Optimization + Hooks

Phase C: Token Analytics & Intelligence

Phase D: Platform & Ecosystem

Disruptive Opportunities (Beyond Existing Approaches)

Success Metrics

Design Principles

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Metric	Target
Token reduction (code reading)	70-85% (structure), 15-30% (minimal)
Token reduction (command output)	60-80% (Phase B)
Correctness	100% — zero tolerance for broken output
Token budget accuracy	Within 5% of target
Startup overhead	<10ms per command
Supported languages	11-12 (Phase A)

Project Horizon: Context Optimization Platform #17

Description

Project Horizon: Context Optimization Platform

Vision

Market Context

Key Insight: Distribution Strategy

Phased Roadmap

Phase A: The Sharpest Code Reader

Phase B: Command Output Optimization + Hooks

Phase C: Token Analytics & Intelligence

Phase D: Platform & Ecosystem

Disruptive Opportunities (Beyond Existing Approaches)

Success Metrics

Design Principles

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions