Phase A: The Sharpest Code Reader

# Phase A: The Sharpest Code Reader

**Parent:** #17 (Project Horizon)
**Priority:** Highest
**Goal:** Make skim the most useful code reading tool an AI agent can reach for — not via hooks, but by being so good that agents and humans actively choose it.

## Strategic Rationale

Modern AI coding agents (Claude Code, Cursor, Windsurf) already have intelligent built-in file reading — they support offsets, limits, and selective reading. Hooking `cat` adds marginal value since agents rarely use raw `cat` as their primary file read path.

**The real opportunity:** features that no agent's built-in `Read` tool can replicate — AST-aware truncation, token budget optimization, and semantic code compression. These make skim the tool agents *choose* when they need to understand code efficiently.

Hooks become valuable in Phase B when we add command output compression (test runners, git, builds) — that's where agents have no built-in intelligence.

## Deliverables

### 1. ✅ `--mode=minimal` — Light-Touch Filtering (PR #22)
Fill the gap between `full` (no transformation) and `structure` (bodies stripped).

- [x] Strip single-line comments (preserve doc comments: `///`, `/**`, `#[doc`)
- [x] Strip block comments (preserve Python docstrings `"""`)
- [x] Normalize excessive blank lines (3+ consecutive → 2)
- [x] Keep all function bodies intact
- [x] Language-aware comment pattern detection via tree-sitter node types
- [x] Expected savings: 15-30% token reduction on typical source files
- [x] Default mode when minimal context cleanup is needed

### 2. `--max-lines N` — AST-Aware Smart Truncation
**Unique differentiator.** No other tool truncates at AST boundaries.

- [ ] Given a line budget, select the most informative lines
- [ ] Priority hierarchy: type definitions > function signatures > imports > structural braces > body lines
- [ ] Cut at AST node boundaries — never mid-expression, never mid-function-signature
- [ ] Insert `// ... N lines omitted` markers at truncation points
- [ ] Composable with any mode: `skim file.rs --mode=structure --max-lines 50`
- [ ] Handles files shorter than N lines gracefully (no-op)

### 3. `--tokens N` — Token Budget Mode
**Novel capability — no existing tool does this.** Pull forward from Phase D.

- [ ] `skim file.rs --tokens 2000` — fit the most informative view into exactly N tokens
- [ ] Uses tiktoken (cl100k_base) for accurate token measurement
- [ ] Algorithm: start with structure mode, progressively add detail until budget is reached
  - Step 1: All type definitions + function signatures (minimum useful view)
  - Step 2: Add import statements
  - Step 3: Add doc comments on public items
  - Step 4: Add short function bodies (< 5 lines)
  - Step 5: Add remaining bodies by size (smallest first)
- [ ] Output always within budget (never exceed, may underuse by <5%)
- [ ] Use case: agent says "I have 4K tokens of context budget, give me what fits"
- [ ] Works with multi-file: `skim 'src/**/*.rs' --tokens 8000` — budget across all files

### 4. New Language Support — C/C++ and TOML
More languages = more useful. Low effort, high visibility.

- [ ] C/C++ support via tree-sitter-c and tree-sitter-cpp (#10)
- [ ] TOML support — evaluate tree-sitter-toml vs serde_toml approach (#8)
- [ ] Extensions: `.c`, `.h`, `.cpp`, `.cc`, `.cxx`, `.hpp`, `.hxx`, `.toml`
- [ ] Full test fixtures for each (minimum 4 per language per CLAUDE.md)
- [ ] Update README language table
- [ ] Brings total to 11-12 supported languages (good for launch optics)

### 5. Stdin & Pipe Improvements
Agents pipe things. Skim should be the best pipe-friendly code compressor.

- [ ] `git diff | skim --lang rust --mode=structure` — works perfectly
- [ ] `git show HEAD:file.rs | skim --mode=signatures` — language auto-detected from filename hint
- [ ] `--lang` flag for explicit language override when piping (no file extension to detect from)
- [ ] Ensure `BufWriter` streaming works correctly for large piped inputs
- [ ] No regression on existing stdin support

## Sequence

```
1. --mode=minimal              ← Foundation (needed by other features)
2. --max-lines N               ← Quick differentiator
3. --tokens N                  ← Novel capability, killer feature
4. C/C++ + TOML languages      ← Parallelizable, independent work
5. Stdin/pipe improvements     ← Polish
```

## Acceptance Criteria

- [ ] All existing 151 tests still pass
- [ ] New tests for minimal mode, max-lines, token budget, and new languages
- [ ] `--mode=minimal` achieves 15-30% reduction on test fixtures
- [ ] `--max-lines` never cuts mid-AST-node
- [ ] `--tokens` output is always within budget (verified by tiktoken count)
- [ ] C/C++ and TOML have ≥4 test fixtures each
- [ ] Performance: all new features maintain <50ms for 1000-line files

## Out of Scope (Moved to Phase B)

- ~~`skim init` hook installation~~ → Phase B (hooks are valuable when we have command output compression)
- ~~`skim rewrite` command engine~~ → Phase B
- ~~Multi-agent hook support~~ → Phase B

---

*Phase A focuses on making skim indispensable by capability, not by interception.*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase A: The Sharpest Code Reader — Minimal Mode, Token Budget, AST Truncation & New Languages #18

Strategic Rationale

Deliverables

1. ✅ `--mode=minimal` — Light-Touch Filtering (PR #22)

2. `--max-lines N` — AST-Aware Smart Truncation

3. `--tokens N` — Token Budget Mode

4. New Language Support — C/C++ and TOML

5. Stdin & Pipe Improvements

Sequence

Acceptance Criteria

Out of Scope (Moved to Phase B)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Phase A: The Sharpest Code Reader — Minimal Mode, Token Budget, AST Truncation & New Languages #18

Description

Phase A: The Sharpest Code Reader

Strategic Rationale

Deliverables

1. ✅ --mode=minimal — Light-Touch Filtering (PR #22)

2. --max-lines N — AST-Aware Smart Truncation

3. --tokens N — Token Budget Mode

4. New Language Support — C/C++ and TOML

5. Stdin & Pipe Improvements

Sequence

Acceptance Criteria

Out of Scope (Moved to Phase B)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. ✅ `--mode=minimal` — Light-Touch Filtering (PR #22)

2. `--max-lines N` — AST-Aware Smart Truncation

3. `--tokens N` — Token Budget Mode