-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Phase A: The Sharpest Code Reader
Parent: #17 (Project Horizon)
Priority: Highest
Goal: Make skim the most useful code reading tool an AI agent can reach for — not via hooks, but by being so good that agents and humans actively choose it.
Strategic Rationale
Modern AI coding agents (Claude Code, Cursor, Windsurf) already have intelligent built-in file reading — they support offsets, limits, and selective reading. Hooking cat adds marginal value since agents rarely use raw cat as their primary file read path.
The real opportunity: features that no agent's built-in Read tool can replicate — AST-aware truncation, token budget optimization, and semantic code compression. These make skim the tool agents choose when they need to understand code efficiently.
Hooks become valuable in Phase B when we add command output compression (test runners, git, builds) — that's where agents have no built-in intelligence.
Deliverables
1. ✅ --mode=minimal — Light-Touch Filtering (PR #22)
Fill the gap between full (no transformation) and structure (bodies stripped).
- Strip single-line comments (preserve doc comments:
///,/**,#[doc) - Strip block comments (preserve Python docstrings
""") - Normalize excessive blank lines (3+ consecutive → 2)
- Keep all function bodies intact
- Language-aware comment pattern detection via tree-sitter node types
- Expected savings: 15-30% token reduction on typical source files
- Default mode when minimal context cleanup is needed
2. --max-lines N — AST-Aware Smart Truncation
Unique differentiator. No other tool truncates at AST boundaries.
- Given a line budget, select the most informative lines
- Priority hierarchy: type definitions > function signatures > imports > structural braces > body lines
- Cut at AST node boundaries — never mid-expression, never mid-function-signature
- Insert
// ... N lines omittedmarkers at truncation points - Composable with any mode:
skim file.rs --mode=structure --max-lines 50 - Handles files shorter than N lines gracefully (no-op)
3. --tokens N — Token Budget Mode
Novel capability — no existing tool does this. Pull forward from Phase D.
-
skim file.rs --tokens 2000— fit the most informative view into exactly N tokens - Uses tiktoken (cl100k_base) for accurate token measurement
- Algorithm: start with structure mode, progressively add detail until budget is reached
- Step 1: All type definitions + function signatures (minimum useful view)
- Step 2: Add import statements
- Step 3: Add doc comments on public items
- Step 4: Add short function bodies (< 5 lines)
- Step 5: Add remaining bodies by size (smallest first)
- Output always within budget (never exceed, may underuse by <5%)
- Use case: agent says "I have 4K tokens of context budget, give me what fits"
- Works with multi-file:
skim 'src/**/*.rs' --tokens 8000— budget across all files
4. New Language Support — C/C++ and TOML
More languages = more useful. Low effort, high visibility.
- C/C++ support via tree-sitter-c and tree-sitter-cpp (feat: Add C/C++ language support #10)
- TOML support — evaluate tree-sitter-toml vs serde_toml approach (feat: Add TOML language support #8)
- Extensions:
.c,.h,.cpp,.cc,.cxx,.hpp,.hxx,.toml - Full test fixtures for each (minimum 4 per language per CLAUDE.md)
- Update README language table
- Brings total to 11-12 supported languages (good for launch optics)
5. Stdin & Pipe Improvements
Agents pipe things. Skim should be the best pipe-friendly code compressor.
-
git diff | skim --lang rust --mode=structure— works perfectly -
git show HEAD:file.rs | skim --mode=signatures— language auto-detected from filename hint -
--langflag for explicit language override when piping (no file extension to detect from) - Ensure
BufWriterstreaming works correctly for large piped inputs - No regression on existing stdin support
Sequence
1. --mode=minimal ← Foundation (needed by other features)
2. --max-lines N ← Quick differentiator
3. --tokens N ← Novel capability, killer feature
4. C/C++ + TOML languages ← Parallelizable, independent work
5. Stdin/pipe improvements ← Polish
Acceptance Criteria
- All existing 151 tests still pass
- New tests for minimal mode, max-lines, token budget, and new languages
-
--mode=minimalachieves 15-30% reduction on test fixtures -
--max-linesnever cuts mid-AST-node -
--tokensoutput is always within budget (verified by tiktoken count) - C/C++ and TOML have ≥4 test fixtures each
- Performance: all new features maintain <50ms for 1000-line files
Out of Scope (Moved to Phase B)
→ Phase B (hooks are valuable when we have command output compression)skim inithook installation→ Phase Bskim rewritecommand engineMulti-agent hook support→ Phase B
Phase A focuses on making skim indispensable by capability, not by interception.