Skip to content

Phase A: The Sharpest Code Reader — Minimal Mode, Token Budget, AST Truncation & New Languages #18

@dean0x

Description

@dean0x

Phase A: The Sharpest Code Reader

Parent: #17 (Project Horizon)
Priority: Highest
Goal: Make skim the most useful code reading tool an AI agent can reach for — not via hooks, but by being so good that agents and humans actively choose it.

Strategic Rationale

Modern AI coding agents (Claude Code, Cursor, Windsurf) already have intelligent built-in file reading — they support offsets, limits, and selective reading. Hooking cat adds marginal value since agents rarely use raw cat as their primary file read path.

The real opportunity: features that no agent's built-in Read tool can replicate — AST-aware truncation, token budget optimization, and semantic code compression. These make skim the tool agents choose when they need to understand code efficiently.

Hooks become valuable in Phase B when we add command output compression (test runners, git, builds) — that's where agents have no built-in intelligence.

Deliverables

1. ✅ --mode=minimal — Light-Touch Filtering (PR #22)

Fill the gap between full (no transformation) and structure (bodies stripped).

  • Strip single-line comments (preserve doc comments: ///, /**, #[doc)
  • Strip block comments (preserve Python docstrings """)
  • Normalize excessive blank lines (3+ consecutive → 2)
  • Keep all function bodies intact
  • Language-aware comment pattern detection via tree-sitter node types
  • Expected savings: 15-30% token reduction on typical source files
  • Default mode when minimal context cleanup is needed

2. --max-lines N — AST-Aware Smart Truncation

Unique differentiator. No other tool truncates at AST boundaries.

  • Given a line budget, select the most informative lines
  • Priority hierarchy: type definitions > function signatures > imports > structural braces > body lines
  • Cut at AST node boundaries — never mid-expression, never mid-function-signature
  • Insert // ... N lines omitted markers at truncation points
  • Composable with any mode: skim file.rs --mode=structure --max-lines 50
  • Handles files shorter than N lines gracefully (no-op)

3. --tokens N — Token Budget Mode

Novel capability — no existing tool does this. Pull forward from Phase D.

  • skim file.rs --tokens 2000 — fit the most informative view into exactly N tokens
  • Uses tiktoken (cl100k_base) for accurate token measurement
  • Algorithm: start with structure mode, progressively add detail until budget is reached
    • Step 1: All type definitions + function signatures (minimum useful view)
    • Step 2: Add import statements
    • Step 3: Add doc comments on public items
    • Step 4: Add short function bodies (< 5 lines)
    • Step 5: Add remaining bodies by size (smallest first)
  • Output always within budget (never exceed, may underuse by <5%)
  • Use case: agent says "I have 4K tokens of context budget, give me what fits"
  • Works with multi-file: skim 'src/**/*.rs' --tokens 8000 — budget across all files

4. New Language Support — C/C++ and TOML

More languages = more useful. Low effort, high visibility.

  • C/C++ support via tree-sitter-c and tree-sitter-cpp (feat: Add C/C++ language support #10)
  • TOML support — evaluate tree-sitter-toml vs serde_toml approach (feat: Add TOML language support #8)
  • Extensions: .c, .h, .cpp, .cc, .cxx, .hpp, .hxx, .toml
  • Full test fixtures for each (minimum 4 per language per CLAUDE.md)
  • Update README language table
  • Brings total to 11-12 supported languages (good for launch optics)

5. Stdin & Pipe Improvements

Agents pipe things. Skim should be the best pipe-friendly code compressor.

  • git diff | skim --lang rust --mode=structure — works perfectly
  • git show HEAD:file.rs | skim --mode=signatures — language auto-detected from filename hint
  • --lang flag for explicit language override when piping (no file extension to detect from)
  • Ensure BufWriter streaming works correctly for large piped inputs
  • No regression on existing stdin support

Sequence

1. --mode=minimal              ← Foundation (needed by other features)
2. --max-lines N               ← Quick differentiator
3. --tokens N                  ← Novel capability, killer feature
4. C/C++ + TOML languages      ← Parallelizable, independent work
5. Stdin/pipe improvements     ← Polish

Acceptance Criteria

  • All existing 151 tests still pass
  • New tests for minimal mode, max-lines, token budget, and new languages
  • --mode=minimal achieves 15-30% reduction on test fixtures
  • --max-lines never cuts mid-AST-node
  • --tokens output is always within budget (verified by tiktoken count)
  • C/C++ and TOML have ≥4 test fixtures each
  • Performance: all new features maintain <50ms for 1000-line files

Out of Scope (Moved to Phase B)

  • skim init hook installation → Phase B (hooks are valuable when we have command output compression)
  • skim rewrite command engine → Phase B
  • Multi-agent hook support → Phase B

Phase A focuses on making skim indispensable by capability, not by interception.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthorizonProject Horizon initiative

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions