feat: add --max-lines N AST-aware smart truncation (#18) by dean0x · Pull Request #23 · dean0x/skim

dean0x · 2026-03-12T10:55:57Z

Summary

Adds --max-lines N flag for AST-aware output truncation with priority-based node selection
Types and signatures prioritized over imports, which are prioritized over function bodies
Language-appropriate omission markers (//, #, ) inserted between gaps
Composable with all modes: skim file.rs --mode=structure --max-lines 50
Per-file semantics in multi-file/glob mode
Cache key includes max_lines for correct cache isolation

Implementation

New truncate.rs module with NodeSpan-based truncation engine
Priority scoring: types(5) > signatures(4) > imports(3) > structure(2) > other(1)
_with_spans variants for structure/signatures/types transforms
simple_line_truncate() fallback for serde languages (JSON/YAML)
transform_auto_with_config() public API
CLI arg with validation, ProcessOptions threading, stdin support

Test plan

25+ new tests across 3 layers (unit, integration, CLI)
All 274 existing tests pass
Zero clippy warnings
Zero Snyk security issues

Closes #18 (Step 2)

Add priority-based truncation that respects AST node boundaries when limiting output to N lines. Types and interfaces are kept preferentially over functions, which are kept over imports, which are kept over bodies. Truncation engine: - NodeSpan maps transformed output line ranges to tree-sitter node kinds - Greedy span selection with marker budget reservation - Language-appropriate omission markers (// / # / ) - Simple line truncation fallback for JSON/YAML Pipeline integration: - All transform modes (structure, signatures, types, minimal, full) produce NodeSpan metadata via _with_spans variants - TransformConfig gains max_lines: Option<usize> with builder method - Cache key includes max_lines for isolation - New public API: transform_auto_with_config() CLI: - --max-lines <N> flag with validation (>= 1) - Composable with all modes, glob, parallel, stdin, --show-stats - MultiFileOptions struct replaces loose parameters Tests: 22 new tests (14 integration + 8 CLI) covering priority ordering, omission markers, mode composition, per-language behavior, and edge cases. All 274 tests pass (184 core + 90 CLI).

YAML comments use # not //. The previous implementation used // which is not valid YAML syntax. JSON retains // as it has no comment syntax and // is the JSONC convention.

- Consolidate serde-based language handling (JSON/YAML) - Remove duplicate truncation logic in match branches - Apply truncation uniformly after transformation - Improve code readability with clearer flow No behavior change.

dean0x · 2026-03-12T21:07:59Z

Code Review Summary - Comprehensive PR Analysis

This PR introduces AST-aware truncation with solid overall design, but requires fixes for 7 blocking issues and several should-fix items before approval. Review reports synthesized from 8 independent analyses (architecture, complexity, consistency, performance, regression, Rust, security, tests).

CRITICAL BLOCKING ISSUES (Must Fix)

1. Markdown Header Extraction Duplication ⚠️ HIGH

Impact: DRY violation, maintenance risk across future changes
Problem: extract_markdown_headers() (lines 402-497) and extract_markdown_headers_with_spans() (lines 502-598) contain ~100 lines of identical logic
Files: crates/rskim-core/src/transform/structure.rs
Fix: Have extract_markdown_headers delegate to the _with_spans variant and discard spans

2. Node Kind Mapping Tables Can Drift ⚠️ HIGH

Impact: Silent priority degradation if tables diverge
Problem: to_static_node_kind() in structure.rs:337-385 and score_node_kind() in utils.rs:90-136 maintain separate mappings that are already out of sync
Files: crates/rskim-core/src/transform/structure.rs, crates/rskim-core/src/transform/utils.rs
Fix: Unify into single source of truth (single match statement mapping both)

3. Double lines().collect() in Truncation Path ⚠️ HIGH

Impact: Unnecessary allocation on hot path
Problem: truncate_to_lines() line 75 collects lines, then simple_line_truncate() collects again at line 211
Files: crates/rskim-core/src/transform/truncate.rs
Fix: Pass pre-collected lines to avoid re-parsing

4. String Clones in Span Builders ⚠️ HIGH

Impact: Doubles memory allocation for signatures/types
Problem: iter() + .clone() instead of into_iter() in signatures.rs:93, types.rs:183, structure.rs:593
Files: Multiple files
Fix: Use into_iter() to move strings instead of cloning

5. Inconsistent _with_spans Pattern ⚠️ HIGH

Impact: Consistency gap for future maintainers
Problem: Minimal mode inlines span generation in mod.rs:82-88 instead of following the _with_spans pattern
Files: crates/rskim-core/src/transform/mod.rs
Fix: Extract transform_minimal_with_spans() or add architecture comment

6. Missing Upper Bound Validation ⚠️ MEDIUM

Impact: Inconsistent with --jobs parameter pattern
Problem: CLI validates --max-lines >= 1 but no upper bound
Files: crates/rskim/src/main.rs:609
Fix: Add const MAX_MAX_LINES = 1_000_000; validation

7. Potential Out-of-Bounds Span Boundaries ⚠️ MEDIUM

Impact: Could produce incorrect span line ranges
Problem: source_to_output_byte() at structure.rs:301-307 clamps to >= 0 but not <= output.len()
Files: crates/rskim-core/src/transform/structure.rs
Fix: Clamp both output_start and output_end

SHOULD-FIX ISSUES (Approve with fixes committed)

Restore // SECURITY: comment annotations on depth guards
Account for inter-span markers in greedy budget calculation
Optimize text clone on no-truncation path
Document cache key format change in release notes

POSITIVE FINDINGS

✅ Core algorithm is sound with clean priority scoring
✅ 22 new tests covering diverse modes and languages
✅ API backward compatible
✅ Error handling consistent throughout

RECOMMENDATION

APPROVED_WITH_CONDITIONS - Address all 7 blocking issues before merge.
Estimated fix time: 41 minutes

Generated by Claude Code + Multi-Review Synthesis

dean0x · 2026-03-12T21:08:17Z

Detailed Blocking Issue Fixes

Issue 1: Markdown Header Duplication

Current state (structure.rs):

Lines 402-497: extract_markdown_headers()
Lines 502-598: extract_markdown_headers_with_spans()

Suggested fix - collapse to one function:

pub(crate) fn extract_markdown_headers_with_spans(
    source: &str,
    tree: &Tree,
    min_level: u32,
    max_level: u32,
) -> Result<(String, Vec<(String, &'static str)>)> {
    // ... existing implementation (lines 502-598) ...
}

pub(crate) fn extract_markdown_headers(
    source: &str,
    tree: &Tree,
    min_level: u32,
    max_level: u32,
) -> Result<String> {
    let (text, _spans) = extract_markdown_headers_with_spans(source, tree, min_level, max_level)?;
    Ok(text)
}

This follows the pattern already used in transform_structure / transform_structure_with_spans (lines 49-57).

Issue 2: Unify Node Kind Tables

Current problem:

to_static_node_kind() in structure.rs has ~30 arms mapping node kinds to static strings
score_node_kind() in utils.rs has overlapping but different arms mapping kinds to priorities
Already out of sync: struct_specifier, enum_specifier, etc. scored but not mapped to static strings

Suggested fix in utils.rs:

struct NodeKindInfo {
    static_kind: &'static str,
    priority: u8,
}

fn node_kind_info(kind: &str) -> NodeKindInfo {
    match kind {
        "type_alias_declaration" => NodeKindInfo { static_kind: "type_alias_declaration", priority: 5 },
        "interface_declaration" => NodeKindInfo { static_kind: "interface_declaration", priority: 5 },
        "struct_item" => NodeKindInfo { static_kind: "struct_item", priority: 5 },
        "class_declaration" => NodeKindInfo { static_kind: "class_declaration", priority: 5 },
        "enum_declaration" => NodeKindInfo { static_kind: "enum_declaration", priority: 5 },
        "function_declaration" => NodeKindInfo { static_kind: "function_declaration", priority: 4 },
        "method_declaration" => NodeKindInfo { static_kind: "method_declaration", priority: 4 },
        // ... complete coverage with one source of truth
        _ => NodeKindInfo { static_kind: "unknown", priority: 1 },
    }
}

pub(crate) fn score_node_kind(kind: &str) -> u8 {
    node_kind_info(kind).priority
}

pub(crate) fn to_static_node_kind(kind: &str) -> &'static str {
    node_kind_info(kind).static_kind
}

Then remove the old to_static_node_kind() implementation from structure.rs and have it call utils::to_static_node_kind() instead.

Issue 3: Double lines().collect()

Current state (truncate.rs lines 75, 211):

pub(crate) fn truncate_to_lines(text: &str, spans: &[NodeSpan], language: Language, max_lines: usize) -> Result<String> {
    let lines = text.lines().collect::<Vec<&str>>();  // Line 75 - FIRST COLLECT
    
    // ... some early returns ...
    
    // Line 84 or 94
    simple_line_truncate(&text, language, max_lines)  // This function collects again!
}

// Line 211 in simple_line_truncate
let lines = text.lines().collect::<Vec<&str>>();  // SECOND COLLECT - DUPLICATE

Suggested fix:

Create a variant that accepts pre-collected lines:

pub(crate) fn truncate_to_lines_from_lines(
    lines: Vec<&str>,
    text: &str,
    spans: &[NodeSpan],
    language: Language,
    max_lines: usize,
) -> Result<String> {
    if lines.len() <= max_lines {
        return Ok(text.to_string());
    }
    // ... rest of logic using &lines instead of re-collecting
}

fn simple_line_truncate_from_lines(
    lines: &[&str],
    language: Language,
    max_lines: usize,
    trailing_newline: bool,
) -> Result<String> {
    // ... uses &lines directly, no re-collection
}

Issue 4: String Clones in Span Builders

Current state (signatures.rs line 93):

let texts: Vec<String> = signatures
    .iter()                          // Borrows each (String, &'static str)
    .map(|(sig, kind)| {
        let line_count = sig.lines().count().max(1);
        spans.push(NodeSpan::new(current_line..current_line + line_count, kind));
        current_line += line_count;
        sig.clone()  // ← UNNECESSARY CLONE
    })
    .collect();

Suggested fix:

let texts: Vec<String> = signatures
    .into_iter()                     // Consume the Vec, take ownership
    .map(|(sig, kind)| {
        let line_count = sig.lines().count().max(1);
        spans.push(NodeSpan::new(current_line..current_line + line_count, kind));
        current_line += line_count;
        sig  // ← MOVED, NOT CLONED
    })
    .collect();

Apply same fix to types.rs:183 and structure.rs:593.

Issue 5: _with_spans Pattern Consistency

Current inconsistency (mod.rs lines 82-88):

Mode::Minimal => {
    let text = minimal::transform_minimal(source, tree, language, config)?;
    let line_count = text.lines().count();
    let spans = vec![NodeSpan::new(0..line_count, "source_file")];  // ← INLINE SPAN GENERATION
    Ok((text, spans))
}

Option A - Extract dedicated function (recommended):

Create transform_minimal_with_spans in minimal.rs matching the pattern:

// In minimal.rs
pub(crate) fn transform_minimal_with_spans(
    source: &str,
    tree: &Tree,
    language: Language,
    config: &TransformConfig,
) -> Result<(String, Vec<NodeSpan>)> {
    let text = transform_minimal(source, tree, language, config)?;
    let line_count = text.lines().count();
    let spans = vec![NodeSpan::new(0..line_count, "source_file")];
    Ok((text, spans))
}

// Then in mod.rs
Mode::Minimal => minimal::transform_minimal_with_spans(source, tree, language, config)

Option B - Add architecture comment (if intentional):

Mode::Minimal => {
    // ARCHITECTURE: Minimal mode preserves all code, so span
    // decomposition is not meaningful. Returns single source_file span.
    let text = minimal::transform_minimal(source, tree, language, config)?;
    let line_count = text.lines().count();
    let spans = vec![NodeSpan::new(0..line_count, "source_file")];
    Ok((text, spans))
}

Issue 6 & 7: CLI Bounds and Span Boundaries

Issue 6 - Add to main.rs after line 609:

const MAX_MAX_LINES: usize = 1_000_000;

// After validation
if let Some(max_lines) = args.max_lines {
    if max_lines == 0 {
        anyhow::bail!("--max-lines must be at least 1\nUse --max-lines 1 to get a single line of output.");
    }
    if max_lines > MAX_MAX_LINES {
        anyhow::bail!(
            "--max-lines value too high: {} (maximum: {})",
            max_lines,
            MAX_MAX_LINES
        );
    }
}

Issue 7 - Fix in structure.rs lines 316-317:

// Current:
let output_end = source_to_output_byte(source_end).min(output.len());

// Fixed:
let output_start = source_to_output_byte(source_start).min(output.len());
let output_end = source_to_output_byte(source_end).min(output.len());

Estimated time to fix all 7 issues: ~41 minutes

Once complete, run:

cargo test --all
cargo clippy -- -D warnings

Then force-push with message: fix: address code review blocking issues from #23 analysis

Generated by Claude Code - Multi-review synthesis

…ry clones - extract_markdown_headers now delegates to extract_markdown_headers_with_spans and discards spans, matching the existing transform_structure pattern - Replace .iter() + .clone() with .into_iter() in span-building iterators across signatures.rs, types.rs, and structure.rs to avoid unnecessary String allocations (zero-copy design constraint) Co-Authored-By: Claude <noreply@anthropic.com>

…dge-case tests - Move empty-spans check before lines().collect() in truncate_to_lines to avoid redundant allocation when falling back to simple_line_truncate - Add boundary test documenting max_lines=0 behavior at core library level (CLI validates >= 1, but core silently clamps) - Add tests for overlapping and adjacent NodeSpan ranges confirming output never exceeds the max_lines budget Co-Authored-By: Claude <noreply@anthropic.com>

…en span logic - Add MAX_MAX_LINES (1,000,000) validation matching --jobs pattern (B8) - Restore // SECURITY: comments on depth guards in signatures/types (SF1) - Add ARCHITECTURE comment explaining Full/Minimal inline spans (SF2) - Gate dead wrapper functions with #[cfg(test)] instead of #[allow(dead_code)] (SF3) - Add .min(output.len()) clamp to output_start for consistency (SF4) - Add // SAFETY: comments documenting usize-to-i64 cast bounds (SF5) Co-Authored-By: Claude <noreply@anthropic.com>

- Promote Python class_definition to priority 5 (type system) so classes are kept over standalone functions under tight --max-lines budgets - Compute tokens inline on cache hit when cached without --show-stats, fixing silent stats gap on subsequent --show-stats runs - Fix clear_cache ENOTEMPTY race by removing files individually instead of remove_dir_all + create_dir_all during parallel test execution

Dean Sharon added 3 commits March 12, 2026 12:42

fix: use correct YAML comment syntax (#) for omission markers

ee35c79

YAML comments use # not //. The previous implementation used // which is not valid YAML syntax. JSON retains // as it has no comment syntax and // is the JSONC convention.

Dean Sharon and others added 3 commits March 12, 2026 23:59

This was referenced Mar 12, 2026

refactor: Duplicated node-kind mapping tables (to_static_node_kind / score_node_kind) #24

Closed

refactor: Greedy span selection does not account for inter-span marker overhead #25

Closed

Dean Sharon added 4 commits March 13, 2026 00:25

chore: condense comments and standardize unreachable annotations

6f1cb9a

style: fix rustfmt and clippy doc-overindented-list-items warning

be13e65

style: fix rustfmt formatting in main.rs cache hit block

464adbf

dean0x merged commit 5031396 into main Mar 12, 2026
9 checks passed

dean0x deleted the feature/18-max-lines-truncation branch March 12, 2026 23:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add --max-lines N AST-aware smart truncation (#18)#23

feat: add --max-lines N AST-aware smart truncation (#18)#23
dean0x merged 10 commits intomainfrom
feature/18-max-lines-truncation

dean0x commented Mar 12, 2026

Uh oh!

dean0x commented Mar 12, 2026

Uh oh!

dean0x commented Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dean0x commented Mar 12, 2026

Summary

Implementation

Test plan

Uh oh!

dean0x commented Mar 12, 2026

Code Review Summary - Comprehensive PR Analysis

CRITICAL BLOCKING ISSUES (Must Fix)

1. Markdown Header Extraction Duplication ⚠️ HIGH

2. Node Kind Mapping Tables Can Drift ⚠️ HIGH

3. Double lines().collect() in Truncation Path ⚠️ HIGH

4. String Clones in Span Builders ⚠️ HIGH

5. Inconsistent _with_spans Pattern ⚠️ HIGH

6. Missing Upper Bound Validation ⚠️ MEDIUM

7. Potential Out-of-Bounds Span Boundaries ⚠️ MEDIUM

SHOULD-FIX ISSUES (Approve with fixes committed)

POSITIVE FINDINGS

RECOMMENDATION

Uh oh!

dean0x commented Mar 12, 2026

Detailed Blocking Issue Fixes

Issue 1: Markdown Header Duplication

Issue 2: Unify Node Kind Tables

Issue 3: Double lines().collect()

Issue 4: String Clones in Span Builders

Issue 5: _with_spans Pattern Consistency

Issue 6 & 7: CLI Bounds and Span Boundaries

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant