Skip to content

feat: add --max-lines N AST-aware smart truncation (#18)#23

Merged
dean0x merged 10 commits intomainfrom
feature/18-max-lines-truncation
Mar 12, 2026
Merged

feat: add --max-lines N AST-aware smart truncation (#18)#23
dean0x merged 10 commits intomainfrom
feature/18-max-lines-truncation

Conversation

@dean0x
Copy link
Owner

@dean0x dean0x commented Mar 12, 2026

Summary

  • Adds --max-lines N flag for AST-aware output truncation with priority-based node selection
  • Types and signatures prioritized over imports, which are prioritized over function bodies
  • Language-appropriate omission markers (//, #, <!-- -->) inserted between gaps
  • Composable with all modes: skim file.rs --mode=structure --max-lines 50
  • Per-file semantics in multi-file/glob mode
  • Cache key includes max_lines for correct cache isolation

Implementation

  • New truncate.rs module with NodeSpan-based truncation engine
  • Priority scoring: types(5) > signatures(4) > imports(3) > structure(2) > other(1)
  • _with_spans variants for structure/signatures/types transforms
  • simple_line_truncate() fallback for serde languages (JSON/YAML)
  • transform_auto_with_config() public API
  • CLI arg with validation, ProcessOptions threading, stdin support

Test plan

  • 25+ new tests across 3 layers (unit, integration, CLI)
  • All 274 existing tests pass
  • Zero clippy warnings
  • Zero Snyk security issues

Closes #18 (Step 2)

Dean Sharon added 3 commits March 12, 2026 12:42
Add priority-based truncation that respects AST node boundaries when
limiting output to N lines. Types and interfaces are kept preferentially
over functions, which are kept over imports, which are kept over bodies.

Truncation engine:
- NodeSpan maps transformed output line ranges to tree-sitter node kinds
- Greedy span selection with marker budget reservation
- Language-appropriate omission markers (// / # / <!-- -->)
- Simple line truncation fallback for JSON/YAML

Pipeline integration:
- All transform modes (structure, signatures, types, minimal, full)
  produce NodeSpan metadata via _with_spans variants
- TransformConfig gains max_lines: Option<usize> with builder method
- Cache key includes max_lines for isolation
- New public API: transform_auto_with_config()

CLI:
- --max-lines <N> flag with validation (>= 1)
- Composable with all modes, glob, parallel, stdin, --show-stats
- MultiFileOptions struct replaces loose parameters

Tests: 22 new tests (14 integration + 8 CLI) covering priority ordering,
omission markers, mode composition, per-language behavior, and edge cases.
All 274 tests pass (184 core + 90 CLI).
YAML comments use # not //. The previous implementation used // which
is not valid YAML syntax. JSON retains // as it has no comment syntax
and // is the JSONC convention.
- Consolidate serde-based language handling (JSON/YAML)
- Remove duplicate truncation logic in match branches
- Apply truncation uniformly after transformation
- Improve code readability with clearer flow

No behavior change.
@dean0x
Copy link
Owner Author

dean0x commented Mar 12, 2026

Code Review Summary - Comprehensive PR Analysis

This PR introduces AST-aware truncation with solid overall design, but requires fixes for 7 blocking issues and several should-fix items before approval. Review reports synthesized from 8 independent analyses (architecture, complexity, consistency, performance, regression, Rust, security, tests).


CRITICAL BLOCKING ISSUES (Must Fix)

1. Markdown Header Extraction Duplication ⚠️ HIGH

  • Impact: DRY violation, maintenance risk across future changes
  • Problem: extract_markdown_headers() (lines 402-497) and extract_markdown_headers_with_spans() (lines 502-598) contain ~100 lines of identical logic
  • Files: crates/rskim-core/src/transform/structure.rs
  • Fix: Have extract_markdown_headers delegate to the _with_spans variant and discard spans

2. Node Kind Mapping Tables Can Drift ⚠️ HIGH

  • Impact: Silent priority degradation if tables diverge
  • Problem: to_static_node_kind() in structure.rs:337-385 and score_node_kind() in utils.rs:90-136 maintain separate mappings that are already out of sync
  • Files: crates/rskim-core/src/transform/structure.rs, crates/rskim-core/src/transform/utils.rs
  • Fix: Unify into single source of truth (single match statement mapping both)

3. Double lines().collect() in Truncation Path ⚠️ HIGH

  • Impact: Unnecessary allocation on hot path
  • Problem: truncate_to_lines() line 75 collects lines, then simple_line_truncate() collects again at line 211
  • Files: crates/rskim-core/src/transform/truncate.rs
  • Fix: Pass pre-collected lines to avoid re-parsing

4. String Clones in Span Builders ⚠️ HIGH

  • Impact: Doubles memory allocation for signatures/types
  • Problem: iter() + .clone() instead of into_iter() in signatures.rs:93, types.rs:183, structure.rs:593
  • Files: Multiple files
  • Fix: Use into_iter() to move strings instead of cloning

5. Inconsistent _with_spans Pattern ⚠️ HIGH

  • Impact: Consistency gap for future maintainers
  • Problem: Minimal mode inlines span generation in mod.rs:82-88 instead of following the _with_spans pattern
  • Files: crates/rskim-core/src/transform/mod.rs
  • Fix: Extract transform_minimal_with_spans() or add architecture comment

6. Missing Upper Bound Validation ⚠️ MEDIUM

  • Impact: Inconsistent with --jobs parameter pattern
  • Problem: CLI validates --max-lines >= 1 but no upper bound
  • Files: crates/rskim/src/main.rs:609
  • Fix: Add const MAX_MAX_LINES = 1_000_000; validation

7. Potential Out-of-Bounds Span Boundaries ⚠️ MEDIUM

  • Impact: Could produce incorrect span line ranges
  • Problem: source_to_output_byte() at structure.rs:301-307 clamps to >= 0 but not <= output.len()
  • Files: crates/rskim-core/src/transform/structure.rs
  • Fix: Clamp both output_start and output_end

SHOULD-FIX ISSUES (Approve with fixes committed)

  • Restore // SECURITY: comment annotations on depth guards
  • Account for inter-span markers in greedy budget calculation
  • Optimize text clone on no-truncation path
  • Document cache key format change in release notes

POSITIVE FINDINGS

✅ Core algorithm is sound with clean priority scoring
✅ 22 new tests covering diverse modes and languages
✅ API backward compatible
✅ Error handling consistent throughout


RECOMMENDATION

APPROVED_WITH_CONDITIONS - Address all 7 blocking issues before merge.
Estimated fix time: 41 minutes

Generated by Claude Code + Multi-Review Synthesis

@dean0x
Copy link
Owner Author

dean0x commented Mar 12, 2026

Detailed Blocking Issue Fixes

Issue 1: Markdown Header Duplication

Current state (structure.rs):

  • Lines 402-497: extract_markdown_headers()
  • Lines 502-598: extract_markdown_headers_with_spans()

Suggested fix - collapse to one function:

pub(crate) fn extract_markdown_headers_with_spans(
    source: &str,
    tree: &Tree,
    min_level: u32,
    max_level: u32,
) -> Result<(String, Vec<(String, &'static str)>)> {
    // ... existing implementation (lines 502-598) ...
}

pub(crate) fn extract_markdown_headers(
    source: &str,
    tree: &Tree,
    min_level: u32,
    max_level: u32,
) -> Result<String> {
    let (text, _spans) = extract_markdown_headers_with_spans(source, tree, min_level, max_level)?;
    Ok(text)
}

This follows the pattern already used in transform_structure / transform_structure_with_spans (lines 49-57).


Issue 2: Unify Node Kind Tables

Current problem:

  • to_static_node_kind() in structure.rs has ~30 arms mapping node kinds to static strings
  • score_node_kind() in utils.rs has overlapping but different arms mapping kinds to priorities
  • Already out of sync: struct_specifier, enum_specifier, etc. scored but not mapped to static strings

Suggested fix in utils.rs:

struct NodeKindInfo {
    static_kind: &'static str,
    priority: u8,
}

fn node_kind_info(kind: &str) -> NodeKindInfo {
    match kind {
        "type_alias_declaration" => NodeKindInfo { static_kind: "type_alias_declaration", priority: 5 },
        "interface_declaration" => NodeKindInfo { static_kind: "interface_declaration", priority: 5 },
        "struct_item" => NodeKindInfo { static_kind: "struct_item", priority: 5 },
        "class_declaration" => NodeKindInfo { static_kind: "class_declaration", priority: 5 },
        "enum_declaration" => NodeKindInfo { static_kind: "enum_declaration", priority: 5 },
        "function_declaration" => NodeKindInfo { static_kind: "function_declaration", priority: 4 },
        "method_declaration" => NodeKindInfo { static_kind: "method_declaration", priority: 4 },
        // ... complete coverage with one source of truth
        _ => NodeKindInfo { static_kind: "unknown", priority: 1 },
    }
}

pub(crate) fn score_node_kind(kind: &str) -> u8 {
    node_kind_info(kind).priority
}

pub(crate) fn to_static_node_kind(kind: &str) -> &'static str {
    node_kind_info(kind).static_kind
}

Then remove the old to_static_node_kind() implementation from structure.rs and have it call utils::to_static_node_kind() instead.


Issue 3: Double lines().collect()

Current state (truncate.rs lines 75, 211):

pub(crate) fn truncate_to_lines(text: &str, spans: &[NodeSpan], language: Language, max_lines: usize) -> Result<String> {
    let lines = text.lines().collect::<Vec<&str>>();  // Line 75 - FIRST COLLECT
    
    // ... some early returns ...
    
    // Line 84 or 94
    simple_line_truncate(&text, language, max_lines)  // This function collects again!
}

// Line 211 in simple_line_truncate
let lines = text.lines().collect::<Vec<&str>>();  // SECOND COLLECT - DUPLICATE

Suggested fix:

Create a variant that accepts pre-collected lines:

pub(crate) fn truncate_to_lines_from_lines(
    lines: Vec<&str>,
    text: &str,
    spans: &[NodeSpan],
    language: Language,
    max_lines: usize,
) -> Result<String> {
    if lines.len() <= max_lines {
        return Ok(text.to_string());
    }
    // ... rest of logic using &lines instead of re-collecting
}

fn simple_line_truncate_from_lines(
    lines: &[&str],
    language: Language,
    max_lines: usize,
    trailing_newline: bool,
) -> Result<String> {
    // ... uses &lines directly, no re-collection
}

Issue 4: String Clones in Span Builders

Current state (signatures.rs line 93):

let texts: Vec<String> = signatures
    .iter()                          // Borrows each (String, &'static str)
    .map(|(sig, kind)| {
        let line_count = sig.lines().count().max(1);
        spans.push(NodeSpan::new(current_line..current_line + line_count, kind));
        current_line += line_count;
        sig.clone()  // ← UNNECESSARY CLONE
    })
    .collect();

Suggested fix:

let texts: Vec<String> = signatures
    .into_iter()                     // Consume the Vec, take ownership
    .map(|(sig, kind)| {
        let line_count = sig.lines().count().max(1);
        spans.push(NodeSpan::new(current_line..current_line + line_count, kind));
        current_line += line_count;
        sig  // ← MOVED, NOT CLONED
    })
    .collect();

Apply same fix to types.rs:183 and structure.rs:593.


Issue 5: _with_spans Pattern Consistency

Current inconsistency (mod.rs lines 82-88):

Mode::Minimal => {
    let text = minimal::transform_minimal(source, tree, language, config)?;
    let line_count = text.lines().count();
    let spans = vec![NodeSpan::new(0..line_count, "source_file")];  // ← INLINE SPAN GENERATION
    Ok((text, spans))
}

Option A - Extract dedicated function (recommended):

Create transform_minimal_with_spans in minimal.rs matching the pattern:

// In minimal.rs
pub(crate) fn transform_minimal_with_spans(
    source: &str,
    tree: &Tree,
    language: Language,
    config: &TransformConfig,
) -> Result<(String, Vec<NodeSpan>)> {
    let text = transform_minimal(source, tree, language, config)?;
    let line_count = text.lines().count();
    let spans = vec![NodeSpan::new(0..line_count, "source_file")];
    Ok((text, spans))
}

// Then in mod.rs
Mode::Minimal => minimal::transform_minimal_with_spans(source, tree, language, config)

Option B - Add architecture comment (if intentional):

Mode::Minimal => {
    // ARCHITECTURE: Minimal mode preserves all code, so span
    // decomposition is not meaningful. Returns single source_file span.
    let text = minimal::transform_minimal(source, tree, language, config)?;
    let line_count = text.lines().count();
    let spans = vec![NodeSpan::new(0..line_count, "source_file")];
    Ok((text, spans))
}

Issue 6 & 7: CLI Bounds and Span Boundaries

Issue 6 - Add to main.rs after line 609:

const MAX_MAX_LINES: usize = 1_000_000;

// After validation
if let Some(max_lines) = args.max_lines {
    if max_lines == 0 {
        anyhow::bail!("--max-lines must be at least 1\nUse --max-lines 1 to get a single line of output.");
    }
    if max_lines > MAX_MAX_LINES {
        anyhow::bail!(
            "--max-lines value too high: {} (maximum: {})",
            max_lines,
            MAX_MAX_LINES
        );
    }
}

Issue 7 - Fix in structure.rs lines 316-317:

// Current:
let output_end = source_to_output_byte(source_end).min(output.len());

// Fixed:
let output_start = source_to_output_byte(source_start).min(output.len());
let output_end = source_to_output_byte(source_end).min(output.len());

Estimated time to fix all 7 issues: ~41 minutes

Once complete, run:

cargo test --all
cargo clippy -- -D warnings

Then force-push with message: fix: address code review blocking issues from #23 analysis

Generated by Claude Code - Multi-review synthesis

Dean Sharon and others added 3 commits March 12, 2026 23:59
…ry clones

- extract_markdown_headers now delegates to extract_markdown_headers_with_spans
  and discards spans, matching the existing transform_structure pattern
- Replace .iter() + .clone() with .into_iter() in span-building iterators
  across signatures.rs, types.rs, and structure.rs to avoid unnecessary
  String allocations (zero-copy design constraint)

Co-Authored-By: Claude <noreply@anthropic.com>
…dge-case tests

- Move empty-spans check before lines().collect() in truncate_to_lines to
  avoid redundant allocation when falling back to simple_line_truncate
- Add boundary test documenting max_lines=0 behavior at core library level
  (CLI validates >= 1, but core silently clamps)
- Add tests for overlapping and adjacent NodeSpan ranges confirming output
  never exceeds the max_lines budget

Co-Authored-By: Claude <noreply@anthropic.com>
…en span logic

- Add MAX_MAX_LINES (1,000,000) validation matching --jobs pattern (B8)
- Restore // SECURITY: comments on depth guards in signatures/types (SF1)
- Add ARCHITECTURE comment explaining Full/Minimal inline spans (SF2)
- Gate dead wrapper functions with #[cfg(test)] instead of #[allow(dead_code)] (SF3)
- Add .min(output.len()) clamp to output_start for consistency (SF4)
- Add // SAFETY: comments documenting usize-to-i64 cast bounds (SF5)

Co-Authored-By: Claude <noreply@anthropic.com>
Dean Sharon added 4 commits March 13, 2026 00:25
- Promote Python class_definition to priority 5 (type system) so classes
  are kept over standalone functions under tight --max-lines budgets
- Compute tokens inline on cache hit when cached without --show-stats,
  fixing silent stats gap on subsequent --show-stats runs
- Fix clear_cache ENOTEMPTY race by removing files individually instead
  of remove_dir_all + create_dir_all during parallel test execution
@dean0x dean0x merged commit 5031396 into main Mar 12, 2026
9 checks passed
@dean0x dean0x deleted the feature/18-max-lines-truncation branch March 12, 2026 23:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase A: The Sharpest Code Reader — Minimal Mode, Token Budget, AST Truncation & New Languages

1 participant