perf: eliminate redundant tokenization and binary search allocation churn by dean0x · Pull Request #29 · dean0x/skim

dean0x · 2026-03-14T20:07:21Z

Summary

Fixes the two performance issues deferred from #27 to tech debt (#28):

B4: truncate_to_token_budget now accepts known_token_count: Option<usize>, eliminating the redundant fast-path tokenization when the cascade loop already computed the count
B5: Binary search replaces per-iteration lines[..mid].join("\n") with a single pre-joined string and byte-offset index, reducing allocation work from O(N log N) to O(N)

Also adds 3 new unit tests for the known_token_count parameter and a dedicated criterion benchmark group (token_budget_truncation) at 100–5000 line scales.

Closes B4 and B5 from #28. Remaining items (B6, B8, S2, S5, S6, S13) are unchanged.

Test plan

All 154 tests pass (cargo test — 151 existing + 3 new)
cargo clippy -- -D warnings clean
test_token_budget_known_count_none_behaves_like_before — property test: None and Some(actual) produce identical output for budgets 1..20
Issue Tech debt: token budget cascade follow-ups #28 updated with B4/B5 struck through

…hurn (#28) B4: Add `known_token_count` parameter to `truncate_to_token_budget` so the cascade loop can pass its already-computed count, skipping the redundant fast-path tokenization of the full text. B5: Replace per-iteration `lines[..mid].join("\n")` in the binary search with a single pre-joined string and byte-offset index. Each iteration now slices from the joined string instead of allocating O(mid) work, reducing total allocation from O(N log N) to O(N). Also adds 3 new unit tests for the known_token_count parameter and a dedicated criterion benchmark group for truncation at 100-5000 lines.

dean0x · 2026-03-14T20:52:23Z

crates/rskim-core/src/transform/truncate.rs

    language: Language,
    token_budget: usize,
    count_tokens: F,
+    known_token_count: Option<usize>,


Issue: Public API signature change breaks backward compatibility

The function truncate_to_token_budget is exported publicly via pub use in lib.rs (line 41). Adding this known_token_count: Option<usize> parameter changes the signature from 4 to 5 arguments, which is a breaking change for external consumers of rskim-core on crates.io.

Recommended fix (choose one):

Option A - Backward-compatible wrapper (preferred):
Keep the 4-param public API and create a new 5-param variant truncate_to_token_budget_with_hint(). Have the 4-param version call the 5-param with None.

Option B - Make it internal:
If external consumers aren't intended, reduce visibility to pub(crate) and remove the pub use from lib.rs:41.

Since rskim-core is published on crates.io, this needs to be addressed before merge.

Reviewed by: Architecture & Regression (HIGH, 90% confidence)

dean0x · 2026-03-14T20:52:39Z

crates/rskim-core/benches/transform_bench.rs

+                        Language::TypeScript,
+                        *budget,
+                        word_count,
+                        None,


Issue: Benchmark doesn't test the Some(count) optimization path (B4)

The B4 optimization (skip redundant tokenization) only executes when known_token_count is Some(count), but this benchmark always passes None. This means:

Benchmark only measures B5 (pre-join allocation savings), not B4

The B4 optimization has no regression guard

Future changes that break the Some fast-path won't be caught

Recommended fix: Add a second benchmark variant that passes Some(total) to measure the B4 improvement:

let total = word_count(&text); // Add new benchmark group exercising Some(total) path: group.bench_with_input( BenchmarkId::new("known/lines", num_lines), &(text, budget, total), |b, (input, budget, total)| { b.iter(|| { truncate_to_token_budget( black_box(input), Language::TypeScript, *budget, word_count, Some(*total), ).unwrap() }) }, );

This provides complete regression coverage for both B4 and B5 optimizations.

Reviewed by: Tests & Performance (MEDIUM, 85% confidence)

dean0x · 2026-03-14T20:52:52Z

crates/rskim-core/src/transform/truncate.rs

+    }
+
+    #[test]
+    fn test_token_budget_known_count_returns_early_when_within_budget() {


Issue: Test doesn't verify fast-path execution

The test test_token_budget_known_count_returns_early_when_within_budget asserts the return value is correct, but it doesn't verify that count_tokens was actually skipped when known_token_count is Some(count). The test passes word_count as the counter, which would return the same result regardless of whether it was called.

This means if someone broke the B4 optimization (e.g., removed the unwrap_or_else short-circuit), this test would still pass.

Recommended fix: Use a "poison" counter that proves the fast-path was taken:

#[test] fn test_token_budget_known_count_returns_early_when_within_budget() { let text = "line one\nline two\nline three\n"; // Use a poison counter that panics if called on the full text let poison_counter = |s: &str| -> usize { if s == text { panic\!("count_tokens should NOT be called when known_token_count is Some"); } s.split_whitespace().count() }; let result = truncate_to_token_budget( text, Language::TypeScript, 100, poison_counter, Some(5) ).unwrap(); assert_eq\!(result, text, "Fast-path should return text unchanged"); }

If the fast-path is working, the poison counter is never called. If it's broken, the test panics.

Reviewed by: Tests (MEDIUM, 85% confidence)

dean0x · 2026-03-14T20:53:03Z

crates/rskim-core/src/transform/truncate.rs

 {
-    // Fast path: if text already fits, return unchanged
-    if count_tokens(text) <= token_budget {
+    // Fast path: if text already fits, return unchanged (B4: skip recount


Issue: Inline ticket references (B4, B5) deviate from codebase style

Comments like "// B4: skip recount" and similar ticket shorthand use codes that won't be meaningful to future contributors. The rest of the codebase uses descriptive prose without ticket codes (e.g., "// Fast path: if text already fits, return unchanged").

This also appears at line 376 ("// B5: Pre-join...") and in test headers ("// known_token_count tests (B4)").

Recommended fix: Replace with descriptive comments that explain the optimization without external context:

Line 353:

// Fast path: when caller provides a known token count from the cascade loop, // skip redundant full-text tokenization and use the pre-computed count instead.

Line 376:

// Pre-compute line byte-offsets once to avoid allocating O(N log N) intermediate // strings during binary search. Each iteration slices from the pre-joined string.

Test section header:

// Tests for known_token_count parameter optimization

This keeps the codebase consistent and makes the intent clear without requiring external knowledge of the optimization tickets.

Reviewed by: Consistency (MEDIUM, 80% confidence)

dean0x · 2026-03-14T20:53:32Z

Code Review Summary: PR #29

Branch: perf/28-token-budget-truncation → main
Commit: 8025892 (perf: eliminate redundant tokenization and binary search allocation churn)
Review Date: 2026-03-14

Overview

This PR introduces two targeted performance optimizations to token budget truncation:

B4: Skip redundant tokenization via known_token_count parameter
B5: Pre-join string + byte-offset index to eliminate O(N log N) allocation churn

Inline Comments Posted: 4 high-confidence blocking issues (≥80% confidence)

Findings by Severity

BLOCKING ISSUES (Address Before Merge)

Severity	Category	Finding	Reviewer(s)	Confidence
HIGH	Architecture	Public API signature change breaks backward compatibility for `rskim-core` consumers on crates.io. Adding required `known_token_count` parameter violates semver.	Architecture, Regression	90%
MEDIUM	Testing	Benchmark only tests `None` path for `known_token_count`; B4 optimization has no regression guard.	Tests, Performance	85%
MEDIUM	Testing	Test doesn't verify fast-path execution; would pass even if B4 optimization were broken.	Tests	85%
MEDIUM	Style	Inline ticket references (B4, B5) in comments deviate from codebase conventions; use prose instead.	Consistency	80%

Action: All 4 issues have detailed inline comments with recommended fixes.

Lower-Confidence Findings (60-79%) → Suggestions

1. `last_token_count` Initialized to 0 (Rust, Architecture, Security, Performance - MEDIUM)

Multiple reviewers noted that initializing last_token_count to 0 is semantically misleading because 0 is a valid token count (for empty strings). If the guard at line 319 (last_output.is_empty()) is ever removed, Some(0) would bypass truncation.

Current Status: Safe due to upstream guard, but fragile.

Suggestion: Use Option<usize> instead:

let mut last_token_count: Option<usize> = None;
// ... assign Some(token_count) only when a mode produces output

Confidence: 65-75% (Rust reviewer marked as MEDIUM BLOCKING, others as defensive concern)

2. `known_token_count` Parameter Is Trusted Without Validation (Architecture, Security - MEDIUM)

The function accepts Some(count) without verifying it matches actual token count. If a caller passes Some(5) for text with 100 tokens, the fast-path returns the full text unchecked.

Current Status:

Low risk in practice (only internal call site, pre-computed from tokens::count_tokens)
But this is a public API on crates.io with no validation
Doc comment at lines 331-333 documents the contract; external callers bear responsibility

Suggestion: Add a debug assertion to catch mismatches during development (Architecture review provides code example).

Confidence: 70% (Architecture and Security mark as MEDIUM BLOCKING; defensive improvement)

3. Binary Search Tokenization Cost Dominates (Performance - MEDIUM)

While B5 eliminates allocation churn from join(), the binary search still calls count_tokens(&candidate) at every iteration. For tiktoken (real BPE tokenizer), per-iteration tokenization is O(N) and dominates the O(log N) search cost overall.

Current Status: B5 optimization is still beneficial for word-counting (micro-benchmark), but production path is limited by tokenizer cost.

Suggestion: This is an inherent limitation of binary-search-on-token-count. A more radical future optimization would build cumulative token count array (one-pass tokenization, then prefix-sum queries), reducing from O(N log N) to O(N) tokens.

Confidence: 70% (Acknowledged by Performance reviewer as "not blocked" but future optimization)

4. `truncate_to_token_budget` Function Length Approaching Threshold (Complexity - MEDIUM)

At ~50 lines (excluding comments), the function is at the upper boundary of acceptable complexity. It handles three distinct concerns: fast-path early return, pre-join index construction, binary search with output assembly.

Suggestion: Extract byte-offset index builder into a small helper function (Complexity review provides code example). This would reduce main function to ~40 lines and make optimization independently testable.

Confidence: 65% (Complexity marks as MEDIUM "Should Fix", not blocking)

Summary Table: All Issues by Category

Category	HIGH	MEDIUM	LOW	Total
Blocking (≥80%)	1	3	0	4
Should Fix (60-79%)	0	4	0	4
Pre-existing	0	2	4	6

Review Scores

Reviewer	Score	Recommendation
Architecture	7/10	CHANGES_REQUESTED
Rust	8/10	APPROVED
Performance	7/10	APPROVED_WITH_CONDITIONS
Tests	7/10	APPROVED_WITH_CONDITIONS
Regression	8/10	APPROVED_WITH_CONDITIONS
Security	9/10	APPROVED_WITH_CONDITIONS
Complexity	8/10	APPROVED
Consistency	8/10	APPROVED_WITH_CONDITIONS

Overall: Strong optimizations with good test coverage. Primary blocker is the public API breaking change, which has two straightforward solutions (backward-compatible wrapper or reduce visibility).

What This PR Does Well

✅ Well-motivated optimizations (skip redundant tokenization, eliminate allocation churn)
✅ All 103+ tests pass; 3 new tests for known_token_count parameter
✅ Property test verifies None and Some(actual) produce identical output
✅ Benchmark added to track B5 optimization (though missing B4 variant)
✅ Commit message accurately describes both optimizations (B4, B5 labels in code)
✅ All in-repo call sites updated correctly
✅ No unsafe Rust; no new dependencies; no security issues

Recommendation

Status: APPROVED_WITH_CHANGES

Conditions:

Address the 4 blocking inline comments before merge
Optionally implement 4 "should fix" suggestions (token counting guards, complexity reduction, etc.)

The optimizations are sound and well-tested. The blocking issues are straightforward to address and do not indicate deeper problems with the implementation.

Review consolidation by Claude Code | Deduplicated 8 review reports | 4 inline comments + summary

- Change last_token_count from usize to Option<usize> for type safety (#6) - Add debug_assert! to validate known_token_count matches actual (#2) - Replace B4/B5 ticket shorthand with descriptive comments (#5) - Strengthen fast-path test with call-counting poison counter (#3) - Add lines_known benchmark variant exercising Some(total) path (#4)

dean0x commented Mar 14, 2026

View reviewed changes

dean0x mentioned this pull request Mar 14, 2026

Tech debt: token budget cascade follow-ups #28

Open

Dean Sharon added 2 commits March 14, 2026 23:18

style: fix rustfmt formatting in truncate tests

6483469

dean0x merged commit 5031fc9 into main Mar 15, 2026
5 checks passed

dean0x deleted the perf/28-token-budget-truncation branch March 15, 2026 07:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: eliminate redundant tokenization and binary search allocation churn#29

perf: eliminate redundant tokenization and binary search allocation churn#29
dean0x merged 3 commits intomainfrom
perf/28-token-budget-truncation

dean0x commented Mar 14, 2026

Uh oh!

dean0x Mar 14, 2026

Uh oh!

dean0x Mar 14, 2026

Uh oh!

dean0x Mar 14, 2026

Uh oh!

dean0x Mar 14, 2026

Uh oh!

dean0x commented Mar 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dean0x commented Mar 14, 2026

Summary

Test plan

Uh oh!

dean0x Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

dean0x Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

dean0x Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

dean0x Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

dean0x commented Mar 14, 2026

Code Review Summary: PR #29

Overview

Findings by Severity

BLOCKING ISSUES (Address Before Merge)

Lower-Confidence Findings (60-79%) → Suggestions

1. last_token_count Initialized to 0 (Rust, Architecture, Security, Performance - MEDIUM)

2. known_token_count Parameter Is Trusted Without Validation (Architecture, Security - MEDIUM)

3. Binary Search Tokenization Cost Dominates (Performance - MEDIUM)

4. truncate_to_token_budget Function Length Approaching Threshold (Complexity - MEDIUM)

Summary Table: All Issues by Category

Review Scores

What This PR Does Well

Recommendation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `last_token_count` Initialized to 0 (Rust, Architecture, Security, Performance - MEDIUM)

2. `known_token_count` Parameter Is Trusted Without Validation (Architecture, Security - MEDIUM)

4. `truncate_to_token_budget` Function Length Approaching Threshold (Complexity - MEDIUM)