Skip to content

perf: skip no-op merge passes in analysis pipeline#119

Open
dbwls99706 wants to merge 1 commit intochenglou:mainfrom
dbwls99706:perf/skip-no-op-merge-passes
Open

perf: skip no-op merge passes in analysis pipeline#119
dbwls99706 wants to merge 1 commit intochenglou:mainfrom
dbwls99706:perf/skip-no-op-merge-passes

Conversation

@dbwls99706
Copy link
Copy Markdown

Summary

The six post-segmentation passes in buildMergedSegmentation unconditionally allocate and populate new output arrays even when the input contains no patterns that would trigger a merge or split. This adds a linear early-exit guard at the top of each function that returns the input segmentation unchanged when the relevant pattern is absent.

Each guard is intentionally a cheap necessary-condition scan: if the guard returns false, the pass cannot produce any change. A guard may return true when no actual merge happens (false positive), but it will never skip a pass that would have produced a change (no false negative).

Motivation

For text that never hits a given pass — e.g. pure CJK has no URLs, no numeric runs, no ASCII punctuation chains, no hyphenated numbers — each pass still copies four arrays (texts, isWordLike, kinds, starts) element-by-element to produce identical output. The guards are O(n) with early break on first match, and in the no-op cases targeted here, they are cheaper than allocating and populating replacement arrays.

Note: carryTrailingForwardStickyAcrossCJKBoundary is the exception — its guard triggers on CJK text (adjacent CJK text pairs), so the CJK improvement comes primarily from the other five guards. This guard benefits non-CJK text that would otherwise pay for .slice() copies without any carries to perform.

Changes

  • src/analysis.ts — added early-exit guards to 6 internal functions:
    • mergeUrlLikeRuns: skip when no URL-like run starts exist
    • mergeUrlQueryRuns: skip when no URL query boundary segments exist (conservative — isUrlQueryBoundarySegment already requires :// or www. prefix, so the guard is at least as wide as the actual merge condition)
    • mergeNumericRuns: skip when no numeric run segments with decimal digits exist
    • mergeAsciiPunctuationChains: skip when no trailing-joiner wordlike text is followed by another wordlike text (necessary condition for the inner while loop to merge anything)
    • splitHyphenatedNumericRuns: skip when no text contains both - and a decimal digit
    • carryTrailingForwardStickyAcrossCJKBoundary: skip when no adjacent CJK text pairs exist
  • No changes to existing merge/split logic — guards only add an early return path
  • No public API changes, no layout() hot path changes

Benchmark

Environment: Windows 11, Bun 1.3.11, fake canvas backend.
Method: analyzeText() × 5000 iters, trimmed mean of 20 rounds, alternating patched (P) / original (O) across 3 independent process pairs.

No-pattern text (guards skip all passes):

Text P1 O1 P2 O2 P3 O3
Chinese 150c (µs) 122.3 150.9 138.0 117.8 112.5 121.1
English 150c (µs) 39.0 44.4 38.5 41.2 34.6 39.8
Long Chinese 1500c (µs) 1176.4 1226.0 1231.1 1212.9 1050.5 1274.7

Pattern-heavy text (guards pass through to existing logic):

Text P1 O1 P2 O2 P3 O3
AllPatterns (µs) 51.9 54.4 45.7 42.7 43.2 50.6
URLs (µs) 31.2 36.4 28.7 28.7 27.7 34.2
AppText (µs) 43.4 35.3 41.2 45.0 40.4 49.0

Bottom line: In this local benchmark, English plain prose consistently improved across all 3 pairs (~10% in the 150c case). Pattern-free CJK inputs showed improvement in most pairs. No consistent regression was observed on pattern-heavy inputs.

Test plan

  • bun test — 84 tests pass, 0 fail
  • Benchmark: pattern-free text shows improvement, no worst-case regression
  • Browser benchmark verification on macOS (not available to contributor)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant