Everything we tried, measured, and learned while building this library.
For the current compact browser-accuracy / benchmark snapshot, see STATUS.md.
For the current compact corpus / sweep snapshot, see corpora/STATUS.md.
For the shared mismatch vocabulary, see corpora/TAXONOMY.md.
This log is historical. The current practical steering picture is:
- Japanese has two real canaries (
羅生門,蜘蛛の糸), both clean at anchor widths and both still exposing a small positive one-line field on broader Chrome sweeps. - Chinese has two long-form canaries (
祝福,故鄉) showing the same broad Chrome-positive / Safari-clean split, with real font sensitivity betweenSongti SCandPingFang SC. - Myanmar still has two real canaries with residual Chrome/Safari disagreement around quote/follower-style classes, so it remains the main unresolved Southeast Asian frontier.
- Urdu has a real Nastaliq/Naskh canary (
چغد) with the same narrow-width negative field in Chrome and Safari, so it is clearly a shaping/context class rather than dirty data or a browser-only quirk. It remains parked rather than actively tuned. - Arabic coarse corpora are clean; the remaining work there is mostly a fine-width edge-fit class, not the old preprocessing/corpus-hygiene problems.
- Mixed app text still matters because it catches product-shaped classes that books miss, especially soft-hyphen and extractor-sensitive cases.
When UI components independently measure text heights with DOM reads like getBoundingClientRect(), each read can force synchronous layout. If those reads interleave with writes, the browser can end up relaying out the whole document repeatedly.
The goal here was always the same:
- do the expensive text work once in
prepare() - keep
layout()arithmetic-only - make resize-driven relayout cheap and coordination-free
Canvas measureText() avoids DOM layout. It goes straight to the browser's font engine.
That led to the basic two-phase model:
prepare(text, font)— segment text, measure segments, cache widthslayout(prepared, maxWidth, lineHeight)— walk cached widths with pure arithmetic
That architecture held up. The broad browser sweeps are now clean in Chrome, Safari, and Firefox, and the hot layout() path is still the core product win.
Several alternatives were tried and rejected:
- measuring full candidate lines as strings during
layout() - moving measurement into hidden DOM elements during
prepare() - using SVG
getComputedTextLength()
The pattern was consistent:
- they either reintroduced DOM reads
- or they were slower than the current two-phase model
- or they looked cleaner locally but regressed the actual benchmark path
The important keep was architectural, not algorithmic:
layout()stayed arithmetic-only on cached widths
Canvas and DOM resolve system-ui to different font variants on macOS at certain sizes:
Machine-readable scan:
In the recorded scan, mismatches clustered at 10-12px, 14px, and 26px.
13px, 15-25px, and 27-28px were exact.
macOS uses SF Pro Text at smaller sizes and SF Pro Display at larger sizes. Canvas and DOM switch between them at different thresholds.
Practical conclusion:
- use a named font if accuracy matters
- keep
system-uidocumented as unsafe - if we ever support it properly, the believable path is a narrow prepare-time DOM fallback for detected bad tuples
What did not look trustworthy enough:
- lookup tables
- naive scaling
- guessed resolved-font substitution
Canvas is internally consistent enough that summing measured segments works very well, but not perfectly. Over a full paragraph, tiny adjacency differences can accumulate into a line-edge error.
The keeps were small and semantic:
- merge punctuation into the preceding word before measuring
- let trailing collapsible spaces hang instead of forcing a break
What did not survive:
- full-string verification in
layout() - uniform rescaling
- generic pair-level correction models
The broad lesson was that local semantic preprocessing paid off more than clever runtime correction.
text-shaper was useful reference material, especially for Unicode coverage and bidi ideas, but not a replacement for the current browser-facing model.
What was worth taking:
- broader Unicode coverage, e.g. missing CJK extension blocks
What was not worth taking:
- its segmentation as a runtime replacement for
Intl.Segmenter - its paragraph breaker as a substitute for browser-parity layout
Bottom line:
- good reference material
- wrong runtime center of gravity for this repo
The smallest honest second whitespace mode turned out to be:
- preserve ordinary spaces
- preserve
\nhard breaks - preserve tabs with default browser-style tab stops
- leave the other wrapping defaults alone
That became:
{ whiteSpace: 'pre-wrap' }
What mattered:
- preserved spaces still hang at line end
- consecutive hard breaks keep empty lines
- a trailing final hard break does not invent an extra empty line
- tabs advance to the next default browser tab stop from the current line start
The mode now covers the textarea-like cases we cared about, and the broad browser sweeps plus the dedicated pre-wrap oracle are green.
One important tooling lesson also came out of this:
- keep a small permanent oracle suite
- justify it once with a broader brute-force validation pass
- do not keep the brute-force pass forever once it has done its job
Chrome and Firefox on macOS can measure emoji wider in canvas than in DOM at small sizes. Safari does not share the same discrepancy.
What held up:
- detect the discrepancy by comparing canvas emoji width against actual DOM emoji width per font
- cache that correction
- keep it outside the hot layout path
This is now one of the small browser-profile shims that is actually justified.
We briefly kept a headless HarfBuzz backend in the repo for server-side measurement probes.
What it taught us:
- it was useful for research and algorithm probes
- it was not close enough to our active browser-grounded path to justify keeping it in the main repo
- isolated Arabic words in that probe path needed explicit LTR direction to avoid misleading widths
So if HarfBuzz comes up again later, treat it as explored territory:
- useful as a research reference
- not the runtime direction for Pretext
- not a substitute for browser-oracle or browser-canvas validation
The last browser mismatches were not fixed by moving more work into layout(). That regressed the hot path and was reverted.
What actually held up:
- better preprocessing in
prepare() - better browser diagnostics pages and scripts
- a tiny browser-specific line-fit tolerance
What did not change:
layout()stayed arithmetic-only
That remains the right center of gravity for the project.
Arabic took several passes, but the pattern is clearer now.
What survived:
- merge no-space Arabic punctuation clusters during
prepare()- e.g.
فيقول:وعليك,همزةٌ،ما
- e.g.
- treat Arabic punctuation-plus-mark clusters like
،ٍas left-sticky too - split
" " + combining marksinto plain space plus marks attached to the following word - use normalized slices and the exact corpus font during probe work
- trust the better RTL diagnostics path instead of reconstructing offsets from rendered line text
- clean obvious corpus/source artifacts instead of inventing new engine rules for them
- allow a tiny non-Safari line-fit tolerance bump for the remaining positive fine-width field
What did not survive:
- pair correction models at segment boundaries
- larger Arabic run-slice width models
- broad phrase-level heuristics derived from one good-looking probe
Those failed for the same reason in different sizes:
- pair corrections were too local to move the real misses
- run-slice widths were much heavier and still did not move the hard widths enough
- both made
prepare()orlayout()materially worse without buying a clean Arabic field
So the useful guardrail is:
- if an Arabic idea starts by adding more shaping-aware width caches inside the current segment-sum architecture, be skeptical early
- the Arabic keeps so far have been preprocessing, corpus cleanup, diagnostics, and tiny tolerance shims, not richer width-cache models
Current read:
- Arabic coarse corpora are healthy
- the remaining work is much narrower now
- the unresolved class looks like a mix of fine-width edge-fit and shaping/context, not another obvious preprocessing hole
Once the main browser sweep became a regression gate, the long-form corpora became the real steering canaries.
This is the most product-shaped canary.
What it has been good for:
- URL/query-string handling
- escaped quote clusters
- numeric expressions like
२४×७ - time ranges like
7:00-9:00 - emoji ZWJ runs
- manual soft hyphens
Important keep:
- model URL/query strings as narrow structured units, not one giant breakable blob
Current status:
- almost entirely clean
- one remaining extractor-sensitive soft-hyphen miss around
710pxstill looks paragraph-scale or accumulation-sensitive rather than like a neat local bug
Thai exposed a product-shaped ASCII quote issue more than a dictionary-segmentation failure.
The keep:
- contextual ASCII quote glue during preprocessing
Result:
- two Thai prose corpora are healthy at anchor widths
- maintained step10 sweeps stayed clean enough that Thai now looks broader than one lucky story
Khmer broadened the Southeast Asian class without immediately demanding new engine work.
The keep:
- preserve explicit zero-width separators from the source text
Result:
- anchor widths and the maintained step10 sweep were clean enough to keep Khmer as a real canary
The Lao corpus attempt was a source problem, not an engine problem.
The raw text was wrapped print/legal text, which made it a dirty white-space: normal canary. We rejected it instead of normalizing nonsense into the repo.
Myanmar is still the main unresolved Southeast Asian frontier.
What survived:
- treat
၊/။/၍/၌/၏as left-sticky during preprocessing - treat
၏as medial glue in clusters likeကျွန်ုပ်၏လက်မ
What did not survive:
- broad Myanmar grapheme breaking in ordinary wrapping
- quote-follower glue like closing-quote +
ဟု
Current read:
- there are real recurring classes here
- but the obvious tempting heuristics improved one browser and hurt another
- that makes Myanmar a canary, not a license for more instinctive glue rules
Japanese gave us one real semantic keep:
- kana iteration marks like
ゝ/ゞ/ヽ/ヾshould be treated as CJK line-start-prohibited
What remains:
- a small context-width class around punctuation/quote compression
- good evidence for the exactness ceiling of a width-independent grapheme-sum model in proportional Japanese fonts
So Japanese stays as a canary, not as a place to keep stacking narrow punctuation rules.
Chinese is now the clearest active CJK canary.
What we learned:
- Safari is clean on the maintained step10 sweep
- Chrome keeps a broader narrow-width positive field
- the field changes with font choice (
Songti SCvsPingFang SC)
What did not survive:
- carrying closing punctuation forward
- coalescing repeated punctuation runs like
——or……
Current read:
- the remaining Chinese field is real
- it is not another obvious punctuation bug
- it is best treated as a canary for the model’s current exactness ceiling
The first cross-font pass was reassuring:
- Korean, Thai, Khmer, Hindi, Arabic, and Hebrew all stayed exact across the sampled Chrome matrix on this machine
That does not mean font fragility is gone. It just means the next likely surprises are:
- new scripts
- finer width sweeps
- or product-shaped mixed text
The cache used to store just widths. It now stores richer per-segment metrics and computes the more expensive derived facts lazily.
Current useful cached facts include:
- width
containsCJK- lazily computed emoji count
- lazily computed grapheme widths
That improved repeated prepare() work without moving any live measurement back into layout().
Soft hyphen became a real internal break kind instead of ordinary text.
What that bought us:
- unbroken lines keep it invisible
- broken lines can expose a visible trailing
- - rich APIs stay aligned with the actual break choice
This was a genuine model improvement, not just a cosmetic API change.
Sebastian’s original prototype already had the right overall instinct:
- words/runs as the unit of caching
- browser-grounded measurement
- streamed greedy line breaking
What changed here was mostly engineering discipline:
- caching
- a clean
prepare()/layout()split - preprocessing
- browser diagnostics
- and a willingness to keep the hot path simple