Use README.md as the public source of truth for API examples and user-facing limitations. See DEVELOPMENT.md for the current command surface and the canonical dashboards/snapshots to consult before making browser-accuracy or benchmark claims. Use TODO.md for the current priorities. Every time before you commit, ensure you've synced the docs.
Do not change the existing tone of the documents unless they're wrong.
Do bun install if you're in a fresh worktree.
Important: after you're done with a feature, and have enough holistic vision, make sure you do a pass over all the files again and see if you can simplify anything. Don't change things for the sake of, but if there are simplifications, YELL I DID A HOLISTIC PASS AND FOUND SIMPLIFICATIONS with a brief summary.
Important: do NOT monkey-patch. If you found yourself solving the symptom instead of the root cause, reconsider and do a proper fix, then YELL I SOLVED THE ROOT CAUSE NOT THE SYMPTOM with a brief summary.
Changelog updates guideline: don't add dev-facing notes, only user-facing ones. Refer to closed PR numbers.
See DEVELOPMENT.md for the current command surface and packaging/release checks. Keep the higher-level workflow notes below in sync with that command list rather than duplicating it here.
package.json— published entrypoints now targetdist/layout.js+dist/layout.d.ts; keep the package/export surface aligned with the emitted filestsconfig.build.json— publish-time emit config fordist/scripts/package-smoke-test.ts— tarball-level JS/TS consumer verification for the published package shapesrc/layout.ts— core library; keeplayout()fast and allocation-lightsrc/analysis.ts— normalization, segmentation, glue rules, and text-analysis phase forprepare()src/measurement.ts— canvas measurement runtime, segment metrics cache, emoji correction, and engine-profile shimssrc/line-break.ts— internal line-walking core shared by the rich layout APIs and the hot-path line countersrc/bidi.ts— simplified bidi metadata helper for the richprepareWithSegments()pathsrc/rich-inline.ts— inline-only helper for rich-text inline flow, atomic pills, and boundary whitespace collapsesrc/test-data.ts— shared corpus for browser accuracy pages/checkers and benchmarkssrc/layout.test.ts— small durable invariant tests for the exported prepare/layout APIspages/accuracy.ts— browser sweep plus per-line diagnosticsstatus/dashboard.json— machine-readable main status dashboard derived from the checked-in accuracy and benchmark snapshotsaccuracy/chrome.json/accuracy/safari.json/accuracy/firefox.json— checked-in raw accuracy rowspages/benchmark.ts— performance comparisonsbenchmarks/chrome.json/benchmarks/safari.json— checked-in current benchmark snapshotscorpora/dashboard.json— machine-readable long-form corpus dashboard derived from the corpus snapshots and notescorpora/chrome-step10.json/corpora/safari-step10.json— checked-in browserstep=10corpus sweep snapshotspages/diagnostic-utils.ts— shared grapheme-safe diagnostic helpers used by the browser check pagesscripts/pre-wrap-check.ts— small permanent browser-oracle sweep for the non-default{ whiteSpace: 'pre-wrap' }modepages/demos/index.html— public static demo landing page used as the GitHub Pages site rootpages/demos/bubbles.ts— bubble shrinkwrap demo using the rich non-materializing line-range walkerpages/demos/dynamic-layout.ts— fixed-height editorial spread with a continuous two-column flow, obstacle-aware title routing, and live logo-driven reflowpages/demos/markdown-chat.ts— rich chat virtualization demo that stress-tests prepared templates and manual block layoutpages/demos/rich-note.ts— inline-rich-note demo that dogfoods the rich-text inline flow helper at@chenglou/pretext/rich-inline
- The published package ships built ESM from
dist/;dist/is publish-time output, not checked-in source. - Keep shipped library source imports runtime-honest with
.jsspecifiers inside.tsfiles. That keeps plaintscemit producing correct JS and.d.tsfiles without a declaration rewrite step. prepare()/prepareWithSegments()do horizontal-only work.layout()/layoutWithLines()take explicitlineHeight.setLocale(locale?)retargets the hoisted word segmenter for futureprepare()calls and clears shared caches. Use it before preparing new text when the app wants a specificIntl.Segmenterlocale instead of the runtime default.prepare()should stay the opaque fast-path handle. If a page/script needs segment arrays, that should usually flow throughprepareWithSegments()instead of re-exposing internals on the main prepared type.- The rich public surface is intentionally split between stats/range helpers (
walkLineRanges(),measureLineStats(),layoutNextLineRange()) and text-materializing helpers (layoutWithLines(),layoutNextLine(),materializeLineRange()). Keep their break semantics aligned. walkLineRanges()is the rich-path batch range API: no string materialization, but still browser-like line widths/cursors/discretionary-hyphen state. Prefer it over private line walkers for shrinkwrap or aggregate layout work.- Keep prepare-time diagnostics internal to benchmark tooling. Do not grow a second public prepare surface just to expose timing splits.
prepare()is internally split into a text-analysis phase and a measurement phase; keep that seam clear, but keep the public API simple unless requirements force a change.- The internal segment model now distinguishes at least eight break kinds: normal text, collapsible spaces, preserved spaces, tabs, non-breaking glue (
NBSP/NNBSP/WJ-like runs), zero-width break opportunities, soft hyphens, and hard breaks. Do not collapse those back into one boolean unless the model gets richer in a better way. layout()is the resize hot path: no DOM reads, no canvas calls, no string work, and avoid gratuitous allocations.- Segment metrics cache is
Map<font, Map<segment, metrics>>; shared across texts and resettable viaclearCache(). Width is only one cached fact now; grapheme widths and other segment-derived facts can be populated lazily. - Word and grapheme segmenters are hoisted at module scope. Any locale reset should also clear the word cache.
- Punctuation is merged into preceding word-like segments only, never into spaces.
- Keep script-specific break-policy fixes in preprocessing, not
layout(). That includes Arabic no-space punctuation clusters, Arabic punctuation-plus-mark clusters, and" " + combining marksbefore Arabic text. NBSP-style glue should surviveprepare()as visible content and prevent ordinary word-boundary wrapping;ZWSPshould survive as a zero-width break opportunity.- Soft hyphens should stay invisible when unbroken, but if the engine chooses that break, the broken line should expose a visible trailing hyphen in
layoutWithLines(). - If a soft hyphen wins the break, the rich line APIs should still expose the visible trailing
-inline.text, even though the public line types do not currently carry a separate soft-hyphen metadata flag. layoutNextLine()is the rich-path escape hatch for variable-width userland layout. It now hides its grapheme-cache bookkeeping again by internally splitting line stepping from text materialization. Keep that internal split semantically aligned withlayoutWithLines(), but do not pull its extra bookkeeping into the hotlayout()path.- Astral CJK ideographs, compatibility ideographs, and the later extension blocks must still hit the CJK path; do not rely on BMP-only
charCodeAt()checks there. - Non-word, non-space segments are break opportunities, same as words.
- CJK grapheme splitting plus kinsoku merging keeps prohibited punctuation attached to adjacent graphemes.
- Emoji correction is auto-detected per font size, constant per emoji grapheme, and effectively font-independent.
- Bidi levels now stay on the rich
prepareWithSegments()path as custom-rendering metadata only. The opaque fastprepare()handle should not pay for bidi metadata thatlayout()does not consume, and line breaking itself does not read those levels. - The rich-path bidi classifier now comes from checked-in generated Unicode range data. Refresh it manually with
bun run generate:bidi-data; do not turn that into a normal build step. - A larger pure-TS Unicode stack like
text-shaperis useful as reference material, especially for Unicode coverage and richer bidi metadata, but its runtime segmentation and greedy glyph-line breaker are not replacements for our browser-facingIntl.Segmenter+ preprocessing + canvas-measurement model. - Supported CSS target is still the common app-text configuration:
white-space: normal,word-break: normal,overflow-wrap: break-word,line-break: auto. - There is also an explicit opt-in
{ wordBreak: 'keep-all' }mode for CJK/Hangul text and CJK-leading no-space mixed-script runs; keep its policy work in preprocessing, notlayout(). - There is now a second explicit whitespace mode,
{ whiteSpace: 'pre-wrap' }, for ordinary spaces,\ttabs, and\nhard breaks. Tabs follow the default browser-style tab stops. Treat it as editor/input-oriented, not the whole CSSpre-wrapsurface. - Keep the permanent
pre-wrapcoverage small and explicit. A one-time raw-source validation was useful, but the standing repo coverage should stay a compact oracle set rather than a giant sweep over wiki scaffolding. - That default target means narrow widths may still break inside words, but only at grapheme boundaries. Keep the core engine honest to that behavior; if an editorial page wants stricter whole-word handling, layer it on top in userland instead of quietly changing the library default.
system-uiis unsafe for accuracy; canvas and DOM can resolve different fonts on macOS.- Accuracy pages and checkers are now expected to be green in all three installed browsers on fresh runs; if a page disagrees, suspect stale tabs/servers before changing the algorithm.
- The browser automation lock is self-healing for stale dead-owner files now, but it is still single-owner per browser. If a checker times out on the lock, confirm a live checker process still owns it before changing the algorithm.
- Accuracy and corpus checkers can use background-safe browser automation, but benchmark runs should stay foreground. Do not “optimize away” benchmark focus; throttled/background tabs make the numbers less trustworthy.
- Accuracy and the maintained
step=10corpus sweep paths now batch widths in-page after a single navigation. Prefer those sweep entrypoints over userland “navigate once per width” loops, and keep the slow single-width checkers for diagnosis. - Keep the transport split deliberate: small automation reports can ride the hash, but large batched reports should use the local POST side channel instead of stuffing every row into
#report=.... - Browser-automation timeouts now report the last page phase they saw (
loading,measuring, orposting). Treatpostingtimeouts as transport-side clues first; they usually point at the report side channel rather than the text engine. - For deep perf or memory work, prefer an isolated debuggable Chrome over a pure Bun microbenchmark. Bun is fine for quick hypotheses, but Chrome profiling is the better source of truth for CPU hotspots, allocation churn, and retained-heap checks.
- Refresh
benchmarks/chrome.jsonandbenchmarks/safari.jsonwhen a diff changes benchmark methodology or the text engine hot path (src/analysis.ts,src/measurement.ts,src/line-break.ts,src/layout.ts,src/bidi.ts, orpages/benchmark.ts). Regeneratestatus/dashboard.jsonafter those snapshot changes. bun startis the stable human-facing dev server. The scripted checkers intentionally keep using--no-hmrtemporary servers so their runs stay deterministic and easy to tear down.- Do not run multiple browser corpus/sweep/font-matrix jobs in parallel against the same browser. The automation session and temporary page server paths interfere with each other and can make a healthy corpus look hung or flaky.
- An
ERR_CONNECTION_REFUSEDtab onlocalhost:3210or a similar temporary checker port usually means you caught a per-run Bun server after teardown. That is expected after the script exits; it is not, by itself, evidence of a bad measurement. - Keep
src/layout.test.tssmall and durable. For browser-specific or narrow hypothesis work, prefer throwaway probes/scripts and promote only the stable invariants into permanent tests. - For long-form corpus canary work, use the checked-in
step=10sweep first and only diagnose the mismatching widths in detail. The slow detailed checker is for narrowing root causes, not for every width by default. - For Arabic corpus/probe work, use normalized slices, the exact corpus font, and the RTL
Range-based diagnostics. Raw offsets or rough fallback fonts will mislead you. - For
pre-wrapprobe work, Safari span extraction is currently a better cross-check than SafariRangeextraction around preserved spaces and hard breaks. Keep usingRangefor the defaultwhite-space: normaldiagnostics unless the mode itself is the thing under test. - For Southeast Asian and Arabic/Urdu raw-diagnostic work, keep using the script-appropriate extractor instead of forcing one Safari rule everywhere.
- The corpus/probe diagnostic pages now compute our line offsets directly from prepared segments and grapheme fallbacks; do not go back to reconstructing them from
layoutWithLines().line.text.length. /corpus,corpus-check, andcorpus-sweepnow acceptfont/lineHeightoverrides. Use those before inventing a second page or checker when the question is “does this same corpus stay healthy under another font?”- Prefer Chrome for the first font-matrix pass. Safari font-matrix automation is slower and noisier, so treat it as follow-up smoke coverage.
- Mixed app text is now a first-class canary. Use it to catch product-shaped classes like URL/query-string wrapping, emoji ZWJ runs, and mixed-script punctuation before tuning another book corpus.
- URL-like runs such as
https://.../www...are currently modeled as two breakable preprocessing units when a query exists: the path through the query introducer (?), then the query string. This is intentionally narrow and exists to stop obviously bad mid-path URL breaks without forcing the whole query string to fragment character-by-character. - Mixed app text also pulled in two more keep-worthy preprocessing rules: contextual escaped quote clusters like
\"word\", and numeric/time-range runs like२४×७/7:00-9:00. - For Southeast Asian scripts or mixed text containing Thai/Lao/Khmer/Myanmar, trust the
Range-based corpus diagnostics over span-probing; span units can perturb line breaking there. - The remaining Chrome mixed-app
710pxsoft-hyphen miss is extractor-sensitive and not cleanly local. Treat it as paragraph-scale / accumulation-sensitive until a cleaner reproducer appears, and do not patch the engine from only one extractor view. - Safari
Range-based probe extraction can over-advance across URL query text (...path?q) even when the real DOM height and thespanextractor are exact. Cross-check--method=spanbefore changing the engine on Safari URL/query probe misses. - Keep the current corpus lessons in mind:
- Thai: contextual ASCII quotes were a real keep
- Khmer: explicit zero-width separators from clean source text are useful signal
- Lao: wrapped raw-law text was a bad canary and was rejected
- Myanmar: punctuation/medial-glue keeps survived, broader Chrome-only fixes did not
- Japanese: kana iteration marks are CJK line-start-prohibited
- Chinese: the remaining broad Chrome-positive field is real and not obviously another punctuation bug
- The corpus diagnostics should derive our candidate lines from
layoutWithLines(), not from a second local line-walker. That avoids SHY and future custom-break drift between the hot path and the diagnostic path. - Current line-fit tolerance is
0.005for Chromium/Gecko and1/64for Safari/WebKit. That bump was justified by the remaining Arabic fine-width field and did not move the solved browser corpus or the English long-form canary. - Refresh
accuracy/chrome.json,accuracy/safari.json, andaccuracy/firefox.jsonwhen a diff changes the browser sweep methodology or the main text engine behavior (src/analysis.ts,src/measurement.ts,src/line-break.ts,src/layout.ts,src/bidi.ts, orpages/accuracy.ts). - Refresh
corpora/chrome-step10.jsonand then regeneratecorpora/dashboard.jsonwhen the corpus sweep methodology or long-form canary behavior changes in a way that moves the dashboard counts. - Refresh
corpora/safari-step10.jsonalongsidecorpora/chrome-step10.jsonwhen the corpus sweep methodology or long-form canary behavior changes in a way that moves the dashboard counts.