Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
198 changes: 198 additions & 0 deletions docs/classifier-guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
# Divergence Classifier Guide

The divergence classifier (`src/measurement-validator/classifier.ts`) identifies
**why** Pretext canvas measurements diverge from DOM measurements. It runs a
priority-ordered chain of detection strategies and returns a `DivergenceAnalysis`.

## Quick Start

```typescript
import {
createComparator,
createDOMAdapter,
classifyDivergence,
classifyDivergenceSync,
} from './src/measurement-validator/index.js'

const adapter = createDOMAdapter()
const comparator = createComparator(adapter)

const result = comparator.compare({
text: 'مرحباً بالعالم',
font: '16px Arial',
maxWidth: 300,
lineHeight: 20,
})

// Async (includes font-fallback detection via DOM)
const analysis = await classifyDivergence(result, adapter)
console.log(analysis.rootCause) // 'bidi_shaping'
console.log(analysis.confidence) // 0.85
console.log(analysis.recommendation)
```

## `DivergenceAnalysis` shape

```typescript
type DivergenceAnalysis = {
detected: boolean
severity: 'minor' | 'major' | 'critical'
rootCause?:
| 'font_fallback'
| 'bidi_shaping'
| 'emoji_rendering'
| 'browser_quirk'
| 'variable_font'
| 'unknown'
confidence: number // 0–1
recommendation: string
details: Record<string, unknown>
}
```

## Detection Strategies

The classifier tests strategies in priority order. The **first** matching strategy
wins and determines the `rootCause`.

### 1. Font Fallback (`font_fallback`) — async only

**When:** The requested font is not loaded and the browser silently falls back to
a system font.

**How:** Re-measures with the `serif` fallback; if the total line widths are
within 1 % of the specified-font widths, the font was likely never loaded.

**Confidence:** 0.90

**Fix:** Preload fonts with `<link rel="preload">` or use a guaranteed system font.

---

### 2. Bidi Shaping (`bidi_shaping`)

**When:** The text contains Arabic, Hebrew, Urdu, or other RTL characters
(`U+0590–U+08FF`, `U+FB1D–U+FDFF`, `U+FE70–U+FEFF`).

**How:** Regexp check on the text string; no DOM access required.

**Confidence:** 0.85

**Fix:** Verify that `PreparedTextWithSegments.segLevels` are populated and used
for RTL rendering; check that `canvas.measureText` and DOM agree on shaped glyphs.

---

### 3. Emoji Rendering (`emoji_rendering`)

**When:** The text contains one or more emoji presentation codepoints
(`\p{Emoji_Presentation}`).

**How:** Unicode regex check; no DOM access required.

**Confidence:** 0.75

**Note:** Pretext auto-corrects Chrome/Firefox canvas emoji metrics at small font
sizes; Safari canvas and DOM agree natively. Divergence here usually means a
font-size-specific correction is off.

**Fix:** Test emoji-heavy strings across Chrome, Firefox, and Safari independently.

---

### 4. Browser Quirk (`browser_quirk`)

**When:** Heuristics suggest a known browser/OS rendering difference:
- `system-ui` in the font string → `os_rendering` (macOS vs Windows resolution)
- `variation` in the font string → `variable_font` (canvas axis support)
- Safari user-agent detected → `safari_kerning`

**Confidence:** 0.60

**Fix:** Use named fonts instead of `system-ui`; test variable fonts manually.

---

### 5. Unknown (`unknown`)

**When:** A divergence exists but none of the above strategies fired.

**Confidence:** 0.30

**Action:** File a bug with a minimal reproduction — text string, font, width, and
the two measurements.

---

## Sync vs Async

| Function | Font-fallback | Use when |
|---|---|---|
| `classifyDivergence(result, adapter)` | ✅ async DOM | Production validation |
| `classifyDivergenceSync(result)` | ❌ skipped | Unit tests, scripts without live DOM |

---

## Batch Classification

```typescript
import { classifyAll } from './src/measurement-validator/index.js'

const analyses = await classifyAll(results, adapter)
```

---

## Interpreting Confidence

| Confidence | Interpretation |
|---|---|
| ≥ 0.90 | Very likely the root cause |
| 0.75–0.89 | Probable cause, worth investigating |
| 0.60–0.74 | Possible cause, check manually |
| < 0.60 | Speculative — use as a starting point |

---

## Examples

### Font not loaded

```typescript
const analysis = await classifyDivergence(result, adapter)
// {
// detected: true,
// rootCause: 'font_fallback',
// severity: 'critical',
// confidence: 0.90,
// recommendation: 'Font "16px Roboto" may not be loaded ...',
// details: { fontSpecified: '16px Roboto', fontDetected: 'serif (system fallback)' }
// }
```

### RTL text

```typescript
// sample.text = 'مرحباً بالعالم'
// {
// detected: true,
// rootCause: 'bidi_shaping',
// severity: 'major',
// confidence: 0.85,
// recommendation: 'RTL text detected ...',
// details: { hasRTL: true, isMixedBidi: false }
// }
```

### No divergence

```typescript
// result.overallSeverity === 'pass'
// {
// detected: false,
// severity: 'minor',
// confidence: 1,
// recommendation: 'No divergence detected.',
// details: {}
// }
```
165 changes: 165 additions & 0 deletions docs/language-matrix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# Language Support Matrix

This document describes the measurement validator's coverage for each language
group, known divergence patterns, and recommended workarounds.

## Language Groups

| Group | Languages | Fixture file | Status |
|-------|-----------|--------------|--------|
| `ltr-simple` | English, Spanish, French, German | `english-samples.json` | ✅ Phase 1 |
| `rtl` | Arabic, Hebrew, Urdu | `rtl-samples.json` | ✅ Phase 2 |
| `cjk` | Chinese (Simplified/Traditional), Japanese, Korean | `cjk-samples.json` | ✅ Phase 2 |
| `complex-script` | Thai, Myanmar, Khmer | `complex-script-samples.json` | ✅ Phase 2 |
| `mixed-bidi` | English + Arabic/Hebrew in same text | `mixed-bidi-samples.json` | ✅ Phase 2 |

---

## LTR Simple (`ltr-simple`)

**Languages:** English, Spanish, French, German (and other Latin-script languages)

**Accuracy target:** ≥ 99 % of lines within 0.5 px

**Known divergences:**

| Divergence | Cause | Impact | Workaround |
|---|---|---|---|
| System font resolution | `system-ui` resolves differently in canvas vs DOM on macOS | Moderate | Use named font (e.g. `Arial`, `Helvetica`) |
| Soft-hyphen visibility | SHY (`\u00AD`) is invisible until chosen as break point | Minor | Expected; Pretext exposes trailing `-` in `line.text` |
| Non-breaking space | NBSP prevents word breaks but may add visual width | Minor | Expected behaviour |

---

## RTL (`rtl`)

**Languages:** Arabic, Hebrew, Urdu, Persian

**Accuracy target:** ≥ 85 % of lines within 1.0 px

**Key considerations:**

- Arabic is a **connected script** — glyphs change shape depending on position in a word
(initial, medial, final, isolated forms). `canvas.measureText` uses the shaped
glyph widths when the font is loaded; divergence usually means the font is missing.
- The classifier flags RTL text with `rootCause: 'bidi_shaping'` at confidence 0.85.
- Pretext's `prepareWithSegments()` exposes `segLevels` (bidi embedding levels) for
custom RTL rendering; `layout()` itself does not read bidi levels.

**Known divergences:**

| Divergence | Cause | Impact |
|---|---|---|
| Arabic ligature width | Some fonts collapse two glyphs into one ligature | Minor |
| Diacritic (harakat) stacking | Zero-width combining marks may add canvas overhead | Minor |
| RTL line direction | DOM aligns text to the right; Pretext reports pixel widths only | N/A |

**Workarounds:**

1. Always preload the target Arabic/Hebrew font — falling back to a system font will
cause significant divergence.
2. Use `prepareWithSegments()` and inspect `segLevels` to confirm bidi levels.
3. For Urdu (Nastaliq style), use a font that supports the Nastaliq layout engine.

---

## CJK (`cjk`)

**Languages:** Chinese (Simplified), Chinese (Traditional), Japanese, Korean

**Accuracy target:** ≥ 90 % of lines within 0.5 px

**Key considerations:**

- CJK ideographs are normally one grapheme cluster per character and break at every
character boundary (Pretext uses `Intl.Segmenter` for this).
- Japanese kinsoku rules prohibit certain punctuation at line start/end; Pretext merges
these into adjacent graphemes.
- `word-break: keep-all` prevents mid-word breaks in CJK; pass `wordBreak: 'keep-all'`
to `MeasurementSample` to test this mode.
- Full-width punctuation and iteration marks have specific break prohibition rules.

**Known divergences:**

| Divergence | Cause | Impact |
|---|---|---|
| Kinsoku edge cases | Browser may differ from Pretext on rare punctuation combinations | Minor |
| Mixed CJK + Latin kerning | Latin kerning near CJK glyphs varies by font | Minor |
| `word-break: keep-all` | Hangul syllable block break policy varies | Minor |

---

## Complex Scripts (`complex-script`)

**Languages:** Thai, Myanmar, Khmer

**Accuracy target:** ≥ 80 % of lines within 1.0 px

**Key considerations:**

- These scripts do **not** use spaces as word boundaries; line breaking is
cluster/syllable-based and requires ICU dictionary data.
- Pretext relies on `Intl.Segmenter` which uses the browser's ICU data — this varies
between Chrome, Firefox, and Safari.
- **Use `Range`-based DOM extraction** for these scripts. Span-based extraction
can perturb line breaking around cluster boundaries.
- Divergences here are often extractor-sensitive, not algorithm bugs.

**Known divergences:**

| Language | Divergence | Notes |
|---|---|---|
| Thai | Cluster boundary differences between browsers | Chrome ICU ≥ Firefox |
| Myanmar | Medial/glue glyph stacking | Font-dependent |
| Khmer | Zero-width spaces from clean source text | Can be explicit break hints |

**Workarounds:**

1. Use the exact corpus font for measurements.
2. For diagnosis, prefer Range-based extraction over span-based.
3. Accept higher tolerance (1.0 px) for these scripts.

---

## Mixed Bidi (`mixed-bidi`)

**Languages:** Any combination of LTR + RTL in the same string

**Accuracy target:** ≥ 80 % of lines within 1.0 px

**Key considerations:**

- Mixed bidi text requires the Unicode Bidi Algorithm (UBA) to determine visual order.
- Pretext handles bidi at the segment level via `prepareWithSegments().segLevels`.
- The classifier returns `rootCause: 'bidi_shaping'` for any text containing RTL ranges.
- Line-break opportunities in mixed bidi text depend on the resolved bidi levels.

**Known divergences:**

| Divergence | Cause | Impact |
|---|---|---|
| Visual order of neutral characters | Punctuation between LTR and RTL runs | Minor |
| URL / brand names in RTL context | Latin embedded in Arabic paragraph | Minor |
| Number direction | Arabic-Indic vs Western Arabic numerals | Minor |

---

## Browser Compatibility

| Browser | LTR Simple | RTL | CJK | Complex Script | Mixed Bidi |
|---|---|---|---|---|---|
| Chrome (Chromium) | ✅ | ✅ | ✅ | ✅ | ✅ |
| Safari (WebKit) | ✅ | ✅ | ✅ | ⚠️ | ⚠️ |
| Firefox (Gecko) | ✅ | ✅ | ✅ | ⚠️ | ⚠️ |

⚠️ = higher tolerance required; see the accuracy checker pages for current per-browser data.

---

## Adding New Languages

1. Add fixture samples to the appropriate JSON file under `test/fixtures/`.
2. Give each sample an `id`, `description`, `languageGroup`, and `language` field.
3. Run the test suite with `bun test test/` to confirm the new samples load.
4. If a new language group is needed, add it to the `LanguageGroup` union in
`src/measurement-validator/types.ts` and add a corresponding fixture file.
Loading