fix: decode HTML character references in tokenizer by dannywillems · Pull Request #74 · LeakIX/ironhtml

dannywillems · 2026-02-09T14:54:41Z

Add entity decoding to the tokenizer so that named (&), decimal (A) and hex (A) character references are resolved to their corresponding characters during tokenization.

New entities module with 253 named HTML entities and binary search
Numeric references handle decimal and hex, map null/surrogates to U+FFFD per WHATWG spec
Named references use longest-match and respect the WHATWG attribute context rule (no decode when followed by = or alphanumeric without ;)
Entity handling in Data, AttributeValueDoubleQuoted, AttributeValueSingleQuoted, and AttributeValueUnquoted states
28 new tests covering tokenizer and tree builder integration

Add entity decoding to the tokenizer so that named (&), decimal (A) and hex (A) character references are resolved to their corresponding characters during tokenization. - New entities module with 253 named HTML entities and binary search - Numeric references handle decimal and hex, map null/surrogates to U+FFFD per WHATWG spec - Named references use longest-match and respect the WHATWG attribute context rule (no decode when followed by = or alphanumeric without ;) - Entity handling in Data, AttributeValueDoubleQuoted, AttributeValueSingleQuoted, and AttributeValueUnquoted states - 28 new tests covering tokenizer and tree builder integration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: decode HTML character references in tokenizer#74

fix: decode HTML character references in tokenizer#74
dannywillems wants to merge 1 commit intomasterfrom
fix/entity-decoding

dannywillems commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dannywillems commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant