Skip to content

fix: decode HTML character references in tokenizer#74

Open
dannywillems wants to merge 1 commit intomasterfrom
fix/entity-decoding
Open

fix: decode HTML character references in tokenizer#74
dannywillems wants to merge 1 commit intomasterfrom
fix/entity-decoding

Conversation

@dannywillems
Copy link
Contributor

Add entity decoding to the tokenizer so that named (&), decimal (A) and hex (A) character references are resolved to their corresponding characters during tokenization.

  • New entities module with 253 named HTML entities and binary search
  • Numeric references handle decimal and hex, map null/surrogates to U+FFFD per WHATWG spec
  • Named references use longest-match and respect the WHATWG attribute context rule (no decode when followed by = or alphanumeric without ;)
  • Entity handling in Data, AttributeValueDoubleQuoted, AttributeValueSingleQuoted, and AttributeValueUnquoted states
  • 28 new tests covering tokenizer and tree builder integration

Add entity decoding to the tokenizer so that named (&), decimal
(A) and hex (A) character references are resolved to their
corresponding characters during tokenization.

- New entities module with 253 named HTML entities and binary search
- Numeric references handle decimal and hex, map null/surrogates to
  U+FFFD per WHATWG spec
- Named references use longest-match and respect the WHATWG attribute
  context rule (no decode when followed by = or alphanumeric without ;)
- Entity handling in Data, AttributeValueDoubleQuoted,
  AttributeValueSingleQuoted, and AttributeValueUnquoted states
- 28 new tests covering tokenizer and tree builder integration
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant