htmlparser Documentation

This is the canonical manual for usage, API, selector behavior, performance workflow, conformance expectations, and internals.

Requirements
Quick Start
Core API
Selector Support
Mode Guidance
Performance and Benchmarks
Latest Benchmark Snapshot
Conformance Status
Architecture
Troubleshooting

Requirements

Zig 0.16.0-dev.2984+cb7d2b056
Mutable input buffers ([]u8) for parsing

Quick Start

const std = @import("std");
const html = @import("htmlparser");
const options: html.ParseOptions = .{};
const Document = options.GetDocument();

test "basic parse + query" {
    var doc = Document.init(std.testing.allocator);
    defer doc.deinit();

    var input = "<div id='app'><a class='nav' href='/docs'>Docs</a></div>".*;
    try doc.parse(&input, .{});

    const a = doc.queryOne("div#app > a.nav") orelse return error.TestUnexpectedResult;
    try std.testing.expectEqualStrings("/docs", a.getAttributeValue("href").?);
}

Source examples:

examples/basic_parse_query.zig
examples/query_time_decode.zig

All examples are verified by running zig build examples-check

Core API

`Document` factory and lifecycle

const opts: ParseOptions = .{};
const Document = opts.GetDocument();
Document.init(allocator)
doc.deinit()
doc.clear()
doc.parse(input: []u8, comptime opts: ParseOptions)

Query APIs

Compile-time selectors:
- doc.queryOne(comptime selector)
- doc.queryAll(comptime selector)
Runtime selectors:
- try doc.queryOneRuntime(selector)
- try doc.queryAllRuntime(selector)
Cached runtime selectors:
- doc.queryOneCached(&selector)
- doc.queryAllCached(&selector)
- selector created via try Selector.compileRuntime(allocator, source)
Diagnostics:
- doc.queryOneDebug(comptime selector, report)
- try doc.queryOneRuntimeDebug(selector, report)

Node APIs

Navigation:
- tagName()
- parentNode()
- firstChild()
- lastChild()
- nextSibling()
- prevSibling()
- children() (iterator of wrapped child nodes; collect(allocator) returns an owned []Node)
Text:
- innerText(allocator) (borrowed or allocated depending on shape)
- innerTextWithOptions(allocator, TextOptions)
- innerTextOwned(allocator) (always allocated)
- innerTextOwnedWithOptions(allocator, TextOptions)
Attributes:
- getAttributeValue(name)
Scoped queries:
- same query family as Document (queryOne/queryAll, runtime, cached, debug)

Helpers

doc.html(), doc.head(), doc.body()
doc.isOwned(slice) to check whether a slice points into document source bytes

Parse/Text options

ParseOptions
- eager_child_views: bool = true
- drop_whitespace_text_nodes: bool = false
TextOptions
- normalize_whitespace: bool = true
parse/query work split:
- parse keeps raw text and attribute spans in-place
- entity decode and whitespace normalization are applied by query-time APIs (getAttributeValue, innerText*, selector attribute predicates)

Instrumentation wrappers

parseWithHooks(doc, input, opts, hooks)
queryOneRuntimeWithHooks(doc, selector, hooks)
queryOneCachedWithHooks(doc, selector, hooks)
queryAllRuntimeWithHooks(doc, selector, hooks)
queryAllCachedWithHooks(doc, selector, hooks)

Selector Support

Supported selectors:

tag selectors and universal *
#id, .class
attributes:
- [a], [a=v], [a^=v], [a$=v], [a*=v], [a~=v], [a|=v]
combinators:
- descendant (a b)
- child (a > b)
- adjacent sibling (a + b)
- general sibling (a ~ b)
grouping: a, b, c
pseudo-classes:
- :first-child
- :last-child
- :nth-child(An+B) with odd/even and forms like 3n+1, +3n-2, -n+6
- :not(...) (simple selector payload)
parser guardrails:
- multiple #id predicates in one compound (for example #a#b) are rejected as invalid

Compilation modes:

comptime selectors fail at compile time when invalid
runtime selectors return error.InvalidSelector

Mode Guidance

htmlparser is permissive by design. Choose parse options by workload:

Mode	Parse Options	Best For	Tradeoffs
`strictest`	`.eager_child_views = true`, `.drop_whitespace_text_nodes = false`	traversal predictability and text fidelity	higher parse-time work
`fastest`	`.eager_child_views = false`, `.drop_whitespace_text_nodes = true`	throughput-first scraping	whitespace-only text nodes dropped; child views built lazily

Fallback playbook:

Start with fastest for bulk workloads.
Move unstable domains to strictest.
Use queryOneRuntimeDebug and QueryDebugReport before changing selectors.

Performance and Benchmarks

Run benchmarks:

zig build bench-compare
zig build tools -- run-benchmarks --profile quick
zig build tools -- run-benchmarks --profile stable

Artifacts:

bench/results/latest.md
bench/results/latest.json

Benchmark policy:

parse comparisons include strlen, lexbor, and parse-only lol-html
query parse/match/cached sections benchmark htmlparser
repeated runtime selector workloads should use cached selectors

Latest Benchmark Snapshot

Warning: throughput numbers are not conformance claims. This parser is permissive by design; see Conformance Status.

Source: bench/results/latest.json (stable profile).

Parse Throughput Comparison (MB/s)

Fixture	ours	lol-html	lexbor
`rust-lang.html`	2124.97	801.16	323.63
`wiki-html.html`	1748.06	1186.82	263.60
`mdn-html.html`	2993.42	1815.62	398.56
`w3-html52.html`	974.07	741.27	192.43
`hn.html`	1550.52	869.40	216.62
`python-org.html`	2081.63	1304.94	274.35
`kernel-org.html`	1981.37	1303.81	284.72
`gnu-org.html`	2401.47	1444.13	306.67
`ziglang-org.html`	2039.74	1267.60	285.58
`ziglang-doc-master.html`	1386.53	1013.46	221.28
`wikipedia-unicode-list.html`	1628.86	1039.14	221.72
`whatwg-html-spec.html`	1320.16	872.68	217.95
`synthetic-forms.html`	1395.06	758.12	185.42
`synthetic-table-grid.html`	1059.29	698.14	165.95
`synthetic-list-nested.html`	1145.71	625.10	158.75
`synthetic-comments-doctype.html`	1807.36	895.79	217.90
`synthetic-template-rich.html`	865.06	453.51	140.55
`synthetic-whitespace-noise.html`	1458.04	1016.04	184.45
`synthetic-news-feed.html`	1150.04	621.22	154.10
`synthetic-ecommerce.html`	1115.61	617.51	159.84
`synthetic-forum-thread.html`	1139.72	624.18	157.97

Query Match Throughput (ours)

Case	ours ops/s	ours ns/op
`attr-heavy-button`	143231.07	6981.73
`attr-heavy-nav`	96598.54	10352.12

Cached Query Throughput (ours)

Case	ours ops/s	ours ns/op
`attr-heavy-button`	154529.03	6471.28
`attr-heavy-nav`	114815.30	8709.64

Query Parse Throughput (ours)

Selector case	Ops/s	ns/op
`simple`	10316641.45	96.93
`complex`	4859071.50	205.80
`grouped`	5679185.67	176.08

For full per-parser, per-fixture tables and gate output:

bench/results/latest.md
bench/results/latest.json

Conformance Status

Run conformance suites:

zig build conformance
# or
zig build tools -- run-external-suites --mode both

Artifact: bench/results/external_suite_report.json

Tracked suites:

selector suites: nwmatcher, qwery_contextual
parser suites:
- html5lib tree-construction subset
- WHATWG HTML parsing corpus (via WPT html/syntax/parsing/html5lib_*.html)

Fetched suite repos are cached under bench/.cache/suites/ (gitignored).

Architecture

Core modules:

src/html/parser.zig: permissive parse pipeline
src/html/scanner.zig: byte-scanning hot-path helpers
src/html/tags.zig: tag metadata and hash dispatch
src/html/attr_inline.zig: in-place attribute traversal/lazy materialization
src/html/entities.zig: entity decode utilities
src/selector/runtime.zig, src/selector/compile_time.zig: selector parsing
src/selector/matcher.zig: selector matching/combinator traversal

Data model highlights:

Document owns source bytes and node/index storage
nodes are contiguous and linked by indexes for traversal
attributes are traversed directly from source spans (no heap attribute objects)

Troubleshooting

Query returns nothing

validate selector syntax (queryOneRuntime can return error.InvalidSelector)
check scope (Document vs scoped Node)
use queryOneRuntimeDebug and inspect QueryDebugReport

Unexpected `innerText`

default innerText normalizes whitespace
use innerTextWithOptions(..., .{ .normalize_whitespace = false }) for raw spacing
use innerTextOwned(...) when output must always be allocated
use doc.isOwned(slice) to check borrowed vs allocated

Runtime iterator invalidation

queryAllRuntime iterators are invalidated by newer queryAllRuntime calls on the same Document.

Input buffer changed

Expected: parse and lazy decode paths mutate source bytes in place.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

htmlparser Documentation

Table of Contents

Requirements

Quick Start

Core API

`Document` factory and lifecycle

Query APIs

Node APIs

Helpers

Parse/Text options

Instrumentation wrappers

Selector Support

Mode Guidance

Performance and Benchmarks

Latest Benchmark Snapshot

Parse Throughput Comparison (MB/s)

Query Match Throughput (ours)

Cached Query Throughput (ours)

Query Parse Throughput (ours)

Conformance Status

Architecture

Troubleshooting

Query returns nothing

Unexpected `innerText`

Runtime iterator invalidation

Input buffer changed

FilesExpand file tree

DOCUMENTATION.md

Latest commit

History

DOCUMENTATION.md

File metadata and controls

htmlparser Documentation

Table of Contents

Requirements

Quick Start

Core API

Document factory and lifecycle

Query APIs

Node APIs

Helpers

Parse/Text options

Instrumentation wrappers

Selector Support

Mode Guidance

Performance and Benchmarks

Latest Benchmark Snapshot

Parse Throughput Comparison (MB/s)

Query Match Throughput (ours)

Cached Query Throughput (ours)

Query Parse Throughput (ours)

Conformance Status

Architecture

Troubleshooting

Query returns nothing

Unexpected innerText

Runtime iterator invalidation

Input buffer changed

`Document` factory and lifecycle

Unexpected `innerText`