This is the canonical manual for usage, API, selector behavior, performance workflow, conformance expectations, and internals.
- Requirements
- Quick Start
- Core API
- Selector Support
- Mode Guidance
- Performance and Benchmarks
- Latest Benchmark Snapshot
- Conformance Status
- Architecture
- Troubleshooting
- Zig
0.16.0-dev.2984+cb7d2b056 - Mutable input buffers (
[]u8) for parsing
const std = @import("std");
const html = @import("htmlparser");
const options: html.ParseOptions = .{};
const Document = options.GetDocument();
test "basic parse + query" {
var doc = Document.init(std.testing.allocator);
defer doc.deinit();
var input = "<div id='app'><a class='nav' href='/docs'>Docs</a></div>".*;
try doc.parse(&input, .{});
const a = doc.queryOne("div#app > a.nav") orelse return error.TestUnexpectedResult;
try std.testing.expectEqualStrings("/docs", a.getAttributeValue("href").?);
}Source examples:
examples/basic_parse_query.zigexamples/query_time_decode.zig
All examples are verified by running zig build examples-check
const opts: ParseOptions = .{};const Document = opts.GetDocument();Document.init(allocator)doc.deinit()doc.clear()doc.parse(input: []u8, comptime opts: ParseOptions)
- Compile-time selectors:
doc.queryOne(comptime selector)doc.queryAll(comptime selector)
- Runtime selectors:
try doc.queryOneRuntime(selector)try doc.queryAllRuntime(selector)
- Cached runtime selectors:
doc.queryOneCached(&selector)doc.queryAllCached(&selector)- selector created via
try Selector.compileRuntime(allocator, source)
- Diagnostics:
doc.queryOneDebug(comptime selector, report)try doc.queryOneRuntimeDebug(selector, report)
- Navigation:
tagName()parentNode()firstChild()lastChild()nextSibling()prevSibling()children()(iterator of wrapped child nodes;collect(allocator)returns an owned[]Node)
- Text:
innerText(allocator)(borrowed or allocated depending on shape)innerTextWithOptions(allocator, TextOptions)innerTextOwned(allocator)(always allocated)innerTextOwnedWithOptions(allocator, TextOptions)
- Attributes:
getAttributeValue(name)
- Scoped queries:
- same query family as
Document(queryOne/queryAll, runtime, cached, debug)
- same query family as
doc.html(),doc.head(),doc.body()doc.isOwned(slice)to check whether a slice points into document source bytes
ParseOptionseager_child_views: bool = truedrop_whitespace_text_nodes: bool = false
TextOptionsnormalize_whitespace: bool = true
- parse/query work split:
- parse keeps raw text and attribute spans in-place
- entity decode and whitespace normalization are applied by query-time APIs (
getAttributeValue,innerText*, selector attribute predicates)
parseWithHooks(doc, input, opts, hooks)queryOneRuntimeWithHooks(doc, selector, hooks)queryOneCachedWithHooks(doc, selector, hooks)queryAllRuntimeWithHooks(doc, selector, hooks)queryAllCachedWithHooks(doc, selector, hooks)
Supported selectors:
- tag selectors and universal
* #id,.class- attributes:
[a],[a=v],[a^=v],[a$=v],[a*=v],[a~=v],[a|=v]
- combinators:
- descendant (
a b) - child (
a > b) - adjacent sibling (
a + b) - general sibling (
a ~ b)
- descendant (
- grouping:
a, b, c - pseudo-classes:
:first-child:last-child:nth-child(An+B)withodd/evenand forms like3n+1,+3n-2,-n+6:not(...)(simple selector payload)
- parser guardrails:
- multiple
#idpredicates in one compound (for example#a#b) are rejected as invalid
- multiple
Compilation modes:
- comptime selectors fail at compile time when invalid
- runtime selectors return
error.InvalidSelector
htmlparser is permissive by design. Choose parse options by workload:
| Mode | Parse Options | Best For | Tradeoffs |
|---|---|---|---|
strictest |
.eager_child_views = true, .drop_whitespace_text_nodes = false |
traversal predictability and text fidelity | higher parse-time work |
fastest |
.eager_child_views = false, .drop_whitespace_text_nodes = true |
throughput-first scraping | whitespace-only text nodes dropped; child views built lazily |
Fallback playbook:
- Start with
fastestfor bulk workloads. - Move unstable domains to
strictest. - Use
queryOneRuntimeDebugandQueryDebugReportbefore changing selectors.
Run benchmarks:
zig build bench-compare
zig build tools -- run-benchmarks --profile quick
zig build tools -- run-benchmarks --profile stableArtifacts:
bench/results/latest.mdbench/results/latest.json
Benchmark policy:
- parse comparisons include
strlen,lexbor, and parse-onlylol-html - query parse/match/cached sections benchmark
htmlparser - repeated runtime selector workloads should use cached selectors
Warning: throughput numbers are not conformance claims. This parser is permissive by design; see Conformance Status.
Source: bench/results/latest.json (stable profile).
| Fixture | ours | lol-html | lexbor |
|---|---|---|---|
rust-lang.html |
2124.97 | 801.16 | 323.63 |
wiki-html.html |
1748.06 | 1186.82 | 263.60 |
mdn-html.html |
2993.42 | 1815.62 | 398.56 |
w3-html52.html |
974.07 | 741.27 | 192.43 |
hn.html |
1550.52 | 869.40 | 216.62 |
python-org.html |
2081.63 | 1304.94 | 274.35 |
kernel-org.html |
1981.37 | 1303.81 | 284.72 |
gnu-org.html |
2401.47 | 1444.13 | 306.67 |
ziglang-org.html |
2039.74 | 1267.60 | 285.58 |
ziglang-doc-master.html |
1386.53 | 1013.46 | 221.28 |
wikipedia-unicode-list.html |
1628.86 | 1039.14 | 221.72 |
whatwg-html-spec.html |
1320.16 | 872.68 | 217.95 |
synthetic-forms.html |
1395.06 | 758.12 | 185.42 |
synthetic-table-grid.html |
1059.29 | 698.14 | 165.95 |
synthetic-list-nested.html |
1145.71 | 625.10 | 158.75 |
synthetic-comments-doctype.html |
1807.36 | 895.79 | 217.90 |
synthetic-template-rich.html |
865.06 | 453.51 | 140.55 |
synthetic-whitespace-noise.html |
1458.04 | 1016.04 | 184.45 |
synthetic-news-feed.html |
1150.04 | 621.22 | 154.10 |
synthetic-ecommerce.html |
1115.61 | 617.51 | 159.84 |
synthetic-forum-thread.html |
1139.72 | 624.18 | 157.97 |
| Case | ours ops/s | ours ns/op |
|---|---|---|
attr-heavy-button |
143231.07 | 6981.73 |
attr-heavy-nav |
96598.54 | 10352.12 |
| Case | ours ops/s | ours ns/op |
|---|---|---|
attr-heavy-button |
154529.03 | 6471.28 |
attr-heavy-nav |
114815.30 | 8709.64 |
| Selector case | Ops/s | ns/op |
|---|---|---|
simple |
10316641.45 | 96.93 |
complex |
4859071.50 | 205.80 |
grouped |
5679185.67 | 176.08 |
For full per-parser, per-fixture tables and gate output:
bench/results/latest.mdbench/results/latest.json
Run conformance suites:
zig build conformance
# or
zig build tools -- run-external-suites --mode bothArtifact: bench/results/external_suite_report.json
Tracked suites:
- selector suites:
nwmatcher,qwery_contextual - parser suites:
- html5lib tree-construction subset
- WHATWG HTML parsing corpus (via WPT
html/syntax/parsing/html5lib_*.html)
Fetched suite repos are cached under bench/.cache/suites/ (gitignored).
Core modules:
src/html/parser.zig: permissive parse pipelinesrc/html/scanner.zig: byte-scanning hot-path helperssrc/html/tags.zig: tag metadata and hash dispatchsrc/html/attr_inline.zig: in-place attribute traversal/lazy materializationsrc/html/entities.zig: entity decode utilitiessrc/selector/runtime.zig,src/selector/compile_time.zig: selector parsingsrc/selector/matcher.zig: selector matching/combinator traversal
Data model highlights:
Documentowns source bytes and node/index storage- nodes are contiguous and linked by indexes for traversal
- attributes are traversed directly from source spans (no heap attribute objects)
- validate selector syntax (
queryOneRuntimecan returnerror.InvalidSelector) - check scope (
Documentvs scopedNode) - use
queryOneRuntimeDebugand inspectQueryDebugReport
- default
innerTextnormalizes whitespace - use
innerTextWithOptions(..., .{ .normalize_whitespace = false })for raw spacing - use
innerTextOwned(...)when output must always be allocated - use
doc.isOwned(slice)to check borrowed vs allocated
queryAllRuntime iterators are invalidated by newer queryAllRuntime calls on the same Document.
Expected: parse and lazy decode paths mutate source bytes in place.