Skip to content

Commit cad1d18

Browse files
committed
Cleanup docs and README
1 parent 525668f commit cad1d18

6 files changed

Lines changed: 137 additions & 277 deletions

File tree

CONTRIBUTING.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ zig fmt src/**/*.zig examples/*.zig build.zig
3434

3535
## Documentation and Snippet Policy
3636

37-
- User-facing snippets in `README.md` and `docs/` must match canonical code in `examples/`.
37+
- User-facing snippets in `README.md` and `DOCUMENTATION.md` must match canonical code in `examples/`.
3838
- Every example file must contain executable tests.
3939
- Run `zig build examples-check` before merging doc/example changes.
4040

docs/README.md renamed to DOCUMENTATION.md

Lines changed: 66 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# htmlparser Manual
1+
# htmlparser Documentation
22

3-
This is the single source of truth for library usage, behavior contracts, performance workflow, and implementation notes.
3+
This is the canonical manual for usage, API, selector behavior, performance workflow, conformance expectations, and internals.
44

55
## Table of Contents
66

@@ -10,6 +10,7 @@ This is the single source of truth for library usage, behavior contracts, perfor
1010
- [Selector Support](#selector-support)
1111
- [Mode Guidance](#mode-guidance)
1212
- [Performance and Benchmarks](#performance-and-benchmarks)
13+
- [Latest Benchmark Snapshot](#latest-benchmark-snapshot)
1314
- [Conformance Status](#conformance-status)
1415
- [Architecture](#architecture)
1516
- [Troubleshooting](#troubleshooting)
@@ -39,7 +40,7 @@ test "basic parse + query" {
3940
}
4041
```
4142

42-
Canonical examples live in `examples/` and are verified by `zig build examples-check`
43+
Source example: `examples/basic_parse_query.zig` (verified by `zig build examples-check`)
4344

4445
## Core API
4546

@@ -79,7 +80,7 @@ Canonical examples live in `examples/` and are verified by `zig build examples-c
7980
- `prevSibling()`
8081
- `children()` (borrowed `[]const u32` index view)
8182
- Text:
82-
- `innerText(allocator)` (may return borrowed or allocated)
83+
- `innerText(allocator)` (borrowed or allocated depending on shape)
8384
- `innerTextWithOptions(allocator, TextOptions)`
8485
- `innerTextOwned(allocator)` (always allocated)
8586
- `innerTextOwnedWithOptions(allocator, TextOptions)`
@@ -88,12 +89,12 @@ Canonical examples live in `examples/` and are verified by `zig build examples-c
8889
- Scoped queries:
8990
- same query family as `Document` (`queryOne/queryAll`, runtime, cached, debug)
9091

91-
### Additional helpers
92+
### Helpers
9293

9394
- `doc.html()`, `doc.head()`, `doc.body()`
94-
- `doc.isOwned(slice)` to check whether a returned slice points into document source bytes
95+
- `doc.isOwned(slice)` to check whether a slice points into document source bytes
9596

96-
### Options
97+
### Parse/Text options
9798

9899
- `ParseOptions`
99100
- `eager_child_views: bool = true`
@@ -136,18 +137,18 @@ Compilation modes:
136137

137138
## Mode Guidance
138139

139-
`htmlparser` is permissive by design. Choose parse options per site behavior:
140+
`htmlparser` is permissive by design. Choose parse options by workload:
140141

141142
| Mode | Parse Options | Best For | Tradeoffs |
142143
|---|---|---|---|
143-
| `strictest` | `.eager_child_views = true`, `.drop_whitespace_text_nodes = false` | Maximum traversal predictability and text fidelity | More parse-time work |
144-
| `fastest` | `.eager_child_views = false`, `.drop_whitespace_text_nodes = true` | Throughput-first scraping | Whitespace-only text nodes dropped; child views built lazily |
144+
| `strictest` | `.eager_child_views = true`, `.drop_whitespace_text_nodes = false` | traversal predictability and text fidelity | higher parse-time work |
145+
| `fastest` | `.eager_child_views = false`, `.drop_whitespace_text_nodes = true` | throughput-first scraping | whitespace-only text nodes dropped; child views built lazily |
145146

146147
Fallback playbook:
147148

148149
1. Start with `fastest` for bulk workloads.
149-
2. Switch problematic domains to `strictest` if text/navigation assumptions fail.
150-
3. Use `queryOneRuntimeDebug` and inspect `QueryDebugReport` before changing selectors.
150+
2. Move unstable domains to `strictest`.
151+
3. Use `queryOneRuntimeDebug` and `QueryDebugReport` before changing selectors.
151152

152153
## Performance and Benchmarks
153154

@@ -164,12 +165,57 @@ Artifacts:
164165
- `bench/results/latest.md`
165166
- `bench/results/latest.json`
166167

167-
Notes:
168+
Benchmark policy:
168169

169170
- parse comparisons include `strlen`, `lexbor`, and parse-only `lol-html`
170171
- query parse/match/cached sections benchmark `htmlparser`
171172
- repeated runtime selector workloads should use cached selectors
172173

174+
## Latest Benchmark Snapshot
175+
176+
Warning: throughput numbers are not conformance claims. This parser is permissive by design; see [Conformance Status](#conformance-status).
177+
178+
<!-- BENCHMARK_SNAPSHOT:START -->
179+
180+
Source: `bench/results/latest.json` (`stable` profile).
181+
182+
#### Parse Throughput Comparison (MB/s)
183+
184+
| Fixture | ours-fastest | ours-strictest | lol-html | lexbor |
185+
|---|---:|---:|---:|---:|
186+
| `rust-lang.html` | 1657.11 | 1880.99 | 1472.30 | 339.06 |
187+
| `wiki-html.html` | 1269.14 | 1076.93 | 905.54 | 256.92 |
188+
| `mdn-html.html` | 1966.96 | 1904.34 | 1757.21 | 315.31 |
189+
| `w3-html52.html` | 902.64 | 825.04 | 735.91 | 182.75 |
190+
| `hn.html` | 1355.63 | 1252.24 | 858.87 | 220.22 |
191+
192+
#### Query Match Throughput (ours)
193+
194+
| Case | strictest ops/s | strictest ns/op | fastest ops/s | fastest ns/op |
195+
|---|---:|---:|---:|---:|
196+
| `attr-heavy-button` | 140088984.52 | 7.14 | 146858189.33 | 6.81 |
197+
| `attr-heavy-nav` | 135268575.76 | 7.39 | 143792203.01 | 6.95 |
198+
199+
#### Cached Query Throughput (ours)
200+
201+
| Case | strictest ops/s | strictest ns/op | fastest ops/s | fastest ns/op |
202+
|---|---:|---:|---:|---:|
203+
| `attr-heavy-button` | 210881929.32 | 4.74 | 210389894.55 | 4.75 |
204+
| `attr-heavy-nav` | 169021702.39 | 5.92 | 197867776.84 | 5.05 |
205+
206+
#### Query Parse Throughput (ours)
207+
208+
| Selector case | Ops/s | ns/op |
209+
|---|---:|---:|
210+
| `simple` | 20005017.26 | 49.99 |
211+
| `complex` | 6688312.83 | 149.51 |
212+
| `grouped` | 7306593.87 | 136.86 |
213+
214+
For full per-parser, per-fixture tables and gate output:
215+
- `bench/results/latest.md`
216+
- `bench/results/latest.json`
217+
<!-- BENCHMARK_SNAPSHOT:END -->
218+
173219
## Conformance Status
174220

175221
Run conformance suites:
@@ -180,7 +226,7 @@ zig build conformance
180226
zig build tools -- run-external-suites --mode both
181227
```
182228

183-
Report artifact: `bench/results/external_suite_report.json`
229+
Artifact: `bench/results/external_suite_report.json`
184230

185231
Tracked suites:
186232

@@ -203,21 +249,21 @@ Data model highlights:
203249

204250
- `Document` owns source bytes and node/index storage
205251
- nodes are contiguous and linked by indexes for traversal
206-
- attributes are traversed directly from source spans (no heap attr objects)
252+
- attributes are traversed directly from source spans (no heap attribute objects)
207253

208254
## Troubleshooting
209255

210256
### Query returns nothing
211257

212-
- validate selector syntax (`queryOneRuntime` returns `error.InvalidSelector`)
213-
- check query scope (`Document` vs scoped `Node`)
214-
- use `queryOneRuntimeDebug` + `QueryDebugReport` for near-miss reasons
258+
- validate selector syntax (`queryOneRuntime` can return `error.InvalidSelector`)
259+
- check scope (`Document` vs scoped `Node`)
260+
- use `queryOneRuntimeDebug` and inspect `QueryDebugReport`
215261

216262
### Unexpected `innerText`
217263

218264
- default `innerText` normalizes whitespace
219265
- use `innerTextWithOptions(..., .{ .normalize_whitespace = false })` for raw spacing
220-
- use `innerTextOwned(...)` when you always require allocated output
266+
- use `innerTextOwned(...)` when output must always be allocated
221267
- use `doc.isOwned(slice)` to check borrowed vs allocated
222268

223269
### Runtime iterator invalidation
@@ -226,4 +272,4 @@ Data model highlights:
226272

227273
### Input buffer changed
228274

229-
Expected behavior: parsing and lazy decode paths mutate source bytes in place.
275+
Expected: parse and lazy decode paths mutate source bytes in place.

0 commit comments

Comments
 (0)