You own the harness (correctness + performance) and the reference implementations (ground truth).
Harness code will evolve; the goals and invariants must not.
For the full, longer spec and rationale, you can reference TASK.md.
If you are implementing an optimized candidate renderer, the most relevant starting points are src/benchmarks/README.md and src/impl_candidate/.
Benchmarks are primarily run via the CLI (often headless/automated), not manually in the browser UI.
The demo page should match the benchmarked implementation closely, so the interactive user experience reflects what is actually being measured (same renderer mode, interaction paths, and key rendering/quality policies where applicable). If the demo intentionally differs (e.g. extra UI features), document the differences explicitly.
To avoid drift, the demo and benchmark paths should share as much code as possible via shared modules (core math, interaction semantics, dataset generation, rendering policy knobs, etc.), rather than duplicating logic in multiple entrypoints.
The harness must include (and you must keep up to date) simple, reliable instructions for engineers to run accuracy tests and benchmarks.
Build a lab that can make two objective statements for every supported geometry:
- Correctness: candidate behavior matches the reference (within specified tolerances).
- Performance: candidate is faster under realistic workloads (render + interaction), and regressions are detectable.
The lab must be geometry-extensible. Euclidean and Poincaré are initial examples, not the final list.
These are the non-negotiables the harness must enforce.
- Determinism: same dataset seed + same view + same interaction sequence ⇒ same results.
- Explicit semantics per geometry: no “close enough Euclidean” fallbacks for non-Euclidean geometries.
- Anchor-invariant interaction: pan/zoom must preserve the point under the cursor (when the geometry’s model defines such an invariant).
Measure and report performance in a way that reflects real usage:
- steady-state render
- panning
- hovering/picking
- lasso/selection
Benchmarks must be reproducible and comparable over time.
Important harness principle: design the harness so it does not limit the performance potential of the candidate implementation or force it into a particular technical direction (e.g. requiring a specific rendering pipeline, mandatory GPU readbacks, fixed buffer layouts, or other constraints that bias optimization choices). The harness should specify semantics and measurements, not dictate how the candidate achieves them.
- Actionable failures: what diverged (operation), by how much, and how to reproduce.
- Machine-readable summaries: pass/fail + key metrics suitable for CI/regression tracking.
The test suite should continuously evolve to cover accuracy and performance and be robust against “loopholes” (i.e., implementations that game benchmarks or satisfy tests while violating intended semantics).