perf: cached Scanner API for 13x faster library-mode scanning by garagon · Pull Request #42 · garagon/aguara

garagon · 2026-03-26T13:58:51Z

Summary

Add NewScanner(opts...) that pre-compiles rules, regex patterns, and Aho-Corasick automatons once. Scanner.ScanContent() reuses the cached matcher, dropping per-scan latency geomean from ~10ms to ~1.7ms (-82.7%).
Filter decoder rescan to all-target rules only (extension-specific rules are irrelevant for decoded content).
NLP fast-path: skip Goldmark parsing for structureless plain text while preserving authority claim and credential exfil combo detection.
Post-processing: early return on 0 findings, O(n log n) proximity check replacing O(n^2).
Memory: 99.9% fewer allocations per scan (151K to 85).
Backwards compatible: existing ScanContent() package-level API unchanged.

Benchmarks (Apple M4 Max, benchstat 6 iterations, p=0.002)

Scenario	Production	Cached	Change
Short message	9.7ms	0.7ms	-92.5%
JSON config	11.2ms	1.9ms	-82.6%
Structured markdown	13.4ms	4.3ms	-68.0%
Plain text	11.3ms	2.3ms	-79.8%
Latency geomean	9.7ms	1.7ms	-82.7%
Concurrent (8 threads)	1.9ms/op	0.08ms/op	-95.6%
Memory per scan	13.2MB	9KB	-99.9%

Test plan

make build && make test && make vet && make lint all passing (0 issues)
Correctness test: cached and production APIs produce identical findings and verdicts across 7 content types
Concurrency test: 10 goroutines scanning in parallel without races (-race)
NLP fast-path preserves authority claim and cred+exfil combo detection on plain text
Comprehensive benchmarks in bench_test.go covering 7 scenarios + mixed workload + concurrent throughput

Add NewScanner() that pre-compiles rules, regex patterns, and Aho-Corasick automatons once at startup. Subsequent ScanContent calls reuse the cached matcher, dropping per-scan latency from ~10ms to ~0.7ms on typical agent messages. Additional optimizations: - Filter decoder rescan to all-target rules only (skip extension-specific) - NLP fast-path: skip markdown analysis for structureless plain text - Post-processing: early return on 0 findings, O(n log n) proximity check - Memory: 99.9% fewer allocations (151K -> 85 per scan) Backwards compatible: existing ScanContent() package-level API unchanged.

garagon merged commit d06559e into main Mar 26, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: cached Scanner API for 13x faster library-mode scanning#42

perf: cached Scanner API for 13x faster library-mode scanning#42
garagon merged 1 commit intomainfrom
feature/cached-scanner-api

garagon commented Mar 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garagon commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmarks (Apple M4 Max, benchstat 6 iterations, p=0.002)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

garagon commented Mar 26, 2026 •

edited

Loading