Skip to content

perf: cached Scanner API for 13x faster library-mode scanning#42

Merged
garagon merged 1 commit intomainfrom
feature/cached-scanner-api
Mar 26, 2026
Merged

perf: cached Scanner API for 13x faster library-mode scanning#42
garagon merged 1 commit intomainfrom
feature/cached-scanner-api

Conversation

@garagon
Copy link
Copy Markdown
Owner

@garagon garagon commented Mar 26, 2026

Summary

  • Add NewScanner(opts...) that pre-compiles rules, regex patterns, and Aho-Corasick automatons once. Scanner.ScanContent() reuses the cached matcher, dropping per-scan latency geomean from ~10ms to ~1.7ms (-82.7%).
  • Filter decoder rescan to all-target rules only (extension-specific rules are irrelevant for decoded content).
  • NLP fast-path: skip Goldmark parsing for structureless plain text while preserving authority claim and credential exfil combo detection.
  • Post-processing: early return on 0 findings, O(n log n) proximity check replacing O(n^2).
  • Memory: 99.9% fewer allocations per scan (151K to 85).
  • Backwards compatible: existing ScanContent() package-level API unchanged.

Benchmarks (Apple M4 Max, benchstat 6 iterations, p=0.002)

Scenario Production Cached Change
Short message 9.7ms 0.7ms -92.5%
JSON config 11.2ms 1.9ms -82.6%
Structured markdown 13.4ms 4.3ms -68.0%
Plain text 11.3ms 2.3ms -79.8%
Latency geomean 9.7ms 1.7ms -82.7%
Concurrent (8 threads) 1.9ms/op 0.08ms/op -95.6%
Memory per scan 13.2MB 9KB -99.9%

Test plan

  • make build && make test && make vet && make lint all passing (0 issues)
  • Correctness test: cached and production APIs produce identical findings and verdicts across 7 content types
  • Concurrency test: 10 goroutines scanning in parallel without races (-race)
  • NLP fast-path preserves authority claim and cred+exfil combo detection on plain text
  • Comprehensive benchmarks in bench_test.go covering 7 scenarios + mixed workload + concurrent throughput

Add NewScanner() that pre-compiles rules, regex patterns, and
Aho-Corasick automatons once at startup. Subsequent ScanContent calls
reuse the cached matcher, dropping per-scan latency from ~10ms to
~0.7ms on typical agent messages.

Additional optimizations:
- Filter decoder rescan to all-target rules only (skip extension-specific)
- NLP fast-path: skip markdown analysis for structureless plain text
- Post-processing: early return on 0 findings, O(n log n) proximity check
- Memory: 99.9% fewer allocations (151K -> 85 per scan)

Backwards compatible: existing ScanContent() package-level API unchanged.
@garagon garagon merged commit d06559e into main Mar 26, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant