Releases: Expl0dingCat/safehere
Releases · Expl0dingCat/safehere
v1.0.0-stable
safehere v1.0.0-stable
Runtime tool-output scanning middleware for Cohere AI agents. Detects and blocks prompt injection attacks hiding in tool results before they reach the model.
Highlights
- 5 detection layers: pattern matching, schema drift, statistical anomaly, heuristic instruction classification, TF-IDF semantic classifier
- 1,028-sample evaluation corpus across 50+ attack categories
- Pre-trained model bundled --
pip install safehere[ml]works out of the box - Regex timeout protection prevents ReDoS denial-of-service
- 0.5% FPR on 405 benign samples, 97.6% TPR on 623 adversarial samples
Install
pip install safehere # core (4 rule-based scanners)
pip install safehere[ml] # + TF-IDF semantic scanner
pip install safehere[cohere] # + Cohere managed loop
pip install safehere[all] # everythingv1.0.0-beta.1
safehere v1.0.0-beta.1
Runtime tool-output scanning middleware for Cohere AI agents. Detects and blocks prompt injection attacks hiding in tool results before they reach the model.
Highlights
- 5 detection layers: pattern matching, schema drift, statistical anomaly, heuristic instruction classification, and TF-IDF semantic classifier
- 1,028-sample evaluation corpus across 50+ attack categories -- narrative injection, roleplay hijacking, fake compliance requests, translation-based attacks, persona splitting, encoding evasion, and more
- Regex timeout protection (50ms per pattern) prevents ReDoS denial-of-service
- Security hardened via red-team audit: recursive encoding decoder, RTL override reversal, Unicode tag character decoding, homoglyph expansion, schema recursion depth guard, anomaly cold-start hardening
Benchmarks
| Metric | Result |
|---|---|
| Detection (623 adversarial) | 97.6% TPR |
| False positives (405 benign) | 0.5% FPR |
| Semantic classifier (held-out 20%) | 0.96 F1 |
| CyberArk-style live attacks | 10/10 blocked |
| Latency (with semantic scanner) | ~12ms P50 |
Install
pip install safehere # core (4 rule-based scanners)
pip install safehere[ml] # + TF-IDF semantic scanner
pip install safehere[cohere] # + Cohere managed loop (run/arun)
pip install safehere[all] # everythingKnown limitations
- Narrative/analogy attacks with zero injection vocabulary can evade all layers
- Low signal density (<5% payload in long documents) evades density-based filtering
- Payload splitting across multiple tool outputs is not detected
- Metrics are self-evaluated, not independently audited
- Semantic model must be trained locally (
python -m safehere.scanners.semantic --train)
Breaking changes from v0.x
cohereis now an optional dependency (pip install safehere[cohere])- Python 3.8 is no longer supported (minimum 3.9)
- Scoring weights redistributed to accommodate the semantic scanner
SemanticScanneradded to the default pipeline (degrades gracefully if scikit-learn is not installed)
v0.3.0-alpha
Full Changelog: v0.1.1-alpha...v0.3.0-alpha
v0.1.1-alpha
Initial pre-release. Pattern matching, schema drift detection, statistical anomaly detection, and heuristic instruction classification for scanning tool outputs before they reach the model. See README for usage.