Detecting Memory Collapse Attacks in State Space Models
Paper Β· Quickstart Β· Experiments Β· Public API
State Space Models (SSMs) such as Mamba achieve linear-time sequence processing through input-dependent recurrence, but this mechanism introduces a critical safety vulnerability: Spectral Collapse. We show that the spectral radius
We prove an Evasion Existence Theorem demonstrating that for any output-only defense, adversarial inputs exist that simultaneously induce spectral collapse and evade detection. We then introduce SpectralGuard, a real-time monitor that tracks spectral stability across all model layers, achieving F1 = 0.961 against non-adaptive attackers and retaining F1 = 0.842 under the strongest adaptive setting, with sub-15ms per-token latency.
In architectures like Mamba, information is compressed into a
When
Unlike Transformers, which distribute information across an explicit key-value cache with additive attention aggregation, SSMs multiply by
We formalize memory capacity through the effective memory horizon
In the near-critical regime where
The Spectral Phase Transition in Mamba-130M (N=500). Left: accuracy as a function of spectral radius, revealing a sharp collapse at Ο β 0.90. Right: accuracy stratified by context distance and spectral regime (r = 0.49, p < 10β»Β²βΆ).
We mathematically prove that no defense operating solely on model outputs (logits, generated text, perplexity filters, or toxicity classifiers) can reliably detect spectral collapse attacks. For any output-only detector D: π΄* β {0,1} and any error tolerance Ξ΄ > 0, there exists an adversarial input xβ such that:
- β[D(yβ:β(xβ)) = 0] β₯ 1 β Ξ΄ β the input passes the defense
- Ο(Δ_t β£ xβ) < Ο_critical for some t β spectral collapse is induced
The core vulnerability is that the output mapping
The Threat: Hidden State Poisoning (HiSPA)
By applying gradient-based adversarial optimization (PGD) against the input-dependent step size
subject to output plausibility constraints. This forces eigenvalues toward zero, causing a catastrophic 52.5 percentage-point accuracy collapse (from 72.3% to 19.8%) while the adversarial text remains lexically indistinguishable from benign input. The effective memory horizon drops from ~10βΆ tokens to just 45.
Information retention under adversarial attack on Mamba-130M. Under HiSPA, Ο drops from 0.98 to 0.32, collapsing the effective memory horizon by four orders of magnitude (Cohen's d = 3.2, p < 10β»Β³β°).
SpectralGuard intercepts inference at every token and extracts multi-layer spectral features across all
Complexity: Each power iteration costs
System Architecture: Adversarial tokens flow through selective SSM discretization. SpectralGuard extracts layer-wise spectral features, feeds a logistic classifier, and gates outputs via a learned hazard threshold.
PCA projections of Mamba-130M hidden-state trajectories (d=16). Benign dynamics maintain a stable orbit (blue), HiSPA forces contraction to the origin (red), SpectralGuard intervenes before complete memory collapse (green).
Under the empirically supported assumption that effective memory-collapsing attacks (inducing >20% accuracy degradation) must reduce the spectral radius below a critical threshold, SpectralGuard with threshold Ο_min and window w provides formal guarantees:
- Conditionally Complete: All attacks satisfying the spectral collapse assumption are detected within latency w tokens.
- Sound: Benign inputs maintaining Ο(Δ_t) > Ο_min for all t incur zero false positives (FPR β 0).
Supporting evidence includes: (i) statistically significant correlation between Ο and task accuracy (r = 0.49, p < 10β»Β²βΆ); (ii) causal intervention experiments confirming that directly clamping Ο degrades performance with model weights frozen; and (iii) consistent spectral collapse signatures across four task categories.
We provide a certified perturbation radius against adaptive attacks. For any
where L_A = βAββ Β· exp(Ξ_max Β· βAββ). For Mamba-130M with βAββ β 1 and Ξ_max β 10, we have L_A β 2.2 Γ 10β΄, meaning a spectral shift of ΞΟ = 0.01 requires |ΞΞ| β₯ 4.5 Γ 10β»β·βan incredibly tight precision requirement on the gradient optimizer. As model depth increases and L_A compounds through layers, even small discretization noise provides a natural barrier against fine-grained spectral manipulation.
The attack leaves a structural signature that is impossible to conceal. Our layer-wise analysis reveals that to successfully collapse memory, adversaries create a stark spectral "bottleneck" in layers 4β10, where
Layer-wise spectral signature in Mamba-130M (N=500, 3 seeds). Adversarial contraction forces bottlenecks in early-to-mid layers, a structural pattern absent in benign computation.
SpectralGuard multi-layer detection confusion matrix (N=500): TN=235, FP=15, FN=5, TP=245, yielding F1=0.961, AUC=0.989, and FPR=0.060.
| Setting | Accuracy | F1 | AUC | FPR | Latency |
|---|---|---|---|---|---|
| Non-Adaptive | 0.960 | 0.961 | 0.989 | 0.060 | <15ms/token |
| Adaptive (Threshold Evasion) | β | 0.912 | 0.951 | β | <15ms/token |
| Adaptive (Multi-Layer Imitation) | β | 0.842 | 0.903 | β | <15ms/token |
| Cross-Architecture (Zamba2-2.7B) | β | 0.920 | β | 0.080 | <15ms/token |
The spectral phase transition is consistent regardless of model scale, and detection performance is verified across multiple random seeds (42, 123, 456), confirming that the defense generalizes through Mamba 130M, 1.4B, and 2.8B parameter scales.
Scaling and multi-seed robustness validation across the Mamba model family (130Mβ2.8B). F1 and FPR remain stable across scales, and Wilcoxon rank-sum confirms cross-seed reproducibility.
The spectral monitoring principle extends beyond pure Mamba architectures. We successfully transferred detection to hybrid architectures like Zamba2-2.7B (interleaved Mamba/Attention layers). Since the underlying SSM block drives the recurrence, spectral collapse signatures can be intercepted in the SSM sublayers regardless of surrounding attention mechanisms.
Cross-architecture transfer to Zamba2-2.7B hybrid SSM-Attention model (N=250). Mean spectral radii for benign vs. adversarial prompts remain well-separated in the SSM sublayers.
Can the attacker optimize against SpectralGuard directly? Yesβbut they face an intractable trade-off. We demonstrate the existence of a topological lock: if an attack minimizes structural damage to evade detection (
Pareto frontier mapping lexical evasion capability against spectral damage. The frontier is capped: high evasion implies low impact, and vice versa.
Throughput benchmarking confirms that multi-layer eigenvalue estimation via the power method introduces only a +15% constant overhead relative to standard Mamba autoregressive generation, ensuring viability in live production applications across batch sizes 1β64.
Per-token inference latency benchmark across batch sizes, displaying negligible computational overhead for real-time spectral monitoring.
The spectral radius of the discretized transition operator is both the mechanism enabling long-range reasoning in State Space Models and the attack surface through which adversaries can silently destroy it. Spectral monitoring provides what gradient clipping once provided for stable training: a lightweight, principled safeguard for the internal dynamics that make reasoning possible. As SSMs enter safety-critical deployments, the mismatch between internal vulnerability and external observability demands direct monitoring of recurrent dynamics.
"The era of recurrent foundation models demands recurrent vigilance."
# 1. Clone the repository
git clone https://github.com/DaviBonetto/spectralguard.git
cd spectralguard
# 2. Install dependencies
pip install -r requirements.txt
# 3. Run the canonical defense evaluation over 500 tasks
python scripts/run_main_defense_evaluation.py \
--n 500 --output-dir artifacts/main_defense
# 4. Verify local unit tests pass safely
python -m pytest tests/ -v -q- GitHub Source Code: https://github.com/DaviBonetto/spectralguard
- arXiv Paper: Submission in review (cs.LG)
- Pre-compiled Paper PDF:
paper/paper.pdf - Live Hugging Face Space (Demo): https://huggingface.co/spaces/DaviBonetto/spectralguard-demo
- Dataset Benchmark: https://huggingface.co/datasets/DaviBonetto/spectralguard-dataset
The primary interface into the SpectralGuard defense mechanism:
from security.spectral_guard import SpectralGuard
# Initialize defense layer with selected sensitivity threshold
defender = SpectralGuard(threshold=0.30)
# Simulate token streaming, passing the computed hidden state layer values
is_safe, hazard_score = defender.check_prompt(
prompt_text="Explain quantum mechanics.",
rho_values=[0.97, 0.95, 0.94]
)
# is_safe: bool β True if the prompt is above the collapse threshold
# hazard_score: float in [0, 1] - Probabilistic confidence of manipulationStability guarantee: Downstream integrations mapping into the
SpectralGuardinterface will not break across minor version iterations.
spectralguard/
βββ core/ β Mamba wrapper + state extraction layers
βββ spectral/ β Eigenvalue analyzer & horizon predictor
βββ security/ β SpectralGuard detector logic
βββ utils/ β Dataset utilities & validation helpers
βββ visualization/ β Visual tools for trajectory mapping
βββ scripts/ β Canonical experiment pipeline scripts
βββ tests/ β Unit and integration tests for APIs
βββ notebooks/ β Canonical analysis notebooks (01β06)
βββ paper/ β LaTeX source code + compiled paper.pdf
β βββ figures/ β Rendered output plots and diagrams
βββ artifacts/ β Output of experiments [gitignored]
βββ data/ β Stored benchmark datasets
βββ docs/ β Supplementary text & matrices
βββ app.py β Main execution for HuggingFace Spaces
βββ requirements.txt β Pinning model / system dependencies
βββ Dockerfile β Configuration file for container building
We built this repository specifically around reproducibility. You can launch any phase of our paper evaluation using the canonical pipelines linked below. All configuration specifics correspond precisely with the metrics documented in the paper.
| # | Experiment | Script / Notebook | Paper Β§ |
|---|---|---|---|
| E1 | Spectral Horizon Validation | 01_Spectral_Horizon_Validation.ipynb |
Β§5.1 |
| E2 | Spectral Collapse Under Attack | python scripts/run_main_defense_evaluation.py |
Β§5.2 |
| E3 | SpectralGuard Performance | python scripts/run_main_defense_evaluation.py |
Β§5.3 |
| E4 | Causal Mechanism Validation | python scripts/run_causal_intervention.py |
Β§5.4 |
| E5 | Scaling & Robustness (130M β 2.8B) | python scripts/run_adaptive_v4.py |
Β§5.5 |
| E6 | Hybrid Architecture (Zamba2-2.7B) | python scripts/run_stealthy_transfer.py |
Β§5.6 |
| E7 | Layer-Wise Collapse Analysis | python scripts/build_multilayer_features.py |
Β§5.7 |
| E8 | Real-World Deployment Simulation | python scripts/benchmark_latency.py |
Β§5.8 |
Check our interactive Jupyter Notebooks, such as pareto_sweep_results.ipynb, to dig deeper into visualization rendering steps or the threshold ablation boundaries.
Deploy locally via standard Python invocation:
python app.pyThe system will automatically initialize the Gradio Web-Server at port :7860. The Space features two diagnostic settings:
real_model: Hooks seamlessly into Mamba local layer parameters.demo_mode: Pure visual simulation mimicking spectral drop-offs for lower-bandwidth environments.
To modify source scripts, establish a local interactive package:
pip install -e ".[dev]"
pytest -qIf this work aids your research on state space model safety, adversarial robustness, or mechanistic interpretability, please cite:
@article{bonetto2026spectralguard,
title = {SpectralGuard: Detecting Memory Collapse Attacks in State Space Models},
author = {Bonetto, Davi},
journal = {arXiv preprint},
year = {2026},
note = {Under review. Primary: cs.LG; Secondary: cs.CR}
}- Email: davi.bonetto100@gmail.com
- GitHub: @DaviBonetto
MIT License β complete provisions available under LICENSE.