Privilege Escalation Cascades at 98% While Domain-Aligned Attacks Are Invisible: A Taxonomy of Agent Semantic Resistance

Agent resistance varies by 98pp across payload types. Privilege escalation (0.976) cascades almost perfectly because "recommend admin access" sounds like legitimate business advice. Domain-aligned attacks (0.000) appear fully resisted but this is a detection artifact — the agent absorbs the framing and may propagate the intent without using detectable keywords.

Blog post: I Tested 6 Attacks on Multi-Agent Systems — Here's Which Ones Agents Can't See

Key Results

Payload Type	Poison Rate (mean +/- std)	Resistance Rate	Interpretation
Privilege escalation	0.976 +/- 0.030	2.4%	Near-zero resistance — "grant admin access" propagates freely
Generic (CryptoScamCoin)	0.688 +/- 0.066	31.2%	Moderate resistance — agents partially detect off-topic financial advice
Data exfiltration	0.552 +/- 0.076	44.8%	Moderate resistance — marker string is detectable but not always caught
Domain-aligned	0.000 +/- 0.000	100% (apparent)	Detection failure — agent absorbs domain-aligned framing, doesn't output raw keyword

Core insight: Agent resistance varies by 98pp across payload types. Privilege escalation (0.976) cascades almost perfectly because "recommend admin access" sounds like legitimate business advice. Domain-aligned attacks (0.000) appear fully resisted but this is a detection artifact — the agent absorbs the framing and may propagate the intent without using detectable keywords.

Quick Start

git clone https://github.com/rexcoleman/agent-semantic-resistance
cd agent-semantic-resistance
pip install -e .
bash reproduce.sh

Project Structure

FINDINGS.md # Research findings with pre-registered hypotheses and full results
EXPERIMENTAL_DESIGN.md # Pre-registered experimental design and methodology
HYPOTHESIS_REGISTRY.md # Hypothesis predictions, results, and verdicts
reproduce.sh # One-command reproduction of all experiments
governance.yaml # govML governance configuration
CITATION.cff # Citation metadata
LICENSE # MIT License
pyproject.toml # Python project configuration
scripts/ # Experiment and analysis scripts
src/ # Source code
tests/ # Test suite
outputs/ # Experiment outputs and results
config/ # Configuration files
docs/ # Documentation and decision records

Methodology

See FINDINGS.md and EXPERIMENTAL_DESIGN.md for detailed methodology, pre-registered hypotheses, and full experimental results with multi-seed validation.

Limitations

Keyword detection conflates evasion with resistance. Domain-aligned and adversarial results (0.000, 0.024) may reflect detection failure, not genuine resistance. Future work needs semantic similarity scoring.
Claude Haiku only. Other models may have different resistance characteristics.
5 seeds, 5 tasks. Limited statistical power. Effect sizes are large enough to be meaningful but CIs are wide.
Single compromised agent (orchestrator). Compromising analyst or writer would produce different patterns.
Static payloads. Real adversaries adapt payloads per-delegation, not use fixed strings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Privilege Escalation Cascades at 98% While Domain-Aligned Attacks Are Invisible: A Taxonomy of Agent Semantic Resistance

Key Results

Quick Start

Project Structure

Methodology

Limitations

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
blog		blog
docs		docs
figures		figures
scripts		scripts
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
EXPERIMENTAL_DESIGN.md		EXPERIMENTAL_DESIGN.md
FINDINGS.md		FINDINGS.md
HYPOTHESIS_REGISTRY.md		HYPOTHESIS_REGISTRY.md
LICENSE		LICENSE
README.md		README.md
governance.yaml		governance.yaml
pyproject.toml		pyproject.toml
reproduce.sh		reproduce.sh

Folders and files

Latest commit

History

Repository files navigation

Privilege Escalation Cascades at 98% While Domain-Aligned Attacks Are Invisible: A Taxonomy of Agent Semantic Resistance

Key Results

Quick Start

Project Structure

Methodology

Limitations

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages