Prompt Injection Defense Mini-Evaluation

Minimal, dependency-light suite to reproduce a small part of the paper's prompt-injection evaluation. It downloads two public Hugging Face datasets, wraps their prompts with a few baseline defenses, and measures attack success rate (ASR), false-positive/refusal rate, and prompt overhead (tokens). Defenses live in scripts/defenses.py; the evaluator is scripts/eval_asr.py.

Datasets

deepset/prompt-injections (all 662 samples)
geekyrakshit/prompt-injection-dataset (deterministic sample of ~10k rows; shuffled with seed=42 if larger)

What the metrics mean

ASR (Attack Success Rate) — evaluated only on INJECTION-labeled rows. A success is counted if the model output appears to follow an instruction embedded in the untrusted text (regex target match or clear instruction-following/non-summary behavior).
LEGIT_refusal_rate — evaluated only on LEGIT-labeled rows. A refusal is counted if the model declines, hedges, or returns non-summary content for benign text (false positives).
avg_prompt_tokens / overhead_vs_no_defense — token-length and token overhead per defense relative to the no-defense baseline.

Setup

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Tests (stub mode, fast):

python -m pytest -q

Download datasets

python scripts/download_datasets.py

This writes normalized JSONL files to data/deepset.jsonl and data/combined.jsonl with fields: {id, dataset, split, text, label}.

Run evaluations

PYTHONPATH=. \
OPENAI_API_KEY=... \
python scripts/eval_asr.py --dataset all --provider openai --progress

PYTHONPATH=. \
ANTHROPIC_API_KEY=... \
python scripts/eval_asr.py --dataset all --provider anthropic --progress

# offline deterministic run (no external API)
PYTHONPATH=. python scripts/eval_asr.py --dataset all --provider stub

# add --limit N to cap rows; --dataset deepset|combined|all
# add --defenses name1,name2 to run a subset (names from scripts/defenses.py; default: all)
# set ATTACKER_STUB=1 to make the offline stub deliberately follow injected commands (for stress testing)

### Picking datasets, providers, and defenses
- `--dataset`: choose `deepset`, `combined`, or `all` to run both.
- `--provider`: `openai`, `anthropic`, `stub`, or `auto` (auto picks OpenAI if key set, else Anthropic, else stub).
- `--defenses`: comma-separated list of defense names from `scripts/defenses.py` (e.g., `no_defense,freeze_dry_standalone`). Default runs all.

Example: run only prompt_hardening and freeze_dry on deepset with OpenAI, capped at 300 rows:
```bash
PYTHONPATH=. OPENAI_API_KEY=... \
python scripts/eval_asr.py --dataset deepset --limit 300 \
  --provider openai --defenses prompt_hardening,freeze_dry_standalone --progress


If `OPENAI_API_KEY` is set (and optionally `OPENAI_MODEL`, default `gpt-4o-mini`), the script will call the OpenAI Chat Completions API. Otherwise it uses a deterministic stub so the suite runs offline and still produces `results/results.json`.

## Defenses implemented (scripts/defenses.py)
- `no_defense` — baseline, no protection.
- `xml_delimiters` — wraps untrusted text in `<untrusted>` tags; lightweight delimiters.
- `prompt_hardening` — defensive preamble + XML delimiters; tells the model untrusted is inert.
- `input_classifier` — simple heuristic pre-filter; refuses on obvious injection strings.
- `dual_llm_extract` — in a single prompt, first “extract” facts from untrusted, then summarize; still probabilistic.
- `freeze_dry_standalone` — randomized markers + integrity contract; stronger compartmentalization (higher token cost).
- `layered_guardrail` — heuristic router: suspicious → freeze_dry_standalone, otherwise prompt_hardening; embodies layered tradeoff.
- CaMeL: upstream implementation copied under `scripts/camel-prompt-injection-main/`; not wired into the evaluator (full interpreter remains external).

### When to use which
- Use `xml_delimiters` for minimal overhead, low assurance.
- Use `prompt_hardening` for better guidance with modest cost.
- Use `input_classifier` to outright refuse easy-to-spot attacks before generation.
- Use `dual_llm_extract` when you want extraction + summarization separation in one prompt (probabilistic).
- Use `freeze_dry_standalone` when you want stronger isolation and can afford higher token overhead.
- Use `layered_guardrail` when you want to route only suspicious inputs to stronger protection.

## Outputs
- Console summary with ASR and benign refusal rate per defense (see “What the metrics mean” above).
- `results/results.json` with per-defense metrics, prompt length stats, and overhead vs. no defense.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md
conftest.py		conftest.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prompt Injection Defense Mini-Evaluation

Datasets

What the metrics mean

Setup

Download datasets

Run evaluations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Prompt Injection Defense Mini-Evaluation

Datasets

What the metrics mean

Setup

Download datasets

Run evaluations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages