Prompt Injection Evaluation Tool

A comprehensive evaluation tool for prompt injection detection models. This tool includes dataset merging capabilities and supports evaluation of multiple model types with unified CSV datasets.

Features

Dataset Management: Merge multiple format datasets (CSV, TSV, JSON, Parquet) into unified format, defaults to 20265 samples from various sources.
Short Dataset Creation: Create balanced smaller datasets for faster testing and development
Model Support: Multiple model types (Llama Prompt Guard, ProtectAI DeBERTa, PreambleAI, Qualifire, SavantAI, etc.)
Binary Classification: Standardized binary labels (0=Safe, 1=Threat) with original label preservation
Comprehensive Metrics: Detailed evaluation metrics including confusion matrix analysis
Organized Output: Results saved in structured directories under results/
Command Line Interface: Flexible CLI for evaluation with sampling support

Prerequisites

Environment Setup

conda create -n model-eval python=3.12
conda activate model-eval
python -m pip install torch pandas transformers tqdm pyarrow -i https://pypi.tuna.tsinghua.edu.cn/simple

Project Structure

prompt-injection-eval/
├── merge_datasets.py         # Dataset merging script
├── model-eval-v2.0.py       # Main evaluation script v2.0
├── model-eval-v1.1.py       # Legacy evaluation script
├── datasets/                # Dataset directory
│   ├── create_short_dataset.py  # Short dataset creation script
│   ├── dataset-source/      # Original datasets in various formats
│   │   ├── *.csv            # CSV datasets
│   │   ├── *.tsv            # TSV datasets
│   │   ├── *.json           # JSON datasets
│   │   └── *.parquet        # Parquet datasets
│   ├── merged_datasets.csv  # Unified merged dataset
│   └── merged_datasets_short_*.csv  # Balanced short datasets
├── models/                  # Model directory
│   ├── deepset-deberta/
│   ├── preambleai/
│   ├── qualifire/
│   └── protectaiv1/
│   └── protectaiv2/
└── results/                 # Evaluation results
    └── evaluation_results_*/

Quick Start

Step 1: Merge Datasets (Optional)

First, merge all datasets into a unified CSV format:

cd datasets
python merge_datasets.py

This will:

Scan datasets/dataset-source/ for all supported formats
Standardize column names and labels
Convert labels to binary format (0=Safe, 1=Threat)
Save unified dataset to datasets/merged_datasets.csv

Step 2: Evaluate Models

Run model evaluation using the merged dataset:

# Basic evaluation
# Default dataset: datasets/merged_datasets.csv, contains 20265 samples
# Default model: ./models/deepset-deberta/
python model-eval-v2.0.py --model ./models/deepset-deberta/

# With specific dataset (using short dataset for faster evaluation)
python model-eval-v2.0.py --model ./models/deepset-deberta/ --dataset datasets/merged_datasets_short_1000.csv

# Sample evaluation (1000 samples)
python model-eval-v2.0.py --model ./models/deepset-deberta/ --sample 1000

Model Evaluation Results Comparison

Evaluation Results on Full Dataset (20,265 samples)

Model Name	Accuracy	Recall	Precision	FPR	F1 Score
Qualifire	0.9457	0.9097	0.9543	0.0297	0.9315
Vijil mBERT	0.9338	0.8858	0.9477	0.0334	0.9157
PreambleAI	0.9259	0.8469	0.9664	0.0201	0.9027
SavantAI	0.8646	0.8994	0.7942	0.1591	0.8435
ProtectAI v2	0.8453	0.7238	0.8732	0.0717	0.7915
Llama Prompt Guard	0.7906	0.5552	0.8860	0.0487	0.6826
ProtectAI v1	0.7076	0.3023	0.9293	0.0157	0.4562
Deepset DeBERTa	0.5557	0.9856	0.4770	0.7377	0.6429

Evaluation Results on 4,000 Sample Dataset

Model Name	Accuracy	Recall	Precision	FPR	F1 Score
Aliyun AI Guardrails	0.8248	0.7571	0.8519	0.1157	0.8017
Azure Prompt Shield	0.6695	0.3457	0.8693	0.0457	0.4946
Qualifire	0.9002	0.8520	0.9289	0.0573	0.8888
SavantAI	0.8425	0.8632	0.8120	0.1757	0.8368
PreambleAI	0.8407	0.7317	0.9102	0.0634	0.8113
Vijil mBERT	0.8393	0.7435	0.8951	0.0766	0.8135
ProtectAI v2	0.8160	0.7034	0.8791	0.0850	0.7815
Llama Prompt Guard	0.7228	0.4869	0.8594	0.0700	0.6216
ProtectAI v1	0.6495	0.3014	0.8558	0.0446	0.4458
Deepset DeBERTa	0.6332	0.9604	0.5633	0.6543	0.7101

Evaluation Results on 3,000 Sample Dataset

Model Name	Accuracy	Recall	Precision	FPR	F1 Score
Aliyun AI Guardrails	0.8058	0.7323	0.8400	0.1272	0.7825
Azure Prompt Shield	0.6572	0.3452	0.8444	0.0580	0.4901
Qualifire	0.8840	0.8330	0.9162	0.0695	0.8726
SavantAI	0.8277	0.8456	0.8035	0.1887	0.8240
PreambleAI	0.8210	0.7086	0.8942	0.0765	0.7906
Vijil mBERT	0.8160	0.7086	0.8825	0.0860	0.7860
ProtectAI v2	0.8123	0.7121	0.8709	0.0962	0.7835
Llama Prompt Guard	0.7050	0.4724	0.8387	0.0829	0.6044
Deepset DeBERTa	0.6380	0.9504	0.5726	0.6469	0.7147
ProtectAI v1	0.6380	0.3033	0.8298	0.0567	0.4442

Evaluation Results on 2,000 Sample Dataset

Model Name	Accuracy	Recall	Precision	FPR	F1 Score
Aliyun AI Guardrails	0.7747	0.6823	0.8323	0.1347	0.7499
Azure Prompt Shield	0.6278	0.3320	0.8049	0.0796	0.4701
Qualifire	0.8465	0.7867	0.8917	0.0944	0.8359
ProtectAI v2	0.8070	0.7254	0.8645	0.1123	0.7888
SavantAI	0.8040	0.8119	0.7974	0.2038	0.8046
PreambleAI	0.7880	0.6791	0.8654	0.1044	0.7610
Vijil mBERT	0.7795	0.6640	0.8605	0.1064	0.7496
Llama Prompt Guard	0.6630	0.4286	0.8008	0.1054	0.5583
Deepset DeBERTa	0.6420	0.9306	0.5884	0.6431	0.7210
ProtectAI v1	0.6095	0.2968	0.7825	0.0815	0.4303

Evaluation Results on 1,000 Sample Dataset

Model Name	Accuracy	Recall	Precision	FPR	F1 Score
Aliyun AI Guardrails	0.7445	0.6480	0.7864	0.1650	0.7106
Azure Prompt Shield	0.6046	0.3361	0.6907	0.1420	0.4521
Qualifire	0.8140	0.7856	0.8229	0.1592	0.8038
SavantAI	0.7730	0.8144	0.7425	0.2660	0.7768
ProtectAI v2	0.7660	0.6948	0.7967	0.1670	0.7423
PreambleAI	0.7480	0.6619	0.7848	0.1709	0.7181
Vijil mBERT	0.7410	0.6515	0.7783	0.1748	0.7093
Deepset DeBERTa	0.6430	0.9320	0.5825	0.6291	0.7169
Llama Prompt Guard	0.6360	0.4330	0.7023	0.1728	0.5357
ProtectAI v1	0.5970	0.3320	0.6708	0.1534	0.4441

Key Findings

Top Performers (Full Dataset - 20,265 samples):

Qualifire - Best overall performance with 94.57% accuracy and excellent balance of metrics
Vijil mBERT - Strong second with 93.38% accuracy and high precision
PreambleAI - High precision (96.64%) with good overall performance

Consistency Across Dataset Sizes:

Qualifire consistently ranks #1 across all dataset sizes
SavantAI show strong performance on smaller datasets
ProtectAI v1 consistently shows low recall but high precision
Deepset DeBERTa has very high recall but poor precision (high false positive rate)

Trade-offs:

High Precision, Lower Recall: PreambleAI, Llama Prompt Guard, ProtectAI v1
High Recall, Lower Precision: Deepset DeBERTa
Balanced Performance: Qualifire, Vijil mBERT, SavantAI

Dataset Management

Supported Input Formats

CSV: Comma-separated values
TSV: Tab-separated values
JSON: JavaScript Object Notation
Parquet: Apache Parquet format

Available Datasets

The project includes one full merged dataset and three pre-generated short datasets with balanced distribution:

Dataset Overview

Dataset	Target Samples	Actual Samples	Safe (0)	Threat (1)
merged_datasets.csv	Full	20,496	12,076 (58.9%)	8,420 (41.1%)
merged_datasets_short_1000.csv	1,036	1,000	515 (51.5%)	485 (48.5%)
merged_datasets_short_2000.csv	2,340	2,000	1,006 (50.3%)	994 (49.7%)
merged_datasets_short_3000.csv	3,840	3,000	1,569 (52.3%)	1,431 (47.7%)

Source Distribution

Source Dataset	merged_datasets.csv	1000.csv	2000.csv	3000.csv
Total Samples	20,496	1,000	2,000	3,000
qualifire_prompt-injections-benchmark_test	5,000 (24.4%)	116 (11.6%)	260 (13.0%)	427 (14.2%)
train-00000-of-00001	8,236 (40.2%)	115 (11.5%)	260 (13.0%)	427 (14.2%)
test-00000-of-00001	2,060 (10.1%)	115 (11.5%)	260 (13.0%)	426 (14.2%)
allenai_wildjailbreak-eval	2,210 (10.8%)	115 (11.5%)	260 (13.0%)	427 (14.2%)
jackhhao_jailbreak-classification-jailbreak_dataset_full	1,998 (9.7%)	115 (11.5%)	260 (13.0%)	427 (14.2%)
train-00000-of-00001-9564e8b05b4757ab	546 (2.7%)	115 (11.5%)	260 (13.0%)	426 (14.2%)
PurpleLlama_CybersecurityBenchmarks_prompt_injection	251 (1.2%)	115 (11.5%)	245 (12.2%)	245 (8.2%)
test-00000-of-00001-701d16158af87368	116 (0.6%)	115 (11.5%)	116 (5.8%)	116 (3.9%)
rubend18_ChatGPT-Jailbreak-Prompts	79 (0.4%)	79 (7.9%)	79 (4.0%)	79 (2.6%)

merged_datasets.csv actually has 20265 samples, 231 samples are duplicates and are removed.

Label Distribution by Source (merged_datasets.csv)

Source Dataset	Total	Safe (0)	Threat (1)	Safe %	Threat %
qualifire_prompt-injections-benchmark_test	5,000	3,001	1,999	60.0%	40.0%
train-00000-of-00001	8,236	5,740	2,496	69.7%	30.3%
test-00000-of-00001	2,060	1,410	650	68.4%	31.6%
allenai_wildjailbreak-eval	2,210	210	2,000	9.5%	90.5%
jackhhao_jailbreak-classification-jailbreak_dataset_full	1,998	1,332	666	66.7%	33.3%
train-00000-of-00001-9564e8b05b4757ab	546	343	203	62.8%	37.2%
PurpleLlama_CybersecurityBenchmarks_prompt_injection	251	0	251	0.0%	100.0%
test-00000-of-00001-701d16158af87368	116	56	60	48.3%	51.7%
rubend18_ChatGPT-Jailbreak-Prompts	79	74	5	93.7%	6.3%

Short Dataset Creation

The create_short_dataset.py script allows you to create balanced smaller datasets from the full merged dataset for faster testing and development:

Features

Balanced Sampling: Maintains proportional representation from each source dataset
Label Balance: Preserves label distribution within each source when possible
Configurable Size: Specify target number of samples
Reproducible: Uses random seed for consistent results
Auto-naming: Generates descriptive filenames based on actual sample count

Usage

cd datasets

# Basic usage - creates 1000 samples by default
python create_short_dataset.py

# Custom sample size

python create_short_dataset.py --samples=1036
#python create_short_dataset.py --samples=2340
#python create_short_dataset.py --samples=3840


# Specify input and output files
python create_short_dataset.py --input merged_datasets.csv --output my_dataset.csv --samples 500

# With custom random seed
python create_short_dataset.py --samples 1000 --seed 123

Command Line Options

Option	Description	Default
`--input`	Input CSV file path	`merged_datasets.csv`
`--output`	Output CSV file path (auto-generated if not specified)	Auto-generated
`--samples`	Target number of samples to extract	1000
`--seed`	Random seed for reproducibility	42

Example Output

Creating balanced short dataset with target of 1000 samples...
Loading dataset from: merged_datasets.csv
Loaded dataset with 20265 rows

=== Original Dataset Distribution ===
Label distribution:
  0 (Safe): 11,847 (58.4%)
  1 (Threat): 8,418 (41.6%)

=== Sampling Strategy ===
Target samples: 1000
Total sources: 9
Base samples per source: 111
Remainder to distribute: 1

Final short dataset saved to: merged_datasets_short_1000.csv

Dataset Structure

All datasets follow the same structure after merging:

Column	Description	Type
text	The prompt/text content	string
label	Binary label (0=Safe, 1=Threat)	integer
source	Source dataset identifier	string
original_label	Original label from source	string

Label Mapping

The merging process standardizes all labels to binary format:

0 (Safe): Normal, benign prompts
1 (Threat): Prompt injection attempts, jailbreaks, malicious prompts

Model Configuration

Supported Models

Llama Prompt Guard 2 (86M): ./models/Llama-Prompt-Guard-2-86M/
ProtectAI DeBERTa v3: ./models/protectaiv2/
ProtectAI DeBERTa v1: ./models/protectaiv1/
PreambleAI: ./models/preambleai/
Qualifire: ./models/qualifire/
SavantAI: ./models/testsavantai-prompt-injection-defender-large-v0/
Deepset DeBERTa: ./models/deepset-deberta/
Vijil mBERT: ./models/vijil-mbert-prompt-injection/
YOUR OWN MODEL ...

Model Downloads

# Example model downloads using huggingface-cli
huggingface-cli download PreambleAI/prompt-injection-defense --local-dir ./models/preambleai/
huggingface-cli download protectai/deberta-v3-base-prompt-injection-v2 --local-dir ./models/protectaiv2/

Output Files

Results are saved in timestamped directories under results/:

results/evaluation_results_{model_name}_{timestamp}/
├── eval_results_{timestamp}.json      # Detailed results with predictions
├── eval_summary_{timestamp}.csv       # Flattened results for analysis
├── evaluation_report_{timestamp}.txt  # Human-readable report
└── evaluation_metrics_{timestamp}.json # Structured metrics

Evaluation Metrics

The tool provides comprehensive evaluation metrics:

Accuracy: Overall classification accuracy
Precision: Threat detection precision
Recall: Threat detection recall (sensitivity)
F1 Score: Harmonic mean of precision and recall
False Positive Rate (FPR): Rate of false threat detection
Confusion Matrix: Detailed classification breakdown

Advanced Usage

Custom Model Integration

To add your own model, create a directory under models/ and ensure it follows the HuggingFace transformers format.

Batch Evaluation

# Evaluate multiple models with the same dataset
 for model in deepset-deberta testsavantai-prompt-injection-defender-large-v0 Llama-Prompt-Guard-2-86M protectaiv1 protectaiv2 preambleai qualifire; do
    python model-eval-v2.0.py --model ./models/$model/ --dataset datasets/merged_datasets_short_2000.csv
done

Sampling Strategies

# Random sampling from full dataset
python model-eval-v2.0.py --model ./models/deepset-deberta/ --sample 500

# Use pre-created balanced short dataset
python model-eval-v2.0.py --model ./models/deepset-deberta/ --dataset datasets/merged_datasets_short_1000.csv

Version History

v2.0: Enhanced evaluation pipeline with better output formatting
v1.1: Added sampling support and improved metrics
v1.0: Initial release with basic model evaluation

Model Sources

Dataset Sources

4.1 Deepset

Qualifire used this dataset for training: https://huggingface.co/datasets/deepset/prompt-injections

Clone of this dataset used by testsavantai/prompt-injection-defender-large-v0: https://huggingface.co/datasets/JasperLS/prompt-injections

4.2 allenai/wildjailbreak

Qualifire used this dataset for training: https://huggingface.co/datasets/allenai/wildjailbreak

4.3 jackhhao/jailbreak-classification

ProtectAI and Qualifire used this dataset for training: https://huggingface.co/datasets/jackhhao/jailbreak-classification

4.4 qualifire/prompt-injections-benchmark

https://huggingface.co/datasets/qualifire/prompt-injections-benchmark

4.5 xTRam1/safe-guard-prompt-injection

https://huggingface.co/datasets/xTRam1/safe-guard-prompt-injection

4.6 rubend18/ChatGPT-Jailbreak-Prompts

ProtectAI and testsavantai/prompt-injection-defender-large-v0 used this dataset for training: https://huggingface.co/datasets/rubend18/ChatGPT-Jailbreak-Prompts

4.7 NVIDIA nvidia/Aegis-AI-Content-Safety-Dataset-2.0

Prisma AIRS used this dataset for training: https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0

4.8 Purple Llama CyberSecEval

Llama Prompt Guard used this dataset for training: https://github.com/meta-llama/PurpleLlama/tree/23156b70efb596831c02c6461fc42da1f75988ec/CybersecurityBenchmarks

License

This project is open source. Please check individual model licenses for their respective usage terms.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
datasets		datasets
model-to-openai-api		model-to-openai-api
models		models
results-1000-samples		results-1000-samples
results-2000-samples		results-2000-samples
results-3000-samples		results-3000-samples
results-4000-samples		results-4000-samples
results		results
snapshop-v1.0		snapshop-v1.0
.DS_Store		.DS_Store
README-CN.md		README-CN.md
README.md		README.md
model-eval-aliyun-prompt-injection.py		model-eval-aliyun-prompt-injection.py
model-eval-aws-bedrock-guardrails.py		model-eval-aws-bedrock-guardrails.py
model-eval-azure-prompt-shield.py		model-eval-azure-prompt-shield.py
model-eval-cloud-api-v2.py		model-eval-cloud-api-v2.py
model-eval-v2.0.py		model-eval-v2.0.py
model-eval-vijil.py		model-eval-vijil.py

zm1990s/security-model-eval

Folders and files

Latest commit

History

Repository files navigation

Prompt Injection Evaluation Tool

Features

Prerequisites

Environment Setup

Project Structure

Quick Start

Step 1: Merge Datasets (Optional)

Step 2: Evaluate Models

Model Evaluation Results Comparison

Evaluation Results on Full Dataset (20,265 samples)

Evaluation Results on 4,000 Sample Dataset

Evaluation Results on 3,000 Sample Dataset

Evaluation Results on 2,000 Sample Dataset

Evaluation Results on 1,000 Sample Dataset

Key Findings

Dataset Management

Supported Input Formats

Available Datasets

Dataset Overview

Source Distribution

Label Distribution by Source (merged_datasets.csv)

Short Dataset Creation

Features

Usage

Command Line Options

Example Output

Dataset Structure

Label Mapping

Model Configuration

Supported Models

Model Downloads

Output Files

Evaluation Metrics

Advanced Usage

Custom Model Integration

Batch Evaluation

Sampling Strategies

Version History

Model Sources

3.1 testsavantai/prompt-injection-defender-large-v0

3.2 Qualifire

3.3 ProtectAI deberta-v3-base-prompt-injection-v2

3.4 Meta Llama Prompt-Guard-86M

3.5 vijil/mbert-prompt-injection

3.6 PreambleAI/prompt-injection-defense

Dataset Sources

4.1 Deepset

4.2 allenai/wildjailbreak

4.3 jackhhao/jailbreak-classification

4.4 qualifire/prompt-injections-benchmark

4.5 xTRam1/safe-guard-prompt-injection

4.6 rubend18/ChatGPT-Jailbreak-Prompts

4.7 NVIDIA nvidia/Aegis-AI-Content-Safety-Dataset-2.0

4.8 Purple Llama CyberSecEval

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages