Automatically detect why your AI agents are failing. Upload your evaluation data and instantly surface failure patterns, characteristics, and actionable recommendations.
When agents fail evaluations, teams get lists of failures but no insights into why:
- Are failures clustered on specific topics?
- Do longer inputs cause more failures?
- Is there a safety pattern?
This tool answers those questions in seconds.
- Upload CSV of your agent evaluation results
- Analyzes failures to find common characteristics
- Detects patterns (e.g., "80% of safety failures are on political topics")
- Provides recommendations (e.g., "Add safety filter for political questions")
- Exports insights for your team
cd backend
npm install
npm startServer runs on http://localhost:5000
cd frontend
npx serve --singleOr open frontend/index.html directly in your browser.
Your CSV should have these columns:
task_name,topic,input_length,output_length,safety_passed,instruction_passed,efficiency_score,pass_fail
classify_email,finance,150,120,true,true,92,pass
classify_email,politics,250,180,false,false,45,fail
...
Required column: pass_fail (values: pass or fail)
Optional columns: Add any metadata you want to analyze for patterns:
topic— categorical groupinginput_length— numerical characteristicdomain— categorizationdifficulty— custom metric- etc.
- Isolates failures from your dataset
- Groups by characteristic (topic, length, etc.)
- Calculates frequency — what % of failures share this trait?
- Scores confidence — is this pattern significant?
- Generates recommendations — what should you do about it?
A pattern is flagged if:
- It accounts for >60% of failures
- At least 3 failures match it
- High confidence (validated against dataset size)
We include sample_eval_data.csv with 100 realistic agent evaluations showing:
- ✅ Politics topic → 80% safety failures
- ✅ Input length >2000 → 85% instruction following failures
- ✅ Medical entity extraction → 70% failures
- ✅ Complex multi-step tasks → 60% failures
Try it: Upload the sample CSV to see how patterns are detected.
Analyze CSV data for failure patterns.
Request:
{
"csv_content": "task_name,topic,...\ndata1,val1,...\n..."
}Response:
{
"success": true,
"total_tests": 100,
"total_failures": 20,
"failure_rate": 20,
"patterns": [
{
"description": "80% of safety failures are on political topics",
"matches": 16,
"percentage": 80,
"confidence": 0.95,
"characteristic": { "topic": "political" },
"recommendation": "Add safety guardrails for political queries"
}
]
}- Model Evaluation — Understand which categories your agent struggles with
- Data Collection — Identify gaps to fill in training data
- Safety Testing — Detect safety failures clustered by topic/domain
- Performance Analysis — Find input characteristics that cause failures
- Continuous Improvement — Track patterns over time as you retrain
vercelInput: 100 agent evaluations, 20 failures
Patterns Detected:
- "80% of safety failures happen on political topics" — 95% confidence
- "Input length >2000 chars → 85% fail instruction following" — 92% confidence
- "Medical entity extraction has 70% failure rate" — 88% confidence
Recommendations:
- Add safety filter for political content
- Retrain model on longer-context examples
- Collect more medical entity training data
Built as a micro-demo to showcase:
- Understanding agent evaluation problems — knows what matters
- Full-stack execution — backend analysis + polished frontend
- Fast shipping — built in ~60 minutes
- Research mindset — pattern detection, not just dashboards
Ideal for teams that want to move faster on agent reliability.
Questions? Check the sample CSV or reach out!