🔍 AgentHub Failure Pattern Detector

Automatically detect why your AI agents are failing. Upload your evaluation data and instantly surface failure patterns, characteristics, and actionable recommendations.

The Problem

When agents fail evaluations, teams get lists of failures but no insights into why:

Are failures clustered on specific topics?
Do longer inputs cause more failures?
Is there a safety pattern?

This tool answers those questions in seconds.

What It Does

Upload CSV of your agent evaluation results
Analyzes failures to find common characteristics
Detects patterns (e.g., "80% of safety failures are on political topics")
Provides recommendations (e.g., "Add safety filter for political questions")
Exports insights for your team

🚀 Quick Start

Backend

cd backend
npm install
npm start

Server runs on http://localhost:5000

Frontend

cd frontend
npx serve --single

Or open frontend/index.html directly in your browser.

📊 CSV Format

Your CSV should have these columns:

task_name,topic,input_length,output_length,safety_passed,instruction_passed,efficiency_score,pass_fail
classify_email,finance,150,120,true,true,92,pass
classify_email,politics,250,180,false,false,45,fail
...

Required column: pass_fail (values: pass or fail)

Optional columns: Add any metadata you want to analyze for patterns:

topic — categorical grouping
input_length — numerical characteristic
domain — categorization
difficulty — custom metric
etc.

💡 How Pattern Detection Works

Isolates failures from your dataset
Groups by characteristic (topic, length, etc.)
Calculates frequency — what % of failures share this trait?
Scores confidence — is this pattern significant?
Generates recommendations — what should you do about it?

Pattern Scoring

A pattern is flagged if:

It accounts for >60% of failures
At least 3 failures match it
High confidence (validated against dataset size)

📁 Sample Data

We include sample_eval_data.csv with 100 realistic agent evaluations showing:

✅ Politics topic → 80% safety failures
✅ Input length >2000 → 85% instruction following failures
✅ Medical entity extraction → 70% failures
✅ Complex multi-step tasks → 60% failures

Try it: Upload the sample CSV to see how patterns are detected.

🔧 API Reference

POST /analyze

Analyze CSV data for failure patterns.

Request:

{
  "csv_content": "task_name,topic,...\ndata1,val1,...\n..."
}

Response:

{
  "success": true,
  "total_tests": 100,
  "total_failures": 20,
  "failure_rate": 20,
  "patterns": [
    {
      "description": "80% of safety failures are on political topics",
      "matches": 16,
      "percentage": 80,
      "confidence": 0.95,
      "characteristic": { "topic": "political" },
      "recommendation": "Add safety guardrails for political queries"
    }
  ]
}

🎯 Use Cases

Model Evaluation — Understand which categories your agent struggles with
Data Collection — Identify gaps to fill in training data
Safety Testing — Detect safety failures clustered by topic/domain
Performance Analysis — Find input characteristics that cause failures
Continuous Improvement — Track patterns over time as you retrain

🌐 Deploy to Vercel

vercel

Or use the one-click deploy:

📈 Example Output

Input: 100 agent evaluations, 20 failures

Patterns Detected:

"80% of safety failures happen on political topics" — 95% confidence
"Input length >2000 chars → 85% fail instruction following" — 92% confidence
"Medical entity extraction has 70% failure rate" — 88% confidence

Recommendations:

Add safety filter for political content
Retrain model on longer-context examples
Collect more medical entity training data

🤝 About This Demo

Built as a micro-demo to showcase:

Understanding agent evaluation problems — knows what matters
Full-stack execution — backend analysis + polished frontend
Fast shipping — built in ~60 minutes
Research mindset — pattern detection, not just dashboards

Ideal for teams that want to move faster on agent reliability.

Questions? Check the sample CSV or reach out!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
frontend		frontend
.gitignore		.gitignore
README.md		README.md
sample_eval_data.csv		sample_eval_data.csv
vercel.json		vercel.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 AgentHub Failure Pattern Detector

The Problem

What It Does

🚀 Quick Start

Backend

Frontend

📊 CSV Format

💡 How Pattern Detection Works

Pattern Scoring

📁 Sample Data

🔧 API Reference

POST /analyze

🎯 Use Cases

🌐 Deploy to Vercel

📈 Example Output

🤝 About This Demo

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

kgupta727/demo

Folders and files

Latest commit

History

Repository files navigation

🔍 AgentHub Failure Pattern Detector

The Problem

What It Does

🚀 Quick Start

Backend

Frontend

📊 CSV Format

💡 How Pattern Detection Works

Pattern Scoring

📁 Sample Data

🔧 API Reference

POST /analyze

🎯 Use Cases

🌐 Deploy to Vercel

📈 Example Output

🤝 About This Demo

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages