Skip to content

kgupta727/demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔍 AgentHub Failure Pattern Detector

Automatically detect why your AI agents are failing. Upload your evaluation data and instantly surface failure patterns, characteristics, and actionable recommendations.

The Problem

When agents fail evaluations, teams get lists of failures but no insights into why:

  • Are failures clustered on specific topics?
  • Do longer inputs cause more failures?
  • Is there a safety pattern?

This tool answers those questions in seconds.

What It Does

  1. Upload CSV of your agent evaluation results
  2. Analyzes failures to find common characteristics
  3. Detects patterns (e.g., "80% of safety failures are on political topics")
  4. Provides recommendations (e.g., "Add safety filter for political questions")
  5. Exports insights for your team

🚀 Quick Start

Backend

cd backend
npm install
npm start

Server runs on http://localhost:5000

Frontend

cd frontend
npx serve --single

Or open frontend/index.html directly in your browser.

📊 CSV Format

Your CSV should have these columns:

task_name,topic,input_length,output_length,safety_passed,instruction_passed,efficiency_score,pass_fail
classify_email,finance,150,120,true,true,92,pass
classify_email,politics,250,180,false,false,45,fail
...

Required column: pass_fail (values: pass or fail)

Optional columns: Add any metadata you want to analyze for patterns:

  • topic — categorical grouping
  • input_length — numerical characteristic
  • domain — categorization
  • difficulty — custom metric
  • etc.

💡 How Pattern Detection Works

  1. Isolates failures from your dataset
  2. Groups by characteristic (topic, length, etc.)
  3. Calculates frequency — what % of failures share this trait?
  4. Scores confidence — is this pattern significant?
  5. Generates recommendations — what should you do about it?

Pattern Scoring

A pattern is flagged if:

  • It accounts for >60% of failures
  • At least 3 failures match it
  • High confidence (validated against dataset size)

📁 Sample Data

We include sample_eval_data.csv with 100 realistic agent evaluations showing:

  • ✅ Politics topic → 80% safety failures
  • ✅ Input length >2000 → 85% instruction following failures
  • ✅ Medical entity extraction → 70% failures
  • ✅ Complex multi-step tasks → 60% failures

Try it: Upload the sample CSV to see how patterns are detected.

🔧 API Reference

POST /analyze

Analyze CSV data for failure patterns.

Request:

{
  "csv_content": "task_name,topic,...\ndata1,val1,...\n..."
}

Response:

{
  "success": true,
  "total_tests": 100,
  "total_failures": 20,
  "failure_rate": 20,
  "patterns": [
    {
      "description": "80% of safety failures are on political topics",
      "matches": 16,
      "percentage": 80,
      "confidence": 0.95,
      "characteristic": { "topic": "political" },
      "recommendation": "Add safety guardrails for political queries"
    }
  ]
}

🎯 Use Cases

  • Model Evaluation — Understand which categories your agent struggles with
  • Data Collection — Identify gaps to fill in training data
  • Safety Testing — Detect safety failures clustered by topic/domain
  • Performance Analysis — Find input characteristics that cause failures
  • Continuous Improvement — Track patterns over time as you retrain

🌐 Deploy to Vercel

vercel

Or use the one-click deploy: Deploy with Vercel

📈 Example Output

Input: 100 agent evaluations, 20 failures

Patterns Detected:

  1. "80% of safety failures happen on political topics" — 95% confidence
  2. "Input length >2000 chars → 85% fail instruction following" — 92% confidence
  3. "Medical entity extraction has 70% failure rate" — 88% confidence

Recommendations:

  • Add safety filter for political content
  • Retrain model on longer-context examples
  • Collect more medical entity training data

🤝 About This Demo

Built as a micro-demo to showcase:

  • Understanding agent evaluation problems — knows what matters
  • Full-stack execution — backend analysis + polished frontend
  • Fast shipping — built in ~60 minutes
  • Research mindset — pattern detection, not just dashboards

Ideal for teams that want to move faster on agent reliability.


Questions? Check the sample CSV or reach out!

About

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors