Skip to content

KhalilMrini/Agent-Testing-Agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ€– Agent Testing Agent (ATA)

A Meta-Agent for Automated Testing and Evaluation of Conversational AI Agents

Python Version License OpenAI Agents SDK

Automatically test, evaluate, and improve your conversational AI agents with zero manual effort


Sameer KomoravoluΒΉ Β Β Β Β  Khalil MriniΒ²

ΒΉskomo2@illinois.edu Β Β Β Β  Β²hello@drkhalil.ai

Work done during Sameer's internship at Grammarly under Khalil's mentorship


🌟 Overview

The Agent Testing Agent (ATA) is a sophisticated meta-agent that automatically tests and evaluates other conversational AI agents. It uses adaptive difficulty testing, persona-based scenarios, and comprehensive evaluation rubrics to provide actionable insights for improving agent performance.

🎯 Key Features

  • πŸ”„ Automated Testing Pipeline: From code analysis to report generation, ATA handles the entire testing workflow
  • 🎭 Persona-Based Testing: Dynamic test personas with realistic goals and behaviors
  • πŸ“ˆ Adaptive Difficulty: Advanced algorithms adjust test difficulty based on agent performance
  • 🎯 Weakness-Targeted Scenarios: Identifies specific vulnerabilities and creates targeted tests
  • πŸ“Š Comprehensive Evaluation: Multi-criteria evaluation with detailed rubrics and scoring
  • 🌐 Dual Interface: Both command-line and beautiful web interfaces available
  • πŸ“ Actionable Reports: Detailed reports with specific code recommendations
  • πŸ”¬ Research-Informed: Integrates academic research on AI testing methodologies

πŸ“¦ Supported Agents

ATA can test any conversational agent, with pre-configured support for:

  • πŸ“š Wikipedia Writer Agent: Tests article creation with STORM methodology
  • ✈️ Travel Agent: Tests constraint handling and itinerary planning

πŸ“‹ Table of Contents


πŸš€ Quick Start

Command Line Interface

# Install dependencies
pip install -r requirements.txt

# Start the Agent Testing Agent
python agent_testing_agent.py

# Follow the interactive prompts to test your agent

Web Interface

# Start the web server
python ata_web_server.py

# Open your browser to http://localhost:5000

πŸ› οΈ Installation

Prerequisites

  • Python 3.12 or higher
  • pip package manager
  • Git

Step-by-Step Installation

  1. Clone the repository

    git clone https://github.com/your-org/agent-testing-agent.git
    cd agent-testing-agent
  2. Create virtual environment (recommended)

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Configure environment

    export EMAIL="your-email@example.com"
    # Optional: export TRACING_EXPORT_API_KEY="your-api-key"
  5. Create configuration file (optional)

    # config/ata_config.yaml
    # Note: logger_dir should be a DIRECTORY path where session logs are saved
    logger_dir: "data/conversations"
    default_max_turns: 20
    papers_per_research_round: 5
    research_rounds: 3
    num_weaknesses_to_identify: 7

πŸ—‚οΈ Project Structure

agent-testing-agent/
β”œβ”€β”€ agent_testing_agent.py          # CLI entry point (uses src/ for imports)
β”œβ”€β”€ ata_web_server.py               # Web entry point (Flask + Socket.IO)
β”œβ”€β”€ example_agents/                 # Example agents under test
β”‚   β”œβ”€β”€ chat/
β”‚   β”‚   └── chat_agent.py
β”‚   β”œβ”€β”€ travel/
β”‚   β”‚   β”œβ”€β”€ travel_agent.py
β”‚   β”‚   β”œβ”€β”€ plan_verifier.py
β”‚   β”‚   └── travel_models.py
β”‚   └── wikipedia/
β”‚       └── wikipedia_writer.py
β”œβ”€β”€ src/ata/                        # ATA core implementation
β”‚   β”œβ”€β”€ agents.py
β”‚   β”œβ”€β”€ cli.py
β”‚   β”œβ”€β”€ core.py
β”‚   β”œβ”€β”€ models.py
β”‚   β”œβ”€β”€ prompts.py
β”‚   β”œβ”€β”€ utils.py
β”‚   └── web_server.py
β”œβ”€β”€ src/utils/
β”‚   └── openai_utils.py
β”œβ”€β”€ config/                         # Configuration files
β”‚   β”œβ”€β”€ ata_config.yaml
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ web_config.json
β”‚   └── agent_evaluation_criteria.json
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ conversations/              # Session logs (tracked via .gitkeep)
β”‚   └── evaluations/                # Saved evaluations (tracked via .gitkeep)
β”œβ”€β”€ web/                            # Web UI assets
β”‚   β”œβ”€β”€ static/
β”‚   β”‚   β”œβ”€β”€ css/ata_style.css
β”‚   β”‚   └── js/ata_interface.js
β”‚   └── templates/ata_interface.html
β”œβ”€β”€ docs/                           # (removed)
└── requirements.txt

Notes:

  • The CLI and Web entry points remain python agent_testing_agent.py and python ata_web_server.py.
  • Config files live under config/ and data output (logs, reports) under data/.
  • .gitkeep files are placeholders to keep otherwise-empty directories tracked by Git.

πŸ“– Usage Guide

Command Line Interface

The CLI provides a streamlined testing experience with interactive prompts:

--- Agent-Testing Agent (ATA) ---
Welcome to the Agent-Testing Agent! I'm here to help you test your agents.

Available agents for testing:
1. Wikipedia Writer Agent
2. Travel Agent

Which agent would you like to test? 

CLI Commands

  • Agent Selection: Enter number (1-2) or agent name
  • Weakness Review: Type accept, regenerate, or provide feedback
  • Exit: Type quit or exit

Web Interface

The web interface provides real-time testing visualization:

Features

  • Split View: Chat on left, test results on right
  • Real-time Monitoring: Watch tests execute live
  • Interactive Controls: Save/load sessions, view state
  • Visual Test Grid: See all test scenarios at a glance

Navigation

  • Tests Tab: View test scenarios and results
  • Criteria Tab: See evaluation rubrics
  • State Modal: Inspect system state
  • History Modal: Track workflow progression

πŸ—οΈ Architecture

System Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  User Interface Layer                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚     CLI      β”‚              Web Interface               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                     Core Engine                          β”‚
β”‚                 (src/ata/core.py)                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                 Specialized Agents                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Selector β”‚ β”‚ Analyzer β”‚ β”‚Generator β”‚ β”‚Evaluator β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚              Data Models & State Management              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Components

  • src/ata/core.py: Core testing engine and workflow orchestration
  • src/ata/agents.py: Specialized AI agents for each testing phase
  • src/ata/models.py: Data models and state management
  • src/ata/prompts.py: Prompt templates and instructions
  • src/ata/utils.py: Utility functions and helpers
  • src/ata/web_server.py: Flask-based web interface implementation
  • ata_web_server.py: Web server entry point

πŸ”„ Testing Workflow

Phase 1: Setup & Analysis

  1. Agent Selection: Choose agent to test
  2. Code Analysis: Parse source code and structure
  3. Parameter Gathering: Define testing requirements

Phase 2: Research & Planning

  1. Web Research: Find relevant testing methodologies
  2. Weakness Analysis: Identify vulnerabilities
  3. Scenario Generation: Create targeted test cases

Phase 3: Execution & Evaluation

  1. Test Execution: Run scenarios programmatically
  2. Performance Evaluation: Score using rubrics
  3. Report Generation: Create actionable insights

πŸ“Š Evaluation System

Scoring Framework

Rubric-Based Criteria (1-5 scale)

  • 5: Excellent - Exceeds expectations
  • 4: Good - Meets expectations
  • 3: Adequate - Basic requirements met
  • 2: Poor - Significant issues
  • 1: Failing - Does not meet requirements

Overall Performance (1-10 scale)

Combines criteria scores with efficiency metrics using weighted averaging.

Agent-Specific Criteria

Wikipedia Writer Agent

  • Use of Citations
  • Completeness of Article
  • Style and Organization
  • Overall Utility

Travel Agent

  • Constraint Handling
  • User Communication
  • Overall Utility

Adaptive Difficulty Algorithm

# Simplified algorithm
next_difficulty = weighted_average(
    adjust_difficulty(current_difficulty, score),
    weights=calculate_weight(score)
)

βš™οΈ Configuration

ATA looks for config/ata_config.yaml (or JSON) for configuration. You can override the path with the ATA_CONFIG_PATH environment variable.

# Logging (DIRECTORY for session logs)
logger_dir: "data/conversations"

# Test Parameters
default_max_turns: 20
papers_per_research_round: 3
research_rounds: 2
num_weaknesses_to_identify: 5
num_scenarios_per_weakness: 3

# Workflow Options
skip_parameter_gathering: false
skip_web_search: false
ablate_scanning: false
ablate_difficulty: false

Additional configuration files in config/:

  • config.json: Web server settings (host, port, debug)
  • agent_evaluation_criteria.json: Criteria used by the judge phase
  • web_config.json: Web UI-specific toggles

Configuration Options

  • Test Execution: Control test duration and complexity
  • Research Settings: Configure research depth
  • Ablation Modes: Skip phases for experimentation
  • Logging: Set log levels and destinations

πŸ“š API Documentation

Core Classes

AgentTesterState

class AgentTesterState:
    agent_selection: Optional[AgentSelection]
    testing_parameters: Optional[TestingParameters]
    code_analysis: Optional[CodeAnalysisResult]
    test_scenarios: Dict[str, List[TestScenario]]
    evaluations: Optional[TestEvaluations]
    final_report: Optional[FinalReport]

TestScenario

class TestScenario(BaseModel):
    scenario_id: str
    description: str
    persona: TestPersona
    max_turns: int = 20
    expected_outcome: str
    difficulty: Optional[float]

Key Functions

execute_single_test

async def execute_single_test(
    state: AgentTesterState, 
    scenario: TestScenario
) -> Dict[str, Any]

verify_agent_availability

async def verify_agent_availability(
    agent_key: str
) -> Tuple[bool, str]

Adding Custom Agents

  1. Update AVAILABLE_AGENTS in src/ata/models.py
  2. Add evaluation criteria to agent_evaluation_criteria.json
  3. Ensure agent follows expected interface

πŸ“„ License

This project is licensed under the MIT License - see LICENSE file.


Making AI agents more reliable, one test at a time

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published