🤖 Agent Testing Agent (ATA)

A Meta-Agent for Automated Testing and Evaluation of Conversational AI Agents

Automatically test, evaluate, and improve your conversational AI agents with zero manual effort

Sameer Komoravolu¹ Khalil Mrini²

Work done during Sameer's internship at Grammarly under Khalil's mentorship

🌟 Overview

The Agent Testing Agent (ATA) is a sophisticated meta-agent that automatically tests and evaluates other conversational AI agents. It uses adaptive difficulty testing, persona-based scenarios, and comprehensive evaluation rubrics to provide actionable insights for improving agent performance.

🎯 Key Features

🔄 Automated Testing Pipeline: From code analysis to report generation, ATA handles the entire testing workflow
🎭 Persona-Based Testing: Dynamic test personas with realistic goals and behaviors
📈 Adaptive Difficulty: Advanced algorithms adjust test difficulty based on agent performance
🎯 Weakness-Targeted Scenarios: Identifies specific vulnerabilities and creates targeted tests
📊 Comprehensive Evaluation: Multi-criteria evaluation with detailed rubrics and scoring
🌐 Dual Interface: Both command-line and beautiful web interfaces available
📝 Actionable Reports: Detailed reports with specific code recommendations
🔬 Research-Informed: Integrates academic research on AI testing methodologies

📦 Supported Agents

ATA can test any conversational agent, with pre-configured support for:

📚 Wikipedia Writer Agent: Tests article creation with STORM methodology
✈️ Travel Agent: Tests constraint handling and itinerary planning

🚀 Quick Start

Command Line Interface

# Install dependencies
pip install -r requirements.txt

# Start the Agent Testing Agent
python agent_testing_agent.py

# Follow the interactive prompts to test your agent

Web Interface

# Start the web server
python ata_web_server.py

# Open your browser to http://localhost:5000

🛠️ Installation

Prerequisites

Python 3.12 or higher
pip package manager
Git

Step-by-Step Installation

Clone the repository

git clone https://github.com/your-org/agent-testing-agent.git
cd agent-testing-agent

Create virtual environment (recommended)

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies
```
pip install -r requirements.txt
```

Configure environment

export EMAIL="your-email@example.com"
# Optional: export TRACING_EXPORT_API_KEY="your-api-key"

Create configuration file (optional)

# config/ata_config.yaml
# Note: logger_dir should be a DIRECTORY path where session logs are saved
logger_dir: "data/conversations"
default_max_turns: 20
papers_per_research_round: 5
research_rounds: 3
num_weaknesses_to_identify: 7

🗂️ Project Structure

agent-testing-agent/
├── agent_testing_agent.py          # CLI entry point (uses src/ for imports)
├── ata_web_server.py               # Web entry point (Flask + Socket.IO)
├── example_agents/                 # Example agents under test
│   ├── chat/
│   │   └── chat_agent.py
│   ├── travel/
│   │   ├── travel_agent.py
│   │   ├── plan_verifier.py
│   │   └── travel_models.py
│   └── wikipedia/
│       └── wikipedia_writer.py
├── src/ata/                        # ATA core implementation
│   ├── agents.py
│   ├── cli.py
│   ├── core.py
│   ├── models.py
│   ├── prompts.py
│   ├── utils.py
│   └── web_server.py
├── src/utils/
│   └── openai_utils.py
├── config/                         # Configuration files
│   ├── ata_config.yaml
│   ├── config.json
│   ├── web_config.json
│   └── agent_evaluation_criteria.json
├── data/
│   ├── conversations/              # Session logs (tracked via .gitkeep)
│   └── evaluations/                # Saved evaluations (tracked via .gitkeep)
├── web/                            # Web UI assets
│   ├── static/
│   │   ├── css/ata_style.css
│   │   └── js/ata_interface.js
│   └── templates/ata_interface.html
├── docs/                           # (removed)
└── requirements.txt

Notes:

The CLI and Web entry points remain python agent_testing_agent.py and python ata_web_server.py.
Config files live under config/ and data output (logs, reports) under data/.
.gitkeep files are placeholders to keep otherwise-empty directories tracked by Git.

📖 Usage Guide

Command Line Interface

The CLI provides a streamlined testing experience with interactive prompts:

--- Agent-Testing Agent (ATA) ---
Welcome to the Agent-Testing Agent! I'm here to help you test your agents.

Available agents for testing:
1. Wikipedia Writer Agent
2. Travel Agent

Which agent would you like to test?

CLI Commands

Agent Selection: Enter number (1-2) or agent name
Weakness Review: Type accept, regenerate, or provide feedback
Exit: Type quit or exit

Web Interface

The web interface provides real-time testing visualization:

Features

Split View: Chat on left, test results on right
Real-time Monitoring: Watch tests execute live
Interactive Controls: Save/load sessions, view state
Visual Test Grid: See all test scenarios at a glance

Navigation

Tests Tab: View test scenarios and results
Criteria Tab: See evaluation rubrics
State Modal: Inspect system state
History Modal: Track workflow progression

🏗️ Architecture

System Overview

┌─────────────────────────────────────────────────────────┐
│                  User Interface Layer                    │
├──────────────┬──────────────────────────────────────────┤
│     CLI      │              Web Interface               │
├──────────────┴──────────────────────────────────────────┤
│                     Core Engine                          │
│                 (src/ata/core.py)                        │
├──────────────────────────────────────────────────────────┤
│                 Specialized Agents                       │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐  │
│  │ Selector │ │ Analyzer │ │Generator │ │Evaluator │  │
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘  │
├──────────────────────────────────────────────────────────┤
│              Data Models & State Management              │
└──────────────────────────────────────────────────────────┘

Key Components

src/ata/core.py: Core testing engine and workflow orchestration
src/ata/agents.py: Specialized AI agents for each testing phase
src/ata/models.py: Data models and state management
src/ata/prompts.py: Prompt templates and instructions
src/ata/utils.py: Utility functions and helpers
src/ata/web_server.py: Flask-based web interface implementation
ata_web_server.py: Web server entry point

🔄 Testing Workflow

Phase 1: Setup & Analysis

Agent Selection: Choose agent to test
Code Analysis: Parse source code and structure
Parameter Gathering: Define testing requirements

Phase 2: Research & Planning

Web Research: Find relevant testing methodologies
Weakness Analysis: Identify vulnerabilities
Scenario Generation: Create targeted test cases

Phase 3: Execution & Evaluation

Test Execution: Run scenarios programmatically
Performance Evaluation: Score using rubrics
Report Generation: Create actionable insights

📊 Evaluation System

Scoring Framework

Rubric-Based Criteria (1-5 scale)

5: Excellent - Exceeds expectations
4: Good - Meets expectations
3: Adequate - Basic requirements met
2: Poor - Significant issues
1: Failing - Does not meet requirements

Overall Performance (1-10 scale)

Combines criteria scores with efficiency metrics using weighted averaging.

Agent-Specific Criteria

Wikipedia Writer Agent

Use of Citations
Completeness of Article
Style and Organization
Overall Utility

Travel Agent

Constraint Handling
User Communication
Overall Utility

Adaptive Difficulty Algorithm

# Simplified algorithm
next_difficulty = weighted_average(
    adjust_difficulty(current_difficulty, score),
    weights=calculate_weight(score)
)

⚙️ Configuration

ATA looks for config/ata_config.yaml (or JSON) for configuration. You can override the path with the ATA_CONFIG_PATH environment variable.

# Logging (DIRECTORY for session logs)
logger_dir: "data/conversations"

# Test Parameters
default_max_turns: 20
papers_per_research_round: 3
research_rounds: 2
num_weaknesses_to_identify: 5
num_scenarios_per_weakness: 3

# Workflow Options
skip_parameter_gathering: false
skip_web_search: false
ablate_scanning: false
ablate_difficulty: false

Additional configuration files in config/:

config.json: Web server settings (host, port, debug)
agent_evaluation_criteria.json: Criteria used by the judge phase
web_config.json: Web UI-specific toggles

Configuration Options

Test Execution: Control test duration and complexity
Research Settings: Configure research depth
Ablation Modes: Skip phases for experimentation
Logging: Set log levels and destinations

📚 API Documentation

Core Classes

AgentTesterState

class AgentTesterState:
    agent_selection: Optional[AgentSelection]
    testing_parameters: Optional[TestingParameters]
    code_analysis: Optional[CodeAnalysisResult]
    test_scenarios: Dict[str, List[TestScenario]]
    evaluations: Optional[TestEvaluations]
    final_report: Optional[FinalReport]

TestScenario

class TestScenario(BaseModel):
    scenario_id: str
    description: str
    persona: TestPersona
    max_turns: int = 20
    expected_outcome: str
    difficulty: Optional[float]

Key Functions

execute_single_test

async def execute_single_test(
    state: AgentTesterState, 
    scenario: TestScenario
) -> Dict[str, Any]

verify_agent_availability

async def verify_agent_availability(
    agent_key: str
) -> Tuple[bool, str]

Adding Custom Agents

Update AVAILABLE_AGENTS in src/ata/models.py
Add evaluation criteria to agent_evaluation_criteria.json
Ensure agent follows expected interface

📄 License

This project is licensed under the MIT License - see LICENSE file.

Making AI agents more reliable, one test at a time

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
example_agents		example_agents
src		src
web		web
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
agent_testing_agent.py		agent_testing_agent.py
ata_web_server.py		ata_web_server.py
requirements.txt		requirements.txt

KhalilMrini/Agent-Testing-Agent

Folders and files

Latest commit

History

Repository files navigation

🤖 Agent Testing Agent (ATA)

🌟 Overview

🎯 Key Features

📦 Supported Agents

📋 Table of Contents

🚀 Quick Start

Command Line Interface

Web Interface

🛠️ Installation

Prerequisites

Step-by-Step Installation

🗂️ Project Structure

📖 Usage Guide

Command Line Interface

CLI Commands

Web Interface

Features

Navigation

🏗️ Architecture

System Overview

Key Components

🔄 Testing Workflow

Phase 1: Setup & Analysis

Phase 2: Research & Planning

Phase 3: Execution & Evaluation

📊 Evaluation System

Scoring Framework

Rubric-Based Criteria (1-5 scale)

Overall Performance (1-10 scale)

Agent-Specific Criteria

Adaptive Difficulty Algorithm

⚙️ Configuration

Configuration Options

📚 API Documentation

Core Classes

AgentTesterState

TestScenario

Key Functions

execute_single_test

verify_agent_availability

Adding Custom Agents

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages