A Meta-Agent for Automated Testing and Evaluation of Conversational AI Agents
Automatically test, evaluate, and improve your conversational AI agents with zero manual effort
Sameer KomoravoluΒΉ Β Β Β Β Khalil MriniΒ²
ΒΉskomo2@illinois.edu Β Β Β Β Β²hello@drkhalil.ai
Work done during Sameer's internship at Grammarly under Khalil's mentorship
The Agent Testing Agent (ATA) is a sophisticated meta-agent that automatically tests and evaluates other conversational AI agents. It uses adaptive difficulty testing, persona-based scenarios, and comprehensive evaluation rubrics to provide actionable insights for improving agent performance.
- π Automated Testing Pipeline: From code analysis to report generation, ATA handles the entire testing workflow
- π Persona-Based Testing: Dynamic test personas with realistic goals and behaviors
- π Adaptive Difficulty: Advanced algorithms adjust test difficulty based on agent performance
- π― Weakness-Targeted Scenarios: Identifies specific vulnerabilities and creates targeted tests
- π Comprehensive Evaluation: Multi-criteria evaluation with detailed rubrics and scoring
- π Dual Interface: Both command-line and beautiful web interfaces available
- π Actionable Reports: Detailed reports with specific code recommendations
- π¬ Research-Informed: Integrates academic research on AI testing methodologies
ATA can test any conversational agent, with pre-configured support for:
- π Wikipedia Writer Agent: Tests article creation with STORM methodology
βοΈ Travel Agent: Tests constraint handling and itinerary planning
- Quick Start
- Installation
- Usage Guide
- Project Structure
- Architecture
- Testing Workflow
- Evaluation System
- Configuration
- API Documentation
# Install dependencies
pip install -r requirements.txt
# Start the Agent Testing Agent
python agent_testing_agent.py
# Follow the interactive prompts to test your agent# Start the web server
python ata_web_server.py
# Open your browser to http://localhost:5000- Python 3.12 or higher
- pip package manager
- Git
-
Clone the repository
git clone https://github.com/your-org/agent-testing-agent.git cd agent-testing-agent -
Create virtual environment (recommended)
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Configure environment
export EMAIL="your-email@example.com" # Optional: export TRACING_EXPORT_API_KEY="your-api-key"
-
Create configuration file (optional)
# config/ata_config.yaml # Note: logger_dir should be a DIRECTORY path where session logs are saved logger_dir: "data/conversations" default_max_turns: 20 papers_per_research_round: 5 research_rounds: 3 num_weaknesses_to_identify: 7
agent-testing-agent/
βββ agent_testing_agent.py # CLI entry point (uses src/ for imports)
βββ ata_web_server.py # Web entry point (Flask + Socket.IO)
βββ example_agents/ # Example agents under test
β βββ chat/
β β βββ chat_agent.py
β βββ travel/
β β βββ travel_agent.py
β β βββ plan_verifier.py
β β βββ travel_models.py
β βββ wikipedia/
β βββ wikipedia_writer.py
βββ src/ata/ # ATA core implementation
β βββ agents.py
β βββ cli.py
β βββ core.py
β βββ models.py
β βββ prompts.py
β βββ utils.py
β βββ web_server.py
βββ src/utils/
β βββ openai_utils.py
βββ config/ # Configuration files
β βββ ata_config.yaml
β βββ config.json
β βββ web_config.json
β βββ agent_evaluation_criteria.json
βββ data/
β βββ conversations/ # Session logs (tracked via .gitkeep)
β βββ evaluations/ # Saved evaluations (tracked via .gitkeep)
βββ web/ # Web UI assets
β βββ static/
β β βββ css/ata_style.css
β β βββ js/ata_interface.js
β βββ templates/ata_interface.html
βββ docs/ # (removed)
βββ requirements.txt
Notes:
- The CLI and Web entry points remain
python agent_testing_agent.pyandpython ata_web_server.py. - Config files live under
config/and data output (logs, reports) underdata/. .gitkeepfiles are placeholders to keep otherwise-empty directories tracked by Git.
The CLI provides a streamlined testing experience with interactive prompts:
--- Agent-Testing Agent (ATA) ---
Welcome to the Agent-Testing Agent! I'm here to help you test your agents.
Available agents for testing:
1. Wikipedia Writer Agent
2. Travel Agent
Which agent would you like to test?
- Agent Selection: Enter number (1-2) or agent name
- Weakness Review: Type
accept,regenerate, or provide feedback - Exit: Type
quitorexit
The web interface provides real-time testing visualization:
- Split View: Chat on left, test results on right
- Real-time Monitoring: Watch tests execute live
- Interactive Controls: Save/load sessions, view state
- Visual Test Grid: See all test scenarios at a glance
- Tests Tab: View test scenarios and results
- Criteria Tab: See evaluation rubrics
- State Modal: Inspect system state
- History Modal: Track workflow progression
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User Interface Layer β
ββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββ€
β CLI β Web Interface β
ββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββ€
β Core Engine β
β (src/ata/core.py) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Specialized Agents β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β Selector β β Analyzer β βGenerator β βEvaluator β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Data Models & State Management β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
src/ata/core.py: Core testing engine and workflow orchestrationsrc/ata/agents.py: Specialized AI agents for each testing phasesrc/ata/models.py: Data models and state managementsrc/ata/prompts.py: Prompt templates and instructionssrc/ata/utils.py: Utility functions and helperssrc/ata/web_server.py: Flask-based web interface implementationata_web_server.py: Web server entry point
- Agent Selection: Choose agent to test
- Code Analysis: Parse source code and structure
- Parameter Gathering: Define testing requirements
- Web Research: Find relevant testing methodologies
- Weakness Analysis: Identify vulnerabilities
- Scenario Generation: Create targeted test cases
- Test Execution: Run scenarios programmatically
- Performance Evaluation: Score using rubrics
- Report Generation: Create actionable insights
- 5: Excellent - Exceeds expectations
- 4: Good - Meets expectations
- 3: Adequate - Basic requirements met
- 2: Poor - Significant issues
- 1: Failing - Does not meet requirements
Combines criteria scores with efficiency metrics using weighted averaging.
Wikipedia Writer Agent
- Use of Citations
- Completeness of Article
- Style and Organization
- Overall Utility
Travel Agent
- Constraint Handling
- User Communication
- Overall Utility
# Simplified algorithm
next_difficulty = weighted_average(
adjust_difficulty(current_difficulty, score),
weights=calculate_weight(score)
)ATA looks for config/ata_config.yaml (or JSON) for configuration. You can override the path with the ATA_CONFIG_PATH environment variable.
# Logging (DIRECTORY for session logs)
logger_dir: "data/conversations"
# Test Parameters
default_max_turns: 20
papers_per_research_round: 3
research_rounds: 2
num_weaknesses_to_identify: 5
num_scenarios_per_weakness: 3
# Workflow Options
skip_parameter_gathering: false
skip_web_search: false
ablate_scanning: false
ablate_difficulty: falseAdditional configuration files in config/:
config.json: Web server settings (host, port, debug)agent_evaluation_criteria.json: Criteria used by the judge phaseweb_config.json: Web UI-specific toggles
- Test Execution: Control test duration and complexity
- Research Settings: Configure research depth
- Ablation Modes: Skip phases for experimentation
- Logging: Set log levels and destinations
class AgentTesterState:
agent_selection: Optional[AgentSelection]
testing_parameters: Optional[TestingParameters]
code_analysis: Optional[CodeAnalysisResult]
test_scenarios: Dict[str, List[TestScenario]]
evaluations: Optional[TestEvaluations]
final_report: Optional[FinalReport]class TestScenario(BaseModel):
scenario_id: str
description: str
persona: TestPersona
max_turns: int = 20
expected_outcome: str
difficulty: Optional[float]async def execute_single_test(
state: AgentTesterState,
scenario: TestScenario
) -> Dict[str, Any]async def verify_agent_availability(
agent_key: str
) -> Tuple[bool, str]- Update
AVAILABLE_AGENTSinsrc/ata/models.py - Add evaluation criteria to
agent_evaluation_criteria.json - Ensure agent follows expected interface
This project is licensed under the MIT License - see LICENSE file.
Making AI agents more reliable, one test at a time