Agent Evaluation Framework

This project provides a framework for evaluating the performance of various AI agents, including AutoGen, MetaGPT, and TaskWeaver. It is designed to test agents against a variety of tasks.

Project Overview

The primary goal of this framework is to offer a standardized environment for benchmarking different AI agent implementations. By providing common evaluation scripts and datasets, it allows for fair and reproducible comparisons of agent capabilities.

Directory Structure

The project is organized into the following directories:

agents/: Contains the core logic and evaluation scripts for each supported agent.
- AutoGen/: Evaluation script for AutoGen.
- MetaGPT/: Evaluation script for MetaGPT.
- TaskWeaver/: Evaluation script for TaskWeaver.
- utils.py: Shared utility functions used by the evaluation scripts.
- eval_config_template.json: A template for configuring the evaluation parameters for each agent.
data/: Contains the datasets and attachments used for the evaluation tasks.
- data.jsonl: The primary data file containing evaluation tasks in JSONL format.
- attachment/: A directory containing various files (CSV, TXT) that can be used as attachments for the tasks.
cases/: Contains the running logs of representative data examples.

Supported Agents

The framework currently supports the evaluation of the following agents:

AutoGen: A framework that enables the development of LLM applications using multiple agents that can converse with each other to solve tasks.
MetaGPT: A meta-programming framework for multi-agent collaboration.
TaskWeaver: A code-first agent framework for planning and executing data analytics tasks.

Datasets

The evaluation tasks are defined in data/data.jsonl. Associated data files and attachments required by the tasks are located in the data/attachment/ directory.

Getting Started

Prerequisites

Ensure you have the necessary dependencies installed for each agent you intend to evaluate. Refer to the official documentation for each agent for installation instructions.

Configuration

Create a configuration file for your evaluation. You can copy the template provided:
```
cp agents/eval_config_template.json agents/my_eval_config.json
```
Edit agents/my_eval_config.json to specify the parameters for the agent you want to evaluate. This includes model endpoints, API keys, and other agent-specific settings.

Running Evaluations

To run the evaluation for a specific agent, execute its corresponding evaluation script. For example, to evaluate AutoGen:

python agents/AutoGen/eval.py --config agents/my_eval_config.json

Similarly, for other agents:

# For MetaGPT
python agents/MetaGPT/eval.py --config agents/my_eval_config.json

# For TaskWeaver
python agents/TaskWeaver/eval.py --config agents/my_eval_config.json

The evaluation scripts will process the tasks from data/data.jsonl, run the specified agent, and log the results.

Citation

@misc{lu2025exploringautonomousagentscloser,
      title={Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks}, 
      author={Ruofan Lu and Yichen Li and Yintong Huo},
      year={2025},
      url={https://arxiv.org/abs/2508.13143}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
agents		agents
cases		cases
data		data
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Agent Evaluation Framework

Project Overview

Directory Structure

Supported Agents

Datasets

Getting Started

Prerequisites

Configuration

Running Evaluations

Citation

About

Uh oh!

Releases

Packages

Languages

License

lurf21/Agent_Evaluation_Framework

Folders and files

Latest commit

History

Repository files navigation

Agent Evaluation Framework

Project Overview

Directory Structure

Supported Agents

Datasets

Getting Started

Prerequisites

Configuration

Running Evaluations

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages