This project provides a framework for evaluating the performance of various AI agents, including AutoGen, MetaGPT, and TaskWeaver. It is designed to test agents against a variety of tasks.
The primary goal of this framework is to offer a standardized environment for benchmarking different AI agent implementations. By providing common evaluation scripts and datasets, it allows for fair and reproducible comparisons of agent capabilities.
The project is organized into the following directories:
agents/: Contains the core logic and evaluation scripts for each supported agent.AutoGen/: Evaluation script for AutoGen.MetaGPT/: Evaluation script for MetaGPT.TaskWeaver/: Evaluation script for TaskWeaver.utils.py: Shared utility functions used by the evaluation scripts.eval_config_template.json: A template for configuring the evaluation parameters for each agent.
data/: Contains the datasets and attachments used for the evaluation tasks.data.jsonl: The primary data file containing evaluation tasks in JSONL format.attachment/: A directory containing various files (CSV, TXT) that can be used as attachments for the tasks.
cases/: Contains the running logs of representative data examples.
The framework currently supports the evaluation of the following agents:
- AutoGen: A framework that enables the development of LLM applications using multiple agents that can converse with each other to solve tasks.
- MetaGPT: A meta-programming framework for multi-agent collaboration.
- TaskWeaver: A code-first agent framework for planning and executing data analytics tasks.
The evaluation tasks are defined in data/data.jsonl. Associated data files and attachments required by the tasks are located in the data/attachment/ directory.
Ensure you have the necessary dependencies installed for each agent you intend to evaluate. Refer to the official documentation for each agent for installation instructions.
- Create a configuration file for your evaluation. You can copy the template provided:
cp agents/eval_config_template.json agents/my_eval_config.json
- Edit
agents/my_eval_config.jsonto specify the parameters for the agent you want to evaluate. This includes model endpoints, API keys, and other agent-specific settings.
To run the evaluation for a specific agent, execute its corresponding evaluation script. For example, to evaluate AutoGen:
python agents/AutoGen/eval.py --config agents/my_eval_config.jsonSimilarly, for other agents:
# For MetaGPT
python agents/MetaGPT/eval.py --config agents/my_eval_config.json
# For TaskWeaver
python agents/TaskWeaver/eval.py --config agents/my_eval_config.jsonThe evaluation scripts will process the tasks from data/data.jsonl, run the specified agent, and log the results.
@misc{lu2025exploringautonomousagentscloser,
title={Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks},
author={Ruofan Lu and Yichen Li and Yintong Huo},
year={2025},
url={https://arxiv.org/abs/2508.13143},
}