Skip to content

lurf21/Agent_Evaluation_Framework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agent Evaluation Framework

This project provides a framework for evaluating the performance of various AI agents, including AutoGen, MetaGPT, and TaskWeaver. It is designed to test agents against a variety of tasks.

Project Overview

The primary goal of this framework is to offer a standardized environment for benchmarking different AI agent implementations. By providing common evaluation scripts and datasets, it allows for fair and reproducible comparisons of agent capabilities.

Directory Structure

The project is organized into the following directories:

  • agents/: Contains the core logic and evaluation scripts for each supported agent.
    • AutoGen/: Evaluation script for AutoGen.
    • MetaGPT/: Evaluation script for MetaGPT.
    • TaskWeaver/: Evaluation script for TaskWeaver.
    • utils.py: Shared utility functions used by the evaluation scripts.
    • eval_config_template.json: A template for configuring the evaluation parameters for each agent.
  • data/: Contains the datasets and attachments used for the evaluation tasks.
    • data.jsonl: The primary data file containing evaluation tasks in JSONL format.
    • attachment/: A directory containing various files (CSV, TXT) that can be used as attachments for the tasks.
  • cases/: Contains the running logs of representative data examples.

Supported Agents

The framework currently supports the evaluation of the following agents:

  • AutoGen: A framework that enables the development of LLM applications using multiple agents that can converse with each other to solve tasks.
  • MetaGPT: A meta-programming framework for multi-agent collaboration.
  • TaskWeaver: A code-first agent framework for planning and executing data analytics tasks.

Datasets

The evaluation tasks are defined in data/data.jsonl. Associated data files and attachments required by the tasks are located in the data/attachment/ directory.

Getting Started

Prerequisites

Ensure you have the necessary dependencies installed for each agent you intend to evaluate. Refer to the official documentation for each agent for installation instructions.

Configuration

  1. Create a configuration file for your evaluation. You can copy the template provided:
    cp agents/eval_config_template.json agents/my_eval_config.json
  2. Edit agents/my_eval_config.json to specify the parameters for the agent you want to evaluate. This includes model endpoints, API keys, and other agent-specific settings.

Running Evaluations

To run the evaluation for a specific agent, execute its corresponding evaluation script. For example, to evaluate AutoGen:

python agents/AutoGen/eval.py --config agents/my_eval_config.json

Similarly, for other agents:

# For MetaGPT
python agents/MetaGPT/eval.py --config agents/my_eval_config.json

# For TaskWeaver
python agents/TaskWeaver/eval.py --config agents/my_eval_config.json

The evaluation scripts will process the tasks from data/data.jsonl, run the specified agent, and log the results.

Citation

@misc{lu2025exploringautonomousagentscloser,
      title={Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks}, 
      author={Ruofan Lu and Yichen Li and Yintong Huo},
      year={2025},
      url={https://arxiv.org/abs/2508.13143}, 
}

About

[ASE'25 NIER] Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages