🎨 ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks

👋 Welcome to the ColorBench repository — a graph-structured benchmark designed to evaluate mobile GUI agents on complex, long-horizon tasks composed of multiple atomic operations. This project provides:

A graph-based benchmark construction methodology to expand or reconstruct environments.
A plug-and-play evaluation framework for safe, reproducible testing.

📢 News

[2 Dec '25] Released the core code and dataset (including evaluation environment and benchmark graphs).
[16 Oct '25] Our paper ColorBench: Benchmarking Mobile Agents with Graph Structured Framework for Complex Long-Horizon Task is now available on arXiv!

🧭 Overview

📦 175 Complex Long-Horizon Tasks

🌐 Covering 21 major apps – WeChat, Meituan, JD, Xiaohongshu, etc.
🔄 101 cross-app and 74 single-app tasks
🧭 Average optimal path length >13 steps

🎨 Graph-Based Design & Multi-Path Evaluation

🔀 Multiple correct and error paths supported
🔁 Enables reflection, replanning, and backtracking behaviors

📊 Comprehensive Evaluation Metrics

✅ 3 Core Indicators: Success Rate (SR), Completion Rate (CR), Atomic Capability (AC)
🧩 15 Atomic Capabilities – e.g., Search, Filter, Save, Share, Memory
🎯 Fine-grained diagnostics for weak atomic capabilities

🤖 Plug-and-Play Evaluation Framework

📱 Static but interactive graph environment
📐 Safe and repeatable testing without real devices or accounts
🧰 Fully automated evaluation – no human verification required

📂 Repository Structure

ColorBench/
├── config/
│   ├── default.yaml                # Config for evaluating agents
│   └── customized_config...
├── data/
│   ├── graph.json                  # Graph structure
│   ├── task.json                   # Task details
│   ├── graph_image/                # Screenshots
│   │   ├── Screenshot0.png
│   │   ├── Screenshot1.jpg
│   └── ...
├── HammerEnv/                      # BFS-based trajectory collection
├── src/
│   ├── agent/                      # Evaluation agents
│   ├── graph_construction/         # Graph construction utilities
│   ├── test/                       # Evaluation scripts
│   └── utils.py
├── construct_graph.py
├── run_colorbench_multi_agent.py
├── run_colorbench.py
└── README.md

🚀 Quick Start

🛠️ Installation

git clone https://github.com/MadeAgents/ColorBench
cd ColorBench
pip install -r requirements.txt

🧪 Evaluation

python3 run_colorbench.py --config configs/default.yaml --model your_model_name

Alternatively, use the provided script:

bash run_colorbench.sh

Customize Your Agent

Define your agent in src/agent/agent_base.py by inheriting from AgentBase and implementing the agent_step function (responsible for executing actions and logging). Then, add your agent to run_colorbench.py and create a new config file under ./config/.

Evaluation results are saved under ./checkpoints/.

🧩 Graph-Structured Benchmark Construction

Breadth-First Search (BFS) Application Exploration

We use our self-developed Android device interaction environment HammerEnv for breadth-first application exploration. HammerEnv is a comprehensive Android device interaction environment that enables dynamic exploration and automated operations of mobile applications.

Installation Steps

Download and install android_env and android_world open-source projects:

https://github.com/google-deepmind/android_env https://github.com/google-research/android_world

Note: When installing via pip, you need to use the editable mode with the command: pip install -e .

Configure ADB connection: Refer to https://developer.android.com/tools
Set environment variables:

export OPENAI_API_KEY="EMPTY"
export OPENAI_BASE_URL="http://xxx.xxx.xxx.xxx/v1"

Start interaction environment server:

python HammerEnv/src/server/gradio_web_server_physical_device.py

Run BFS application explorer:

python HammerEnv/examples/bfs_app_explorer_fixed.py

Configuration

Exploration Configuration Parameters

Parameter	Description	Default Value
`max_depth`	Maximum exploration depth	3
`max_trajectories`	Maximum number of trajectories to generate	50
`app_name`	Target application name	"小红书"
`output_dir`	Trajectory output directory	"trajectories"
`delay_between_actions`	Delay between actions (seconds)	2.0
`model_name`	AI model name for analysis	"Qwen2.5-VL-72B-Instruct"
`reset_environment_per_task`	Reset environment after each task	True
`reset_delay`	Environment reset delay (seconds)	1.0

Command Line Parameters

python examples/bfs_app_explorer_fixed.py \
    --server-name "http://localhost:7880/" \
    --model-name "xxx" \
    --app-name "小红书" \
    --max-depth 3 \
    --max-trajectories 20 \
    --output-dir "trajectories" \
    --delay 2.0

Depth-First Search (DFS) Application Exploration

To capture user long-horizon tasks, we manually capture sequences of mobile operation screenshots using a depth-first approach, then generate structured trajectory data through AI model analysis.

Workflow

Screenshot Collection: Manually capture application operation screenshots in order
Trajectory Analysis: Use large models to analyze adjacent screenshot pairs
Action Recognition: Extract precise click coordinates, input text, and other operations
Trajectory Generation: Build trajectory files based on trajectory data

Usage

# Run depth-first trajectory generation
python src/graph_construction/pic2trajectory.py

Input Requirements

Directory Structure: dfs/pic/trajectory1/
Required Files: query.txt (task description) + Screenshot_step_*_raw.{png|jpg}
Naming Convention: Screenshot files numbered in operation order (trajectory1 represents the first trajectory)

Output Results

Trajectory File: dfs/trajectory/trajectory1/trajectory_v0.txt
Adjacency Matrix: dfs/trajectory/trajectory1/{query}.csv

Output Structure

The system generates well-organized trajectory data with the following structure:

trajectories/
├── 小红书/
│   ├── 小红书.json
│   ├── Screenshot_2025-01-10-20-10-21_0.jpg
│   ├── Screenshot_2025-01-10-20-10-21_1.jpg
│   └── Screenshot_2025-01-10-20-10-21_2.jpg
└── 搜索/
    ├── 搜索.json
    ├── Screenshot_2025-01-10-20-15-30_0.jpg
    └── Screenshot_2025-01-10-20-15-30_1.jpg

Graph Construction

To merge multiple trajectory files into a unified task graph, run:

python construct_graph.py --input_folder <trajectories> --output_file <path/to/graph.json>

During merging, we use the following default models:

models--BAAI--bge-large-zh-v1.5 for text feature embedding
Qwen2.5-VL-72B for visual-language understanding

You can modify these in ./src/graph_construction/graph.py according to your setup. The generated graph.json records all node and edge information in the following format:

{
  "node_id": ,
  "screenlists": [
    {
      "screenshot_path": "",
      "node_description": ""
    }
  ],
  "ui_element_edge_list": [
    {
      "source_node": ,
      "target_node": ,
      "action_type": "",
      "action_parameter": {}
    }
  ]
}

Frontend Inspection Tool

After graph merging, you can manually inspect and adjust graph data using the frontend visualization tool. Convert the merged graph.json into a CSV file:

In ./src/graph_construction/parse_json_to_cvs.py, set json_file (path to graph JSON) and save_file (output CSV path).
In ./src/graph_construction/matrix_analyzer.py, set BASE_RECORD_PATH to your image directory.

Run the following commands:

python src/graph_construction/parse_json_to_cvs.py
python src/graph_construction/matrix_analyzer.py

After manual corrections, convert the updated CSV file back into the JSON format for evaluation.

python src/graph_construction/matrix_to_json.py

Bounding Box Annotation

Used for automatically generating bounding boxes for interface elements.

In src/graph_construction/image_jump_parser.py, modify the input paths in the main function: Path to the graph dataset JSON file；Path to the corresponding image folder
Set your model service API key;

Run the following command:

python src/graph_construction/image_jump_parser.py

🤝 Contributing & Citation

Contributions via Issues or Pull Requests are welcome! If you use this project, please consider citing our paper:

ColorBench: Benchmarking Mobile Agents with Graph Structured Framework for Complex Long-Horizon Task
arXiv:2510.14621

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
HammerEnv		HammerEnv
assets		assets
config		config
data		data
src		src
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
check_preference_points_from_trajectory.py		check_preference_points_from_trajectory.py
construct_graph.py		construct_graph.py
requirements.txt		requirements.txt
run_colorbench.py		run_colorbench.py
run_colorbench.sh		run_colorbench.sh
run_colorbench_multi_agent.py		run_colorbench_multi_agent.py
run_multi_owl.sh		run_multi_owl.sh
run_multi_qwen.sh		run_multi_qwen.sh

License

MadeAgents/ColorBench

Folders and files

Latest commit

History

Repository files navigation