Name	Name	Last commit message	Last commit date
Latest commit History 20 Commits
HammerEnv	HammerEnv
assets	assets
config	config
data	data
src	src
LICENSE	LICENSE
README.md	README.md
README_zh.md	README_zh.md
check_preference_points_from_trajectory.py	check_preference_points_from_trajectory.py
construct_graph.py	construct_graph.py
requirements.txt	requirements.txt
run_colorbench.py	run_colorbench.py
run_colorbench.sh	run_colorbench.sh
run_colorbench_multi_agent.py	run_colorbench_multi_agent.py
run_multi_owl.sh	run_multi_owl.sh
run_multi_qwen.sh	run_multi_qwen.sh

[WWW'26 Oral] ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks

English | 简体中文

👋 Welcome to the ColorBench repository — a graph-structured benchmark designed to evaluate mobile GUI agents on complex, long-horizon tasks composed of multiple atomic operations. This project provides:

A graph-based benchmark construction methodology to expand or reconstruct environments.
A plug-and-play evaluation framework for safe, reproducible testing.

ColorBench

📢 News

[2 Dec '25] Released the core code and dataset (including evaluation environment and benchmark graphs).
[16 Oct '25] Our paper ColorBench: Benchmarking Mobile Agents with Graph Structured Framework for Complex Long-Horizon Task is now available on arXiv!

🧭 Overview

ColorBench

📦 175 Complex Long-Horizon Tasks

🌐 Covering 21 major apps – WeChat, Meituan, JD, Xiaohongshu, etc.
🔄 101 cross-app and 74 single-app tasks
🧭 Average optimal path length >13 steps

🎨 Graph-Based Design & Multi-Path Evaluation

🔀 Multiple correct and error paths supported
🔁 Enables reflection, replanning, and backtracking behaviors

📊 Comprehensive Evaluation Metrics

✅ 3 Core Indicators: Success Rate (SR), Completion Rate (CR), Atomic Capability (AC)
🧩 15 Atomic Capabilities – e.g., Search, Filter, Save, Share, Memory
🎯 Fine-grained diagnostics for weak atomic capabilities

🤖 Plug-and-Play Evaluation Framework

📱 Static but interactive graph environment
📐 Safe and repeatable testing without real devices or accounts
🧰 Fully automated evaluation – no human verification required

ColorBench ColorBench

📂 Repository Structure

ColorBench/
├── config/
│ ├── default.yaml # Config for evaluating agents
│ └── customized_config...
├── data/
│ ├── graph.json # Graph structure
│ ├── task.json # Task details
│ ├── graph_image/ # Screenshots
│ │ ├── Screenshot0.png
│ │ ├── Screenshot1.jpg
│ └── ...
├── HammerEnv/ # BFS-based trajectory collection
├── src/
│ ├── agent/ # Evaluation agents
│ ├── graph_construction/ # Graph construction utilities
│ ├── test/ # Evaluation scripts
│ └── utils.py
├── construct_graph.py
├── run_colorbench_multi_agent.py
├── run_colorbench.py
└── README.md

🚀 Quick Start

🛠️ Installation

git clone https://github.com/MadeAgents/ColorBench
cd ColorBench
pip install -r requirements.txt

🧪 Evaluation

python3 run_colorbench.py --config configs/default.yaml --model your_model_name

Alternatively, use the provided script:

bash run_colorbench.sh

Customize Your Agent

Define your agent in src/agent/agent_base.py by inheriting from AgentBase and implementing the agent_step function (responsible for executing actions and logging). Then, add your agent to run_colorbench.py and create a new config file under ./config/.

Evaluation results are saved under ./checkpoints/.

🧩 Graph-Structured Benchmark Construction

Breadth-First Search (BFS) Application Exploration

We use our self-developed Android device interaction environment HammerEnv for breadth-first application exploration. HammerEnv is a comprehensive Android device interaction environment that enables dynamic exploration and automated operations of mobile applications.

Installation Steps

Download and install android_env and android_world open-source projects:

https://github.com/google-deepmind/android_env https://github.com/google-research/android_world

Note: When installing via pip, you need to use the editable mode with the command: pip install -e .

Configure ADB connection: Refer to https://developer.android.com/tools
Set environment variables:

export OPENAI_API_KEY="EMPTY"
export OPENAI_BASE_URL="http://xxx.xxx.xxx.xxx/v1"

Start interaction environment server:

python HammerEnv/src/server/gradio_web_server_physical_device.py

Run BFS application explorer:

python HammerEnv/examples/bfs_app_explorer_fixed.py

Configuration

Exploration Configuration Parameters

Parameter	Description	Default Value
`max_depth`	Maximum exploration depth	3
`max_trajectories`	Maximum number of trajectories to generate	50
`app_name`	Target application name	"小红书"
`output_dir`	Trajectory output directory	"trajectories"
`delay_between_actions`	Delay between actions (seconds)	2.0
`model_name`	AI model name for analysis	"Qwen2.5-VL-72B-Instruct"
`reset_environment_per_task`	Reset environment after each task	True
`reset_delay`	Environment reset delay (seconds)	1.0

Command Line Parameters

python examples/bfs_app_explorer_fixed.py \
 --server-name "http://localhost:7880/" \
 --model-name "xxx" \
 --app-name "小红书" \
 --max-depth 3 \
 --max-trajectories 20 \
 --output-dir "trajectories" \
 --delay 2.0

Depth-First Search (DFS) Application Exploration

To capture user long-horizon tasks, we manually capture sequences of mobile operation screenshots using a depth-first approach, then generate structured trajectory data through AI model analysis.

Workflow

Screenshot Collection: Manually capture application operation screenshots in order
Trajectory Analysis: Use large models to analyze adjacent screenshot pairs
Action Recognition: Extract precise click coordinates, input text, and other operations
Trajectory Generation: Build trajectory files based on trajectory data

Usage

# Run depth-first trajectory generation
python src/graph_construction/pic2trajectory.py

Input Requirements

Directory Structure: dfs/pic/trajectory1/
Required Files: query.txt (task description) + Screenshot_step_*_raw.{png|jpg}
Naming Convention: Screenshot files numbered in operation order (trajectory1 represents the first trajectory)

Output Results

Trajectory File: dfs/trajectory/trajectory1/trajectory_v0.txt
Adjacency Matrix: dfs/trajectory/trajectory1/{query}.csv

Output Structure

The system generates well-organized trajectory data with the following structure:

trajectories/
├── 小红书/
│ ├── 小红书.json
│ ├── Screenshot_2025年01月10日-20-10-21_0.jpg
│ ├── Screenshot_2025年01月10日-20-10-21_1.jpg
│ └── Screenshot_2025年01月10日-20-10-21_2.jpg
└── 搜索/
 ├── 搜索.json
 ├── Screenshot_2025年01月10日-20-15-30_0.jpg
 └── Screenshot_2025年01月10日-20-15-30_1.jpg

Graph Construction

To merge multiple trajectory files into a unified task graph, run:

python construct_graph.py --input_folder <trajectories> --output_file <path/to/graph.json>

During merging, we use the following default models:

models--BAAI--bge-large-zh-v1.5 for text feature embedding
Qwen2.5-VL-72B for visual-language understanding

You can modify these in ./src/graph_construction/graph.py according to your setup. The generated graph.json records all node and edge information in the following format:

{
 "node_id": ,
 "screenlists": [
 {
 "screenshot_path": "",
 "node_description": ""
 }
 ],
 "ui_element_edge_list": [
 {
 "source_node": ,
 "target_node": ,
 "action_type": "",
 "action_parameter": {}
 }
 ]
}

Frontend Inspection Tool

After graph merging, you can manually inspect and adjust graph data using the frontend visualization tool. Convert the merged graph.json into a CSV file:

In ./src/graph_construction/parse_json_to_cvs.py, set json_file (path to graph JSON) and save_file (output CSV path).
In ./src/graph_construction/matrix_analyzer.py, set BASE_RECORD_PATH to your image directory.

Run the following commands:

python src/graph_construction/parse_json_to_cvs.py
python src/graph_construction/matrix_analyzer.py

After manual corrections, convert the updated CSV file back into the JSON format for evaluation.

python src/graph_construction/matrix_to_json.py

Bounding Box Annotation

Used for automatically generating bounding boxes for interface elements.

In src/graph_construction/image_jump_parser.py, modify the input paths in the main function: Path to the graph dataset JSON file;Path to the corresponding image folder
Set your model service API key;

Run the following command:

python src/graph_construction/image_jump_parser.py

🤝 Contributing & Citation

Contributions via Issues or Pull Requests are welcome! If you use this project, please consider citing our paper:

ColorBench: Benchmarking Mobile Agents with Graph Structured Framework for Complex Long-Horizon Task

📚 Dataset available at:

HuggingFace Dataset

Folders and files

Latest commit

History

Repository files navigation

[WWW'26 Oral] ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks

📢 News

🧭 Overview

📦 175 Complex Long-Horizon Tasks

🎨 Graph-Based Design & Multi-Path Evaluation

📊 Comprehensive Evaluation Metrics

🤖 Plug-and-Play Evaluation Framework

📂 Repository Structure

🚀 Quick Start

🛠️ Installation

🧪 Evaluation

Customize Your Agent

🧩 Graph-Structured Benchmark Construction

Breadth-First Search (BFS) Application Exploration

Installation Steps

Configuration

Exploration Configuration Parameters

Command Line Parameters

Depth-First Search (DFS) Application Exploration

Workflow

Usage

Input Requirements

Output Results

Output Structure

Graph Construction

Frontend Inspection Tool

Bounding Box Annotation

🤝 Contributing & Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages