Name	Name	Last commit message	Last commit date
Latest commit History 211 Commits
.github	.github
assets	assets
clawgui-agent	clawgui-agent
clawgui-app	clawgui-app
clawgui-eval	clawgui-eval
clawgui-rl	clawgui-rl
clawgui-skills	clawgui-skills
.gitignore	.gitignore
CONTRIBUTING.md	CONTRIBUTING.md
LICENSE	LICENSE
README.md	README.md
README_zh.md	README_zh.md

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

A full-stack framework for GUI agents, covering online RL training, standardized evaluation, and deployment.

clawgui-agent.mp4

ClawGUI-Agent controls a real phone
via natural language

clawgui-rl.mp4

ClawGUI-RL trains a GUI agent with online
reinforcement learning

News

📄 [2026年4月14日] Our paper is available on arXiv: ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents.
🔥 [2026年4月13日] ClawGUI is released — train with ClawGUI-RL (GiGPO), evaluate with ClawGUI-Eval, deploy with ClawGUI-Agent. ClawGUI-2B, a 2B agent trained end-to-end with this pipeline, hits 17.1 MobileWorld SR vs. the 11.1 baseline. See Quick Start.

Overview
Architecture
Quick Start
Roadmap
Acknowledgements
License

💡 Overview

ClawGUI is a research framework for GUI agents, covering the complete lifecycle from online RL training and standardized evaluation to real-device deployment.

Building a capable GUI agent involves three tightly coupled problems that are rarely solved together: you need an environment to train the agent online, rigorous benchmarks to measure what it has learned, and a production system to deploy it on real devices. ClawGUI addresses all three.

Module	Role
🚀 ClawGUI-RL	Build — Train GUI agents online with scalable RL: parallel Docker environments, real Android devices, and GiGPO+PRM for fine-grained step-level rewards
📊 ClawGUI-Eval	Evaluate — Measure what the agent has learned: 6 benchmarks, 11+ models, 95.8% faithful reproduction of official results
🤖 ClawGUI-Agent	Deploy — Use GUI agents in the real world: control mobile devices via natural language through 12+ chat platforms, with one-command evaluation built in
🧩 ClawGUI-Skills	Self-evolving skills — Training-free skill evolution proposed and validated in our paper: structured packages, retrieval, failure diagnosis, restricted revision, and reuse
📱 ClawGUI-APP	On-Device Deploy — Run the full brain + GUI agent stack directly on one Android phone, no desktop coordinator needed, powered by Shizuku
🏆 ClawGUI-2B	End-to-end validation: trained entirely with ClawGUI-RL and GiGPO, achieving 17.1 MobileWorld SR vs. the 11.1 baseline

🏗️ Architecture

ClawGUI System Architecture

🚀 Quick Start

git clone https://github.com/ZJU-REAL/ClawGUI.git
cd ClawGUI

Each module is independent with its own environment. Click into each one for full installation and usage instructions.

🚀 ClawGUI-RL — Build

📁 clawgui-rl/ · 📖 Full Documentation

ClawGUI-RL trains GUI agents with online reinforcement learning. It runs dozens of Docker-based Android emulators in parallel or trains directly on physical devices — and replaces standard GRPO with GiGPO+PRM for fine-grained step-level rewards that drive stronger policy learning.

Parallel multi-environment — Dozens of Docker-based virtual Android environments simultaneously
Real-device training — Physical or cloud Android phones with the same API
GiGPO + PRM — Fine-grained step-level reward for better policy optimization than standard GRPO
Spare server rotation — Automatic failover keeps training running without interruption
Episode visualization — Record and replay any training trajectory

ClawGUI-RL Architecture

→ Get started with ClawGUI-RL

📊 ClawGUI-Eval — Evaluate

📁 clawgui-eval/ · 📖 Full Documentation · 🤗 Dataset · 🤖 ModelScope

ClawGUI-Eval gives GUI grounding research a reliable measurement baseline. Its three-stage Infer → Judge → Metric pipeline covers 6 benchmarks and 11+ models, with a 95.8% reproduction rate against official results — so numbers across papers are actually comparable.

6 benchmarks — ScreenSpot-Pro, ScreenSpot-V2, UIVision, MMBench-GUI, OSWorld-G, AndroidControl
11+ models — Qwen3-VL, Qwen2.5-VL, UI-TARS, MAI-UI, GUI-G2, UI-Venus, Gemini, Seed 1.8, and more
Dual backend — Local GPU (transformers) or remote API (OpenAI-compatible)
Multi-GPU & multi-thread — Parallel inference with automatic resume
ClawGUI-Agent integration — Pair with ClawGUI-Agent to run the full pipeline via natural language

ClawGUI-Eval Architecture

→ Get started with ClawGUI-Eval

🤖 ClawGUI-Agent — Deploy

📁 clawgui-agent/ · 📖 Full Documentation · 中文

ClawGUI-Agent closes the loop from training to production. Built on OpenClaw and powered by nanobot, it lets you control Android, HarmonyOS, or iOS devices with natural language from 12+ chat platforms — and trigger the full ClawGUI-Eval benchmark pipeline with a single sentence, no scripts required.

Cross-platform — Android (ADB), HarmonyOS (HDC), iOS (XCTest)
Multi-model — AutoGLM, MAI-UI, GUI-Owl, Qwen-VL, UI-TARS via OpenAI-compatible API
One-command evaluation — Say "benchmark qwen3vl on screenspot-pro" and it handles env check → multi-GPU inference → judging → metrics → result comparison
Personalized memory — Automatically learns user preferences and injects context across tasks
Episode recording — Every task saved as structured episodes for replay and dataset building
Web UI — Gradio interface for device management, task execution, and memory inspection

ClawGUI-Agent

→ Get started with ClawGUI-Agent

🧩 ClawGUI-Skills — Self-Evolving Skills

📁 clawgui-skills/ · 📖 Full Documentation · 中文

ClawGUI-Skills implements the training-free self-evolving GUI skill architecture proposed and validated in our paper "Reflect, Revise, Reuse: Training-Free Skill Evolution for GUI Agents." It stores procedural task knowledge as structured skill packages and lets PhoneAgent retrieve, inject, diagnose, and revise them on demand.

Four modes — off, trace, reuse, and evolve; disabled by default to avoid extra context cost
Structured packages — meta_info.json, plan.md, backup.md, recover.md, and failure_examples/
Instant revision — failed runs are diagnosed by an isolated verifier and mapped to targeted skill-file edits
Visual inspection — the Web UI shows matched skill name, skill_id, injected context, revisions, and failure examples

→ Get started with ClawGUI-Skills

📱 ClawGUI-APP — On-Device Deploy

📁 clawgui-app/ · 📖 Setup Guide

ClawGUI-APP runs the full ClawGUI "brain + GUI agent" stack directly on one Android phone, removing the old split architecture where a desktop host orchestrates tasks and the phone only executes them. Built on Shizuku for high-privilege, non-root device control.

Phone-only workflow — No desktop coordinator required; a device with Shizuku is enough
Two-agent design — Brain LLM handles planning and tool orchestration, phone agent handles screen understanding and actions
Multi-model support — AutoGLM, MAI-UI, GUI-Owl, Qwen-VL, UI-TARS and more via OpenAI-compatible API
Voice input (STT) — Tap-to-record microphone with OpenAI-compatible speech-to-text transcription (SiliconFlow, Groq Whisper, etc.)
Conversation + automation — Sessions, long-term memory, external channels (Feishu), and trace replay
Built for real usage — Floating overlay status, built-in IME, session persistence, and diagnostics

→ Build ClawGUI-APP

🎯 Roadmap

ClawGUI-Agent — GUI agent framework for phone control and evaluation via natural language
ClawGUI-RL — Scalable mobile online RL training infrastructure with GiGPO + PRM
ClawGUI-Eval — Standardized GUI grounding evaluation suite with 6 benchmarks and 95%+ reproduction rate
ClawGUI-2B — 2B GUI agent trained with GiGPO, achieving 17.1 MobileWorld SR (vs. 11.1 baseline)
On-device ClawGUI-Agent (ClawGUI-APP) — Deploy ClawGUI-Agent directly on real phones — no desktop coordinator, paving the way for fully on-device inference (brain/VLM still served via cloud API today)
Desktop Online RL — Extend ClawGUI-RL to desktop environments for online reinforcement learning
Web Online RL — Extend ClawGUI-RL to web environments for online reinforcement learning
More Skills for ClawGUI-Agent — Add more pluggable skills to expand ClawGUI-Agent's capabilities
Hybrid CLI & GUI Mechanism — Explore hybrid interaction combining command-line and GUI operations
Real-time RL — Integrate real-time reinforcement learning based on the OPD algorithm for ClawGUI-RL and ClawGUI-Agent

🤝 Contributing

We welcome contributions of all kinds — new model support, new RL environments, bug fixes, and documentation improvements. See CONTRIBUTING.md for how to get started, module-specific guidelines, and PR requirements.

🙏 Acknowledgements

ClawGUI is built upon the following excellent open-source projects. We sincerely thank their contributors:

License

This project is licensed under the Apache License 2.0.

📝 Citation

If you find ClawGUI useful in your research, please consider citing our paper:

@article{tang2026clawgui,
 title={ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents},
 author={Tang, Fei and Lu, Zhiqiong and Zhang, Boxuan and Lu, Weiming and Xiao, Jun and Zhuang, Yueting and Shen, Yongliang},
 journal={arXiv preprint arXiv:2604.11784},
 year={2026}
}

Folders and files

Latest commit

History

Repository files navigation

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

News

Table of Contents

💡 Overview

🏗️ Architecture

🚀 Quick Start

🚀 ClawGUI-RL — Build

📊 ClawGUI-Eval — Evaluate

🤖 ClawGUI-Agent — Deploy

🧩 ClawGUI-Skills — Self-Evolving Skills

📱 ClawGUI-APP — On-Device Deploy

🎯 Roadmap

🤝 Contributing

🙏 Acknowledgements

License

📝 Citation

Star History

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages