Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

ZJU-REAL/ClawGUI

Repository files navigation

ClawGUI Logo

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

Python 3.12 License Stars arXiv Daily Paper

HuggingFace Model ModelScope Model Project Page

English | 中文

A full-stack framework for GUI agents, covering online RL training, standardized evaluation, and deployment.
clawgui-agent.mp4

ClawGUI-Agent controls a real phone
via natural language
clawgui-rl.mp4

ClawGUI-RL trains a GUI agent with online
reinforcement learning

News

Table of Contents

💡 Overview

ClawGUI is a research framework for GUI agents, covering the complete lifecycle from online RL training and standardized evaluation to real-device deployment.

Building a capable GUI agent involves three tightly coupled problems that are rarely solved together: you need an environment to train the agent online, rigorous benchmarks to measure what it has learned, and a production system to deploy it on real devices. ClawGUI addresses all three.

Module Role
🚀 ClawGUI-RL Build — Train GUI agents online with scalable RL: parallel Docker environments, real Android devices, and GiGPO+PRM for fine-grained step-level rewards
📊 ClawGUI-Eval Evaluate — Measure what the agent has learned: 6 benchmarks, 11+ models, 95.8% faithful reproduction of official results
🤖 ClawGUI-Agent Deploy — Use GUI agents in the real world: control mobile devices via natural language through 12+ chat platforms, with one-command evaluation built in
🧩 ClawGUI-Skills Self-evolving skills — Training-free skill evolution proposed and validated in our paper: structured packages, retrieval, failure diagnosis, restricted revision, and reuse
📱 ClawGUI-APP On-Device Deploy — Run the full brain + GUI agent stack directly on one Android phone, no desktop coordinator needed, powered by Shizuku
🏆 ClawGUI-2B End-to-end validation: trained entirely with ClawGUI-RL and GiGPO, achieving 17.1 MobileWorld SR vs. the 11.1 baseline

🏗️ Architecture

🚀 Quick Start

git clone https://github.com/ZJU-REAL/ClawGUI.git
cd ClawGUI

Each module is independent with its own environment. Click into each one for full installation and usage instructions.

🚀 ClawGUI-RL — Build

📁 clawgui-rl/ · 📖 Full Documentation

ClawGUI-RL trains GUI agents with online reinforcement learning. It runs dozens of Docker-based Android emulators in parallel or trains directly on physical devices — and replaces standard GRPO with GiGPO+PRM for fine-grained step-level rewards that drive stronger policy learning.

  • Parallel multi-environment — Dozens of Docker-based virtual Android environments simultaneously
  • Real-device training — Physical or cloud Android phones with the same API
  • GiGPO + PRM — Fine-grained step-level reward for better policy optimization than standard GRPO
  • Spare server rotation — Automatic failover keeps training running without interruption
  • Episode visualization — Record and replay any training trajectory

Get started with ClawGUI-RL

📊 ClawGUI-Eval — Evaluate

📁 clawgui-eval/ · 📖 Full Documentation · 🤗 Dataset · 🤖 ModelScope

ClawGUI-Eval gives GUI grounding research a reliable measurement baseline. Its three-stage Infer → Judge → Metric pipeline covers 6 benchmarks and 11+ models, with a 95.8% reproduction rate against official results — so numbers across papers are actually comparable.

  • 6 benchmarks — ScreenSpot-Pro, ScreenSpot-V2, UIVision, MMBench-GUI, OSWorld-G, AndroidControl
  • 11+ models — Qwen3-VL, Qwen2.5-VL, UI-TARS, MAI-UI, GUI-G2, UI-Venus, Gemini, Seed 1.8, and more
  • Dual backend — Local GPU (transformers) or remote API (OpenAI-compatible)
  • Multi-GPU & multi-thread — Parallel inference with automatic resume
  • ClawGUI-Agent integration — Pair with ClawGUI-Agent to run the full pipeline via natural language

Get started with ClawGUI-Eval

🤖 ClawGUI-Agent — Deploy

📁 clawgui-agent/ · 📖 Full Documentation · 中文

ClawGUI-Agent closes the loop from training to production. Built on OpenClaw and powered by nanobot, it lets you control Android, HarmonyOS, or iOS devices with natural language from 12+ chat platforms — and trigger the full ClawGUI-Eval benchmark pipeline with a single sentence, no scripts required.

  • Cross-platform — Android (ADB), HarmonyOS (HDC), iOS (XCTest)
  • Multi-model — AutoGLM, MAI-UI, GUI-Owl, Qwen-VL, UI-TARS via OpenAI-compatible API
  • One-command evaluation — Say "benchmark qwen3vl on screenspot-pro" and it handles env check → multi-GPU inference → judging → metrics → result comparison
  • Personalized memory — Automatically learns user preferences and injects context across tasks
  • Episode recording — Every task saved as structured episodes for replay and dataset building
  • Web UI — Gradio interface for device management, task execution, and memory inspection

Get started with ClawGUI-Agent

🧩 ClawGUI-Skills — Self-Evolving Skills

📁 clawgui-skills/ · 📖 Full Documentation · 中文

ClawGUI-Skills implements the training-free self-evolving GUI skill architecture proposed and validated in our paper "Reflect, Revise, Reuse: Training-Free Skill Evolution for GUI Agents." It stores procedural task knowledge as structured skill packages and lets PhoneAgent retrieve, inject, diagnose, and revise them on demand.

  • Four modesoff, trace, reuse, and evolve; disabled by default to avoid extra context cost
  • Structured packagesmeta_info.json, plan.md, backup.md, recover.md, and failure_examples/
  • Instant revision — failed runs are diagnosed by an isolated verifier and mapped to targeted skill-file edits
  • Visual inspection — the Web UI shows matched skill name, skill_id, injected context, revisions, and failure examples

Get started with ClawGUI-Skills

📱 ClawGUI-APP — On-Device Deploy

📁 clawgui-app/ · 📖 Setup Guide

ClawGUI-APP runs the full ClawGUI "brain + GUI agent" stack directly on one Android phone, removing the old split architecture where a desktop host orchestrates tasks and the phone only executes them. Built on Shizuku for high-privilege, non-root device control.

  • Phone-only workflow — No desktop coordinator required; a device with Shizuku is enough
  • Two-agent design — Brain LLM handles planning and tool orchestration, phone agent handles screen understanding and actions
  • Multi-model support — AutoGLM, MAI-UI, GUI-Owl, Qwen-VL, UI-TARS and more via OpenAI-compatible API
  • Voice input (STT) — Tap-to-record microphone with OpenAI-compatible speech-to-text transcription (SiliconFlow, Groq Whisper, etc.)
  • Conversation + automation — Sessions, long-term memory, external channels (Feishu), and trace replay
  • Built for real usage — Floating overlay status, built-in IME, session persistence, and diagnostics

Build ClawGUI-APP

🎯 Roadmap

  • ClawGUI-Agent — GUI agent framework for phone control and evaluation via natural language
  • ClawGUI-RL — Scalable mobile online RL training infrastructure with GiGPO + PRM
  • ClawGUI-Eval — Standardized GUI grounding evaluation suite with 6 benchmarks and 95%+ reproduction rate
  • ClawGUI-2B — 2B GUI agent trained with GiGPO, achieving 17.1 MobileWorld SR (vs. 11.1 baseline)
  • On-device ClawGUI-Agent (ClawGUI-APP) — Deploy ClawGUI-Agent directly on real phones — no desktop coordinator, paving the way for fully on-device inference (brain/VLM still served via cloud API today)
  • Desktop Online RL — Extend ClawGUI-RL to desktop environments for online reinforcement learning
  • Web Online RL — Extend ClawGUI-RL to web environments for online reinforcement learning
  • More Skills for ClawGUI-Agent — Add more pluggable skills to expand ClawGUI-Agent's capabilities
  • Hybrid CLI & GUI Mechanism — Explore hybrid interaction combining command-line and GUI operations
  • Real-time RL — Integrate real-time reinforcement learning based on the OPD algorithm for ClawGUI-RL and ClawGUI-Agent

🤝 Contributing

We welcome contributions of all kinds — new model support, new RL environments, bug fixes, and documentation improvements. See CONTRIBUTING.md for how to get started, module-specific guidelines, and PR requirements.

🙏 Acknowledgements

ClawGUI is built upon the following excellent open-source projects. We sincerely thank their contributors:

License

This project is licensed under the Apache License 2.0.

📝 Citation

If you find ClawGUI useful in your research, please consider citing our paper:

@article{tang2026clawgui,
 title={ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents},
 author={Tang, Fei and Lu, Zhiqiong and Zhang, Boxuan and Lu, Weiming and Xiao, Jun and Zhuang, Yueting and Shen, Yongliang},
 journal={arXiv preprint arXiv:2604.11784},
 year={2026}
}

Star History

Star History Chart

About

Build, Evaluate, and Deploy GUI Agents — online RL training, standardized benchmarks, and real-device deployment in one framework.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

Contributors

AltStyle によって変換されたページ (->オリジナル) /