Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
/ UI-R1 Public

[AAAI-2026] Code for "UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning"

License

Notifications You must be signed in to change notification settings

lll6gg/UI-R1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

67 Commits

Repository files navigation

UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

🔥 Overview

We propose UI-R1, the first framework to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for GUI action prediction tasks.

Experimental results demonstrate that our proposed UI-R1-3B achieves significant improvements over the base model (i.e. Qwen2.5-VL-3B) on both in-domain (ID) and out-of-domain (OOD) tasks, with average accuracy gains of 22.1% on ScreenSpot, 6.0% on ScreenSpot-Pro, and 12.7% on AndroidControl. Furthermore, UI-R1-3B delivers competitive performance compared to larger models (e.g., OS-Atlas-7B) trained via supervised fine-tuning (SFT) on 76K samples.

Grounding Leaderboard: UI-I2E-Bench

Model ScreenSpot UI-I2E-Bench Avg ScreenSpot-Pro Average
UI-TARS-1.5-7B 88.1 73.2 42.2 67.8
Uground-V1-72B 89.7 76.3 34.3 66.8
UI-TARS-72B 88.4 73.7 38.1 66.7
UI-R1-E-3B 89.2 69.1 33.5 63.9
Uground-V1-7B 87.1 70.3 31.1 62.8
InfiGUI-R1 87.5 69.7 29.6 62.3
UI-TARS-7B 89.5 61.4 35.7 62.2
Qwen2.5-VL-72B 87.1 51.4 43.6 60.7
UI-I2E-VLM-7B 82.5 69.5 23.6 58.5
UI-TARS-2B 82.3 62 27.7 57.3
Qwen2.5-VL-7B 84.7 53.8 29 55.8
OmniParser-V2 72 54.8 39.6 55.5
Uground-V1-2B 78.8 57.4 26.6 54.3
OS-Atlas-7B 82.5 58.6 18.9 53.3
UI-R1-3B 83.3 58.5 17.8 53.2
UGround-7B 74.1 54.2 16.5 48.3
UI-I2E-VLM-4B 70.4 53.4 12.2 45.3
OmniParser 73.9 53.1 8.3 45.1
ShowUI-2B 76.8 41.5 7.7 42
Qwen2.5-VL-3B 55.5 41.7 23.9 41.3
Aguvis-7B 84.4 53.2 22.9 40.4
OS-Atlas-4B 70.1 44.3 3.7 39.4
Qwen2-VL-7B 42.6 48.7 1.6 31
Seeclick 55.8 26.4 1.1 27.8
InternVL2-4B 4.2 0.9 0.3 1.8

🔥Insight 1 : Fast Grounding

Thinking is not needed for GUI grounding.

Inspired by concurrent works studying efficient LRM, we realize efficient reasoning by RFT training. UI-R1-3B-E's training consists of two steps:

  1. DAST (Difficulty-Adaptive Slow-Thinking): Add difficulty-adaptive length reward to make reasoning from slow to fast.
  2. Nothinking: Not output reasoning process.

Note: UI-R1-3B (v2) and UI-R1-3B-E both train on larger dataset (2K grounding data in GUI-R1-3K) compared to UI-R1-3B (v1).

Benchmark 1: ScreenSpotV2

ScreenSpotV2 inference mode Mobile-T Mobile-I Desktop-T Desktop-I Web-T Web-I Avg↑ / Len↓
OS-ATLAS-7B w/o thinking 95.2 75.8 90.7 63.6 90.6 77.3 84.1 /
UI-TARS-7B w/o thinking 95.2 79.1 90.7 68.6 90.6 78.3 84.7 /
UI-R1-3B (v1) w/ thinking 96.2 84.3 92.3 63.6 89.2 75.4 85.4 / 67
GUI-R1-3B w/ thinking 97.6 78.2 94.3 64.3 91.0 72.4 85.0 / 80
UI-R1-3B (v2) w/ thinking 97.6 79.6 92.3 67.9 88.9 77.8 85.8 / 60
UI-R1-E-3B w/o thinking 98.2 83.9 94.8 75.0 93.2 83.7 89.5 / 28

Benchmark 2: ScreenSpot-Pro

ScreenSpot-Pro inference mode Average Length↓ Average Accuracy↑
UGround-7B w/o thinking - 16.5
OS-ATLAS-7B w/o thinking - 18.9
UI-R1-3B (v1) w/ thinking 102 17.8
GUI-R1-3B w/ thinking 114 26.6
UI-R1-3B (v2) w/ thinking 129 29.8
UI-R1-E-3B w/o thinking 28 33.5
Analysis
  1. Our UI-R1-3B-E achieves SOTA with least answer tokens in 3B/7B Open-source methods, demonstrating GUI grounding needs no reasoning.
Todo
  • Performance on 7B may be opposite.
  • Performance on Planning may be opposite. The author predicts that Fast Grounding, Slow Planning.
  • The checkpoints of UI-R1-3B-E will be released soon.
  • The updated paper will come soon.
  • The efficient training code will come soon. (in src/script/train_e.sh)

Setup

conda create -n ui-r1 python=3.10
conda activate ui-r1
bash setup.sh

Data

Our training mobile data is a subset from AndroidControl and ScreenSpot.

You can also prepare your training or inference data like:

images/:
	image1.png
	image2.png
test.json:
[
	{
	"img_filename": "image1.png",
 "bbox": [
 825,
 72,
 1673,
 149
 ],
 "instruction": "search bar"
 },
 {
	"img_filename": "image2.png",
 "bbox": [
 123,
 732,
 334,
 812
 ],
 "instruction": "check weather"
 }
]

where bbox : [x1,y1,x2,y2] is the coordinate of the left top and the right bottom of the ground truth bbox

Inference

We provide an example here

cd evaluation/
bash test.sh

Please fill the MODEL_PATH, IMG_PATH, TEST_JSON with your real checkpoint path and data path.

Training

cd src/script/
bash train.sh
# efficient training
bash train_e.sh

🗞️ News

  • 2025年11月08日: Our paper was accepted by AAAI-2026.
  • 2025年05月14日: We update the paper with UI-R1-E-3B.
  • 2025年05月12日: We release the checkpoints of the UI-R1-E-3B model.
  • 2025年05月12日: We fix the bug of scales when batch_size > 1.
  • 2025年05月11日: We release the efficient training code of the UI-R1-E-3B model.
  • 2025年04月02日: We release the datasets of the UI-R1-3B (v1) model.
  • 2025年03月30日: We release the checkpoints of the UI-R1-3B (v1) model.
  • 2025年03月30日: We release the UI-R1 repository.
  • 2025年03月27日: We release our paper.

⭐️ Citation

If you find this project useful, welcome to cite us.

@article{lu2025ui,
 title={UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning},
 author={Lu, Zhengxi and Chai, Yuxiang and Guo, Yaxuan and Yin, Xi and Liu, Liang and Wang, Hao and Xiong, Guanjing and Li, Hongsheng},
 journal={arXiv preprint arXiv:2503.21620},
 year={2025}
}

🤝 Acknowledgements

We sincerely thank projects R1-V, Open-R1, and Open-r1-multimodal, VLM-R1 for providing their open-source resources.

About

[AAAI-2026] Code for "UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

AltStyle によって変換されたページ (->オリジナル) /