Name	Name	Last commit message	Last commit date
Latest commit History 67 Commits
assets	assets
dataset	dataset
evaluation	evaluation
src	src
.gitignore	.gitignore
LICENSE	LICENSE
README.md	README.md
setup.sh	setup.sh

UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

[📖 Paper] [🤗 UI-R1-3B] [🤗 UI-R1-E-3B][🤗 Datasets] [🤗 Daily Paper]

🔥 Overview

We propose UI-R1, the first framework to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for GUI action prediction tasks.

Logo

Experimental results demonstrate that our proposed UI-R1-3B achieves significant improvements over the base model (i.e. Qwen2.5-VL-3B) on both in-domain (ID) and out-of-domain (OOD) tasks, with average accuracy gains of 22.1% on ScreenSpot, 6.0% on ScreenSpot-Pro, and 12.7% on AndroidControl. Furthermore, UI-R1-3B delivers competitive performance compared to larger models (e.g., OS-Atlas-7B) trained via supervised fine-tuning (SFT) on 76K samples.

Logo

Grounding Leaderboard: UI-I2E-Bench

Model	ScreenSpot	UI-I2E-Bench Avg	ScreenSpot-Pro	Average
UI-TARS-1.5-7B	88.1	73.2	42.2	67.8
Uground-V1-72B	89.7	76.3	34.3	66.8
UI-TARS-72B	88.4	73.7	38.1	66.7
UI-R1-E-3B	89.2	69.1	33.5	63.9
Uground-V1-7B	87.1	70.3	31.1	62.8
InfiGUI-R1	87.5	69.7	29.6	62.3
UI-TARS-7B	89.5	61.4	35.7	62.2
Qwen2.5-VL-72B	87.1	51.4	43.6	60.7
UI-I2E-VLM-7B	82.5	69.5	23.6	58.5
UI-TARS-2B	82.3	62	27.7	57.3
Qwen2.5-VL-7B	84.7	53.8	29	55.8
OmniParser-V2	72	54.8	39.6	55.5
Uground-V1-2B	78.8	57.4	26.6	54.3
OS-Atlas-7B	82.5	58.6	18.9	53.3
UI-R1-3B	83.3	58.5	17.8	53.2
UGround-7B	74.1	54.2	16.5	48.3
UI-I2E-VLM-4B	70.4	53.4	12.2	45.3
OmniParser	73.9	53.1	8.3	45.1
ShowUI-2B	76.8	41.5	7.7	42
Qwen2.5-VL-3B	55.5	41.7	23.9	41.3
Aguvis-7B	84.4	53.2	22.9	40.4
OS-Atlas-4B	70.1	44.3	3.7	39.4
Qwen2-VL-7B	42.6	48.7	1.6	31
Seeclick	55.8	26.4	1.1	27.8
InternVL2-4B	4.2	0.9	0.3	1.8

🔥Insight 1 : Fast Grounding

Thinking is not needed for GUI grounding.

Inspired by concurrent works studying efficient LRM, we realize efficient reasoning by RFT training. UI-R1-3B-E's training consists of two steps:

DAST (Difficulty-Adaptive Slow-Thinking): Add difficulty-adaptive length reward to make reasoning from slow to fast.
Nothinking: Not output reasoning process.

Note: UI-R1-3B (v2) and UI-R1-3B-E both train on larger dataset (2K grounding data in GUI-R1-3K) compared to UI-R1-3B (v1).

Benchmark 1: ScreenSpotV2

ScreenSpotV2	inference mode	Mobile-T	Mobile-I	Desktop-T	Desktop-I	Web-T	Web-I	Avg↑ / Len↓
OS-ATLAS-7B	w/o thinking	95.2	75.8	90.7	63.6	90.6	77.3	84.1 /
UI-TARS-7B	w/o thinking	95.2	79.1	90.7	68.6	90.6	78.3	84.7 /
UI-R1-3B (v1)	w/ thinking	96.2	84.3	92.3	63.6	89.2	75.4	85.4 / 67
GUI-R1-3B	w/ thinking	97.6	78.2	94.3	64.3	91.0	72.4	85.0 / 80
UI-R1-3B (v2)	w/ thinking	97.6	79.6	92.3	67.9	88.9	77.8	85.8 / 60
UI-R1-E-3B	w/o thinking	98.2	83.9	94.8	75.0	93.2	83.7	89.5 / 28

Benchmark 2: ScreenSpot-Pro

ScreenSpot-Pro	inference mode	Average Length↓	Average Accuracy↑
UGround-7B	w/o thinking	-	16.5
OS-ATLAS-7B	w/o thinking	-	18.9
UI-R1-3B (v1)	w/ thinking	102	17.8
GUI-R1-3B	w/ thinking	114	26.6
UI-R1-3B (v2)	w/ thinking	129	29.8
UI-R1-E-3B	w/o thinking	28	33.5

Analysis

Our UI-R1-3B-E achieves SOTA with least answer tokens in 3B/7B Open-source methods, demonstrating GUI grounding needs no reasoning.

Todo

Performance on 7B may be opposite.
Performance on Planning may be opposite. The author predicts that Fast Grounding, Slow Planning.
The checkpoints of UI-R1-3B-E will be released soon.
The updated paper will come soon.
The efficient training code will come soon. (in src/script/train_e.sh)

Setup

conda create -n ui-r1 python=3.10
conda activate ui-r1
bash setup.sh

Data

Our training mobile data is a subset from AndroidControl and ScreenSpot.

You can also prepare your training or inference data like:

images/:
	image1.png
	image2.png

test.json:
[
	{
	"img_filename": "image1.png",
 "bbox": [
 825,
 72,
 1673,
 149
 ],
 "instruction": "search bar"
 },
 {
	"img_filename": "image2.png",
 "bbox": [
 123,
 732,
 334,
 812
 ],
 "instruction": "check weather"
 }
]

where bbox : [x1,y1,x2,y2] is the coordinate of the left top and the right bottom of the ground truth bbox

Inference

We provide an example here

cd evaluation/
bash test.sh

Please fill the MODEL_PATH, IMG_PATH, TEST_JSON with your real checkpoint path and data path.

Training

cd src/script/
bash train.sh
# efficient training
bash train_e.sh

🗞️ News

2025年11月08日: Our paper was accepted by AAAI-2026.
2025年05月14日: We update the paper with UI-R1-E-3B.
2025年05月12日: We release the checkpoints of the UI-R1-E-3B model.
2025年05月12日: We fix the bug of scales when batch_size > 1.
2025年05月11日: We release the efficient training code of the UI-R1-E-3B model.
2025年04月02日: We release the datasets of the UI-R1-3B (v1) model.
2025年03月30日: We release the checkpoints of the UI-R1-3B (v1) model.
2025年03月30日: We release the UI-R1 repository.
2025年03月27日: We release our paper.

⭐️ Citation

If you find this project useful, welcome to cite us.

@article{lu2025ui,
 title={UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning},
 author={Lu, Zhengxi and Chai, Yuxiang and Guo, Yaxuan and Yin, Xi and Liu, Liang and Wang, Hao and Xiong, Guanjing and Li, Hongsheng},
 journal={arXiv preprint arXiv:2503.21620},
 year={2025}
}

🤝 Acknowledgements

We sincerely thank projects R1-V, Open-R1, and Open-r1-multimodal, VLM-R1 for providing their open-source resources.

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

lll6gg/UI-R1

Folders and files

Latest commit

History

Repository files navigation

UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

🔥 Overview

Grounding Leaderboard: UI-I2E-Bench

🔥Insight 1 : Fast Grounding

Benchmark 1: ScreenSpotV2

Benchmark 2: ScreenSpot-Pro

Analysis

Todo

Setup

Data

Inference

Training

🗞️ News

⭐️ Citation

🤝 Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Languages

License

lll6gg/UI-R1

Folders and files

Latest commit

History

Repository files navigation

UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

🔥 Overview

Grounding Leaderboard: UI-I2E-Bench

🔥Insight 1 : Fast Grounding

Benchmark 1: ScreenSpotV2

Benchmark 2: ScreenSpot-Pro

Analysis

Todo

Setup

Data

Inference

Training

🗞️ News

⭐️ Citation

🤝 Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages