Dynamic JSON Badge Dynamic JSON Badge Dynamic JSON Badge Dynamic JSON Badge
Training Vision-Language-Action(VLA) Model for GUI & Computer Use tasks by watching online tutorials. Fully open-sourced dataset, model and training pipeline. Cost efficient solution for GUI task data generation.
针对图形操作界面任务设计的VLA模型和智能体框架。
📑 Paper | 🤗 HuggingFace Collections (Models & Datasets) | 🤖 ModelScope Collections (Models & Datasets) | 🤗 Spaces Demo | 🌐 Webpage
TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials
Bofei Zhang*, Zirui Shang*, Zhi Gao*, Wang Zhang, Rui Xie, Xiaojian Ma, Yuan Tao, Xinxiao Wu, Song-Chun Zhu, Qing Li✉
- Release all experiments/evaluation scripts [WIP].
- [2025年10月23日] Release TongUI-Absolute — annotated dataset with absolute coordinate labels.
- [2025年07月10日] Release crawler code and intermediate crawler data🤗. Please feel free to process your own SFT dataset!
- [2025年06月16日] Submit evaluation to UI-Vision! Checkout result here and how to reproduce here
- [2025年05月27日] Release TongUI-32B model and Training Details.
- [2025年05月06日] Release TongUI-7B model and GUI-Net-1M dataset.
- [2025年04月21日] Release 🔧 Training pipeline.
- [2025年04月17日] Release TongUI-3B model.
Key findings
- Training with this cost-efficient dataset gives SOTA👑 performance on Multiple GUI benchmarks!
- Training with 1M version of dataset make the performance scale up🚀!
Results on ScreenSpot; † means the results are re-produced. We report results on six splits of ScreenSpot and the average scores. The best method is marked in bold. 1M means the dataset is 1M version.
| Model | Data Num | Data Size | Desktop Icon | Desktop Text | Mobile Icon | Mobile Text | Web Icon | Web Text | Average |
|---|---|---|---|---|---|---|---|---|---|
| SeeClick-9.6B | 364K | - | 30.0 | 72.2 | 52.0 | 78.0 | 32.5 | 55.7 | 53.4 |
| UGround-7B | 1.3M | - | 63.6 | 82.5 | 60.3 | 82.8 | 80.4 | 73.3 | 70.4 |
| OmniParser-GPT-4V | - | - | 63.6 | 91.3 | 57.0 | 93.9 | 51.0 | 81.3 | 73.0 |
| ShowUI-2B | 256K | 0.72B | 61.1 | 76.3 | 75.5 | 92.3 | 63.6 | 81.7 | 75.1 |
| Qwen2.5-VL-3B † | - | - | 7.8 | 22.2 | 5.2 | 8.4 | 1.7 | 2.4 | 8.0 |
| Qwen2.5-VL-7B † | - | - | 16.4 | 26.8 | 5.2 | 6.6 | 7.3 | 13.0 | 12.6 |
| TongUI-3B | 399K | 1.24B | 68.5 | 86.5 | 76.0 | 90.5 | 68.4 | 87.4 | 79.6 |
| TongUI-7B | 399K | 1.24B | 75.0 | 91.2 | 79.9 | 93.0 | 72.3 | 88.7 | 83.4 |
| TongUI-3B(1M) | 1.3M | - | 77.1 | 92.3 | 77.7 | 92.6 | 74.8 | 87.8 | 83.6 |
| TongUI-7B(1M) | 1.3M | - | 80.0 | 93.8 | 79.5 | 91.9 | 81.6 | 89.1 | 86.0 |
| TongUI-32B(1M) | 1.3M | - | 80.0 | 94.8 | 84.3 | 96.3 | 84.5 | 91.3 | 88.5 |
Results on Mind2Web. We report results on three types of tasks: cross-task, cross-website, and cross-domain. Elem. Acc means whether the element is selected correctly, OP. F1 denotes the F1 score for the predicted action, and Step SR counts successful steps. 1M means the dataset is 1M version.
| Method | Cross-Task | Cross-Website | Cross-Domain | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Elem. Acc | OP. F1 | Step SR | Elem. Acc | OP. F1 | Step SR | Elem. Acc | OP. F1 | Step SR | |
| CogAgent | 22.4 | 53.0 | 17.6 | 18.4 | 42.4 | 13.4 | 20.6 | 42.0 | 15.5 |
| MindAct | 55.1 | 75.7 | 52.0 | 42.0 | 65.2 | 38.9 | 42.1 | 66.5 | 39.6 |
| OmniParser | 42.4 | 87.6 | 39.4 | 41.0 | 84.8 | 36.5 | 45.5 | 85.7 | 42.0 |
| ShowUI-2B | 39.9 | 88.6 | 37.2 | 41.6 | 83.5 | 35.1 | 39.4 | 86.8 | 35.2 |
| SeeClick-9.6B | 28.3 | 87.0 | 25.5 | 21.4 | 80.6 | 16.4 | 23.2 | 84.8 | 20.8 |
| Qwen2.5-VL-3B † | 2.5 | 14.5 | 0.4 | 2.7 | 12.6 | 1.0 | 3.3 | 24.2 | 1.7 |
| Qwen2.5-VL-7B † | 6.2 | 72.8 | 5.0 | 6.3 | 68.2 | 4.5 | 8.4 | 73.6 | 7.2 |
| Qwen2.5-VL-3B-ShowUI | 43.2 | 88.7 | 39.7 | 41.3 | 86.7 | 35.5 | 45.1 | 86.1 | 40.7 |
| TongUI-3B | 48.0 | 88.4 | 44.2 | 48.9 | 85.4 | 42.6 | 50.0 | 87.7 | 46.0 |
| TongUI-7B | 51.1 | 88.7 | 46.9 | 50.4 | 87.5 | 43.7 | 53.9 | 88.6 | 49.1 |
| TongUI-3B(1M) | 53.4 | 89.0 | 48.8 | 54.2 | 86.4 | 48.1 | 53.8 | 88.2 | 49.5 |
| TongUI-7B(1M) | 58.1 | 88.7 | 53.4 | 55.6 | 87.2 | 49.0 | 57.6 | 88.7 | 52.9 |
| TongUI-32B(1M) | 57.2 | 88.1 | 52.4 | 57.4 | 85.8 | 50.6 | 59.2 | 87.8 | 54.1 |
For other experiments, please refer to our paper.
We use uv to manage the dependencies.
uv sync --all-groups
To using conda and pip to install the dependencies.
conda create -n tongui python=3.12
conda activate tongui
pip install -e .To execute any script by uv, you can use the following command.
uv run <script_name>.py
Just replace uv with python if you are using conda or pip to install the dependencies.
python <script_name>.py
We host an online Gradio Demo on Hugging Face Spaces. Please feel free to try it. We also open source the code for this demo. Feel free to run it locally.
git clone https://huggingface.co/spaces/Bofeee5675/TongUI
cd TongUI
uv run app.pyYou can programatically call the TongUI API by using the following code.
uv run examples/api.py
You can serve the model by vLLM.
uv run vllm serve Bofeee5675/TongUI-3B --port 8000 --served-model-name tongui-3b --limit-mm-per-prompt image=3
Then, you can use openai compatible API to call the model. Checkout examples/call_vllm.py for more details.
uv run examples/call_vllm.py
Checkout examples/inference.py for local inference.
uv run examples/inference.py
For detailed information about model training, including hyperparameters, data preprocessing, and training configurations, please refer to our Training Documentation.
For comprehensive experimental results, ablation studies, and evaluation details, please check our Experiments Documentation.
We thank the following projects for their wonderful works.
- We adopt experiments, data preprocessing pipeline from ShowUI
- We train our model by using LLaMA-Factory
- Thanks for Qwen2.5-VL series model and UI-TARS for their great work.
If you find this work useful in your research, please consider citing:
@article{zhang2025tongui,
title={TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials},
author={Zhang, Bofei and Shang, Zirui and Gao, Zhi and Zhang, Wang and Xie, Rui and Ma, Xiaojian and Yuan, Tao and Wu, Xinxiao and Zhu, Song-Chun and Li, Qing},
journal={arXiv preprint arXiv:2504.12679},
year={2025}
}