Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Release of code, datasets and model for our work TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials

Notifications You must be signed in to change notification settings

TongUI-agent/TongUI-agent

Repository files navigation

TongUI Logo

Dynamic JSON Badge Dynamic JSON Badge Dynamic JSON Badge Dynamic JSON Badge

Training Vision-Language-Action(VLA) Model for GUI & Computer Use tasks by watching online tutorials. Fully open-sourced dataset, model and training pipeline. Cost efficient solution for GUI task data generation.

针对图形操作界面任务设计的VLA模型和智能体框架。

📑 Paper | 🤗 HuggingFace Collections (Models & Datasets) | 🤖 ModelScope Collections (Models & Datasets) | 🤗 Spaces Demo | 🌐 Webpage

TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials
Bofei Zhang*, Zirui Shang*, Zhi Gao*, Wang Zhang, Rui Xie, Xiaojian Ma, Yuan Tao, Xinxiao Wu, Song-Chun Zhu, Qing Li✉

TongUI

🌟 Updates

📊 Performance

Key findings

  • Training with this cost-efficient dataset gives SOTA👑 performance on Multiple GUI benchmarks!
  • Training with 1M version of dataset make the performance scale up🚀!

Results on ScreenSpot; † means the results are re-produced. We report results on six splits of ScreenSpot and the average scores. The best method is marked in bold. 1M means the dataset is 1M version.

Model Data Num Data Size Desktop Icon Desktop Text Mobile Icon Mobile Text Web Icon Web Text Average
SeeClick-9.6B 364K - 30.0 72.2 52.0 78.0 32.5 55.7 53.4
UGround-7B 1.3M - 63.6 82.5 60.3 82.8 80.4 73.3 70.4
OmniParser-GPT-4V - - 63.6 91.3 57.0 93.9 51.0 81.3 73.0
ShowUI-2B 256K 0.72B 61.1 76.3 75.5 92.3 63.6 81.7 75.1
Qwen2.5-VL-3B † - - 7.8 22.2 5.2 8.4 1.7 2.4 8.0
Qwen2.5-VL-7B † - - 16.4 26.8 5.2 6.6 7.3 13.0 12.6
TongUI-3B 399K 1.24B 68.5 86.5 76.0 90.5 68.4 87.4 79.6
TongUI-7B 399K 1.24B 75.0 91.2 79.9 93.0 72.3 88.7 83.4
TongUI-3B(1M) 1.3M - 77.1 92.3 77.7 92.6 74.8 87.8 83.6
TongUI-7B(1M) 1.3M - 80.0 93.8 79.5 91.9 81.6 89.1 86.0
TongUI-32B(1M) 1.3M - 80.0 94.8 84.3 96.3 84.5 91.3 88.5

Results on Mind2Web. We report results on three types of tasks: cross-task, cross-website, and cross-domain. Elem. Acc means whether the element is selected correctly, OP. F1 denotes the F1 score for the predicted action, and Step SR counts successful steps. 1M means the dataset is 1M version.

Method Cross-Task Cross-Website Cross-Domain
Elem. Acc OP. F1 Step SR Elem. Acc OP. F1 Step SR Elem. Acc OP. F1 Step SR
CogAgent 22.4 53.0 17.6 18.4 42.4 13.4 20.6 42.0 15.5
MindAct 55.1 75.7 52.0 42.0 65.2 38.9 42.1 66.5 39.6
OmniParser 42.4 87.6 39.4 41.0 84.8 36.5 45.5 85.7 42.0
ShowUI-2B 39.9 88.6 37.2 41.6 83.5 35.1 39.4 86.8 35.2
SeeClick-9.6B 28.3 87.0 25.5 21.4 80.6 16.4 23.2 84.8 20.8
Qwen2.5-VL-3B † 2.5 14.5 0.4 2.7 12.6 1.0 3.3 24.2 1.7
Qwen2.5-VL-7B † 6.2 72.8 5.0 6.3 68.2 4.5 8.4 73.6 7.2
Qwen2.5-VL-3B-ShowUI 43.2 88.7 39.7 41.3 86.7 35.5 45.1 86.1 40.7
TongUI-3B 48.0 88.4 44.2 48.9 85.4 42.6 50.0 87.7 46.0
TongUI-7B 51.1 88.7 46.9 50.4 87.5 43.7 53.9 88.6 49.1
TongUI-3B(1M) 53.4 89.0 48.8 54.2 86.4 48.1 53.8 88.2 49.5
TongUI-7B(1M) 58.1 88.7 53.4 55.6 87.2 49.0 57.6 88.7 52.9
TongUI-32B(1M) 57.2 88.1 52.4 57.4 85.8 50.6 59.2 87.8 54.1

For other experiments, please refer to our paper.

👋 Getting Started

We use uv to manage the dependencies.

uv sync --all-groups

To using conda and pip to install the dependencies.

conda create -n tongui python=3.12
conda activate tongui
pip install -e .

To execute any script by uv, you can use the following command.

uv run <script_name>.py

Just replace uv with python if you are using conda or pip to install the dependencies.

python <script_name>.py

Gradio Demo (Local or Online)

We host an online Gradio Demo on Hugging Face Spaces. Please feel free to try it. We also open source the code for this demo. Feel free to run it locally.

git clone https://huggingface.co/spaces/Bofeee5675/TongUI
cd TongUI
uv run app.py

API Calling

You can programatically call the TongUI API by using the following code.

uv run examples/api.py

Serve Model By vLLM

You can serve the model by vLLM.

uv run vllm serve Bofeee5675/TongUI-3B --port 8000 --served-model-name tongui-3b --limit-mm-per-prompt image=3

Then, you can use openai compatible API to call the model. Checkout examples/call_vllm.py for more details.

uv run examples/call_vllm.py

Local Model

Checkout examples/inference.py for local inference.

uv run examples/inference.py

🔧 Training Details

For detailed information about model training, including hyperparameters, data preprocessing, and training configurations, please refer to our Training Documentation.

📚 Experiments

For comprehensive experimental results, ablation studies, and evaluation details, please check our Experiments Documentation.

🌟 Star History

Star History Chart

Acknowledgement

We thank the following projects for their wonderful works.

Citation

If you find this work useful in your research, please consider citing:

@article{zhang2025tongui,
 title={TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials},
 author={Zhang, Bofei and Shang, Zirui and Gao, Zhi and Zhang, Wang and Xie, Rui and Ma, Xiaojian and Yuan, Tao and Wu, Xinxiao and Zhu, Song-Chun and Li, Qing},
 journal={arXiv preprint arXiv:2504.12679},
 year={2025}
}

About

Release of code, datasets and model for our work TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

AltStyle によって変換されたページ (->オリジナル) /