Name	Name	Last commit message	Last commit date
Latest commit History 45 Commits
assets	assets
configs	configs
data	data
docs	docs
examples	examples
scripts	scripts
synapse	synapse
tests	tests
tongui	tongui
.gitattributes	.gitattributes
.gitignore	.gitignore
.python-version	.python-version
README.md	README.md
pyproject.toml	pyproject.toml

Dynamic JSON Badge Dynamic JSON Badge Dynamic JSON Badge Dynamic JSON Badge

Training Vision-Language-Action(VLA) Model for GUI & Computer Use tasks by watching online tutorials. Fully open-sourced dataset, model and training pipeline. Cost efficient solution for GUI task data generation.

针对图形操作界面任务设计的VLA模型和智能体框架。

📑 Paper | 🤗 HuggingFace Collections (Models & Datasets) | 🤖 ModelScope Collections (Models & Datasets) | 🤗 Spaces Demo | 🌐 Webpage

TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials
Bofei Zhang*, Zirui Shang*, Zhi Gao*, Wang Zhang, Rui Xie, Xiaojian Ma, Yuan Tao, Xinxiao Wu, Song-Chun Zhu, Qing Li✉

TongUI

Supporters ❤️

DataCanvas sponsorship

Training TongUI-3B/7B/32B with DataCanvas(九章云极)

🌟 Updates

Release all experiments/evaluation scripts [WIP].
[2025年10月23日] Release TongUI-Absolute — annotated dataset with absolute coordinate labels.
[2025年07月10日] Release crawler code and intermediate crawler data🤗. Please feel free to process your own SFT dataset!
[2025年06月16日] Submit evaluation to UI-Vision! Checkout result here and how to reproduce here
[2025年05月27日] Release TongUI-32B model and Training Details.
[2025年05月06日] Release TongUI-7B model and GUI-Net-1M dataset.
[2025年04月21日] Release 🔧 Training pipeline.
[2025年04月17日] Release TongUI-3B model.

📊 Performance

Key findings

Training with this cost-efficient dataset gives SOTA👑 performance on Multiple GUI benchmarks!
Training with 1M version of dataset make the performance scale up🚀!

Results on ScreenSpot; † means the results are re-produced. We report results on six splits of ScreenSpot and the average scores. The best method is marked in bold. 1M means the dataset is 1M version.

Model	Data Num	Data Size	Desktop Icon	Desktop Text	Mobile Icon	Mobile Text	Web Icon	Web Text	Average
SeeClick-9.6B	364K	-	30.0	72.2	52.0	78.0	32.5	55.7	53.4
UGround-7B	1.3M	-	63.6	82.5	60.3	82.8	80.4	73.3	70.4
OmniParser-GPT-4V	-	-	63.6	91.3	57.0	93.9	51.0	81.3	73.0
ShowUI-2B	256K	0.72B	61.1	76.3	75.5	92.3	63.6	81.7	75.1
Qwen2.5-VL-3B †	-	-	7.8	22.2	5.2	8.4	1.7	2.4	8.0
Qwen2.5-VL-7B †	-	-	16.4	26.8	5.2	6.6	7.3	13.0	12.6
TongUI-3B	399K	1.24B	68.5	86.5	76.0	90.5	68.4	87.4	79.6
TongUI-7B	399K	1.24B	75.0	91.2	79.9	93.0	72.3	88.7	83.4
TongUI-3B(1M)	1.3M	-	77.1	92.3	77.7	92.6	74.8	87.8	83.6
TongUI-7B(1M)	1.3M	-	80.0	93.8	79.5	91.9	81.6	89.1	86.0
TongUI-32B(1M)	1.3M	-	80.0	94.8	84.3	96.3	84.5	91.3	88.5

Results on Mind2Web. We report results on three types of tasks: cross-task, cross-website, and cross-domain. Elem. Acc means whether the element is selected correctly, OP. F1 denotes the F1 score for the predicted action, and Step SR counts successful steps. 1M means the dataset is 1M version.

Method	Cross-Task	Cross-Website	Cross-Domain
Elem. Acc	OP. F1	Step SR	Elem. Acc	OP. F1	Step SR	Elem. Acc	OP. F1	Step SR
CogAgent	22.4	53.0	17.6	18.4	42.4	13.4	20.6	42.0	15.5
MindAct	55.1	75.7	52.0	42.0	65.2	38.9	42.1	66.5	39.6
OmniParser	42.4	87.6	39.4	41.0	84.8	36.5	45.5	85.7	42.0
ShowUI-2B	39.9	88.6	37.2	41.6	83.5	35.1	39.4	86.8	35.2
SeeClick-9.6B	28.3	87.0	25.5	21.4	80.6	16.4	23.2	84.8	20.8
Qwen2.5-VL-3B †	2.5	14.5	0.4	2.7	12.6	1.0	3.3	24.2	1.7
Qwen2.5-VL-7B †	6.2	72.8	5.0	6.3	68.2	4.5	8.4	73.6	7.2
Qwen2.5-VL-3B-ShowUI	43.2	88.7	39.7	41.3	86.7	35.5	45.1	86.1	40.7
TongUI-3B	48.0	88.4	44.2	48.9	85.4	42.6	50.0	87.7	46.0
TongUI-7B	51.1	88.7	46.9	50.4	87.5	43.7	53.9	88.6	49.1
TongUI-3B(1M)	53.4	89.0	48.8	54.2	86.4	48.1	53.8	88.2	49.5
TongUI-7B(1M)	58.1	88.7	53.4	55.6	87.2	49.0	57.6	88.7	52.9
TongUI-32B(1M)	57.2	88.1	52.4	57.4	85.8	50.6	59.2	87.8	54.1

For other experiments, please refer to our paper.

👋 Getting Started

We use uv to manage the dependencies.

uv sync --all-groups

To using conda and pip to install the dependencies.

conda create -n tongui python=3.12
conda activate tongui
pip install -e .

To execute any script by uv, you can use the following command.

uv run <script_name>.py

Just replace uv with python if you are using conda or pip to install the dependencies.

python <script_name>.py

Gradio Demo (Local or Online)

We host an online Gradio Demo on Hugging Face Spaces. Please feel free to try it. We also open source the code for this demo. Feel free to run it locally.

git clone https://huggingface.co/spaces/Bofeee5675/TongUI
cd TongUI
uv run app.py

API Calling

You can programatically call the TongUI API by using the following code.

uv run examples/api.py

Serve Model By vLLM

You can serve the model by vLLM.

uv run vllm serve Bofeee5675/TongUI-3B --port 8000 --served-model-name tongui-3b --limit-mm-per-prompt image=3

Then, you can use openai compatible API to call the model. Checkout examples/call_vllm.py for more details.

uv run examples/call_vllm.py

Local Model

Checkout examples/inference.py for local inference.

uv run examples/inference.py

🔧 Training Details

For detailed information about model training, including hyperparameters, data preprocessing, and training configurations, please refer to our Training Documentation.

📚 Experiments

For comprehensive experimental results, ablation studies, and evaluation details, please check our Experiments Documentation.

🌟 Star History

Star History Chart

Acknowledgement

We thank the following projects for their wonderful works.

We adopt experiments, data preprocessing pipeline from ShowUI
We train our model by using LLaMA-Factory
Thanks for Qwen2.5-VL series model and UI-TARS for their great work.

Citation

If you find this work useful in your research, please consider citing:

@article{zhang2025tongui,
 title={TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials},
 author={Zhang, Bofei and Shang, Zirui and Gao, Zhi and Zhang, Wang and Xie, Rui and Ma, Xiaojian and Yuan, Tao and Wu, Xinxiao and Zhu, Song-Chun and Li, Qing},
 journal={arXiv preprint arXiv:2504.12679},
 year={2025}
}

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TongUI-agent/TongUI-agent

Folders and files

Latest commit

History

Repository files navigation

Supporters ❤️

Training TongUI-3B/7B/32B with DataCanvas(九章云极)

🌟 Updates

📊 Performance

👋 Getting Started

Gradio Demo (Local or Online)

API Calling

Serve Model By vLLM

Local Model

🔧 Training Details

📚 Experiments

🌟 Star History

Acknowledgement

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

TongUI-agent/TongUI-agent

Folders and files

Latest commit

History

Repository files navigation

Supporters ❤️

Training TongUI-3B/7B/32B with DataCanvas(九章云极)

🌟 Updates

📊 Performance

👋 Getting Started

Gradio Demo (Local or Online)

API Calling

Serve Model By vLLM

Local Model

🔧 Training Details

📚 Experiments

🌟 Star History

Acknowledgement

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages