GitHub - CircleRadon/TokenPacker: The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM", IJCV2025

Name	Name	Last commit message	Last commit date
Latest commit History 23 Commits
assets	assets
docs	docs
llava	llava
scripts	scripts
README.md	README.md
pyproject.toml	pyproject.toml

Comparisons with existing methods 💡

Updates 📌

[2025年5月23日] TokenPacker is accepted by IJCV 🎉🎉🎉.
[2024年10月22日] We integrated TokenPacker-HD framework with Osprey to achieve fine-grained high-resolution pixel-level understanding with large performance gains. Please see the codes in this branch for your reference.
[2024年7月25日] We released checkpoints, please check them.
[2024年7月3日] We released the paper of our TokenPacker on Arxiv.
[2024年7月3日] We released the training and inference codes.

What is TokenPacker 👀

TokenPacker is a novel visual projector, which adopts a coarse-to-fine scheme to inject the enriched characteristics to generate the condensed visual tokens. Using TokenPacker, we can compress the visual tokens by 75%∼89%, while achieves comparable or even better performance across diverse benchmarks with significantly higher efficiency.

Algorithms

We provide the pseudo-codes to showcase the detailed processing flow.

Core codes

As a visual projector, TokenPacker is implemented by a class TokenPacker, which can be found in multimodal_projector/builder.py

Comparisons with various projectors

High-Resolution Image Understanding with TokenPacker 🔬

To support efficient high-resolution image understanding, we further develop an effective image cropping method TokenPacker-HD.

Install 🛠️

Clone this repository and navigate to TokenPacker folder

git clone https://github.com/CircleRadon/TokenPacker.git
cd TokenPacker

Install packages

conda create -n tokenpacker python=3.10 -y
conda activate tokenpacker
pip install --upgrade pip # enable PEP 660 support
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Training 🚀

LLaVA-TokenPacker

Dataset

To make a fair comparison, we use the same training data as in LLaVA-1.5, i.e., LLaVA-Pretrain-558K for stage 1, and Mix665k for stage 2.

Training

Stage1: Image-Text Alignment Pre-training

bash scripts/v1_5/pretrain.sh

Stage2: Visual Instruction Tuning

bash scripts/v1_5/finetune.sh

Note: Using --scale_factor to control compression ratio, support [2,3,4]

LLaVA-TokenPacker-HD

Dataset

To obtain the competitive high-resolution performance, we use 2.7M data as organized by Mini-Gemini, i.e., 1.2M for stage 1 and 1.5M for stage 2.

Training

Stage1: Image-Text Alignment Pre-training

bash scripts/v1_5/pretrain_hd.sh

Stage2: Visual Instruction Tuning

bash scripts/v1_5/finetune_hd.sh

Note:

Using --scale_factor to control compression ratio, support [2,3,4].
Using --patch_num to control max patch dividing number, support [9,16,25].

Experiments

Model Zoo

Model	Max Res.	Compre. Ratio	Token Num.	Max Patch Num.	Training Data	Download
TokenPacker-7b	336x336	1/4	144	-	558K+665K	checkpoints
TokenPacker-13b	336x336	1/4	144	-	558K+665K	checkpoints
TokenPacker-HD-7b	1088x1088	1/4	~954	9	1.2M+1.5M	checkpoints
TokenPacker-HD-13b	1088x1088	1/4	~954	9	1.2M+1.5M	checkpoints
TokenPacker-HD-13b	1344x1344	1/4	~1393	16	1.2M+1.5M	checkpoints
TokenPacker-HD-13b	1344x1344	1/9	~619	16	1.2M+1.5M	checkpoints
TokenPacker-HD-13b	1344x1344	1/16	~347	16	1.2M+1.5M	checkpoints

Note:

The token number of TokenPacker-HD is the average statistically across all training and test data.
The training data of 558K+665K follows LLaVA-1.5, the one of 1.2M+1.5M follows Mini-Gemini.
All LLMs use Vicuna-7b/13b as based LLM.

Visualization

We provide some visual examples.

High-resolution image understanding.

TODO List 📝

Release the training and inference codes.
Release all checkpoints.

Acknowledgement 💌

LLaVA-v1.5: the codebase we built upon.
Mini-Gemini: the organized data we used for training high-resolution method.

For more recent related works, please refer to this repo of Awesome-Token-Compress.

BibTeX 🖊️

@misc{TokenPacker,
 title={TokenPacker: Efficient Visual Projector for Multimodal LLM},
 author={Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jianke Zhu and Lei Zhang},
 year={2024},
 eprint={2407.02392},
 archivePrefix={arXiv},
 primaryClass={cs.CV}
}

CircleRadon/TokenPacker

Folders and files

Latest commit

History

Repository files navigation

Comparisons with existing methods 💡

Updates 📌

What is TokenPacker 👀

Algorithms

Core codes

Comparisons with various projectors

High-Resolution Image Understanding with TokenPacker 🔬

Install 🛠️

Training 🚀

LLaVA-TokenPacker

Dataset

Training

LLaVA-TokenPacker-HD

Dataset

Training

Experiments

Model Zoo

Visualization

TODO List 📝

Acknowledgement 💌

More

BibTeX 🖊️

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Languages

Packages