Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

CircleRadon/TokenPacker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

23 Commits

Repository files navigation


Comparisons with existing methods πŸ’‘

Updates πŸ“Œ

  • [2025εΉ΄5月23ζ—₯] TokenPacker is accepted by IJCV πŸŽ‰πŸŽ‰πŸŽ‰.
  • [2024εΉ΄10月22ζ—₯] We integrated TokenPacker-HD framework with Osprey to achieve fine-grained high-resolution pixel-level understanding with large performance gains. Please see the codes in this branch for your reference.
  • [2024εΉ΄7月25ζ—₯] We released checkpoints, please check them.
  • [2024εΉ΄7月3ζ—₯] We released the paper of our TokenPacker on Arxiv.
  • [2024εΉ΄7月3ζ—₯] We released the training and inference codes.

What is TokenPacker πŸ‘€

TokenPacker is a novel visual projector, which adopts a coarse-to-fine scheme to inject the enriched characteristics to generate the condensed visual tokens. Using TokenPacker, we can compress the visual tokens by 75%∼89%, while achieves comparable or even better performance across diverse benchmarks with significantly higher efficiency.

Algorithms

We provide the pseudo-codes to showcase the detailed processing flow.

Core codes

As a visual projector, TokenPacker is implemented by a class TokenPacker, which can be found in multimodal_projector/builder.py

Comparisons with various projectors

High-Resolution Image Understanding with TokenPacker πŸ”¬

To support efficient high-resolution image understanding, we further develop an effective image cropping method TokenPacker-HD.

Install πŸ› οΈ

  1. Clone this repository and navigate to TokenPacker folder
git clone https://github.com/CircleRadon/TokenPacker.git
cd TokenPacker
  1. Install packages
conda create -n tokenpacker python=3.10 -y
conda activate tokenpacker
pip install --upgrade pip # enable PEP 660 support
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Training πŸš€

LLaVA-TokenPacker

Dataset

To make a fair comparison, we use the same training data as in LLaVA-1.5, i.e., LLaVA-Pretrain-558K for stage 1, and Mix665k for stage 2.

Training

  • Stage1: Image-Text Alignment Pre-training
bash scripts/v1_5/pretrain.sh
  • Stage2: Visual Instruction Tuning
bash scripts/v1_5/finetune.sh

Note: Using --scale_factor to control compression ratio, support [2,3,4]

LLaVA-TokenPacker-HD

Dataset

To obtain the competitive high-resolution performance, we use 2.7M data as organized by Mini-Gemini, i.e., 1.2M for stage 1 and 1.5M for stage 2.

Training

  • Stage1: Image-Text Alignment Pre-training
bash scripts/v1_5/pretrain_hd.sh
  • Stage2: Visual Instruction Tuning
bash scripts/v1_5/finetune_hd.sh

Note:

  • Using --scale_factor to control compression ratio, support [2,3,4].
  • Using --patch_num to control max patch dividing number, support [9,16,25].

Experiments

Model Zoo

Model Max Res. Compre. Ratio Token Num. Max Patch Num. Training Data Download
TokenPacker-7b 336x336 1/4 144 - 558K+665K checkpoints
TokenPacker-13b 336x336 1/4 144 - 558K+665K checkpoints
TokenPacker-HD-7b 1088x1088 1/4 ~954 9 1.2M+1.5M checkpoints
TokenPacker-HD-13b 1088x1088 1/4 ~954 9 1.2M+1.5M checkpoints
TokenPacker-HD-13b 1344x1344 1/4 ~1393 16 1.2M+1.5M checkpoints
TokenPacker-HD-13b 1344x1344 1/9 ~619 16 1.2M+1.5M checkpoints
TokenPacker-HD-13b 1344x1344 1/16 ~347 16 1.2M+1.5M checkpoints

Note:

  • The token number of TokenPacker-HD is the average statistically across all training and test data.
  • The training data of 558K+665K follows LLaVA-1.5, the one of 1.2M+1.5M follows Mini-Gemini.
  • All LLMs use Vicuna-7b/13b as based LLM.

Visualization

We provide some visual examples.

High-resolution image understanding.

TODO List πŸ“

  • Release the training and inference codes.
  • Release all checkpoints.

Acknowledgement πŸ’Œ

  • LLaVA-v1.5: the codebase we built upon.
  • Mini-Gemini: the organized data we used for training high-resolution method.

More

For more recent related works, please refer to this repo of Awesome-Token-Compress.

BibTeX πŸ–ŠοΈ

@misc{TokenPacker,
 title={TokenPacker: Efficient Visual Projector for Multimodal LLM},
 author={Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jianke Zhu and Lei Zhang},
 year={2024},
 eprint={2407.02392},
 archivePrefix={arXiv},
 primaryClass={cs.CV}
}

About

The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM", IJCV2025

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /