Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[ECCV 2024] Official code implementation of Vary: Scaling Up the Vision Vocabulary of Large Vision Language Models.

Notifications You must be signed in to change notification settings

Ucas-HaoranWei/Vary

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

65 Commits

Repository files navigation

Ucas-HaoranWei%2FVary | Trendshift

Haoran Wei*, Lingyu Kong*, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, Xiangyu Zhang

Release

  • [2024εΉ΄12月24ζ—₯] πŸ”₯πŸ”₯πŸ”₯ My new work on system-2 perception is released slow-perception.
  • [2024εΉ΄9月03ζ—₯] πŸ”₯πŸ”₯πŸ”₯ We release a very strong and comprehensive OCR model GOT-OCR2.0.
  • [2024εΉ΄7月16ζ—₯] πŸŽ‰πŸŽ‰πŸŽ‰ OneChart is accepted by ACM'MM 2024 oral (3.97%)!
  • [2024εΉ΄7月2ζ—₯] πŸ”₯πŸ”₯πŸ”₯ Vary is accepted by ECCV2024. To thank everyone for their attention, I will release a model that performs on par with the Vary-document soon.
  • [2024εΉ΄5月27ζ—₯] πŸ”₯πŸ”₯πŸ”₯ We present a document understanding benchmark in Fox .
  • [2024εΉ΄5月24ζ—₯] πŸ”₯πŸ”₯πŸ”₯ We propose a multi-page document understanding work -- Fox, which supports 8-page pdf-image input !!!
  • [2024εΉ΄4月21ζ—₯] πŸ”₯πŸ”₯πŸ”₯ For OneChart, we have released the web demo in Project Page. Have fun!!
  • [2024εΉ΄4月21ζ—₯] πŸ”₯πŸ”₯πŸ”₯ We present a Vary-tiny LAVIS codebase (for training from scratch) and the Vary-600k dataset (300K English and 300K Chinese pages) here !!!
  • [2024εΉ΄4月15ζ—₯]πŸ”₯πŸ”₯πŸ”₯We release a chart parsing model OneChart here.
  • [2024εΉ΄4月12ζ—₯]πŸ”₯πŸ”₯πŸ”₯We will release a chart parsing model based on Vary-tiny next week. The model supports both English and Chinese charts.
  • [2024εΉ΄3月16ζ—₯]πŸ”₯πŸ”₯πŸ”₯I found many friends very interested in Vary-tiny(OPT-125M), so I opened source it here, a PDF-dense OCR and object detection version.
  • [2023εΉ΄1月23ζ—₯]πŸ”₯πŸ”₯πŸ”₯We release the Vary-toy here. Besides, we show the super good Vary-family results here.
  • [2023εΉ΄12月29ζ—₯]πŸ”₯πŸ”₯πŸ”₯We will release a new model (a small-size Vary, about 2B) at the beginning of next month and introduce a new feature (object detection). Our online demo will be temporarily closed to prepare for the deployment of the new model.
  • [2023εΉ΄12月11ζ—₯] We released the online demo, have fun!
  • [2023εΉ΄12月11ζ—₯] We released the codes of Vary (train and inference)!

Code License Data License Usage and License Notices: The data, code, and checkpoint are intended and licensed for research use only. They are also restricted to use that follow the license agreement of LLaMA, Vicuna, GPT-4, Qwen, and LLaVA.

Contents

Install

  1. Clone this repository and navigate to the Vary folder
git clone https://github.com/Ucas-HaoranWei/Vary.git
cd Vary
  1. Install Package
conda create -n vary python=3.10 -y
conda activate vary
pip install e .
  1. Install Flash-Attention
pip install ninja
pip install flash-attn --no-build-isolation

Vary Weights

  • If you are in urgent need of weights for your research recently, please contact me by email.
  • Download the CLIP-VIT-L in Hugging Face
  • Here for Vary-toy weights.

Demo

  1. Update the CLIP-VIT path in the codes (/cache/vit-large-patch14/) to your path.

python vary/demo/run_qwen_vary.py --model-name /vary/model/path/ --image-file /an/image/file.png

Train

  • We currently do not plan to open source the weights of the intermediate.
  • However, we release the train codes. So you can train on your own dataset. If you want to do this, you can try this:
  1. For Vary-base (one machine, if you have multiple machines you need to prepare your host file)
deepspeed Vary/train/train_qwen_vary.py --deepspeed /Vary/zero_config/zero2.json
 --model_name_or_path /Qwen-7B/path/
 --vision_tower /vit-large-patch14/path/
 --freeze_vision_tower True
 --freeze_lm_model False
 --vision_select_layer -2
 --use_im_start_end True
 --bf16 True
 --per_device_eval_batch_size 4
 --gradient_accumulation_steps 1
 --evaluation_strategy "no"
 --save_strategy "steps"
 --save_steps 5000
 --save_total_limit 1
 --weight_decay 0.
 --warmup_ratio 0.03
 --lr_scheduler_type "cosine"
 --logging_steps 1 --tf32 True
 --model_max_length 4096
 --gradient_checkpointing True
 --dataloader_num_workers 4
 --report_to none
 --per_device_train_batch_size 4
 --num_train_epochs 1
 --learning_rate 5e-5
 --datasets data_name1+data_name2+data_name3
 --output_dir /path/to/output/
  1. For Vary-tiny
deepspeed Vary/train/train_opt.py --deepspeed /Vary/zero_config/zero2.json
 --model_name_or_path /opt125m/path/
 --conversation_version opt
 --freeze_vision_tower False
 --freeze_lm_model False
 --use_im_start_end True
 --bf16 True
 --per_device_eval_batch_size 4
 --gradient_accumulation_steps 1
 --evaluation_strategy "no"
 --save_strategy "steps"
 --save_steps 5000
 --save_total_limit 1
 --weight_decay 0.
 --warmup_ratio 0.03
 --lr_scheduler_type "cosine"
 --logging_steps 1 --tf32 True
 --model_max_length 4096
 --gradient_checkpointing True
 --dataloader_num_workers 4
 --report_to none
 --per_device_train_batch_size 16
 --num_train_epochs 1
 --learning_rate 5e-5
 --datasets data_name1+data_name2+data_name3
 --output_dir /path/to/output/

Contact

If you have any questions related to the code or the paper, feel free to email (weihaoran18@mails.ucas.ac.cn).

Acknowledgement

  • LLaVA: the codebase we built upon!
  • Qwen: the LLM base model of Vary, which is good at both English and Chinese!

Citation

If you find our work useful in your research, please consider citing Vary:

@article{wei2023vary,
 title={Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models},
 author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yang, Jinrong and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
 journal={arXiv preprint arXiv:2312.06109},
 year={2023}
}
@article{wei2024small,
 title={Small Language Model Meets with Reinforced Vision Vocabulary},
 author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yu, En and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
 journal={arXiv preprint arXiv:2401.12503},
 year={2024}
}

About

[ECCV 2024] Official code implementation of Vary: Scaling Up the Vision Vocabulary of Large Vision Language Models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /