π Paper β’ π Demo β’ π€ LongLLaVA-53B-A13B β’ π€ LongLLaVA-9B
- [2024εΉ΄09ζ05ζ₯] LongLLaVA repo is published!π
- [2024εΉ΄10ζ12ζ₯] LongLLaVA-53B-A13B, LongLLaVA-9b and Jamba-9B-Instruct are repleased!π
Click to view the architecture image
Click to view the Results
- Main Results Main Results
- Diagnostic Results Diagnostic Results
- Video-NIAH Video-NIAH
pip install -r requirements.txt
Dataset Taxonomy
- Dataset DownLoading and Construction
Coming Soon.
-
Downloading Language Models
π€ Jamba-9B-Instruct
-
Stage I: Single-image Alignment.
bash Align.sh
-
Stage II: Single-image Instruction-tuning.
bash SingleImageSFT.sh
-
Stage III: Multi-image Instruction-tuning.
bash MultiImageSFT.sh
- Command Line Interface
python cli.py --model_dir path-to-longllava
- Model Inference
query = 'What does the picture show?' image_paths = ['image_path1'] # image or video path from cli import Chatbot bot = Chatbot(path-to-longllava) output = bot.chat(query, image_paths) print(output) # Prints the output of the model
- Benchmarks
python Eval.sh
- FLOPs
python /utils/cal_flops.py
- Prefill Time & Throughput & GPU Memory Usage
python ./benchmarks/Efficiency/evaluate.py python ./benchmarks/Efficiency/evaluatevllm.py
- DownCycling To Transfer Jamba-MoE to Dense
python ./utils/dense_downcycling.py
- Release Data Construction Code
- LLaVA: Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
@misc{wang2024longllavascalingmultimodalllms,
title={LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture},
author={Xidong Wang and Dingjie Song and Shunian Chen and Chen Zhang and Benyou Wang},
year={2024},
eprint={2409.02889},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.02889},
}