GitHub - FreedomIntelligence/LongLLaVA: LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Name	Name	Last commit message	Last commit date
Latest commit History 29 Commits
assets	assets
benchmarks	benchmarks
data	data
llava	llava
scripts	scripts
utils	utils
Align.sh	Align.sh
Eval.sh	Eval.sh
MultiImageSFT.sh	MultiImageSFT.sh
README.md	README.md
SingleImageSFT.sh	SingleImageSFT.sh
cli.py	cli.py
requirements.txt	requirements.txt

Name

Last commit message

Last commit date

Latest commit

History

📃 Paper • 🌐 Demo • 🤗 LongLLaVA-53B-A13B • 🤗 LongLLaVA-9B

efficiency

🌈 Update

[2024年09月05日] LongLLaVA repo is published!🎉
[2024年10月12日] LongLLaVA-53B-A13B, LongLLaVA-9b and Jamba-9B-Instruct are repleased!🎉

Architecture

Click to view the architecture image

Architecture Image

Results

Click to view the Results

Main Results Main Results
Diagnostic Results Diagnostic Results
Video-NIAH Video-NIAH

Results reproduction

1. Environment Setup

pip install -r requirements.txt

2. Data DownLoad and Construction

Dataset Taxonomy

Dataset

Dataset DownLoading and Construction

Coming Soon.

3. Training

Downloading Language Models

🤗 Jamba-9B-Instruct
Stage I: Single-image Alignment.
```
bash Align.sh
```
Stage II: Single-image Instruction-tuning.
```
bash SingleImageSFT.sh
```
Stage III: Multi-image Instruction-tuning.
```
bash MultiImageSFT.sh
```

4. Evaluation

Command Line Interface

python cli.py --model_dir path-to-longllava

Model Inference

query = 'What does the picture show?'
image_paths = ['image_path1'] # image or video path
from cli import Chatbot
bot = Chatbot(path-to-longllava)
output = bot.chat(query, image_paths)
print(output) # Prints the output of the model

Benchmarks

python Eval.sh

5. Reproduce other results in Paper

FLOPs

python /utils/cal_flops.py

Prefill Time & Throughput & GPU Memory Usage

python ./benchmarks/Efficiency/evaluate.py
python ./benchmarks/Efficiency/evaluatevllm.py

DownCycling To Transfer Jamba-MoE to Dense

python ./utils/dense_downcycling.py

TO DO

Release Data Construction Code

Acknowledgement

LLaVA: Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

Citation

@misc{wang2024longllavascalingmultimodalllms,
 title={LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture}, 
 author={Xidong Wang and Dingjie Song and Shunian Chen and Chen Zhang and Benyou Wang},
 year={2024},
 eprint={2409.02889},
 archivePrefix={arXiv},
 primaryClass={cs.CL},
 url={https://arxiv.org/abs/2409.02889}, 
}

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FreedomIntelligence/LongLLaVA

Folders and files

Latest commit

History

Repository files navigation

🌈 Update

Architecture

Results

Results reproduction

1. Environment Setup

2. Data DownLoad and Construction

3. Training

4. Evaluation

5. Reproduce other results in Paper

TO DO

Acknowledgement

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🌈 Update

Architecture

Results

Results reproduction

1. Environment Setup

2. Data DownLoad and Construction

3. Training

4. Evaluation

5. Reproduce other results in Paper

TO DO

Acknowledgement

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages