Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

FreedomIntelligence/LongLLaVA

Repository files navigation

header

πŸ“ƒ Paper β€’ 🌐 Demo β€’ πŸ€— LongLLaVA-53B-A13B β€’ πŸ€— LongLLaVA-9B

efficiency

🌈 Update

Architecture

Click to view the architecture image

Architecture Image

Results

Click to view the Results

Results reproduction

1. Environment Setup

pip install -r requirements.txt

2. Data DownLoad and Construction

Dataset Taxonomy

Dataset

  • Dataset DownLoading and Construction

    Coming Soon.

3. Training

  • Downloading Language Models

    πŸ€— Jamba-9B-Instruct

  • Stage I: Single-image Alignment.

    bash Align.sh
  • Stage II: Single-image Instruction-tuning.

    bash SingleImageSFT.sh
  • Stage III: Multi-image Instruction-tuning.

    bash MultiImageSFT.sh

4. Evaluation

  • Command Line Interface
python cli.py --model_dir path-to-longllava
  • Model Inference
query = 'What does the picture show?'
image_paths = ['image_path1'] # image or video path
from cli import Chatbot
bot = Chatbot(path-to-longllava)
output = bot.chat(query, image_paths)
print(output) # Prints the output of the model
  • Benchmarks
python Eval.sh

5. Reproduce other results in Paper

  • FLOPs
python /utils/cal_flops.py
  • Prefill Time & Throughput & GPU Memory Usage
python ./benchmarks/Efficiency/evaluate.py
python ./benchmarks/Efficiency/evaluatevllm.py
  • DownCycling To Transfer Jamba-MoE to Dense
python ./utils/dense_downcycling.py

TO DO

  • Release Data Construction Code

Acknowledgement

  • LLaVA: Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

Citation

@misc{wang2024longllavascalingmultimodalllms,
 title={LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture}, 
 author={Xidong Wang and Dingjie Song and Shunian Chen and Chen Zhang and Benyou Wang},
 year={2024},
 eprint={2409.02889},
 archivePrefix={arXiv},
 primaryClass={cs.CL},
 url={https://arxiv.org/abs/2409.02889}, 
}

About

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /