Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

ustcwhy/BitVLA

Repository files navigation

BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation

Open Source Plan

  • βœ… Paper, Pre-trained VLM and evaluation code.
  • βœ… Fine-tuned VLA code and models
  • βœ… Pre-trained VLA.
  • 🧭 Pre-training code

Contents

Checkpoints

Models Size Memory Usage↓ LIBERO-Spatial LIBERO-Object LIBERO-Goal LIBERO-Long Avg.
Large Models
OpenVLA 7.5B 15.1GB (×ば぀) 84.7 88.4 79.2 53.7 76.5
CoT-VLA 8.0B 16.2GB (×ば぀) 87.5 91.6 87.6 69.0 81.1
UniVLA 8.5B 17.0GB (×ば぀) 96.5 96.8 95.6 92.0 95.2
UnifiedVLA 8.5B 17.0GB (×ば぀) 95.4 98.8 93.6 94.0 95.5
OpenVLA-OFT 7.7B 15.4GB (×ば぀) 97.6 98.4 97.9 94.5 97.1
Small Models
SpatialVLA 4.2B 8.5GB (×ば぀) 88.2 89.9 78.6 55.5 78.1
NORA-Long 3.8B 7.5GB (×ば぀) 92.2 95.4 89.4 74.6 87.9
4D-VLA 4.1B 8.3GB (×ば぀) 88.9 95.2 90.9 79.1 88.6
SmolVLA 2.3B 4.6GB (×ば぀) 93.0 94.0 91.0 77.0 88.8
GROOT-N1 2.2B 4.4GB (×ば぀) 94.4 97.6 93.0 90.6 93.9
Ο€0 3.5B 7.0GB (×ば぀) 96.8 98.8 95.8 85.2 94.2
BitVLA w/o pre-training 3.0B 1.4GB (×ば぀) 97.4 99.6 94.4 87.6 94.8
πŸš€BitVLA 3.0B 1.4GB (×ば぀) 96.6 99.0 95.4 92.8 96.0
Model Path
πŸš€BitVLA - VL&VLA pre-trained lxsy/bitvla-bf16
BitVLA - VL pre-trained hongyuw/bitvla-bitsiglipL-224px-bf16
BitVLA finetuned on LIBERO-Spatial hongyuw/ft-bitvla-bitsiglipL-224px-libero_spatial-bf16
BitVLA finetuned on LIBERO-Object hongyuw/ft-bitvla-bitsiglipL-224px-libero_object-bf16
BitVLA finetuned on LIBERO-Goal hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16
BitVLA finetuned on LIBERO-Long hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16
BitVLA w/ BF16 SigLIP hongyuw/bitvla-siglipL-224px-bf16

*Note that we provide the master weights of BitVLA and perform online quantization. For actual memory savings, you may quantize the weights offline to 1.58-bit precision. We recommend using the bitnet.cpp inference framework to accurately measure the reduction in inference cost. A dedicated inference framework and model are coming soon.

Vision-Language

Evaluation on VQA

We use the LMM-Eval toolkit to conduct evaluations on VQA tasks. We provide the transformers repo in which we modify the modeling_llava.py and modeling_siglip.py to support the W1.58-A8 quantization.

The evaluation should use nvidia_24_07 docker. Install the packages:

docker run --name nvidia_24_07 --privileged --net=host --ipc=host --gpus=all -v /mnt:/mnt -v /tmp:/tmp -d nvcr.io/nvidia/pytorch:24.07-py3 sleep infinity # only use for multimodal evaluation
docker exec -it nvidia_24_07 bash
git clone https://github.com/ustcwhy/BitVLA.git
cd BitVLA/
bash vl_eval_setup.sh # only use for multimodal evaluation

First, download the BitVLA model from HuggingFace:

git clone https://huggingface.co/hongyuw/bitvla-bitsiglipL-224px-bf16 # BitVLA w/ W1.58-A8 SigLIP-L
git clone https://huggingface.co/hongyuw/bitvla-siglipL-224px-bf16 # BitVLA w/ BF16 SigLIP-L

Then run the following scripts to conduct evaluations:

cd lmms-eval/
bash eval-dense-hf.sh /YOUR_PATH_TO_EXP/bitvla-bitsiglipL-224px-bf16
bash eval-dense-hf.sh /YOUR_PATH_TO_EXP/bitvla-siglipL-224px-bf16

Note that we provide the master weights of BitVLA and perform online quantization. For actual memory savings, you may quantize the weights offline to 1.58-bit precision. We recommend using the bitnet.cpp inference framework to accurately measure the reduction in inference cost.

Vision-Language-Action

Robotics Pre-training

To endow BitVLA with generalizable manipulation priors that transfer across embodiments and environments, we pre-train it with an autoregressive next-action prediction objective following OpenVLA.

Pre-training Details:

  • Base model: We use hongyuw/bitvla-bitsiglipL-224px-bf16 as the base model.
  • Dataset: Following OpenVLA, we use a curated large-scale corpus based on a subset of the Open X-Embodiment dataset, resulting in ~1M training samples.
  • Hyperparameters: We train the model for 200K steps with a total batch size of 2048. The peak learning rates are set to ×ば぀10βˆ’4 for the LLM and ×ば぀10βˆ’4 for the ViT.
  • Compute: The full pre-training takes approximately 14 days on 16 NVIDIA H800 (80GB) GPUs.

OFT Training

1. Preparing OFT

We fine-tune BitVLA using OFT training shown in OpenVLA-OFT. First setup the environment as required by that project. You can refer to SETUP.md and LIBERO.md for detailed instructions.

conda create -n bitvla python=3.10 -y
conda activate bitvla
pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu124
# or use the provided docker
# docker run --name nvidia_24_07 --privileged --net=host --ipc=host --gpus=all -v /mnt:/mnt -v /tmp:/tmp -d nvcr.io/nvidia/pytorch:24.07-py3 sleep infinity
cd BitVLA
pip install -e openvla-oft/
pip install -e transformers
cd openvla-oft/
# install LIBERO
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
pip install -e LIBERO/
pip install -r experiments/robot/libero/libero_requirements.txt
# install bitvla
pip install -e bitvla/

We adopt the same dataset as OpenVLA-OFT for the fine-tuning on LIBERO. You can download the dataset from HuggingFace.

git clone git@hf.co:datasets/openvla/modified_libero_rlds

2. OFT fine-tuning

Prepare the BitVLA
  • πŸš€ New pre-trained model (Recommended): This model is ready to use out-of-the-box. No additional processing is required, and you can directly execute our provided scripts.
  • πŸ•°οΈ Old model: This version was not pre-trained on the Open X-Embodiment dataset. To use this model, you must first convert the model into a format compatible with our codebase before using it.
    python convert_ckpt.py /path/to/bitvla-bitsiglipL-224px-bf16
    
Fine-tuing the BitVLA

After that, you can finetune BitVLA using the provided shell script.

sh ft_script/ft_bitvla_libero_spatial.sh
sh ft_script/ft_bitvla_libero_object.sh
sh ft_script/ft_bitvla_libero_goal.sh
sh ft_script/ft_bitvla_libero_long.sh

Evaluation on LIBERO

You can download our fine-tuned BitVLA models from HuggingFace. As an example for spatial set in LIBERO, run the following script for evaluation:

python experiments/robot/libero/run_libero_eval_bitnet.py \
 --pretrained_checkpoint /path/to/ft-bitvla-bitsiglipL-224px-libero_spatial-bf16 \
 --task_suite_name libero_spatial \
 --info_in_path "information you want to show in path" \
 --model_family "bitnet" 

Acknowledgement

This repository is built using LMM-Eval, the HuggingFace's transformers, OpenVLA-OFT and OpenVLA.

Citation

If you find this repository useful, please consider citing our work:

@article{bitvla,
 title={BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation}, 
 author={Hongyu Wang and Chuyan Xiong and Ruiping Wang and Xilin Chen},
 year={2025},
 eprint={2506.07530},
 archivePrefix={arXiv},
 primaryClass={cs.RO},
}

License

This project is licensed under the MIT License.

Contact Information

For help or issues using models, please submit a GitHub issue.

About

Official implementation for BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /