Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model

Notifications You must be signed in to change notification settings

ginwind/VLA-JEPA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

21 Commits

Repository files navigation

Paper PDF Project Page Hugging Face Code License

⭐ If our project helps you, please give us a star on GitHub to support us!

TODO

  • Partial training code
  • LIBERO evaluation code
  • LIBERO-Plus evaluation code
  • SimplerEnv evaluation code
  • Training codes for custom datasets

Environment Setup

git clone https://github.com/ginwind/VLA-JEPA
# Create conda environment
conda create -n VLA_JEPA python=3.10 -y
conda activate VLA_JEPA
# Install requirements
pip install -r requirements.txt
# Install FlashAttention2
pip install flash-attn --no-build-isolation
# Install project
pip install -e .

This repository's code is based on the starVLA.

Training

0️⃣ Pretrained Model Preparation

Download the Qwen3-VL-2B and the V-JEPA2 encoder.

1️⃣ Data Preparation

Download the following datasets:

2️⃣ Start Training

Depending on whether you are conducting pre-training or post-training, select the appropriate training script and YAML configuration file from the /scripts directory.

Ensure the following configurations are updated in the YAML file:

  • framework.qwenvl.basevlm and framework.vj2_model.base_encoder should be set to the paths of your respective checkpoints.
  • Update datasets.vla_data.data_root_dir, datasets.video_data.video_dir, and datasets.video_data.text_file to match the paths of your datasets.

Once the configurations are updated, you can proceed to start the training process.

Evaluation

Download the model checkpoints from Hugging Face: https://huggingface.co/ginwind/VLA-JEPA

Environment: Install the required Python packages into your VLA-JEPA environment:

pip install tyro matplotlib mediapy websockets msgpack
pip install numpy==1.24.4

LIBERO

  • LIBERO setup: Prepare the LIBERO benchmark in a separate conda environment following the official LIBERO instructions: https://github.com/Lifelong-Robot-Learning/LIBERO

  • Configuration: In the downloaded checkpoint folder, update config.json and config.yaml to point the following fields to your local checkpoints:

    • framework.qwenvl.basevlm: path to the Qwen3-VL-2B checkpoint
    • framework.vj2_model.base_encoder: path to the V-JEPA encoder checkpoint
  • Evaluation script: Edit examples/LIBERO/eval_libero.sh and set the LIBERO_HOME environment variable (line 4) to your local LIBERO code path, and set the sim_python variable (line 9) to the Python executable of the LIBERO conda environment. Finally, set the your_ckpt variable (line 11) to the path of the downloaded LIBERO/checkpoints/VLA-JEPA-LIBERO.pt.

  • Run evaluation: Launch the evaluation (the script runs the four task suites in parallel across 4 GPUs):

bash ./examples/LIBERO/eval_libero.sh

LIBERO-Plus

  • LIBERO-Plus setup: Clone the LIBERO-Plus repository: https://github.com/sylvestf/LIBERO-plus. In ./examples/LIBERO-Plus/libero_plus_init.py, update line 121 to point to your LIBERO-Plus/libero/libero/benchmark/task_classification.json. Replace the original LIBERO-Plus/libero/libero/benchmark/__init__.py with the provided modified implementation (see ./examples/LIBERO-Plus/libero_plus_init.py) to enable evaluation over perturbation dimensions. Finally, follow the official LIBERO-Plus installation instructions and build the benchmark in a separate conda environment.

  • Configuration: In the downloaded checkpoint folder, update config.json and config.yaml to point the following fields to your local checkpoints:

    • framework.qwenvl.basevlm: path to the Qwen3-VL-2B checkpoint
    • framework.vj2_model.base_encoder: path to the V-JEPA encoder checkpoint
  • Evaluation script: Edit examples/LIBERO-Plus/eval_libero_plus.sh and set the LIBERO_HOME environment variable (line 4) to your local LIBERO-Plus code path, and set the sim_python variable (line 9) to the Python executable of the LIBERO-Plus conda environment. Finally, set the your_ckpt variable (line 11) to the path of the downloaded LIBERO/checkpoints/VLA-JEPA-LIBERO.pt.

  • Run evaluation: Launch the evaluation (the script runs the seven pertubation dimensions in parallel across 7 GPUs):

bash ./examples/LIBERO-Plus/eval_libero_plus.sh

Notes: Ensure each process has access to a GPU and verify that all checkpoint paths in the configuration files are correct before running the evaluation.

Acknowledgement

We extend our sincere gratitude to the starVLA project and the V-JEPA2 project for their invaluable open-source contributions.

About

VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

AltStyle によって変換されたページ (->オリジナル) /