Name	Name	Last commit message	Last commit date
Latest commit History 54 Commits
IsaacLab	IsaacLab
embeddings	embeddings
rl	rl
.gitignore	.gitignore
CITATION.cff	CITATION.cff
LICENSE	LICENSE
README.md	README.md
bc_model.pt	bc_model.pt
generate_embeddings.ipynb	generate_embeddings.ipynb
habitat_test.ipynb	habitat_test.ipynb
jupyter-ip.sh	jupyter-ip.sh
offlin_train.py	offlin_train.py
pkl2h5.py	pkl2h5.py
replay_buffer.py	replay_buffer.py
revised_gen_embed.ipynb	revised_gen_embed.ipynb
td3_bc_model.pt	td3_bc_model.pt
td3_bc_model_v2.pt	td3_bc_model_v2.pt
train_offline.py	train_offline.py

Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation?

arXiv RSS 2025 Python Jupyter

Repository: text2nav

Accepted to Robotics: Science and Systems (RSS) 2025 Workshop on Robot Planning in the Era of Foundation Models (FM4RoboPlan)

📝 Overview

This repository contains the implementation for our research investigating whether frozen vision-language model embeddings can guide robot navigation without fine-tuning or specialized architectures. We present a minimalist framework that achieves 74% success rate in language-guided navigation using only pretrained SigLIP embeddings.

🎯 Key Findings

🎯 74% success rate using frozen VLM embeddings alone (vs 100% privileged expert)
🔍 3.2x longer paths compared to privileged expert, revealing efficiency limitations
📊 SigLIP outperforms CLIP and ViLT for navigation tasks (74% vs 62% vs 40%)
⚖️ Clear performance-complexity tradeoffs for resource-constrained applications
🧠 Strong semantic grounding but limitations in spatial reasoning and planning

🚀 Method

Our approach consists of two phases:

Expert Demonstration Phase: Train a privileged policy with full state access using PPO
Behavioral Cloning Phase: Distill expert knowledge into a policy using only frozen VLM embeddings

The key insight is using frozen vision-language embeddings as drop-in representations without any fine-tuning, providing an empirical baseline for understanding foundation model capabilities in embodied tasks.

🛠️ Installation

Prerequisites

Python 3.8+
NVIDIA Isaac Sim/Isaac Lab
PyTorch
CUDA-compatible GPU

Setup

git clone https://github.com/oadamharoon/text2nav.git
cd text2nav
# Install dependencies
pip install torch torchvision
pip install transformers
pip install numpy matplotlib
pip install gymnasium
# For Isaac Lab simulation (follow official installation guide)
# https://isaac-sim.github.io/IsaacLab/

📁 Repository Structure

text2nav/
├── CITATION.cff # Citation information
├── LICENSE # MIT License
├── README.md # This documentation
├── IsaacLab/ # Isaac Lab simulation environment setup
├── embeddings/ # Vision-language embedding generation
├── rl/ # Reinforcement learning expert training
├── generate_embeddings.ipynb # Generate VLM embeddings from demonstrations
├── revised_gen_embed.ipynb # Revised embedding generation
├── train_offline.py # Behavioral cloning training script
├── offlin_train.py # Alternative offline training
├── bc_model.pt # Trained behavioral cloning model
├── td3_bc_model.pt # TD3+BC baseline model
├── habitat_test.ipynb # Testing and evaluation notebook
└── replay_buffer.py # Data handling utilities

🎮 Usage

1. Expert Demonstration Collection

cd rl/
python train_expert.py --env isaac_sim --num_episodes 500

2. Generate VLM Embeddings

jupyter notebook generate_embeddings.ipynb

3. Train Navigation Policy

python train_offline.py --model siglip --embedding_dim 1152 --batch_size 32

📊 Results

Model	Success Rate (%)	Avg Steps	Efficiency
Expert (πβ)	100.0	113.97	×ばつ
SigLIP	74.0	369.4	×ばつ
CLIP	62.0	417.6	×ばつ
ViLT	40.0	472.0	×ばつ

🔬 Experimental Setup

Environment: 3m ×ばつ 3m arena in NVIDIA Isaac Sim
Robot: NVIDIA JetBot with RGB camera (×ばつ256)
Task: Navigate to colored spheres based on language instructions
Targets: 5 colored spheres (red, green, blue, yellow, pink)
Success Criteria: Reach within 0.1m of correct target

💡 Key Insights

Semantic Grounding: Pretrained VLMs excel at connecting language descriptions to visual observations
Spatial Limitations: Frozen embeddings struggle with long-horizon planning and spatial reasoning
Prompt Engineering: Including relative spatial cues significantly improves performance
Embedding Dimensionality: Higher-dimensional embeddings (SigLIP: 1152D) outperform lower-dimensional ones

🔮 Future Work

Hybrid architectures combining frozen embeddings with lightweight spatial memory
Data-efficient adaptation techniques to bridge the efficiency gap
Testing in more complex environments with obstacles and natural language variation
Integration with world models for better spatial reasoning

📚 Citation

@misc{subedi2025pretrainedvisionlanguageembeddingsguide,
 title={Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation?}, 
 author={Nitesh Subedi and Adam Haroon and Shreyan Ganguly and Samuel T. K. Tetteh and Prajwal Koirala and Cody Fleming and Soumik Sarkar},
 year={2025},
 eprint={2506.14507},
 archivePrefix={arXiv},
 primaryClass={cs.RO},
 url={https://arxiv.org/abs/2506.14507}, 
}

🙏 Acknowledgments

This work is funded by NSF-USDA COALESCE grant #2021-67021-34418. Special thanks to the Iowa State University Mechanical Engineering Department for their support.

👥 Contributors

Nitesh Subedi* (Iowa State University)
Adam Haroon* (Iowa State University)
Shreyan Ganguly (Iowa State University)
Samuel T.K. Tetteh (Iowa State University)
Prajwal Koirala (Iowa State University)
Cody Fleming (Iowa State University)
Soumik Sarkar (Iowa State University)

*Equal contribution

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔗 Links

For questions or issues, please open a GitHub issue or contact the authors.

License

oadamharoon/text2nav

Folders and files

Latest commit

History

Repository files navigation

Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation?

📝 Overview

🎯 Key Findings

🚀 Method

🛠️ Installation

Prerequisites

Setup

📁 Repository Structure

🎮 Usage

1. Expert Demonstration Collection

2. Generate VLM Embeddings

3. Train Navigation Policy

📊 Results

🔬 Experimental Setup

💡 Key Insights

🔮 Future Work

📚 Citation

🙏 Acknowledgments

👥 Contributors

📄 License

🔗 Links

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages