Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Minimalist framework for language-guided robot navigation using frozen vision-language embeddings. Achieves 74% success rate without fine-tuning. RSS 2025 Workshop paper.

License

Notifications You must be signed in to change notification settings

oadamharoon/text2nav

Repository files navigation

Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation?

arXiv RSS 2025 Python Jupyter

Repository: text2nav

Accepted to Robotics: Science and Systems (RSS) 2025 Workshop on Robot Planning in the Era of Foundation Models (FM4RoboPlan)

๐Ÿ“ Overview

This repository contains the implementation for our research investigating whether frozen vision-language model embeddings can guide robot navigation without fine-tuning or specialized architectures. We present a minimalist framework that achieves 74% success rate in language-guided navigation using only pretrained SigLIP embeddings.

๐ŸŽฏ Key Findings

  • ๐ŸŽฏ 74% success rate using frozen VLM embeddings alone (vs 100% privileged expert)
  • ๐Ÿ” 3.2x longer paths compared to privileged expert, revealing efficiency limitations
  • ๐Ÿ“Š SigLIP outperforms CLIP and ViLT for navigation tasks (74% vs 62% vs 40%)
  • โš–๏ธ Clear performance-complexity tradeoffs for resource-constrained applications
  • ๐Ÿง  Strong semantic grounding but limitations in spatial reasoning and planning

๐Ÿš€ Method

Our approach consists of two phases:

  1. Expert Demonstration Phase: Train a privileged policy with full state access using PPO
  2. Behavioral Cloning Phase: Distill expert knowledge into a policy using only frozen VLM embeddings

The key insight is using frozen vision-language embeddings as drop-in representations without any fine-tuning, providing an empirical baseline for understanding foundation model capabilities in embodied tasks.

๐Ÿ› ๏ธ Installation

Prerequisites

  • Python 3.8+
  • NVIDIA Isaac Sim/Isaac Lab
  • PyTorch
  • CUDA-compatible GPU

Setup

git clone https://github.com/oadamharoon/text2nav.git
cd text2nav
# Install dependencies
pip install torch torchvision
pip install transformers
pip install numpy matplotlib
pip install gymnasium
# For Isaac Lab simulation (follow official installation guide)
# https://isaac-sim.github.io/IsaacLab/

๐Ÿ“ Repository Structure

text2nav/
โ”œโ”€โ”€ CITATION.cff # Citation information
โ”œโ”€โ”€ LICENSE # MIT License
โ”œโ”€โ”€ README.md # This documentation
โ”œโ”€โ”€ IsaacLab/ # Isaac Lab simulation environment setup
โ”œโ”€โ”€ embeddings/ # Vision-language embedding generation
โ”œโ”€โ”€ rl/ # Reinforcement learning expert training
โ”œโ”€โ”€ generate_embeddings.ipynb # Generate VLM embeddings from demonstrations
โ”œโ”€โ”€ revised_gen_embed.ipynb # Revised embedding generation
โ”œโ”€โ”€ train_offline.py # Behavioral cloning training script
โ”œโ”€โ”€ offlin_train.py # Alternative offline training
โ”œโ”€โ”€ bc_model.pt # Trained behavioral cloning model
โ”œโ”€โ”€ td3_bc_model.pt # TD3+BC baseline model
โ”œโ”€โ”€ habitat_test.ipynb # Testing and evaluation notebook
โ””โ”€โ”€ replay_buffer.py # Data handling utilities

๐ŸŽฎ Usage

1. Expert Demonstration Collection

cd rl/
python train_expert.py --env isaac_sim --num_episodes 500

2. Generate VLM Embeddings

jupyter notebook generate_embeddings.ipynb

3. Train Navigation Policy

python train_offline.py --model siglip --embedding_dim 1152 --batch_size 32

๐Ÿ“Š Results

Model Success Rate (%) Avg Steps Efficiency
Expert (ฯ€ฮฒ) 100.0 113.97 ร—ใฐใค
SigLIP 74.0 369.4 ร—ใฐใค
CLIP 62.0 417.6 ร—ใฐใค
ViLT 40.0 472.0 ร—ใฐใค

๐Ÿ”ฌ Experimental Setup

  • Environment: 3m ร—ใฐใค 3m arena in NVIDIA Isaac Sim
  • Robot: NVIDIA JetBot with RGB camera (ร—ใฐใค256)
  • Task: Navigate to colored spheres based on language instructions
  • Targets: 5 colored spheres (red, green, blue, yellow, pink)
  • Success Criteria: Reach within 0.1m of correct target

๐Ÿ’ก Key Insights

  1. Semantic Grounding: Pretrained VLMs excel at connecting language descriptions to visual observations
  2. Spatial Limitations: Frozen embeddings struggle with long-horizon planning and spatial reasoning
  3. Prompt Engineering: Including relative spatial cues significantly improves performance
  4. Embedding Dimensionality: Higher-dimensional embeddings (SigLIP: 1152D) outperform lower-dimensional ones

๐Ÿ”ฎ Future Work

  • Hybrid architectures combining frozen embeddings with lightweight spatial memory
  • Data-efficient adaptation techniques to bridge the efficiency gap
  • Testing in more complex environments with obstacles and natural language variation
  • Integration with world models for better spatial reasoning

๐Ÿ“š Citation

@misc{subedi2025pretrainedvisionlanguageembeddingsguide,
 title={Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation?}, 
 author={Nitesh Subedi and Adam Haroon and Shreyan Ganguly and Samuel T. K. Tetteh and Prajwal Koirala and Cody Fleming and Soumik Sarkar},
 year={2025},
 eprint={2506.14507},
 archivePrefix={arXiv},
 primaryClass={cs.RO},
 url={https://arxiv.org/abs/2506.14507}, 
}

๐Ÿ™ Acknowledgments

This work is funded by NSF-USDA COALESCE grant #2021-67021-34418. Special thanks to the Iowa State University Mechanical Engineering Department for their support.

๐Ÿ‘ฅ Contributors

*Equal contribution

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ”— Links


For questions or issues, please open a GitHub issue or contact the authors.

About

Minimalist framework for language-guided robot navigation using frozen vision-language embeddings. Achieves 74% success rate without fine-tuning. RSS 2025 Workshop paper.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

AltStyle ใซใ‚ˆใฃใฆๅค‰ๆ›ใ•ใ‚ŒใŸใƒšใƒผใ‚ธ (->ใ‚ชใƒชใ‚ธใƒŠใƒซ) /