Name	Name	Last commit message	Last commit date
Latest commit History 34 Commits
assets	assets
configs	configs
src	src
weights	weights
README.md	README.md
demo.py	demo.py
dino_extraction_v2.py	dino_extraction_v2.py
hf_demo.ipynb	hf_demo.ipynb
pikachu_seg.png	pikachu_seg.png
requirements.txt	requirements.txt
text_features_extraction.py	text_features_extraction.py
train.py	train.py

Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

arXiv Paper Project Website Demo on Hugging Face Hugging Face Models

Talk2DINO is an open-vocabulary segmentation architecture that combines the localized and semantically rich patch-level features of DINOv2 with the multimodal understanding capabilities of CLIP. This is achieved by learning a projection from the CLIP text encoder to the embedding space of DINOv2 using only image-caption pairs and exploiting the self-attention properties of DINOv2 to understand which part of the image has to be aligned to the corresponding caption.

Updates

☄️ 10/2025: Added support for DINOv3 🦖🦖🦕!
🚀 10/2025: Gradio demo is now live! Try Talk2DINO interactively on the Hugging Face Spaces 🦖
🤗 09/2025: Talk2DINO ViT-B and Talk2DINO ViT-L are now available on the Hugging Face Hub 🎉
- Talk2DINO-ViT-B
- Talk2DINO-ViT-L
🔥 06/2025: "Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation" has been accepted to ICCV2025 in Honolulu! 🌺🌴🏖️

Results

Image	Ground Truth	FreeDA	ProxyCLIP	CLIP-DINOiser	Ours (Talk2DINO)
Image	Ground Truth	FreeDA	ProxyCLIP	CLIP-DINOiser	Ours
Image	Ground Truth	FreeDA	ProxyCLIP	CLIP-DINOiser	Ours
Image	Ground Truth	FreeDA	ProxyCLIP	CLIP-DINOiser	Ours
Image	Ground Truth	FreeDA	ProxyCLIP	CLIP-DINOiser	Ours

Here’s a refined and concise version of your installation guidelines that separates Hugging Face inference from full MMCV-based evaluation, while keeping them clear and easy to follow:

Installation

1️⃣ Hugging Face Interface (for inference)

To quickly run Talk2DINO on your own images:

# Clone the repository
git clone https://github.com/lorebianchi98/Talk2DINO.git
cd Talk2DINO
# Install dependencies
pip install -r requirements.txt
# Install PyTorch (CUDA 12.6 example)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126

This setup allows you to load Hugging Face models (Talk2DINO-ViTB / Talk2DINO-ViTL) and generate segmentation masks without setting up MMCV or MMSegmentation.

2️⃣ MMCV Interface (for evaluation & full pipelines)

If you want to perform benchmark evaluation using MMSegmentation:

# Create a dedicated environment
conda create --name talk2dino python=3.10 -c conda-forge
conda activate talk2dino
# Install C++/CUDA compilers
conda install -c conda-forge "gxx_linux-64=11.*" "gcc_linux-64=11.*"
# Install CUDA toolkit and cuDNN
conda install -c nvidia/label/cuda-11.7.0 cuda 
conda install -c nvidia/label/cuda-11.7.0 cuda-nvcc
conda install -c conda-forge cudnn cudatoolkit=11.7.0
# Install PyTorch 2.1 + CUDA 11.8
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
# Install remaining dependencies
pip install -r requirements.txt
pip install -U openmim
mim install mmengine
# Install MMCV (compatible with PyTorch 2.1 + CUDA 11.8)
pip install mmcv-full==1.7.2 -f https://download.openmmlab.com/mmcv/dist/cu118/torch2.1.0/index.html
# Install MMSegmentation
pip install mmsegmentation==0.30.0

Mapping CLIP Text Embeddings to DINOv2 space with Talk2DINO

Talk2DINO enables you to align CLIP text embeddings with the patch-level embedding space of DINOv2.
You can try it in two ways:

🔹 Using the Hugging Face Hub

Easily load pretrained models with the HF interface:

from transformers import AutoModel
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = AutoModel.from_pretrained("lorebianchi98/Talk2DINO-ViTB").to(device).eval()
with torch.no_grad():
 text_embed = model.encode_text("a pikachu")

🔹 Using the Original Talk2DINO Interface

If you prefer local configs and weights:

import clip
from src.model import ProjectionLayer
import torch, os
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# Load Talk2DINO projection layer
proj_name = 'vitb_mlp_infonce'
config_path = os.path.join("configs", f"{proj_name}.yaml")
weights_path = os.path.join("weights", f"{proj_name}.pth")
talk2dino = ProjectionLayer.from_config(config_path)
talk2dino.load_state_dict(torch.load(weights_path, map_location=device))
talk2dino.to(device)
# Load CLIP model
clip_model, _ = clip.load("ViT-B/16", device=device, jit=False)
tokenizer = clip.tokenize
# Example: Tokenize and project text features
texts = ["a cat"]
text_tokens = tokenizer(texts).to(device)
text_features = clip_model.encode_text(text_tokens)
projected_text_features = talk2dino.project_clip_txt(text_features)

Feature Extraction

To speed up training, we use pre-extracted features. Follow these steps:

Download the 2014 images and annotations from the COCO website.

Run the following commands to extract features:

mkdir ../coco2014_b14
python dino_extraction_v2.py --ann_path ../coco/captions_val2014.json --out_path ../coco2014_b14/val.pth --model dinov2_vitb14_reg --resize_dim 448 --crop_dim 448 --extract_avg_self_attn --extract_disentangled_self_attn
python dino_extraction_v2.py --ann_path ../coco/captions_train2014.json --out_path ../coco2014_b14/train.pth --model dinov2_vitb14_reg --resize_dim 448 --crop_dim 448 --extract_avg_self_attn --extract_disentangled_self_attn
python text_features_extraction.py --ann_path ../coco2014_b14/train.pth
python text_features_extraction.py --ann_path ../coco2014_b14/val.pth

Training

To train the model, use the following command (this example runs training for the ViT-Base configuration):

python train.py --model configs/vitb_mlp_infonce.yaml --train_dataset ../coco2014_b14/train.pth --val_dataset ../coco2014_b14/val.pth

Evaluation

This section is adapted from GroupViT, TCL, and FreeDA. The segmentation datasets should be organized as follows:

data
├── cityscapes
│ ├── leftImg8bit
│ │ ├── train
│ │ ├── val
│ ├── gtFine
│ │ ├── train
│ │ ├── val
├── VOCdevkit
│ ├── VOC2012
│ │ ├── JPEGImages
│ │ ├── SegmentationClass
│ │ ├── ImageSets
│ │ │ ├── Segmentation
│ ├── VOC2010
│ │ ├── JPEGImages
│ │ ├── SegmentationClassContext
│ │ ├── ImageSets
│ │ │ ├── SegmentationContext
│ │ │ │ ├── train.txt
│ │ │ │ ├── val.txt
│ │ ├── trainval_merged.json
│ ├── VOCaug
│ │ ├── dataset
│ │ │ ├── cls
├── ade
│ ├── ADEChallengeData2016
│ │ ├── annotations
│ │ │ ├── training
│ │ │ ├── validation
│ │ ├── images
│ │ │ ├── training
│ │ │ ├── validation
├── coco_stuff164k
│ ├── images
│ │ ├── train2017
│ │ ├── val2017
│ ├── annotations
│ │ ├── train2017
│ │ ├── val2017

Please download and setup PASCAL VOC , PASCAL Context, COCO-Stuff164k , Cityscapes, and ADE20k datasets following MMSegmentation data preparation document.

COCO-Object dataset uses only object classes from COCO-Stuff164k dataset by collecting instance semgentation annotations. Run the following command to convert instance segmentation annotations to semantic segmentation annotations:

python convert_dataset/convert_coco.py data/coco_stuff164k/ -o data/coco_stuff164k/

To evaluate the model on open-vocabulary segmentation benchmarks, use the src/open_vocabulary_segmentation/main.py script. Select the appropriate configuration based on the model, benchmark, and PAMR settings. The available models are [vitb, vitl], while the available benchmarks are [ade, cityscapes, voc, voc_bg, context, context_bg, cityscapes, coco_object, stuff]. Below we provide the list of evaluations to reproduce the results reported in the paper for the ViT-Base architecture:

# ADE20K
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/ade/dinotext_ade_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/ade/eval_ade_pamr.yml
# Cityscapes
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/cityscapes/dinotext_cityscapes_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/cityscapes/eval_cityscapes_pamr.yml
# Pascal VOC (without background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/voc/dinotext_voc_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/voc/eval_voc_pamr.yml
# Pascal VOC (with background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/voc_bg/dinotext_voc_bg_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/voc_bg/eval_voc_bg_pamr.yml
# Pascal Context (without background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/context/dinotext_context_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/context/eval_context_pamr.yml
# Pascal Context (with background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/context_bg/dinotext_context_bg_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/context_bg/eval_context_bg_pamr.yml
# COCOStuff
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/stuff/dinotext_stuff_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/stuff/eval_stuff_pamr.yml
# COCO Object
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/coco_object/dinotext_coco_object_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/coco_object/eval_coco_object_pamr.yml

Instead, the evaluations for the ViT-Large architecture are:

# ADE20K
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/ade/dinotext_ade_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/ade/eval_ade_pamr.yml
# Cityscapes
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/cityscapes/dinotext_cityscapes_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/cityscapes/eval_cityscapes_pamr.yml
# Pascal VOC (without background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/voc/dinotext_voc_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/voc/eval_voc_pamr.yml
# Pascal VOC (with background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/voc_bg/dinotext_voc_bg_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/voc_bg/eval_voc_bg_vitl_pamr.yml
# Pascal Context (without background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/context/dinotext_context_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/context/eval_context_pamr.yml
# Pascal Context (with background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/context_bg/dinotext_context_bg_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/context_bg/eval_context_bg_vitl_pamr.yml
# COCOStuff
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/stuff/dinotext_stuff_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/stuff/eval_stuff_pamr.yml
# COCO Object
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/coco_object/dinotext_coco_object_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/coco_object/eval_coco_object_vitl_pamr.yml

Demo

We provide two simple entry points for trying out Talk2DINO:

hf_demo.ipynb – an interactive notebook showing how to generate segmentation masks directly using the Hugging Face interface.
demo.py – a lightweight script for running inference on a custom image with your own textual categories. . Run

python demo.py --input custom_input_image --output custom_output_seg [--with_background] --textual_categories category_1,category_2,..

Example:

python demo.py --input assets/pikachu.png --output pikachu_seg.png --textual_categories pikachu,traffic_sign,forest,route

Result:

Acknowledgments

Thanks to AyoubDamak for contributing to the updated installation instructions.

Reference

If you found this code useful, please cite the following paper:

@inproceedings{barsellotti2025talking,
 title={Talking to dino: Bridging self-supervised vision backbones with language for open-vocabulary segmentation},
 author={Barsellotti, Luca and Bianchi, Lorenzo and Messina, Nicola and Carrara, Fabio and Cornia, Marcella and Baraldi, Lorenzo and Falchi, Fabrizio and Cucchiara, Rita},
 booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
 pages={22025--22035},
 year={2025}
}

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lorebianchi98/Talk2DINO

Folders and files

Latest commit

History

Repository files navigation

Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

Updates

Results

Installation

1️⃣ Hugging Face Interface (for inference)

2️⃣ MMCV Interface (for evaluation & full pipelines)

Mapping CLIP Text Embeddings to DINOv2 space with Talk2DINO

🔹 Using the Hugging Face Hub

🔹 Using the Original Talk2DINO Interface

Feature Extraction

Training

Evaluation

Demo

Acknowledgments

Reference

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

Updates

Results

Installation

1️⃣ Hugging Face Interface (for inference)

2️⃣ MMCV Interface (for evaluation & full pipelines)

Mapping CLIP Text Embeddings to DINOv2 space with Talk2DINO

🔹 Using the Hugging Face Hub

🔹 Using the Original Talk2DINO Interface

Feature Extraction

Training

Evaluation

Demo

Acknowledgments

Reference

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages