Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

lorebianchi98/Talk2DINO

Repository files navigation

Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

arXiv Paper Project Website Demo on Hugging Face Hugging Face Models

Talk2DINO is an open-vocabulary segmentation architecture that combines the localized and semantically rich patch-level features of DINOv2 with the multimodal understanding capabilities of CLIP. This is achieved by learning a projection from the CLIP text encoder to the embedding space of DINOv2 using only image-caption pairs and exploiting the self-attention properties of DINOv2 to understand which part of the image has to be aligned to the corresponding caption.

Updates

  • β˜„οΈ 10/2025: Added support for DINOv3 πŸ¦–πŸ¦–πŸ¦•!
  • πŸš€ 10/2025: Gradio demo is now live! Try Talk2DINO interactively on the Hugging Face Spaces πŸ¦–
  • πŸ€— 09/2025: Talk2DINO ViT-B and Talk2DINO ViT-L are now available on the Hugging Face Hub πŸŽ‰
  • πŸ”₯ 06/2025: "Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation" has been accepted to ICCV2025 in Honolulu! πŸŒΊπŸŒ΄πŸ–οΈ

Results

Image Ground Truth FreeDA ProxyCLIP CLIP-DINOiser Ours (Talk2DINO)
Image Ground Truth FreeDA ProxyCLIP CLIP-DINOiser Ours
Image Ground Truth FreeDA ProxyCLIP CLIP-DINOiser Ours
Image Ground Truth FreeDA ProxyCLIP CLIP-DINOiser Ours
Image Ground Truth FreeDA ProxyCLIP CLIP-DINOiser Ours

Here’s a refined and concise version of your installation guidelines that separates Hugging Face inference from full MMCV-based evaluation, while keeping them clear and easy to follow:


Installation

1️⃣ Hugging Face Interface (for inference)

To quickly run Talk2DINO on your own images:

# Clone the repository
git clone https://github.com/lorebianchi98/Talk2DINO.git
cd Talk2DINO
# Install dependencies
pip install -r requirements.txt
# Install PyTorch (CUDA 12.6 example)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126

This setup allows you to load Hugging Face models (Talk2DINO-ViTB / Talk2DINO-ViTL) and generate segmentation masks without setting up MMCV or MMSegmentation.


2️⃣ MMCV Interface (for evaluation & full pipelines)

If you want to perform benchmark evaluation using MMSegmentation:

# Create a dedicated environment
conda create --name talk2dino python=3.10 -c conda-forge
conda activate talk2dino
# Install C++/CUDA compilers
conda install -c conda-forge "gxx_linux-64=11.*" "gcc_linux-64=11.*"
# Install CUDA toolkit and cuDNN
conda install -c nvidia/label/cuda-11.7.0 cuda 
conda install -c nvidia/label/cuda-11.7.0 cuda-nvcc
conda install -c conda-forge cudnn cudatoolkit=11.7.0
# Install PyTorch 2.1 + CUDA 11.8
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
# Install remaining dependencies
pip install -r requirements.txt
pip install -U openmim
mim install mmengine
# Install MMCV (compatible with PyTorch 2.1 + CUDA 11.8)
pip install mmcv-full==1.7.2 -f https://download.openmmlab.com/mmcv/dist/cu118/torch2.1.0/index.html
# Install MMSegmentation
pip install mmsegmentation==0.30.0

Mapping CLIP Text Embeddings to DINOv2 space with Talk2DINO

Talk2DINO enables you to align CLIP text embeddings with the patch-level embedding space of DINOv2.
You can try it in two ways:

πŸ”Ή Using the Hugging Face Hub

Easily load pretrained models with the HF interface:

from transformers import AutoModel
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = AutoModel.from_pretrained("lorebianchi98/Talk2DINO-ViTB").to(device).eval()
with torch.no_grad():
 text_embed = model.encode_text("a pikachu")

πŸ”Ή Using the Original Talk2DINO Interface

If you prefer local configs and weights:

import clip
from src.model import ProjectionLayer
import torch, os
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# Load Talk2DINO projection layer
proj_name = 'vitb_mlp_infonce'
config_path = os.path.join("configs", f"{proj_name}.yaml")
weights_path = os.path.join("weights", f"{proj_name}.pth")
talk2dino = ProjectionLayer.from_config(config_path)
talk2dino.load_state_dict(torch.load(weights_path, map_location=device))
talk2dino.to(device)
# Load CLIP model
clip_model, _ = clip.load("ViT-B/16", device=device, jit=False)
tokenizer = clip.tokenize
# Example: Tokenize and project text features
texts = ["a cat"]
text_tokens = tokenizer(texts).to(device)
text_features = clip_model.encode_text(text_tokens)
projected_text_features = talk2dino.project_clip_txt(text_features)

Feature Extraction

To speed up training, we use pre-extracted features. Follow these steps:

  1. Download the 2014 images and annotations from the COCO website.
  2. Run the following commands to extract features:
    mkdir ../coco2014_b14
    python dino_extraction_v2.py --ann_path ../coco/captions_val2014.json --out_path ../coco2014_b14/val.pth --model dinov2_vitb14_reg --resize_dim 448 --crop_dim 448 --extract_avg_self_attn --extract_disentangled_self_attn
    python dino_extraction_v2.py --ann_path ../coco/captions_train2014.json --out_path ../coco2014_b14/train.pth --model dinov2_vitb14_reg --resize_dim 448 --crop_dim 448 --extract_avg_self_attn --extract_disentangled_self_attn
    python text_features_extraction.py --ann_path ../coco2014_b14/train.pth
    python text_features_extraction.py --ann_path ../coco2014_b14/val.pth

Training

To train the model, use the following command (this example runs training for the ViT-Base configuration):

python train.py --model configs/vitb_mlp_infonce.yaml --train_dataset ../coco2014_b14/train.pth --val_dataset ../coco2014_b14/val.pth

Evaluation

This section is adapted from GroupViT, TCL, and FreeDA. The segmentation datasets should be organized as follows:

data
β”œβ”€β”€ cityscapes
β”‚ β”œβ”€β”€ leftImg8bit
β”‚ β”‚ β”œβ”€β”€ train
β”‚ β”‚ β”œβ”€β”€ val
β”‚ β”œβ”€β”€ gtFine
β”‚ β”‚ β”œβ”€β”€ train
β”‚ β”‚ β”œβ”€β”€ val
β”œβ”€β”€ VOCdevkit
β”‚ β”œβ”€β”€ VOC2012
β”‚ β”‚ β”œβ”€β”€ JPEGImages
β”‚ β”‚ β”œβ”€β”€ SegmentationClass
β”‚ β”‚ β”œβ”€β”€ ImageSets
β”‚ β”‚ β”‚ β”œβ”€β”€ Segmentation
β”‚ β”œβ”€β”€ VOC2010
β”‚ β”‚ β”œβ”€β”€ JPEGImages
β”‚ β”‚ β”œβ”€β”€ SegmentationClassContext
β”‚ β”‚ β”œβ”€β”€ ImageSets
β”‚ β”‚ β”‚ β”œβ”€β”€ SegmentationContext
β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ train.txt
β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ val.txt
β”‚ β”‚ β”œβ”€β”€ trainval_merged.json
β”‚ β”œβ”€β”€ VOCaug
β”‚ β”‚ β”œβ”€β”€ dataset
β”‚ β”‚ β”‚ β”œβ”€β”€ cls
β”œβ”€β”€ ade
β”‚ β”œβ”€β”€ ADEChallengeData2016
β”‚ β”‚ β”œβ”€β”€ annotations
β”‚ β”‚ β”‚ β”œβ”€β”€ training
β”‚ β”‚ β”‚ β”œβ”€β”€ validation
β”‚ β”‚ β”œβ”€β”€ images
β”‚ β”‚ β”‚ β”œβ”€β”€ training
β”‚ β”‚ β”‚ β”œβ”€β”€ validation
β”œβ”€β”€ coco_stuff164k
β”‚ β”œβ”€β”€ images
β”‚ β”‚ β”œβ”€β”€ train2017
β”‚ β”‚ β”œβ”€β”€ val2017
β”‚ β”œβ”€β”€ annotations
β”‚ β”‚ β”œβ”€β”€ train2017
β”‚ β”‚ β”œβ”€β”€ val2017

Please download and setup PASCAL VOC , PASCAL Context, COCO-Stuff164k , Cityscapes, and ADE20k datasets following MMSegmentation data preparation document.

COCO-Object dataset uses only object classes from COCO-Stuff164k dataset by collecting instance semgentation annotations. Run the following command to convert instance segmentation annotations to semantic segmentation annotations:

python convert_dataset/convert_coco.py data/coco_stuff164k/ -o data/coco_stuff164k/

To evaluate the model on open-vocabulary segmentation benchmarks, use the src/open_vocabulary_segmentation/main.py script. Select the appropriate configuration based on the model, benchmark, and PAMR settings. The available models are [vitb, vitl], while the available benchmarks are [ade, cityscapes, voc, voc_bg, context, context_bg, cityscapes, coco_object, stuff]. Below we provide the list of evaluations to reproduce the results reported in the paper for the ViT-Base architecture:

# ADE20K
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/ade/dinotext_ade_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/ade/eval_ade_pamr.yml
# Cityscapes
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/cityscapes/dinotext_cityscapes_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/cityscapes/eval_cityscapes_pamr.yml
# Pascal VOC (without background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/voc/dinotext_voc_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/voc/eval_voc_pamr.yml
# Pascal VOC (with background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/voc_bg/dinotext_voc_bg_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/voc_bg/eval_voc_bg_pamr.yml
# Pascal Context (without background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/context/dinotext_context_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/context/eval_context_pamr.yml
# Pascal Context (with background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/context_bg/dinotext_context_bg_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/context_bg/eval_context_bg_pamr.yml
# COCOStuff
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/stuff/dinotext_stuff_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/stuff/eval_stuff_pamr.yml
# COCO Object
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/coco_object/dinotext_coco_object_vitb_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/coco_object/eval_coco_object_pamr.yml

Instead, the evaluations for the ViT-Large architecture are:

# ADE20K
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/ade/dinotext_ade_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/ade/eval_ade_pamr.yml
# Cityscapes
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/cityscapes/dinotext_cityscapes_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/cityscapes/eval_cityscapes_pamr.yml
# Pascal VOC (without background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/voc/dinotext_voc_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/voc/eval_voc_pamr.yml
# Pascal VOC (with background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/voc_bg/dinotext_voc_bg_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/voc_bg/eval_voc_bg_vitl_pamr.yml
# Pascal Context (without background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/context/dinotext_context_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/context/eval_context_pamr.yml
# Pascal Context (with background)
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/context_bg/dinotext_context_bg_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/context_bg/eval_context_bg_vitl_pamr.yml
# COCOStuff
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/stuff/dinotext_stuff_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/stuff/eval_stuff_pamr.yml
# COCO Object
python -m torch.distributed.run src/open_vocabulary_segmentation/main.py --eval --eval_cfg src/open_vocabulary_segmentation/configs/coco_object/dinotext_coco_object_vitl_mlp_infonce.yml --eval_base src/open_vocabulary_segmentation/configs/coco_object/eval_coco_object_vitl_pamr.yml

Demo

We provide two simple entry points for trying out Talk2DINO:

  • hf_demo.ipynb – an interactive notebook showing how to generate segmentation masks directly using the Hugging Face interface.
  • demo.py – a lightweight script for running inference on a custom image with your own textual categories. . Run
python demo.py --input custom_input_image --output custom_output_seg [--with_background] --textual_categories category_1,category_2,..

Example:

python demo.py --input assets/pikachu.png --output pikachu_seg.png --textual_categories pikachu,traffic_sign,forest,route

Result:

Acknowledgments

Thanks to AyoubDamak for contributing to the updated installation instructions.

Reference

If you found this code useful, please cite the following paper:

@inproceedings{barsellotti2025talking,
 title={Talking to dino: Bridging self-supervised vision backbones with language for open-vocabulary segmentation},
 author={Barsellotti, Luca and Bianchi, Lorenzo and Messina, Nicola and Carrara, Fabio and Cornia, Marcella and Baraldi, Lorenzo and Falchi, Fabrizio and Cucchiara, Rita},
 booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
 pages={22025--22035},
 year={2025}
}

About

[ICCV 2025] Official repository of the paper "Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /