Rohit Girdhar

Research Scientist

AMI Labs

I am a Research Scientist at AMI Labs. My current research focuses on multimodal understanding, generation and world modeling. I obtained a PhD from Carnegie Mellon University (here’s a link to my dissertation), where I worked on learning from and understanding videos. I was previously part of the Meta Superintelligence Labs and Facebook AI Research (FAIR) at Meta, and have spent time at DeepMind, Adobe and Facebook as an intern. See here for a formal bio.

News

Jun 2025
EgoVis 2023-24 Distinguished Paper Award for HierVL.
Oct 2024
Mark Zuckerberg announced our work on MovieGen, the new state-of-the-art media generation and editing system, outperforming SORA, Emu Video and more! Covered in NY Times, FT, Forbes, WIRED, Bloomberg, TechCrunch, etc.
Jul 2024
Mark Zuckerberg announced Llama 3.1, along with our state-of-the-art video recognition capabilities!
Jun 2024
Invited panelist for the AI for Content Creation (AI4CC) workshop at CVPR 2024 (along with Cynthia Lu and Robin Rombach).
Jun 2024
LaViLa and Ego4D among the winners of the EgoVis 2022-23 Distinguished Paper Awards!
Apr 2024
/animate functionality based on Emu Video is publicly released! Try it out to animate images generated using /imagine on meta.ai!
Apr 2024
Presented Emu Video at RunwayML’s inaugural Research and Art (RNA) event.
Feb 2024
Invited judge for the MIT Filmmaking Hackathhon 2024.
Nov 2023
Mark Zuckerberg announced our state-of-the-art video generation work, Emu Video! Also see coverage by TechCrunch, TheVerge, VentureBeat, Reuters, and others!

Jun 2023
Giving a talk at HVU Workshop and presenting 5 papers at CVPR 2023!

May 2023
Mark Zuckerberg announced our multimodal embedding work, ImageBind! Also see coverage by TheVerge, Engadget, SiliconANGLE, maginative and others!

Jun 2022
Presenting 3 papers at CVPR 2022, including Omivore, a single model that obtains state-of-the-art results across 3 different modalities: images, videos and single-view 3D!
Oct 2021
We announced Ego4D, the largest egocentric video dataset to date! See this video for a quick intro, and see coverage from TechCrunch, TheVerge, Axios, Fast Company, and others!

Education

PhD in Robotics, 2019

Carnegie Mellon University, Pittsburgh PA
MS in Robotics, 2016

Carnegie Mellon University, Pittsburgh PA
B. Tech. in Computer Science, 2014

IIIT Hyderabad, India

Experience

AMI Labs · Research Scientist

New York · 2026 -- Present
Meta · Research Scientist

New York · 2019 -- 2026
DeepMind · Research Scientist Intern

London · Summer 2018
Facebook · Research Scientist Intern

Menlo Park · Summer 2017
Adobe · Research Scientist Intern

San Francisco · Summer 2016
Facebook · Software Engineering Intern

Menlo Park · Summer 2013

Highlights

Videos powered by MovieGen and Emu Video!

[フレーム]

Projects and Publications

.js-id-selected

Selected Generative Multimodal Video 3D Representation Detection All

Kumar Ashutosh, Xudong Wang, Xi Yin, Kristen Grauman, Adam Polyak, Ishan Misra, Rohit Girdhar

January, 2026 In arXiv, 2026

Human detectors are surprisingly powerful reward models

Using human detection confidence as a simple yet effective reward model to improve human motion in video generation.

PDF Cite Video

Bolin Lai, Xudong Wang, Saketh Rambhatla, James M. Rehg, Zsolt Kira, Rohit Girdhar, Ishan Misra

November, 2025 In CVPR, 2026

Toward Diffusible High-Dimensional Latent Spaces: A Frequency Perspective

Introducing FreqWarm, a plug-and-play frequency warm-up curriculum that improves high-dimensional latent diffusion by increasing early-stage exposure to high-frequency signals.

PDF Cite

Yinbo Chen, Rohit Girdhar, Xiaolong Wang, Sai Saketh Rambhatla, Ishan Misra

January, 2025 In arXiv, 2025

Diffusion Autoencoders are Scalable Image Tokenizers

Simplified image tokenization using diffusion

PDF Cite Code

Kumar Ashutosh, Yossi Gandelsman, Xinlei Chen, Ishan Misra, Rohit Girdhar

January, 2025 In ICML, 2025

LLMs can see and hear without any training

Pure text-only LLMs can use off-the-shelf multimodal embedding models to do various multimodal tasks!

PDF Cite Code

Shijie Wang, Samaneh Azadi, Rohit Girdhar, Saketh Rambhatla, Chen Sun, Xi Yin

December, 2024 In CVPR, 2025

MotiF: Making Text Count in Image Animation with Motion Focal Loss

Using flow to improve motion in video generation

PDF Cite

MovieGen Team (Core-Contributor)

October, 2024 In arXiv, 2024

Movie Gen: A Cast of Media Foundation Models

State-of-the-Art Video (+Audio) Generation Model

PDF Cite Video

Llama3 Team (Co-Lead the Video Recognition Efforts)

July, 2024 In arXiv, 2024

The Llama 3 Herd of Models

State-of-the-Art open-source LLM with multimodal capabilities

PDF Cite Code

Changan Chen, Ashutosh Kumar, Rohit Girdhar, David Harwath, Kristen Grauman

April, 2024 In CVPR, 2024

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos.

PDF Cite

Xudong Wang, Trevor Darrell, Sai Saketh Rambhatla, Rohit Girdhar, Ishan Misra

February, 2024 In CVPR, 2024

InstanceDiffusion: Instance-level Control for Image Generation

SOTA instance-conditioned diffusion model for image generation.

PDF Cite Code

Sachit Menon, Ishan Misra, Rohit Girdhar

December, 2023 In CVPR, 2024

Generating Illustrated Instructions

Introducing a new task of generating instructions for the task you want to solve with illustrations, and a LLM + Diffusion model based solution.

PDF Cite Code

Wilson Yan, Andrew Brown, Pieter Abbeel, Rohit Girdhar, Samaneh Azadi

November, 2023 In arXiv, 2023

Motion-Conditioned Image Animation for Video Editing

Image animation FTW again! SOTA video editing results by animating the first frame with motion conditioning.

PDF Cite

Xudong Wang, Ishan Misra, Ziyun Zeng, Rohit Girdhar, Trevor Darrell

November, 2023 In CVPR, 2024

VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation

SOTA unsupervised video segmentation using CutLER.

PDF Cite Code

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, Ishan Misra

November, 2023 In ECCV, 2024

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

A simple and effective approach to high-quality video generation by learning to animate high quality images.

PDF Cite Demo

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

May, 2023 In CVPR, 2023 (Highlighted Presentation)

ImageBind: One Embedding Space To Bind Them All

One embedding space for 6 different modalities, enables zero-shot recognition on all modalities!

PDF Cite Video Code

Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick, Rohit Girdhar, Ishan Misra

March, 2023 In ICCV, 2023

The effectiveness of MAE pre-pretraining for billion-scale pretraining

Scaling up MAE pre-pretraining, followed by weakly supervised pretraining, leads to strong representations.

PDF Cite Code

Xudong Wang, Rohit Girdhar, Stella X. Yu, Ishan Misra

January, 2023 In CVPR, 2023

CutLER: Cut and Learn for Unsupervised Object Detection and Instance Segmentation

Discovering objects using DINO features, and learning an unsupervised detection + segmentation model

PDF Cite Code

Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman

January, 2023 In CVPR, 2023 (Highlighted Presentation)

HierVL: Learning Hierarchical Video-Language Embeddings

Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text. We propose HierVL, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. As training data, we take videos accompanied by timestamped text descriptions of human actions, together with a high-level text summary of the activity throughout the long video (as are available in Ego4D). We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level. While the clip-level constraints use the step-by-step descriptions to capture what is happening in that instant, the video-level constraints use the summary text to capture why it is happening, i.e., the broader context for the activity and the intent of the actor. Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA results on tasks requiring long-term video modeling. HierVL successfully transfers to multiple challenging downstream tasks (in EPIC-KITCHENS-100, Charades-Ego, HowTo100M) in both zero-shot and fine-tuned settings.

PDF Cite Video Code

Yue Zhao, Ishan Misra, Philipp Krähenbühl, Rohit Girdhar

December, 2022 In CVPR, 2023 (Highlighted Presentation)

Learning Video Representations from Large Language Models

Leveraging LLMs to auto-annotate videos for representation learning.

PDF Cite Colab Code

Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

June, 2022 In CVPR, 2023

OmniMAE: Single Model Masked Pretraining on Images and Videos

Single self-supervised representation for images and videos.

PDF Cite Video Code

Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens Van Der Maaten, Armand Joulin, Ishan Misra

June, 2022 In CVPR, 2022 (Oral Presentation)

Omnivore: A Single Model for Many Visual Modalities

A single model for images, video and single-view 3D.

PDF Cite Code

Kristen Grauman, Andrew Westbury, Rohit Girdhar, Et Al

March, 2022 In CVPR, 2022 (Best paper finalist)

Ego4D: Around the World in 3,000 Hours of Egocentric Video

The largest egocentric video dataset.

PDF Cite Video Code

Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra

January, 2022 In ECCV, 2022

Detecting Twenty-thousand Classes using Image-level Supervision

Leverages image classification data to build an object detector

PDF Cite Colab Code

Bowen Cheng, Anwesa Choudhuri, Ishan Misra, Alexander Kirillov, Rohit Girdhar, Alexander G. Schwing

December, 2021 In arXiv, 2021

Mask2Former for Video Instance Segmentation

SOTA video segmentation using Mask2Former.

PDF Cite Code

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar

December, 2021 In CVPR, 2022

Masked-attention Mask Transformer for Universal Image Segmentation

Single architecture state-of-the-art in instance, semantic and panoptic segmentation.

PDF Cite Code

Ishan Misra, Rohit Girdhar, Armand Joulin

September, 2021 In ICCV, 2021 (Oral Presentation)

3DETR: An End-to-End Transformer Model for 3D Object Detection

First Transformer based detection architecture for 3D data.

PDF Cite Code

Rohit Girdhar, Kristen Grauman

June, 2021 In ICCV, 2021

Anticipative Video Transformer

An autoregressive video transformer architecture for action anticipation in videos.

PDF Cite Code

Zhongzheng Ren, Ishan Misra, Alexander G. Schwing, Rohit Girdhar

May, 2021 In CVPR, 2021

3D Spatial Recognition without Spatially Labeled 3D

WyPR can detect and segment objects in a 3D scene without needing any spatial labels at all!

PDF Cite Slides Code

Zaiwei Zhang, Rohit Girdhar, Armand Joulin, Ishan Misra

May, 2021 In CVPR, 2021

Self-Supervised Pretraining of 3D Features on any Point-Cloud

SOTA 3D detection/segmentation results by learning contrastive representations on 3D data

PDF Cite Code

Eltayeb Ahmed, Anton Bakhtin, Laurens Van Der Maaten, Rohit Girdhar

February, 2021 In ICML Workshops, 2021

Physical Reasoning Using Dynamics Aware Embeddings

Self-supervised representations for physical reasoning.

PDF Cite Code

Rohit Girdhar, Laura Gustafson, Aaron Adcock, Laurens Van Der Maaten

June, 2020 In ICML Workshops, 2021

Forward Prediction for Physical Reasoning

Forward prediction for PHYRE benchmark.

PDF Cite Code

Rohit Girdhar, Deva Ramanan

October, 2019 In ICLR, 2020 (Oral Presentation)

CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning

A dataset to evaluate temporal reasoning in video models.

PDF Cite Slides Video Code

Jessica Lee, Deva Ramanan, Rohit Girdhar

October, 2019 In ICLR, 2020

MetaPix: Few-Shot Video Retargeting

A dataset to evaluate temporal reasoning in video models.

PDF Cite Slides Video Code

Rohit Girdhar, Du Tran, Lorenzo Torresani, Deva Ramanan

January, 2019 In ICCV, 2019

DistInit: Learning Video Representations Without a Single Labeled Video

Distilling representations from image models to video models.

PDF Cite

Rohit Girdhar, João Carreira, Carl Doersch, Andrew Zisserman

December, 2018 In CVPR, 2019 (Oral Presentation)

Video Action Transformer Network

Among the first applications of Transformers to model videos. SOTA results: close 2nd at AVA Challenge, CVPR'18.

PDF Cite Video

Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani, Manohar Paluri, Du Tran

December, 2017 In CVPR, 2018

Detect-and-Track: Efficient Pose Estimation in Videos

Human keypoint tracking approach that ranked first in ICCV 2017 PoseTrack keypoint tracking challenge!

PDF Cite Code

Rohit Girdhar, Deva Ramanan

November, 2017 In NeurIPS, 2017

Attentional Pooling for Action Recognition

Among the first applications of attention for contemporary video/action understanding.

PDF Cite Code

Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, Bryan Russell

April, 2017 In CVPR, 2017

ActionVLAD: Learning spatio-temporal aggregation for action classification

Aggregating visual features for action recognition.

PDF Cite Video Code

Xiaolong Wang, Rohit Girdhar, Abhinav Gupta

March, 2016 In CVPR, 2017 (Spotlight Presentation)

Binge Watching: Scaling Affordance Learning from Sitcoms

Learning how humans interact with their environment by watching TV.

PDF Cite

Rohit Girdhar, David F. Fouhey, Mikel Rodriguez, Abhinav Gupta

March, 2016 In ECCV, 2016 (Spotlight Presentation)

Learning a Predictable and Generative Vector Representation for Objects

A single embedding space, good for both generating and understanding 3D models

PDF Cite Video Code