Name	Name	Last commit message	Last commit date
Latest commit History 16 Commits
Data	Data
Models	Models
README.md	README.md
config.py	config.py
face_cropper.py	face_cropper.py
main.py	main.py
requirements.txt	requirements.txt

[ICCV 2025] VALLR: Visual ASR Language Model for Lip Reading

VALLR is a novel two-stage, phoneme-centric framework for Visual Automatic Speech Recognition (VASR) that achieves state-of-the-art performance in lip reading. This approach significantly reduces Word Error Rate (WER) by first predicting a sequence of phonemes from visual inputs and then using a fine-tuned Large Language Model (LLM) to reconstruct coherent words and sentences. This repository contains the official PyTorch implementation of the VALLR model, along with tools for data preprocessing and inference.

Key Features

State-of-the-Art Performance: Achieves a SOTA WER of 18.7% on the LRS3 dataset, outperforming existing methods.
Two-Stage Framework: Decouples the visual feature extraction from linguistic modeling, leading to improved accuracy and data efficiency.
Phoneme-Centric Approach: By predicting phonemes as an intermediate representation, VALLR effectively handles the ambiguities of visemes and coarticulation effects.
Data Efficient: Requires 99.4% less labeled data than the next best approach, making it highly practical for real-world applications without the need for self-supervised pre-training.
Modular Design: The codebase is organized into distinct components for data processing, model architecture, and inference pipelines, allowing for easy customization and extension.

Model Architecture

The VALLR model consists of two main components:

Video-to-Phoneme Network: A Video Transformer with a CTC head that takes video frames of a speaker's mouth as input and predicts a sequence of phonemes.
Phoneme-to-Sentence LLM: A fine-tuned Large Language Model (LLM) that takes the phoneme sequence as input and reconstructs the corresponding words and sentences.

This two-stage design allows the model to first learn the complex visual features of speech and then leverage the linguistic knowledge of an LLM to generate coherent text.

Results

Here's a comparison of VALLR's performance against other state-of-the-art methods on the LRS3 and LRS2 datasets. Our method achieves SOTA performance on LRS3 using only the supervised training set, without any self-supervised pre-training.

LRS3 Dataset Comparison

Method	Unlabeled (hrs)	Labeled (hrs)	WER (%)
Self-supervised pre-training + Supervised fine-tuning
AV-HuBERT Large [44]	1,759	30	32.5
Lip2Vec [12]	1,759	30	31.2
Whisper [41]	1,759	30	25.5
RAVEn [18]	1759	433	23.1
USR [19]	1,326	433	21.5
Supervised fine-tuning only
Ours	-	30	18.7

LRS2 Dataset Comparison

Method	Unlabeled (hrs)	Labeled (hrs)	WER (%)
Self-supervised pre-training + Supervised fine-tuning
Sub-Word [40]	2,676	2,676	22.6
RAVEn [18]	1,759	223	17.9
USR [19]	1,759	223	15.4
Supervised fine-tuning only
Ours	-	28	20.8

Getting Started

Prerequisites

Python 3.10 or higher
PyTorch 2.4.1
Other dependencies listed in requirements.txt

Installation

Clone the repository:

git clone [https://github.com/MarshallT-99/VALLR.git](https://github.com/MarshallT-99/VALLR.git)
cd VALLR

Install the required packages:
```
pip install -r requirements.txt
```

Inference

To run inference on a single video, use the infer mode and provide the path to the trained model and the video file.

Download the pretrained model weights:
- Download VALLR Model Weights from Google Drive

Run inference:

python main.py --mode infer --model_path /path/to/your/downloaded/model.pth --infer_video_path /path/to/your/video.mp4

Train mode:

python3 main.py --mode train --version V1 --save_model_path path/to/model --videos_root path/to/videos

Codebase Overview

main.py: The main script for running inference.
Models/ML_VALLR.py: Contains the implementation of the VALLR model.
Data/dataset.py: The VideoDataset class for loading and preprocessing video data.
face_cropper.py: A utility for detecting and cropping faces from video frames using MediaPipe.
config.py: Configuration file for setting hyperparameters and other settings.

Citation

If you use this code or the VALLR model in your research, please cite the following paper:

@article{thomas2025vallr,
 title={VALLR: Visual ASR Language Model for Lip Reading},
 author={Thomas, Marshall and Fish, Edward and Bowden, Richard},
 journal={arXiv preprint arXiv:2503.21408},
 year={2025}
}

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

CC BY-NC 4.0

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MarshallT-99/VALLR

Folders and files

Latest commit

History

Repository files navigation

[ICCV 2025] VALLR: Visual ASR Language Model for Lip Reading

Key Features

Model Architecture

Results

LRS3 Dataset Comparison

LRS2 Dataset Comparison

Getting Started

Prerequisites

Installation

Inference

Codebase Overview

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[ICCV 2025] VALLR: Visual ASR Language Model for Lip Reading

Key Features

Model Architecture

Results

LRS3 Dataset Comparison

LRS2 Dataset Comparison

Getting Started

Prerequisites

Installation

Inference

Codebase Overview

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages