Marshall Thomas, Edward Fish, Richard Bowden
VALLR is a novel two-stage, phoneme-centric framework for Visual Automatic Speech Recognition (VASR) that achieves state-of-the-art performance in lip reading. This approach significantly reduces Word Error Rate (WER) by first predicting a sequence of phonemes from visual inputs and then using a fine-tuned Large Language Model (LLM) to reconstruct coherent words and sentences. This repository contains the official PyTorch implementation of the VALLR model, along with tools for data preprocessing and inference.
- State-of-the-Art Performance: Achieves a SOTA WER of 18.7% on the LRS3 dataset, outperforming existing methods.
- Two-Stage Framework: Decouples the visual feature extraction from linguistic modeling, leading to improved accuracy and data efficiency.
- Phoneme-Centric Approach: By predicting phonemes as an intermediate representation, VALLR effectively handles the ambiguities of visemes and coarticulation effects.
- Data Efficient: Requires 99.4% less labeled data than the next best approach, making it highly practical for real-world applications without the need for self-supervised pre-training.
- Modular Design: The codebase is organized into distinct components for data processing, model architecture, and inference pipelines, allowing for easy customization and extension.
The VALLR model consists of two main components:
- Video-to-Phoneme Network: A Video Transformer with a CTC head that takes video frames of a speaker's mouth as input and predicts a sequence of phonemes.
- Phoneme-to-Sentence LLM: A fine-tuned Large Language Model (LLM) that takes the phoneme sequence as input and reconstructs the corresponding words and sentences.
This two-stage design allows the model to first learn the complex visual features of speech and then leverage the linguistic knowledge of an LLM to generate coherent text.
Here's a comparison of VALLR's performance against other state-of-the-art methods on the LRS3 and LRS2 datasets. Our method achieves SOTA performance on LRS3 using only the supervised training set, without any self-supervised pre-training.
| Method | Unlabeled (hrs) | Labeled (hrs) | WER (%) |
|---|---|---|---|
| Self-supervised pre-training + Supervised fine-tuning | |||
| AV-HuBERT Large [44] | 1,759 | 30 | 32.5 |
| Lip2Vec [12] | 1,759 | 30 | 31.2 |
| Whisper [41] | 1,759 | 30 | 25.5 |
| RAVEn [18] | 1759 | 433 | 23.1 |
| USR [19] | 1,326 | 433 | 21.5 |
| Supervised fine-tuning only | |||
| Ours | - | 30 | 18.7 |
| Method | Unlabeled (hrs) | Labeled (hrs) | WER (%) |
|---|---|---|---|
| Self-supervised pre-training + Supervised fine-tuning | |||
| Sub-Word [40] | 2,676 | 2,676 | 22.6 |
| RAVEn [18] | 1,759 | 223 | 17.9 |
| USR [19] | 1,759 | 223 | 15.4 |
| Supervised fine-tuning only | |||
| Ours | - | 28 | 20.8 |
- Python 3.10 or higher
- PyTorch 2.4.1
- Other dependencies listed in
requirements.txt
- Clone the repository:
git clone [https://github.com/MarshallT-99/VALLR.git](https://github.com/MarshallT-99/VALLR.git) cd VALLR - Install the required packages:
pip install -r requirements.txt
To run inference on a single video, use the infer mode and provide the path to the trained model and the video file.
-
Download the pretrained model weights:
-
Run inference:
python main.py --mode infer --model_path /path/to/your/downloaded/model.pth --infer_video_path /path/to/your/video.mp4
-
Train mode:
python3 main.py --mode train --version V1 --save_model_path path/to/model --videos_root path/to/videos
main.py: The main script for running inference.Models/ML_VALLR.py: Contains the implementation of the VALLR model.Data/dataset.py: TheVideoDatasetclass for loading and preprocessing video data.face_cropper.py: A utility for detecting and cropping faces from video frames using MediaPipe.config.py: Configuration file for setting hyperparameters and other settings.
If you use this code or the VALLR model in your research, please cite the following paper:
@article{thomas2025vallr,
title={VALLR: Visual ASR Language Model for Lip Reading},
author={Thomas, Marshall and Fish, Edward and Bowden, Richard},
journal={arXiv preprint arXiv:2503.21408},
year={2025}
}
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.