Transcribe audio to text in multiple languages.
For most needs, use vaibhavs10/incredibly-fast-whisper. It really is fast (10x quicker than original Whisper), cheap, accurate, and supports tons of languages.
Need to label speakers or get word-level timestamps? victor-upmeet/whisperx has you covered. Slightly more expensive than incredibly-fast-whisper but still very fast and useful.
You can also check out our Speaker Diarization collection for models that can identify speakers from audio and video.
To translate speech between languages, cjwbw/seamless_communication is your friend.
This unified model enables multiple tasks without relying on multiple separate models:
Featured models
A speech-to-text model that uses GPT-4o to transcribe audio
Updated 1 month, 3 weeks ago
33.7K runs
Accelerated transcription, word-level timestamps and diarization with whisperX large-v3
Updated 1 year, 4 months ago
5.5M runs
whisper-large-v3, incredibly fast, powered by Hugging Face Transformers! 🤗
Updated 1 year, 10 months ago
22.5M runs
Recommended Models
If speed is your top priority, vaibhavs10/incredibly-fast-whisper and openai/gpt-4o-transcribe are among the fastest models in the speech-to-text collection. They’re designed for low-latency transcription, which makes them ideal for live or near real-time scenarios like voice notes, quick interviews, or interactive applications.
Keep in mind that faster models may not include advanced features like speaker labeling or word-level timestamps.
openai/whisper is a reliable general-purpose option that works well with clean audio and single-speaker recordings. It offers multilingual support and solid accuracy for most everyday transcription needs.
If you need more structure—like timestamps or speaker labels—victor-upmeet/whisperx adds those capabilities without a massive jump in runtime.
For clear recordings like lectures, podcasts, or voice memos, vaibhavs10/incredibly-fast-whisper or openai/whisper are great choices. They deliver accurate transcripts quickly and handle common accents well.
If your audio includes multiple speakers—like team meetings, interviews, or panel discussions—victor-upmeet/whisperx is your best bet. It adds speaker diarization and word-level timestamps so you can keep track of who said what.
If you need transcription in multiple languages or want translations built in, cjwbw/seamless_communication is a strong option. It supports multiple languages and can handle more complex audio scenarios like mixed-language conversations.
Most models produce plain text transcripts. Some also include:
You can package your own model with Cog and push it to Replicate. This lets you control how it’s run, updated, and shared, whether you’re adapting an open-source model or deploying a fine-tuned one.
Many models in the speech-to-text collection allow commercial use, but licenses vary. Some models have conditions or attribution requirements, so always check the model page before using transcripts in commercial projects.
Recommended Models
Google's most advanced reasoning Gemini model
Updated 1 month ago
98.6K runs
A speech-to-text model that uses GPT-4o mini to transcribe audio
Updated 1 month, 3 weeks ago
10.4K runs
⚡️ Blazing fast audio transcription with speaker diarization | Whisper Large V3 Turbo | word & sentence level timestamps | prompt
Updated 10 months, 1 week ago
4.1M runs
Convert speech in audio to text
Updated 1 year, 1 month ago
142.9M runs
🗣️ Nvidia + Suno.ai's speech-to-text conversion with high accuracy and efficiency 📝
Updated 1 year, 11 months ago
20.4K runs
ASR from video URL based on whisperx using large-v2 model
Updated 2 years, 3 months ago
19.6K runs
SeamlessM4T—Massively Multilingual & Multimodal Machine Translation
Updated 2 years, 3 months ago
92.7K runs
Accelerated transcription of audio using WhisperX
Updated 2 years, 6 months ago
93.2K runs
Generate subtitles from an audio file, using OpenAI's Whisper model.
Updated 3 years, 3 months ago
73.9K runs