transcribe()v4.0.131

Transcribes a media file by utilizing Whisper.cpp.
You should first install Whisper.cpp, for example through installWhisperCpp().

note

This function only works with Whisper.cpp 1.5.5 or later, unless tokenLevelTimestamps is set to false.

transcribe.mjs
tsx
import path from'path';
import {>
import transcribe">transcribe} from'@remotion/install-whisper-cpp';const {transcription} =await({ inputPath, whisperPath, whisperCppVersion, model, modelFolder, translateToEnglish, tokenLevelTimestamps, printOutput, tokensPerItem, language, splitOnWord, signal, onProgress, flashAttention, additionalArgs, }: {
 inputPath: string;
 whisperPath: string;
 whisperCppVersion: string;
 model: WhisperModel;
 tokenLevelTimestamps: true;
 modelFolder?: string;
 translateToEnglish?: boolean;
 printOutput?: boolean;
 tokensPerItem?: undefined;
 language?: Language | null;
 splitOnWord?: boolean;
 signal?: AbortSignal;
 onProgress?: TranscribeOnProgress;
 flashAttention?: boolean;
 additionalArgs?: AdditionalArgs;
}): Promise>
import transcribe">transcribe({
 inputPath: '/path/to/audio.wav',
 whisperPath: path.join(process.cwd(), 'whisper.cpp'),
 whisperCppVersion: '1.5.5',
 model: 'medium.en',
 tokenLevelTimestamps: true,
});
for (consttokenof transcription) {
 console.log(token.timestamps.from, token.timestamps.to, token.text);
}

Options

`inputPath`

The path to the file you want extract text from.

The file has to be a 16-bit, 16KHz, WAVE file. See Resample audio to 16kHz for more information.

`whisperPath`

The path to your whisper.cpp folder.
If you haven't installed Whisper.cpp, you can do so for example through installWhisperCpp() and use the same folder.

`tokenLevelTimestamps`v4.0.131

Passes the --dtw flag to Whisper.cpp to generate more accurate timestamps, which are being returned under the t_dtw field.
Recommended to get actually accurate timings, but only available from Whisper.cpp versions later than 1.0.55.
Set to false if you use an older version of Whisper.cpp.

`model?`

default: base.en

Specify a specific Whisper model for the transcription.

Possible values: tiny, tiny.en, base, base.en, small, small.en, medium, medium.en, large-v1, large-v2, large-v3, large-v3-turbo.

Make sure the model you want to use exists in your whisper.cpp/models folder. You can ensure a specific model is available locally by utilizing the downloadWhisperModel() API.

Note: large-v3-turbo is only working properly from Whisper.cpp versions built from November 2024 or later and Remotion v4.0.229 or greater.

`modelFolder?`

default: whisperPath/models

If you saved Whisper models to a specific folder, pass its path here.

Uses the whisper.cpp/models folder at the location defined through whisperPath as default.

`translateToEnglish?`

default: false

Set this boolean flag to true if you want to get a translated transcription of the provided file in English. Make sure to not use a *.en model, as they will not be able to translate a foreign language to english.

note

We recommend using at least the medium model to get satisfactory results when translating.

`printOutput?`v4.0.132

Whether to print the output of the transcription process to the console. Defaults to true.

`tokensPerItem?`v4.0.141

default: 1

The maximum amount of tokens included in each transcription item.

Set this flag to null, to use whisper.cpp's default token grouping (useful for generating a movie-style transcription).

info

tokensPerItem can only be set when tokenLevelTimestamps is set to false.

`splitOnWord?`v4.0.208

Adds the --split-on-word flag to Whisper.cpp for cleaner word-for-word output.

`language?`v4.0.142

default: null

Passes the -l flag to Whisper.cpp to specific spoken language of the audio file.

Possible values: Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Bashkir, Basque, Belarusian, Bengali, Bosnian, Breton, Bulgarian, Burmese, Castilian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Faroese, Finnish, Flemish, French, Galician, Georgian, German, Greek, Gujarati, Haitian, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Lao, Latin, Latvian, Letzeburgesch, Lingala, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Moldavian, Moldovan, Mongolian, Myanmar, Nepali, Norwegian, Nynorsk, Occitan, Panjabi, Pashto, Persian, Polish, Portuguese, Punjabi, Pushto, Romanian, Russian, Sanskrit, Serbian, Shona, Sindhi, Sinhala, Sinhalese, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Tibetan, Turkish, Turkmen, Ukrainian, Urdu, Uzbek, Valencian, Vietnamese, Welsh, Yiddish, Yoruba, Zulu. af, am, ar, as, az, ba, be, bg, bn, bo, br, bs, ca, cs, cy, da, de, el, en, es, et, eu, fa, fi, fo, fr, gl, gu, ha, haw, he, hi, hr, ht, hu, hy, id, is, it, ja, jw, ka, kk, km, kn, ko, la, lb, ln, lo, lt, lv, mg, mi, mk, ml, mn, mr, ms, mt, my, ne, nl, nn, no, oc, pa, pl, ps, pt, ro, ru, sa, sd, si, sk, sl, sn, so, sq, sr, su, sv, sw, ta, te, tg, th, tk, tl, tr, tt, uk, ur, uz, vi, yi, yo, zh or auto.

`signal?`v4.0.156

A signal from an AbortController to cancel the transcription process.

`onProgress?`v4.0.156

Listen for progress updates from the transcription process.
The progress is a number between 0 and 1.

tsx
importtype {TranscribeOnProgress} from'@remotion/install-whisper-cpp';
constonProgress:TranscribeOnProgress= (progress) => {
 console.log(`Transcription progress: ${progress*100}%`);
};

`flashAttention?`v4.0.324

Boolean value, enable flash attention.

`additionalArgs?`v4.0.324

Additional args to be passed to whisper, in an array. The array can contain strings or string pairs, like

js
transcribe({
...,
 additionalArgs: ['-tdrz', ['--max-len', '1']]
})

Return value

`TranscriptionJson`

An object containing all the metadata and transcriptions resulting from the transcription process.

ts
typeTimestamps= {
from:string;
to:string;
};
typeOffsets= {
from:number;
to:number;
};
typeWordLevelToken= {
t_dtw:number;
text:string;
timestamps:Timestamps;
offsets:Offsets;
id:number;
p:number;
};
typeTranscriptionItem= {
timestamps:Timestamps;
offsets:Offsets;
text:string;
};
typeTranscriptionItemWithTimestamp=TranscriptionItem& {
tokens:WordLevelToken[];
};
typeModel= {
type:string;
multilingual:boolean;
vocab:number;
audio: {
ctx:number;
state:number;
head:number;
layer:number;
 };
text: {
ctx:number;
state:number;
head:number;
layer:number;
 };
mels:number;
ftype:number;
};
typeParams= {
model:string;
language:string;
translate:boolean;
};
typeResult= {
language:string;
};
exporttypeTranscriptionJson<WithTokenLevelTimestampextendsboolean> = {
systeminfo:string;
model:Model;
params:Params;
result:Result;
transcription:trueextendsWithTokenLevelTimestamp?TranscriptionItemWithTimestamp[] :TranscriptionItem[];
};

Prefer relying on the t_dtw value for accurate timestamps over offsets.
Use convertToCaptions() to use our opinionated suggestion for postprocessing the captions.

Options​

inputPath​

whisperPath​

tokenLevelTimestampsv4.0.131 ​

model?​

modelFolder?​

translateToEnglish?​

printOutput?v4.0.132 ​

tokensPerItem?v4.0.141 ​

splitOnWord?v4.0.208 ​

language?v4.0.142 ​

signal?v4.0.156 ​

onProgress?v4.0.156 ​

flashAttention?v4.0.324 ​

additionalArgs?v4.0.324 ​

Return value​

TranscriptionJson​

See also​