Generate WebVTT and SRT captions

This page describes how to use the Cloud Speech-to-Text API to automatically generate captions from audio files, in SRT and WebVTT formats. These formats can store the text and timing information of audio, making it possible to display subtitles or captions in sync with the media for subtitling and closed captioning.

Enabling caption outputs in your request to Cloud Speech-to-Text is only supported in the V2 API. Specifically, you can only use BatchRecognize to transcribe long audio files. You can save outputs in a Cloud Storage bucket, or they can be returned inline. Multiple formats can be specified at the same time for the Cloud Storage output configuration, which is written to the specified bucket with different file extensions.

Enable caption outputs in a request

To generate SRT or VTT caption outputs for your audio using Cloud Speech-to-Text, follow the next steps to enable caption outputs in your transcription request:

Make a request to the Cloud Speech-to-Text API BatchRecognize method with the output_format_config field populated. Values specified are:
- srt, for the output to follow the SRT format. - vtt, for the output to follow the WebVTT format.
- native, which is the default output format if no format is specified as a serialized BatchRecognizeResults request.
Since the operation is async, poll the request until it's complete.

Multiple formats can be specified at the same time for the Cloud Storage output configuration. They're written to the specified bucket with different file extensions (either .json, .srt, or .vtt).

If multiple formats are specified for the inline output config, each format will be available as a field in the BatchRecognizeFileResult.inline_result message.

The following code snippet demonstrates how to enable caption outputs in a transcription request to Cloud Speech-to-Text using local and remote files:

API

curl-XPOST\
-H"Content-Type: application/json; charset=utf-8"\
-H"Authorization: Bearer $(gcloud auth application-default print-access-token)"\
https://speech.googleapis.com/v2/projects/my-project/locations/global/recognizers/_:batchRecognize \
--data'{
"files":[{
"uri":"gs://my-bucket/jfk_and_the_press.wav"
}],
"config":{
"features":{"enableWordTimeOffsets":true},
"autoDecodingConfig":{},
"model":"long",
"languageCodes":["en-US"]
},
"recognitionOutputConfig":{
"gcsOutputConfig":{"uri":"gs://my-bucket"},
"output_format_config":{"srt":{}}
}
}'

What's next

Learn how to transcribe long audio files.
Learn how to choose the best transcription model.
Transcribe audio files using Chirp.
For best performance, accuracy, and other tips, see the best practices documentation.

Generate WebVTT and SRT captions Stay organized with collections Save and categorize content based on your preferences.

Enable caption outputs in a request

API

What's next

Generate WebVTT and SRT captions