Get audio track transcription

The Video Intelligence API transcribes speech to text from supported video files. There are two supported models, "default" and "video."

Request Speech Transcription for a Video

REST

Send the process request

The following shows how to send a POST request to the videos:annotate method. The example uses the access token for a service account set up for the project using the Google Cloud CLI. For instructions on installing the Google Cloud CLI, setting up a project with a service account, and obtaining an access token, see the Video Intelligence quickstart.

Before using any of the request data, make the following replacements:

INPUT_URI: a Cloud Storage bucket that contains the file you want to annotate, including the file name. Must start with gs://.
For example: "inputUri": "gs://cloud-videointelligence-demo/assistant.mp4",
LANGUAGE_CODE: [Optional] See supported languages
PROJECT_NUMBER: The numeric identifier for your Google Cloud project

HTTP method and URL:

POST https://videointelligence.googleapis.com/v1/videos:annotate

Request JSON body:

{
"inputUri": "INPUT_URI",
 "features": ["SPEECH_TRANSCRIPTION"],
 "videoContext": {
 "speechTranscriptionConfig": {
 "languageCode": "LANGUAGE_CODE",
 "enableAutomaticPunctuation": true,
 "filterProfanity": true
 }
 }
}

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login , or by using Cloud Shell, which automatically logs you into the gcloud CLI . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
 -H "Authorization: Bearer $(gcloud auth print-access-token)" \
 -H "x-goog-user-project: PROJECT_NUMBER" \
 -H "Content-Type: application/json; charset=utf-8" \
 -d @request.json \
 "https://videointelligence.googleapis.com/v1/videos:annotate"

PowerShell (Windows)

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred"; "x-goog-user-project" = "PROJECT_NUMBER" }

Invoke-WebRequest `
 -Method POST `
 -Headers $headers `
 -ContentType: "application/json; charset=utf-8" `
 -InFile request.json `
 -Uri "https://videointelligence.googleapis.com/v1/videos:annotate" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
 "name": "projects/PROJECT_NUMBER/locations/LOCATION_ID/operations/OPERATION_ID"
}

If the request is successful, Video Intelligence returns the name for your operation. The above shows an example of such a response, where project-number is the number of your project and operation-id is the ID of the long-running operation created for the request.

Get the results

To get the results of your request, you must send a GET, using the operation name returned from the call to videos:annotate, as shown in the following example.

Before using any of the request data, make the following replacements:

OPERATION_NAME: the name of the operation as returned by Video Intelligence API. The operation name has the format projects/PROJECT_NUMBER/locations/LOCATION_ID/operations/OPERATION_ID

PROJECT_NUMBER: The numeric identifier for your Google Cloud project

HTTP method and URL:

GET https://videointelligence.googleapis.com/v1/OPERATION_NAME

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Execute the following command:

curl -X GET \
 -H "Authorization: Bearer $(gcloud auth print-access-token)" \
 -H "x-goog-user-project: PROJECT_NUMBER" \
 "https://videointelligence.googleapis.com/v1/OPERATION_NAME"

PowerShell (Windows)

Execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred"; "x-goog-user-project" = "PROJECT_NUMBER" }

Invoke-WebRequest `
 -Method GET `
 -Headers $headers `
 -Uri "https://videointelligence.googleapis.com/v1/OPERATION_NAME" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

Response

{
 "name": "projects/PROJECT_NUMBER/locations/LOCATION_ID/operations/OPERATION_ID",
 "metadata": {
 "@type": "type.googleapis.com/google.cloud.videointelligence.v1.AnnotateVideoProgress",
 "annotationProgress": [{
 "inputUri": "/bucket-name-123/sample-video-short.mp4",
 "progressPercent": 100,
 "startTime": "2018-04-09T15:19:38.919779Z",
 "updateTime": "2018-04-09T15:21:17.652470Z"
 }]
 },
 "done": true,
 "response": {
 "@type": "type.googleapis.com/google.cloud.videointelligence.v1.AnnotateVideoResponse",
 "annotationResults": [
 {
 "speechTranscriptions": [
 {
 "alternatives": [
 {
 "transcript": "and laughing going to talk about is the video intelligence API how many of you saw it at the keynote yesterday ",
 "confidence": 0.8442509,
 "words": [
 {
 "startTime": "0.200s",
 "endTime": "0.800s",
 "word": "and"
 },
 {
 "startTime": "0.800s",
 "endTime": "1.100s",
 "word": "laughing"
 },
 {
 "startTime": "1.100s",
 "endTime": "1.200s",
 "word": "going"
 },
 ...

Download annotation results

Copy the annotation from the source to the destination bucket: (see Copy files and objects)

gcloud storage cp gcs_uri gs://my-bucket

Note: If the output gcs uri is provided by the user, then the annotation is stored in that gcs uri.

Go

To authenticate to Video Intelligence, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.


funcspeechTranscriptionURI(wio.Writer,filestring)error{
ctx:=context.Background()
client,err:=video.NewClient(ctx)
iferr!=nil{
returnerr
}
deferclient.Close()
op,err:=client.AnnotateVideo(ctx,&videopb.AnnotateVideoRequest{
Features:[]videopb.Feature{
videopb.Feature_SPEECH_TRANSCRIPTION,
},
VideoContext:&videopb.VideoContext{
SpeechTranscriptionConfig:&videopb.SpeechTranscriptionConfig{
LanguageCode:"en-US",
EnableAutomaticPunctuation:true,
},
},
InputUri:file,
})
iferr!=nil{
returnerr
}
resp,err:=op.Wait(ctx)
iferr!=nil{
returnerr
}
// A single video was processed. Get the first result.
result:=resp.AnnotationResults[0]
for_,transcription:=rangeresult.SpeechTranscriptions{
// The number of alternatives for each transcription is limited by
// SpeechTranscriptionConfig.MaxAlternatives.
// Each alternative is a different possible transcription
// and has its own confidence score.
for_,alternative:=rangetranscription.GetAlternatives(){
fmt.Fprintf(w,"Alternative level information:\n")
fmt.Fprintf(w,"\tTranscript: %v\n",alternative.GetTranscript())
fmt.Fprintf(w,"\tConfidence: %v\n",alternative.GetConfidence())
fmt.Fprintf(w,"Word level information:\n")
for_,wordInfo:=rangealternative.GetWords(){
startTime:=wordInfo.GetStartTime()
endTime:=wordInfo.GetEndTime()
fmt.Fprintf(w,"\t%4.1f - %4.1f: %v (speaker %v)\n",
float64(startTime.GetSeconds())+float64(startTime.GetNanos())*1e-9,// start as seconds
float64(endTime.GetSeconds())+float64(endTime.GetNanos())*1e-9,// end as seconds
wordInfo.GetWord(),
wordInfo.GetSpeakerTag())
}
}
}
returnnil
}

Java

To authenticate to Video Intelligence, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

// Instantiate a com.google.cloud.videointelligence.v1.VideoIntelligenceServiceClient
try(VideoIntelligenceServiceClientclient=VideoIntelligenceServiceClient.create()){
// Set the language code
SpeechTranscriptionConfigconfig=
SpeechTranscriptionConfig.newBuilder()
.setLanguageCode("en-US")
.setEnableAutomaticPunctuation(true)
.build();
// Set the video context with the above configuration
VideoContextcontext=VideoContext.newBuilder().setSpeechTranscriptionConfig(config).build();
// Create the request
AnnotateVideoRequestrequest=
AnnotateVideoRequest.newBuilder()
.setInputUri(gcsUri)
.addFeatures(Feature.SPEECH_TRANSCRIPTION)
.setVideoContext(context)
.build();
// asynchronously perform speech transcription on videos
OperationFuture<AnnotateVideoResponse,AnnotateVideoProgress>response=
client.annotateVideoAsync(request);
System.out.println("Waiting for operation to complete...");
// Display the results
for(VideoAnnotationResultsresults:
response.get(600,TimeUnit.SECONDS).getAnnotationResultsList()){
for(SpeechTranscriptionspeechTranscription:results.getSpeechTranscriptionsList()){
try{
// Print the transcription
if(speechTranscription.getAlternativesCount() > 0){
SpeechRecognitionAlternativealternative=speechTranscription.getAlternatives(0);
System.out.printf("Transcript: %s\n",alternative.getTranscript());
System.out.printf("Confidence: %.2f\n",alternative.getConfidence());
System.out.println("Word level information:");
for(WordInfowordInfo:alternative.getWordsList()){
doublestartTime=
wordInfo.getStartTime().getSeconds()+wordInfo.getStartTime().getNanos()/1e9;
doubleendTime=
wordInfo.getEndTime().getSeconds()+wordInfo.getEndTime().getNanos()/1e9;
System.out.printf(
"\t%4.2fs - %4.2fs: %s\n",startTime,endTime,wordInfo.getWord());
}
}else{
System.out.println("No transcription found");
}
}catch(IndexOutOfBoundsExceptionioe){
System.out.println("Could not retrieve frame: "+ioe.getMessage());
}
}
}
}

Node.js

To authenticate to Video Intelligence, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

// Imports the Google Cloud Video Intelligence library
constvideoIntelligence=require('@google-cloud/video-intelligence');
// Creates a client
constclient=newvideoIntelligence.VideoIntelligenceServiceClient ();
/**
 * TODO(developer): Uncomment the following line before running the sample.
 */
// const gcsUri = 'GCS URI of video to analyze, e.g. gs://my-bucket/my-video.mp4';
asyncfunctionanalyzeVideoTranscript(){
constvideoContext={
speechTranscriptionConfig:{
languageCode:'en-US',
enableAutomaticPunctuation:true,
},
};
constrequest={
inputUri:gcsUri,
features:['SPEECH_TRANSCRIPTION'],
videoContext:videoContext,
};
const[operation]=awaitclient.annotateVideo(request);
console.log('Waiting for operation to complete...');
const[operationResult]=awaitoperation.promise();
// There is only one annotation_result since only
// one video is processed.
constannotationResults=operationResult.annotationResults[0];
for(constspeechTranscriptionofannotationResults.speechTranscriptions){
// The number of alternatives for each transcription is limited by
// SpeechTranscriptionConfig.max_alternatives.
// Each alternative is a different possible transcription
// and has its own confidence score.
for(constalternativeofspeechTranscription.alternatives){
console.log('Alternative level information:');
console.log(`Transcript: ${alternative.transcript}`);
console.log(`Confidence: ${alternative.confidence}`);
console.log('Word level information:');
for(constwordInfoofalternative.words){
constword=wordInfo.word;
conststart_time=
wordInfo.startTime.seconds+wordInfo.startTime.nanos*1e-9;
constend_time=
wordInfo.endTime.seconds+wordInfo.endTime.nanos*1e-9;
console.log('\t'+start_time+'s - '+end_time+'s: '+word);
}
}
}
}
analyzeVideoTranscript();

Python

To authenticate to Video Intelligence, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

"""Transcribe speech from a video stored on GCS."""
fromgoogle.cloudimport videointelligence
video_client = videointelligence.VideoIntelligenceServiceClient ()
features = [videointelligence.Feature .SPEECH_TRANSCRIPTION]
config = videointelligence.SpeechTranscriptionConfig (
 language_code="en-US", enable_automatic_punctuation=True
)
video_context = videointelligence.VideoContext (speech_transcription_config=config)
operation = video_client.annotate_video (
 request={
 "features": features,
 "input_uri": path,
 "video_context": video_context,
 }
)
print("\nProcessing video for speech transcription.")
result = operation.result(timeout=600)
# There is only one annotation_result since only
# one video is processed.
annotation_results = result.annotation_results[0]
for speech_transcription in annotation_results.speech_transcriptions:
 # The number of alternatives for each transcription is limited by
 # SpeechTranscriptionConfig.max_alternatives.
 # Each alternative is a different possible transcription
 # and has its own confidence score.
 for alternative in speech_transcription.alternatives:
 print("Alternative level information:")
 print("Transcript: {}".format(alternative.transcript))
 print("Confidence: {}\n".format(alternative.confidence))
 print("Word level information:")
 for word_info in alternative.words:
 word = word_info.word
 start_time = word_info.start_time
 end_time = word_info.end_time
 print(
 "\t{}s - {}s: {}".format(
 start_time.seconds + start_time.microseconds * 1e-6,
 end_time.seconds + end_time.microseconds * 1e-6,
 word,
 )
 )

Additional languages

C#: Please follow the C# setup instructions on the client libraries page and then visit the Video Intelligence reference documentation for .NET.

PHP: Please follow the PHP setup instructions on the client libraries page and then visit the Video Intelligence reference documentation for PHP.

Ruby: Please follow the Ruby setup instructions on the client libraries page and then visit the Video Intelligence reference documentation for Ruby.

Get audio track transcription Stay organized with collections Save and categorize content based on your preferences.

Request Speech Transcription for a Video

REST

Send the process request

curl (Linux, macOS, or Cloud Shell)

PowerShell (Windows)

Get the results

curl (Linux, macOS, or Cloud Shell)

PowerShell (Windows)

Response

Download annotation results

Go

Java

Node.js

Python

Additional languages

Get audio track transcription