Get audio track transcription
Stay organized with collections
Save and categorize content based on your preferences.
The Video Intelligence API transcribes speech to text from supported video files. There are two supported models, "default" and "video."
Request Speech Transcription for a Video
REST
Send the process request
The following shows how to send a POST request to the
videos:annotate method.
The example uses the access token for a service account set up for the project
using the Google Cloud CLI. For instructions on installing the Google Cloud CLI,
setting up a project with a service
account, and obtaining an access token, see the
Video Intelligence quickstart.
Before using any of the request data, make the following replacements:
- INPUT_URI: a Cloud Storage bucket that contains
the file you want to annotate, including the file name. Must
start with
gs://.
For example:"inputUri": "gs://cloud-videointelligence-demo/assistant.mp4", - LANGUAGE_CODE: [Optional] See supported languages
- PROJECT_NUMBER: The numeric identifier for your Google Cloud project
HTTP method and URL:
POST https://videointelligence.googleapis.com/v1/videos:annotate
Request JSON body:
{
"inputUri": "INPUT_URI",
"features": ["SPEECH_TRANSCRIPTION"],
"videoContext": {
"speechTranscriptionConfig": {
"languageCode": "LANGUAGE_CODE",
"enableAutomaticPunctuation": true,
"filterProfanity": true
}
}
}
To send your request, expand one of these options:
curl (Linux, macOS, or Cloud Shell)
Save the request body in a file named request.json,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "x-goog-user-project: PROJECT_NUMBER" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://videointelligence.googleapis.com/v1/videos:annotate"
PowerShell (Windows)
Save the request body in a file named request.json,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred"; "x-goog-user-project" = "PROJECT_NUMBER" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://videointelligence.googleapis.com/v1/videos:annotate" | Select-Object -Expand Content
You should receive a JSON response similar to the following:
{
"name": "projects/PROJECT_NUMBER/locations/LOCATION_ID/operations/OPERATION_ID"
}
If the request is successful, Video Intelligence returns the name for your operation. The
above shows an example of such a response, where project-number
is the number of your project and operation-id is the ID of the long-running operation created for the request.
Get the results
To get the results of your request, you must send a GET, using the operation name returned from the
call to videos:annotate, as shown in the following example.
Before using any of the request data, make the following replacements:
- OPERATION_NAME: the name of the operation as
returned by Video Intelligence API. The operation name has the format
projects/PROJECT_NUMBER/locations/LOCATION_ID/operations/OPERATION_ID - PROJECT_NUMBER: The numeric identifier for your Google Cloud project
HTTP method and URL:
GET https://videointelligence.googleapis.com/v1/OPERATION_NAME
To send your request, expand one of these options:
curl (Linux, macOS, or Cloud Shell)
Execute the following command:
curl -X GET \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "x-goog-user-project: PROJECT_NUMBER" \
"https://videointelligence.googleapis.com/v1/OPERATION_NAME"
PowerShell (Windows)
Execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred"; "x-goog-user-project" = "PROJECT_NUMBER" }
Invoke-WebRequest `
-Method GET `
-Headers $headers `
-Uri "https://videointelligence.googleapis.com/v1/OPERATION_NAME" | Select-Object -Expand Content
You should receive a JSON response similar to the following:
Response
{
"name": "projects/PROJECT_NUMBER/locations/LOCATION_ID/operations/OPERATION_ID",
"metadata": {
"@type": "type.googleapis.com/google.cloud.videointelligence.v1.AnnotateVideoProgress",
"annotationProgress": [{
"inputUri": "/bucket-name-123/sample-video-short.mp4",
"progressPercent": 100,
"startTime": "2018-04-09T15:19:38.919779Z",
"updateTime": "2018-04-09T15:21:17.652470Z"
}]
},
"done": true,
"response": {
"@type": "type.googleapis.com/google.cloud.videointelligence.v1.AnnotateVideoResponse",
"annotationResults": [
{
"speechTranscriptions": [
{
"alternatives": [
{
"transcript": "and laughing going to talk about is the video intelligence API how many of you saw it at the keynote yesterday ",
"confidence": 0.8442509,
"words": [
{
"startTime": "0.200s",
"endTime": "0.800s",
"word": "and"
},
{
"startTime": "0.800s",
"endTime": "1.100s",
"word": "laughing"
},
{
"startTime": "1.100s",
"endTime": "1.200s",
"word": "going"
},
...
Download annotation results
Copy the annotation from the source to the destination bucket: (see Copy files and objects)
gcloud storage cp gcs_uri gs://my-bucket
Note: If the output gcs uri is provided by the user, then the annotation is stored in that gcs uri.
Go
To authenticate to Video Intelligence, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
funcspeechTranscriptionURI(wio.Writer,filestring)error{
ctx:=context.Background()
client,err:=video.NewClient(ctx)
iferr!=nil{
returnerr
}
deferclient.Close()
op,err:=client.AnnotateVideo(ctx,&videopb.AnnotateVideoRequest{
Features:[]videopb.Feature{
videopb.Feature_SPEECH_TRANSCRIPTION,
},
VideoContext:&videopb.VideoContext{
SpeechTranscriptionConfig:&videopb.SpeechTranscriptionConfig{
LanguageCode:"en-US",
EnableAutomaticPunctuation:true,
},
},
InputUri:file,
})
iferr!=nil{
returnerr
}
resp,err:=op.Wait(ctx)
iferr!=nil{
returnerr
}
// A single video was processed. Get the first result.
result:=resp.AnnotationResults[0]
for_,transcription:=rangeresult.SpeechTranscriptions{
// The number of alternatives for each transcription is limited by
// SpeechTranscriptionConfig.MaxAlternatives.
// Each alternative is a different possible transcription
// and has its own confidence score.
for_,alternative:=rangetranscription.GetAlternatives(){
fmt.Fprintf(w,"Alternative level information:\n")
fmt.Fprintf(w,"\tTranscript: %v\n",alternative.GetTranscript())
fmt.Fprintf(w,"\tConfidence: %v\n",alternative.GetConfidence())
fmt.Fprintf(w,"Word level information:\n")
for_,wordInfo:=rangealternative.GetWords(){
startTime:=wordInfo.GetStartTime()
endTime:=wordInfo.GetEndTime()
fmt.Fprintf(w,"\t%4.1f - %4.1f: %v (speaker %v)\n",
float64(startTime.GetSeconds())+float64(startTime.GetNanos())*1e-9,// start as seconds
float64(endTime.GetSeconds())+float64(endTime.GetNanos())*1e-9,// end as seconds
wordInfo.GetWord(),
wordInfo.GetSpeakerTag())
}
}
}
returnnil
}
Java
To authenticate to Video Intelligence, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
// Instantiate a com.google.cloud.videointelligence.v1.VideoIntelligenceServiceClient
try(VideoIntelligenceServiceClientclient=VideoIntelligenceServiceClient.create()){
// Set the language code
SpeechTranscriptionConfigconfig=
SpeechTranscriptionConfig.newBuilder()
.setLanguageCode("en-US")
.setEnableAutomaticPunctuation(true)
.build();
// Set the video context with the above configuration
VideoContextcontext=VideoContext.newBuilder().setSpeechTranscriptionConfig(config).build();
// Create the request
AnnotateVideoRequestrequest=
AnnotateVideoRequest.newBuilder()
.setInputUri(gcsUri)
.addFeatures(Feature.SPEECH_TRANSCRIPTION)
.setVideoContext(context)
.build();
// asynchronously perform speech transcription on videos
OperationFuture<AnnotateVideoResponse,AnnotateVideoProgress>response=
client.annotateVideoAsync(request);
System.out.println("Waiting for operation to complete...");
// Display the results
for(VideoAnnotationResultsresults:
response.get(600,TimeUnit.SECONDS).getAnnotationResultsList()){
for(SpeechTranscriptionspeechTranscription:results.getSpeechTranscriptionsList()){
try{
// Print the transcription
if(speechTranscription.getAlternativesCount() > 0){
SpeechRecognitionAlternativealternative=speechTranscription.getAlternatives(0);
System.out.printf("Transcript: %s\n",alternative.getTranscript());
System.out.printf("Confidence: %.2f\n",alternative.getConfidence());
System.out.println("Word level information:");
for(WordInfowordInfo:alternative.getWordsList()){
doublestartTime=
wordInfo.getStartTime().getSeconds()+wordInfo.getStartTime().getNanos()/1e9;
doubleendTime=
wordInfo.getEndTime().getSeconds()+wordInfo.getEndTime().getNanos()/1e9;
System.out.printf(
"\t%4.2fs - %4.2fs: %s\n",startTime,endTime,wordInfo.getWord());
}
}else{
System.out.println("No transcription found");
}
}catch(IndexOutOfBoundsExceptionioe){
System.out.println("Could not retrieve frame: "+ioe.getMessage());
}
}
}
}Node.js
To authenticate to Video Intelligence, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
// Imports the Google Cloud Video Intelligence library
constvideoIntelligence=require('@google-cloud/video-intelligence');
// Creates a client
constclient=newvideoIntelligence.VideoIntelligenceServiceClient ();
/**
* TODO(developer): Uncomment the following line before running the sample.
*/
// const gcsUri = 'GCS URI of video to analyze, e.g. gs://my-bucket/my-video.mp4';
asyncfunctionanalyzeVideoTranscript(){
constvideoContext={
speechTranscriptionConfig:{
languageCode:'en-US',
enableAutomaticPunctuation:true,
},
};
constrequest={
inputUri:gcsUri,
features:['SPEECH_TRANSCRIPTION'],
videoContext:videoContext,
};
const[operation]=awaitclient.annotateVideo(request);
console.log('Waiting for operation to complete...');
const[operationResult]=awaitoperation.promise();
// There is only one annotation_result since only
// one video is processed.
constannotationResults=operationResult.annotationResults[0];
for(constspeechTranscriptionofannotationResults.speechTranscriptions){
// The number of alternatives for each transcription is limited by
// SpeechTranscriptionConfig.max_alternatives.
// Each alternative is a different possible transcription
// and has its own confidence score.
for(constalternativeofspeechTranscription.alternatives){
console.log('Alternative level information:');
console.log(`Transcript: ${alternative.transcript}`);
console.log(`Confidence: ${alternative.confidence}`);
console.log('Word level information:');
for(constwordInfoofalternative.words){
constword=wordInfo.word;
conststart_time=
wordInfo.startTime.seconds+wordInfo.startTime.nanos*1e-9;
constend_time=
wordInfo.endTime.seconds+wordInfo.endTime.nanos*1e-9;
console.log('\t'+start_time+'s - '+end_time+'s: '+word);
}
}
}
}
analyzeVideoTranscript();Python
To authenticate to Video Intelligence, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
"""Transcribe speech from a video stored on GCS."""
fromgoogle.cloudimport videointelligence
video_client = videointelligence.VideoIntelligenceServiceClient ()
features = [videointelligence.Feature .SPEECH_TRANSCRIPTION]
config = videointelligence.SpeechTranscriptionConfig (
language_code="en-US", enable_automatic_punctuation=True
)
video_context = videointelligence.VideoContext (speech_transcription_config=config)
operation = video_client.annotate_video (
request={
"features": features,
"input_uri": path,
"video_context": video_context,
}
)
print("\nProcessing video for speech transcription.")
result = operation.result(timeout=600)
# There is only one annotation_result since only
# one video is processed.
annotation_results = result.annotation_results[0]
for speech_transcription in annotation_results.speech_transcriptions:
# The number of alternatives for each transcription is limited by
# SpeechTranscriptionConfig.max_alternatives.
# Each alternative is a different possible transcription
# and has its own confidence score.
for alternative in speech_transcription.alternatives:
print("Alternative level information:")
print("Transcript: {}".format(alternative.transcript))
print("Confidence: {}\n".format(alternative.confidence))
print("Word level information:")
for word_info in alternative.words:
word = word_info.word
start_time = word_info.start_time
end_time = word_info.end_time
print(
"\t{}s - {}s: {}".format(
start_time.seconds + start_time.microseconds * 1e-6,
end_time.seconds + end_time.microseconds * 1e-6,
word,
)
)Additional languages
C#: Please follow the C# setup instructions on the client libraries page and then visit the Video Intelligence reference documentation for .NET.
PHP: Please follow the PHP setup instructions on the client libraries page and then visit the Video Intelligence reference documentation for PHP.
Ruby: Please follow the Ruby setup instructions on the client libraries page and then visit the Video Intelligence reference documentation for Ruby.