Transcribe long audio files into text
Stay organized with collections
Save and categorize content based on your preferences.
This page demonstrates how to transcribe long audio files (longer than 1 minute) to text using the Cloud Speech-to-Text API and asynchronous speech recognition.
About asynchronous speech recognition
Asynchronous speech recognition starts a long running audio processing operation. Use asynchronous speech recognition to transcribe audio that is longer than 60 seconds. For shorter audio, synchronous speech recognition is faster and simpler. The upper limit for asynchronous speech recognition is 480 minutes.
Cloud Speech-to-Text and asynchronous processing
Audio content can be sent directly to Cloud Speech-to-Text from a local file for asynchronous processing. However, the audio time limit for local files is 60 seconds. Attempting to transcribe local audio files that are longer than 60 seconds will result in an error. To use asynchronous speech recognition to transcribe audio longer than 60 seconds, you must have your data saved in a Cloud Storage bucket .
You can retrieve the results of the operation using the google.longrunning.Operations method. Results remain available for retrieval for 5 days (120 hours). You also have the option of uploading your results directly to a Cloud Storage bucket.
Transcribe long audio files using a Cloud Storage bucket
These samples use a Cloud Storage bucket to store the raw audio input for the
long-running transcription process. For an example of a typical
longrunningrecognize operation response, see the reference documentation.
Protocol
Refer to the speech:longrunningrecognize API endpoint for complete
details.
To perform synchronous speech recognition, make a POST request and provide the
appropriate request body. The following shows an example of a POST request using
curl. The example uses the Google Cloud CLI to generate an access
token. For instructions on installing the gcloud CLI,
see the quickstart.
curl-XPOST\ -H"Authorization: Bearer "$(gcloudauthapplication-defaultprint-access-token)\ -H"Content-Type: application/json; charset=utf-8"\ --data"{ 'config': { 'language_code': 'en-US' }, 'audio':{ 'uri':'gs://cloud-samples-tests/speech/brooklyn.flac' } }""https://speech.googleapis.com/v1/speech:longrunningrecognize"
See the RecognitionConfig and RecognitionAudio reference documentation for more information on configuring the request body.
If the request is successful, the server returns a 200 OK HTTP status code and
the response in JSON format:
{
"name": "7612202767953098924"
}Where name is the name of the long running operation created for the request.
Wait for processing to complete. Processing time differs depending on your
source audio. In most cases, you will get results in half
the length of the source audio.
You can get the status of your long-running operation by making a GET
request to the https://speech.googleapis.com/v1/operations/
endpoint. Replace your-operation-name with the name
returned from your longrunningrecognize request.
curl-H"Authorization: Bearer "$(gcloudauthapplication-defaultprint-access-token)\ -H"Content-Type: application/json; charset=utf-8"\ "https://speech.googleapis.com/v1/operations/your-operation-name"
If the request is successful, the server returns a 200 OK HTTP status code and
the response in JSON format:
{
"name":"7612202767953098924",
"metadata":{
"@type":"type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeMetadata",
"progressPercent":100,
"startTime":"2017-07-20T16:36:55.033650Z",
"lastUpdateTime":"2017-07-20T16:37:17.158630Z"
},
"done":true,
"response":{
"@type":"type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
"results":[
{
"alternatives":[
{
"transcript":"how old is the Brooklyn Bridge",
"confidence":0.96096134,
}
]
},
{
"alternatives":[
{
...
}
]
}
]
}
}
If the operation has not completed, you can poll the endpoint by repeatedly
making the GET request until the done property of the response is true.
gcloud
Refer to the
recognize-long-running command for complete details.
To perform asynchronous speech recognition, use the Google Cloud CLI, providing the path of a local file or a Cloud Storage URL.
gcloudmlspeechrecognize-long-running\ 'gs://cloud-samples-tests/speech/brooklyn.flac'\ --language-code='en-US'--async
If the request is successful, the server returns the ID of the long-running operation in JSON format.
{
"name": OPERATION_ID
}You can then get information about the operation by running the following command.
gcloudmlspeechoperationsdescribeOPERATION_ID
You can also poll the operation until it completes by running the following command.
gcloudmlspeechoperationswaitOPERATION_IDAfter the operation completes, the operation returns a transcript of the audio in JSON format.
{ "@type":"type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse", "results":[ { "alternatives":[ { "confidence":0.9840146, "transcript":"how old is the Brooklyn Bridge" } ] } ] }
Go
To learn how to install and use the client library for Cloud STT, see Cloud STT client libraries. For more information, see the Cloud STT Go API reference documentation.
To authenticate to Cloud STT, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
funcsendGCS(wio.Writer,client*speech.Client,gcsURIstring)error{
ctx:=context.Background()
// Send the contents of the audio file with the encoding and
// and sample rate information to be transcripted.
req:=&speechpb.LongRunningRecognizeRequest{
Config:&speechpb.RecognitionConfig{
Encoding:speechpb.RecognitionConfig_LINEAR16,
SampleRateHertz:16000,
LanguageCode:"en-US",
},
Audio:&speechpb.RecognitionAudio{
AudioSource:&speechpb.RecognitionAudio_Uri{Uri:gcsURI},
},
}
op,err:=client.LongRunningRecognize(ctx,req)
iferr!=nil{
returnerr
}
resp,err:=op.Wait(ctx)
iferr!=nil{
returnerr
}
// Print the results.
for_,result:=rangeresp.Results{
for_,alt:=rangeresult.Alternatives{
fmt.Fprintf(w,"\"%v\" (confidence=%3f)\n",alt.Transcript,alt.Confidence)
}
}
returnnil
}
Java
To learn how to install and use the client library for Cloud STT, see Cloud STT client libraries. For more information, see the Cloud STT Java API reference documentation.
To authenticate to Cloud STT, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
/**
* Performs non-blocking speech recognition on remote FLAC file and prints the transcription.
*
* @param gcsUri the path to the remote LINEAR16 audio file to transcribe.
*/
publicstaticvoidasyncRecognizeGcs(StringgcsUri)throwsException{
// Configure polling algorithm
SpeechSettings.BuilderspeechSettings=SpeechSettings.newBuilder();
TimedRetryAlgorithmtimedRetryAlgorithm=
OperationTimedPollAlgorithm.create(
RetrySettings.newBuilder()
.setInitialRetryDelay(Duration.ofMillis(500L))
.setRetryDelayMultiplier(1.5)
.setMaxRetryDelay(Duration.ofMillis(5000L))
.setInitialRpcTimeout(Duration.ZERO)// ignored
.setRpcTimeoutMultiplier(1.0)// ignored
.setMaxRpcTimeout(Duration.ZERO)// ignored
.setTotalTimeout(Duration.ofHours(24L))// set polling timeout to 24 hours
.build());
speechSettings.longRunningRecognizeOperationSettings().setPollingAlgorithm(timedRetryAlgorithm);
// Instantiates a client with GOOGLE_APPLICATION_CREDENTIALS
try(SpeechClientspeech=SpeechClient.create(speechSettings.build())){
// Configure remote file request for FLAC
RecognitionConfigconfig=
RecognitionConfig.newBuilder()
.setEncoding(AudioEncoding.FLAC)
.setLanguageCode("en-US")
.setSampleRateHertz(16000)
.build();
RecognitionAudioaudio=RecognitionAudio.newBuilder().setUri(gcsUri).build();
// Use non-blocking call for getting file transcription
OperationFuture<LongRunningRecognizeResponse,LongRunningRecognizeMetadata>response=
speech.longRunningRecognizeAsync(config,audio);
while(!response.isDone()){
System.out.println("Waiting for response...");
Thread.sleep(10000);
}
List<SpeechRecognitionResult>results=response.get().getResultsList();
for(SpeechRecognitionResultresult:results){
// There can be several alternative transcripts for a given chunk of speech. Just use the
// first (most likely) one here.
SpeechRecognitionAlternativealternative=result.getAlternativesList().get(0);
System.out.printf("Transcription: %s\n",alternative.getTranscript());
}
}
}Node.js
To learn how to install and use the client library for Cloud STT, see Cloud STT client libraries. For more information, see the Cloud STT Node.js API reference documentation.
To authenticate to Cloud STT, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
// Imports the Google Cloud client library
constspeech=require('@google-cloud/speech');
// Creates a client
constclient=newspeech.SpeechClient ();
/**
* TODO(developer): Uncomment the following lines before running the sample.
*/
// const gcsUri = 'gs://my-bucket/audio.raw';
// const encoding = 'Encoding of the audio file, e.g. LINEAR16';
// const sampleRateHertz = 16000;
// const languageCode = 'BCP-47 language code, e.g. en-US';
constconfig={
encoding:encoding,
sampleRateHertz:sampleRateHertz,
languageCode:languageCode,
};
constaudio={
uri:gcsUri,
};
constrequest={
config:config,
audio:audio,
};
// Detects speech in the audio file. This creates a recognition job that you
// can wait for now, or get its result later.
const[operation]=awaitclient.longRunningRecognize(request);
// Get a Promise representation of the final result of the job
const[response]=awaitoperation.promise();
consttranscription=response.results
.map(result=>result.alternatives[0].transcript)
.join('\n');
console.log(`Transcription: ${transcription}`);Python
To learn how to install and use the client library for Cloud STT, see Cloud STT client libraries. For more information, see the Cloud STT Python API reference documentation.
To authenticate to Cloud STT, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
fromgoogle.cloudimport speech
deftranscribe_gcs(gcs_uri: str) -> str:
"""Asynchronously transcribes the audio file from Cloud Storage
Args:
gcs_uri: The Google Cloud Storage path to an audio file.
E.g., "gs://storage-bucket/file.flac".
Returns:
The generated transcript from the audio file provided.
"""
client = speech.SpeechClient()
audio = speech.RecognitionAudio (uri=gcs_uri)
config = speech.RecognitionConfig (
encoding=speech.RecognitionConfig.AudioEncoding.FLAC,
sample_rate_hertz=44100,
language_code="en-US",
)
operation = client.long_running_recognize (config=config, audio=audio)
print("Waiting for operation to complete...")
response = operation.result(timeout=90)
transcript_builder = []
# Each result is for a consecutive portion of the audio. Iterate through
# them to get the transcripts for the entire audio file.
for result in response.results:
# The first alternative is the most likely one for this portion.
transcript_builder.append(f"\nTranscript: {result.alternatives[0].transcript}")
transcript_builder.append(f"\nConfidence: {result.alternatives[0].confidence}")
transcript = "".join(transcript_builder)
print(transcript)
return transcript
Additional languages
C#: Please follow the C# setup instructions on the client libraries page and then visit the Cloud STT reference documentation for .NET.
PHP: Please follow the PHP setup instructions on the client libraries page and then visit the Cloud STT reference documentation for PHP.
Ruby: Please follow the Ruby setup instructions on the client libraries page and then visit the Cloud STT reference documentation for Ruby.
Upload your transcription results to a Cloud Storage bucket
Cloud Speech-to-Text supports uploading your longrunning recognition results directly to a Cloud Storage bucket. If you implement this feature with Cloud Storage Triggers, Cloud Storage uploads can trigger notifications that call Cloud Functions and remove the need to poll Cloud Speech-to-Text for recognition results.
To have your results uploaded to a Cloud Storage bucket, provide the optional
TranscriptOutputConfig
output configuration in your longrunning recognition request.
messageTranscriptOutputConfig{
oneofoutput_type{
// Specifies a Cloud Storage URI for the recognition results. Must be
// specified in the format: `gs://bucket_name/object_name`
stringgcs_uri=1;
}
}
Protocol
Refer to the longrunningrecognize
API endpoint for complete details.
The following example shows how to send a POST request using curl,
where the body of the request specifies the path to a Cloud Storage
bucket. The results are uploaded to this location as a JSON file that stores
SpeechRecognitionResult.
curl-XPOST\ -H"Authorization: Bearer $(gcloudauthapplication-defaultprint-access-token)"\ -H"Content-Type: application/json; charset=utf-8"\ --data"{ 'config': {...}, 'output_config': { 'gcs_uri':'gs://bucket/result-output-path.json' }, 'audio': { 'uri': 'gs://bucket/audio-path' } }""https://speech.googleapis.com/v2/speech:longrunningrecognize"
The LongRunningRecognizeResponse
includes the path to the Cloud Storage bucket where the upload was attempted. If
the upload was unsuccessful, an output error will be returned. If a file with
the same name already exists, the upload writes the results to a new file with a
timestamp as the suffix.
{
...
"metadata": {
...
"outputConfig": {...}
},
...
"response": {
...
"results": [...],
"outputConfig": {
"gcs_uri":"gs://bucket/result-output-path"
},
"outputError": {...}
}
}
Try it for yourself
If you're new to Google Cloud, create an account to evaluate how Cloud STT performs in real-world scenarios. New customers also get 300ドル in free credits to run, test, and deploy workloads.
Try Cloud STT free