Overview

The goal of text2speech is to harmonize various text-to-speech engines, including Amazon Polly, Coqui TTS, Google Cloud Text-to-Speech API, and Microsoft Cognitive Services Text to Speech REST API.

With the exception of Coqui TTS, all these engines are accessible as R packages:

aws.polly is a client for Amazon Polly
googleLanguageR is a client to the Google Cloud Text-to-Speech API
conrad is a client to the Microsoft Cognitive Services Text to Speech REST API

You might notice Coqui TTS doesn’t have its own R package. This is because, at this time, text2speech directly incorporates the functionality of Coqui TTS. The R wrapper of Coqui is under development.

Installation

You can install this package from CRAN or the development version from GitHub with:

 # Install from CRAN
 install.packages("text2speech")
 
 # or the development version from GitHub
 # install.packages("devtools")
devtools::install_github("jhudsl/text2speech")

Authentication

Check for authentication. If not already authenticated, users must individually configure it for each service.

 library(text2speech)
 
 # Amazon Polly
 tts_auth("amazon")
 #> [1] TRUE
 # Coqui TTS
 tts_auth("coqui")
 #> [1] TRUE
 # Google Cloud Text-to-Speech API 
 tts_auth("google")
 #> [1] TRUE
 # Microsoft Cognitive Services Text to Speech REST API
 tts_auth("microsoft")
 #> [1] TRUE

Voices

List different voice options for each service.

 # Amazon Polly
voices_amazon <- tts_amazon_voices()
 head(voices_amazon)
 #> voice language language_code gender service
 #> 1 Zeina Arabic arb Female amazon
 #> 2 Zhiyu Chinese Mandarin cmn-CN Female amazon
 #> 3 Naja Danish da-DK Female amazon
 #> 4 Mads Danish da-DK Male amazon
 #> 5 Ruben Dutch nl-NL Male amazon
 #> 6 Lotte Dutch nl-NL Female amazon
 
 # Coqui TTS
voices_coqui <- tts_coqui_voices()
 #> i Test out different voices on the CoquiTTS Demo (<https://huggingface.co/spaces/coqui/CoquiTTS>)
 head(voices_coqui)
 #> # A tibble: 6 ×ばつ 5
 #> type language dataset model_name service
 #> <chr> <chr> <chr> <chr> <chr> 
 #> 1 tts_models multilingual multi-dataset your_tts coqui 
 #> 2 tts_models multilingual multi-dataset bark coqui 
 #> 3 tts_models bg cv vits coqui 
 #> 4 tts_models cs cv vits coqui 
 #> 5 tts_models da cv vits coqui 
 #> 6 tts_models et cv vits coqui
 
 # Google Cloud Text-to-Speech API 
voices_google <- tts_google_voices()
 head(voices_google)
 #> voice language language_code gender service
 #> 1 af-ZA-Standard-A <NA> af-ZA FEMALE google
 #> 2 af-ZA-Standard-A <NA> af-ZA FEMALE google
 #> 3 ar-XA-Wavenet-C Arabic ar-XA MALE google
 #> 4 ar-XA-Standard-C Arabic ar-XA MALE google
 #> 5 ar-XA-Standard-D Arabic ar-XA FEMALE google
 #> 6 ar-XA-Wavenet-A Arabic ar-XA FEMALE google
 
 # Microsoft Cognitive Services Text to Speech REST API
voices_microsoft <- tts_microsoft_voices()
 head(voices_microsoft)
 #> voice
 #> 1 Microsoft Server Speech Text to Speech Voice (af-ZA, AdriNeural)
 #> 2 Microsoft Server Speech Text to Speech Voice (af-ZA, WillemNeural)
 #> 3 Microsoft Server Speech Text to Speech Voice (am-ET, MekdesNeural)
 #> 4 Microsoft Server Speech Text to Speech Voice (am-ET, AmehaNeural)
 #> 5 Microsoft Server Speech Text to Speech Voice (ar-AE, FatimaNeural)
 #> 6 Microsoft Server Speech Text to Speech Voice (ar-AE, HamdanNeural)
 #> language language_code gender service
 #> 1 Afrikaans (South Africa) af-ZA Female microsoft
 #> 2 Afrikaans (South Africa) af-ZA Male microsoft
 #> 3 Amharic (Ethiopia) am-ET Female microsoft
 #> 4 Amharic (Ethiopia) am-ET Male microsoft
 #> 5 Arabic (United Arab Emirates) ar-AE Female microsoft
 #> 6 Arabic (United Arab Emirates) ar-AE Male microsoft

Convert text to speech

Synthesize speech with tts(text = "TEXT", service = "ENGINE")

 # Amazon Polly
 tts("Hello world!", service = "amazon")
 
 # Coqui TTS
 tts("Hello world!", service = "coqui")
 
 # Google Cloud Text-to-Speech API 
 tts("Hello world!", service = "google")
 
 # Microsoft Cognitive Services Text to Speech REST API
 tts("Hello world!", service = "microsoft")

The resulting output will consist of a standardized tibble featuring the following columns:

index: Sequential identifier number
original_text: The text input provided by the user
text: In case original_text exceeds the character limit, text represents the outcome of splitting original_text. Otherwise, text remains the same as original_text.
wav: Wave object (S4 class)
file: File path to the audio file
audio_type: The audio format, either mp3 or wav
duration : The duration of the audio file
service: The text-to-speech engine used