🔔 Notice: This project is no longer being updated. Please refer to the new project, LightSpeed, which includes a new male voice.
Duration model + Acoustic model + HiFiGAN vocoder for vietnamese text-to-speech application.
Online demo at https://huggingface.co/spaces/ntt123/vietTTS.
A synthesized audio clip: clip.wav. A colab notebook: notebook.
Checkout the experimental multi-speaker branch (git checkout multi-speaker) for multi-speaker support.
git clone https://github.com/NTT123/vietTTS.git cd vietTTS pip3 install -e .
bash ./scripts/quick_start.sh
python ./scripts/download_aligned_infore_dataset.py
Note: this is a denoised and aligned version of the original dataset which is donated by the InfoRe Technology company (see here). You can download the original dataset (InfoRe Technology 1) at here.
See notebooks/denoise_infore_dataset.ipynb for instructions on how to denoise the dataset. We use the Montreal Forced Aligner (MFA) to align transcript and speech (textgrid files).
See notebooks/align_text_audio_infore_mfa.ipynb for instructions on how to create textgrid files.
python -m vietTTS.nat.duration_trainer
python -m vietTTS.nat.acoustic_trainer
We use the original implementation from HiFiGAN authors at https://github.com/jik876/hifi-gan. Use the config file at assets/hifigan/config.json to train your model.
git clone https://github.com/jik876/hifi-gan.git # create dataset in hifi-gan format ln -sf `pwd`/train_data hifi-gan/data cd hifi-gan/data ls -1 *.TextGrid | sed -e 's/\.TextGrid$//' > files.txt cd .. head -n 100 data/files.txt > val_files.txt tail -n +101 data/files.txt > train_files.txt rm data/files.txt # training python train.py \ --config ../assets/hifigan/config.json \ --input_wavs_dir=data \ --input_training_file=train_files.txt \ --input_validation_file=val_files.txt
Finetune on Ground-Truth Aligned melspectrograms:
cd /path/to/vietTTS # go to vietTTS directory python -m vietTTS.nat.zero_silence_segments -o train_data # zero all [sil, sp, spn] segments python -m vietTTS.nat.gta -o /path/to/hifi-gan/ft_dataset # create gta melspectrograms at hifi-gan/ft_dataset directory # turn on finetune cd /path/to/hifi-gan python train.py \ --fine_tuning True \ --config ../assets/hifigan/config.json \ --input_wavs_dir=data \ --input_training_file=train_files.txt \ --input_validation_file=val_files.txt
Then, use the following command to convert pytorch model to haiku format:
cd ..
python -m vietTTS.hifigan.convert_torch_model_to_haiku \
--config-file=assets/hifigan/config.json \
--checkpoint-file=hifi-gan/cp_hifigan/g_[latest_checkpoint]python -m vietTTS.synthesizer \
--lexicon-file=train_data/lexicon.txt \
--text="hôm qua em tới trường" \
--output=clip.wav