NTT123 / light-speed Public

Notifications You must be signed in to change notification settings
Fork 46
Star 151

A modified VITS that utilizes phoneme duration's ground truth for better robustness

License

MIT license

151 stars 46 forks Branches Tags Activity

Star

Notifications

NTT123/light-speed

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
attentions.py		attentions.py
commons.py		commons.py
config.json		config.json
flow.py		flow.py
inference.ipynb		inference.ipynb
losses.py		losses.py
mel_processing.py		mel_processing.py
models.py		models.py
modules.py		modules.py
net.svg		net.svg
prepare_ljs_tfdata.ipynb		prepare_ljs_tfdata.ipynb
prepare_vbx_tfdata.ipynb		prepare_vbx_tfdata.ipynb
tfloader.py		tfloader.py
train.py		train.py
train_duration_model.ipynb		train_duration_model.ipynb

Repository files navigation

Light Speed ⚡

Light Speed ⚡ is an open-source text-to-speech model based on VITS, with some modifications:

utilizes phoneme duration's ground truth, obtained from an external forced aligner (such as Montreal Forced Aligner), to upsample phoneme information to frame-level information. The result is a more robust model, with a slight trade-off in speech quality.
employs dilated convolution to expand the Wavenet Flow module's receptive field, enhancing its ability to capture long-range interactions.

Pretrained models and demos

We provide two pretrained models and demos:

VN - Male voice: https://huggingface.co/spaces/ntt123/Vietnam-male-voice-TTS
VN - Female voice: https://huggingface.co/spaces/ntt123/Vietnam-female-voice-TTS

FAQ

Q: How do I create training data?
A: See the ./prepare_ljs_tfdata.ipynb notebook for instructions on preparing the training data.

Q: How can I train the model with 1 GPU?
A: Run: python train.py

Q: How can I train the model with 4 GPUs?
A: Run: torchrun --standalone --nnodes=1 --nproc-per-node=4 train.py

Q: How can I train a model to predict phoneme durations?
A: See the ./train_duration_model.ipynb notebook.

Q: How can I generate speech with a trained model?
A: See the ./inference.ipynb notebook.

Credits

Most of the code in this repository is based on the VITS official repository.

About

A modified VITS that utilizes phoneme duration's ground truth for better robustness

Releases

No releases published

Packages

No packages published

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

NTT123/light-speed

Folders and files

Latest commit

History

Repository files navigation

Light Speed ⚡

Pretrained models and demos

FAQ

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

License

NTT123/light-speed

Folders and files

Latest commit

History

Repository files navigation

Light Speed ⚡

Pretrained models and demos

FAQ

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages