Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development. Amphion offers a unique feature: visualizations of classic models or architectures. We believe that these visualizations are beneficial for junior researchers and engineers who wish to gain a better understanding of the model.
The North-Star objective of Amphion is to offer a platform for studying the conversion of any inputs into audio. Amphion is designed to support individual generation tasks, including but not limited to,
- TTS: Text to Speech (⛳ supported)
- SVS: Singing Voice Synthesis (👨💻 developing)
- VC: Voice Conversion (⛳ supported)
- AC: Accent Conversion (⛳ supported)
- SVC: Singing Voice Conversion (⛳ supported)
- TTA: Text to Audio (⛳ supported)
- TTM: Text to Music (👨💻 developing)
- more...
In addition to the specific generation tasks, Amphion includes several vocoders and evaluation metrics. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks. Moreover, Amphion is dedicated to advancing audio generation in real-world applications, such as building large-scale datasets for speech synthesis.
- 2025年05月26日: We release DualCodec, a low-frame-rate (12.5Hz or 25Hz), semantically-enhanced (with SSL feature) Neural Audio Codec designed to extract discrete tokens for efficient speech generation.paper Open In Colab demo page code
- 2025年04月12日: We release Vevo1.5, which extends Vevo and focuses on unified and controllable generation for both speech and singing voice. Vevo1.5 can be applied into a series of speech and singing voice generation tasks, including VC, TTS, AC, SVS, SVC, Speech/Singing Voice Editing, Singing Style Conversion, and more. blog
- 2025年02月26日: We release Metis, a foundation model for unified speech generation. The system supports zero-shot text-to-speech, voice conversion, target speaker extraction, speech enhancement, and lip-to-speech. arXiv hf
- 2025年02月26日: The Emilia-Large dataset, featuring over 200,000 hours of data, is now available!!! Emilia-Large combines the original 101k-hour Emilia dataset (licensed under CC BY-NC 4.0) with the brand-new 114k-hour Emilia-YODAS dataset (licensed underCC BY 4.0). Download at hf. Check details at arXiv.
- 2025年01月30日: We release Amphion v0.2 Technical Report, which provides a comprehensive overview of the Amphion updates in 2024. arXiv
- 2025年01月23日: MaskGCT and Vevo got accepted by ICLR 2025! 🎉
- 2024年12月22日: We release the reproduction of Vevo, a zero-shot voice imitation framework with controllable timbre and style. Vevo can be applied into a series of speech generation tasks, including VC, TTS, AC, and more. The released pre-trained models are trained on Emilia dataset and achieve SOTA zero-shot VC performance. arXiv hf WebPage readme
- 2024年10月19日: We release MaskGCT, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision. MaskGCT is trained on Emilia dataset and achieves SOTA zero-shot TTS performance. arXiv hf hf ModelScope ModelScope readme
- 2024年09月01日: Amphion, Emilia and DSFF-SVC got accepted by IEEE SLT 2024! 🤗
- 2024年08月28日: Welcome to join Amphion's Discord channel to stay connected and engage with our community!
- 2024年08月20日: SingVisio got accepted by Computers & Graphics, available here! 🎉
- 2024年08月27日: The Emilia dataset is now publicly available! Discover the most extensive and diverse speech generation dataset with 101k hours of in-the-wild speech data now at hf or OpenDataLab! 👑👑👑
- 2024年07月01日: Amphion now releases Emilia, the first open-source multilingual in-the-wild dataset for speech generation with over 101k hours of speech data, and the Emilia-Pipe, the first open-source preprocessing pipeline designed to transform in-the-wild speech data into high-quality training data with annotations for speech generation! arXiv hf demo readme
- 2024年03月12日: Amphion now support NaturalSpeech3 FACodec and release pretrained checkpoints. arXiv hf hf readme
- 2024年02月22日: The first Amphion visualization tool, SingVisio, release. arXiv openxlab Video readme
- 2023年12月18日: Amphion v0.1 release. arXiv hf youtube readme
- 2023年11月28日: Amphion alpha release. readme
- Amphion achieves state-of-the-art performance compared to existing open-source repositories on text-to-speech (TTS) systems. It supports the following models or architectures:
- FastSpeech2: A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks. code
- VITS: An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning code
- VALL-E: A zero-shot TTS architecture that uses a neural codec language model with discrete codes. code
- NaturalSpeech2: An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices. code
- Jets: An end-to-end TTS model that jointly trains FastSpeech2 and HiFi-GAN with an alignment module. code
- MaskGCT: A fully non-autoregressive TTS architecture that eliminates the need for explicit alignment information between text and speech supervision. code
- Vevo-TTS: A zero-shot TTS architecture with controllable timbre and style. It consists of an autoregressive transformer and a flow-matching transformer. code
- DualCodec-VALLE: A VALLE model trained on 12.5Hz DualCodec tokens for super fast generation.
 
Amphion supports the following voice conversion models:
- Vevo: A zero-shot voice imitation framework with controllable timbre and style. Vevo-Timbre conducts the style-preserved voice conversion, and Vevo-Voice conducts the style-converted voice conversion. code
- FACodec: FACodec decomposes speech into subspaces representing different attributes like content, prosody, and timbre. It can achieve zero-shot voice conversion. code
- Noro: A noise-robust zero-shot voice conversion system. Noro introduces innovative components tailored for VC using noisy reference speeches, including a dual-branch reference encoding module and a noise-agnostic contrastive speaker loss. code
- DualCodec, a low-frame-rate (12.5Hz or 25Hz), semantically-enhanced (with SSL feature) Neural Audio Codec designed to extract discrete tokens for efficient speech generation.paper Open In Colab demo page code
- FACodec: FACodec decomposes speech into subspaces representing different attributes like content, prosody, and timbre. code
- Amphion supports AC with Vevo-Style. Particularly, it can conduct the accent conversion in a zero-shot manner. code
- Ampion supports multiple content-based features from various pretrained models, including WeNet, Whisper, and ContentVec. Their specific roles in SVC has been investigated in our SLT 2024 paper. arXiv code
- Amphion implements several state-of-the-art model architectures, including diffusion-, transformer-, VAE- and flow-based models. The diffusion-based architecture uses Bidirectional dilated CNN as a backend and supports several sampling algorithms such as DDPM, DDIM, and PNDM. Additionally, it supports single-step inference based on the Consistency Model. code
- Amphion supports the TTA with a latent diffusion model. It is designed like AudioLDM, Make-an-Audio, and AUDIT. It is also the official implementation of the text-to-audio generation part of our NeurIPS 2023 paper. arXiv code
- Amphion supports various widely-used neural vocoders, including:
- Amphion provides the official implementation of Multi-Scale Constant-Q Transform Discriminator (our ICASSP 2024 paper). It can be used to enhance any architecture GAN-based vocoders during training, and keep the inference stage (such as memory or speed) unchanged. arXiv code
Amphion provides a comprehensive objective evaluation of the generated audio. code
The supported evaluation metrics contain:
- F0 Modeling: F0 Pearson Coefficients, F0 Periodicity Root Mean Square Error, F0 Root Mean Square Error, Voiced/Unvoiced F1 Score, etc.
- Energy Modeling: Energy Root Mean Square Error, Energy Pearson Coefficients, etc.
- Intelligibility: Character/Word Error Rate, which can be calculated based on Whisper and more.
- Spectrogram Distortion: Frechet Audio Distance (FAD), Mel Cepstral Distortion (MCD), Multi-Resolution STFT Distance (MSTFT), Perceptual Evaluation of Speech Quality (PESQ), Short Time Objective Intelligibility (STOI), etc.
- Speaker Similarity: Cosine similarity, which can be calculated based on RawNet3, Resemblyzer, WeSpeaker, WavLM and more.
- Amphion unifies the data preprocess of the open-source datasets including AudioCaps, LibriTTS, LJSpeech, M4Singer, Opencpop, OpenSinger, SVCC, VCTK, and more. The supported dataset list can be seen here (updating).
- Amphion (exclusively) supports the Emilia dataset and its preprocessing pipeline Emilia-Pipe for in-the-wild speech data!
Amphion provides visualization tools to interactively illustrate the internal processing mechanism of classic models. This provides an invaluable resource for educational purposes and for facilitating understandable research.
Currently, Amphion supports SingVisio, a visualization tool of the diffusion model for singing voice conversion. arXiv openxlab Video
Amphion can be installed through either Setup Installer or Docker Image.
git clone https://github.com/open-mmlab/Amphion.git cd Amphion # Install Python Environment conda create --name amphion python=3.9.15 conda activate amphion # Install Python Packages Dependencies sh env.sh
- 
Install Docker, NVIDIA Driver, NVIDIA Container Toolkit, and CUDA. 
- 
Run the following commands: 
git clone https://github.com/open-mmlab/Amphion.git
cd Amphion
docker pull realamphion/amphion
docker run --runtime=nvidia --gpus all -it -v .:/app realamphion/amphionMount dataset by argument -v is necessary when using Docker. Please refer to Mount dataset in Docker container and Docker Docs for more details.
We detail the instructions of different tasks in the following recipes:
- Text to Speech (TTS)
- Voice Conversion (VC)
- Accent Conversion (AC)
- Singing Voice Conversion (SVC)
- Text to Audio (TTA)
- Vocoder
- Evaluation
- Visualization
We appreciate all contributions to improve Amphion. Please refer to CONTRIBUTING.md for the contributing guideline.
- ming024's FastSpeech2 and jaywalnut310's VITS for model architecture code.
- lifeiteng's VALL-E for training pipeline and model architecture design.
- SpeechTokenizer for semantic-distilled tokenizer design.
- WeNet, Whisper, ContentVec, and RawNet3 for pretrained models and inference code.
- HiFi-GAN for GAN-based Vocoder's architecture design and training strategy.
- Encodec for well-organized GAN Discriminator's architecture and basic blocks.
- Latent Diffusion for model architecture design.
- TensorFlowTTS for preparing the MFA tools.
Amphion is under the MIT License. It is free for both research and commercial use cases.
Amphion v0.2:
@article{amphion_v0.2, title = {Overview of the Amphion Toolkit (v0.2)}, author = {Jiaqi Li and Xueyao Zhang and Yuancheng Wang and Haorui He and Chaoren Wang and Li Wang and Huan Liao and Junyi Ao and Zeyu Xie and Yiqiao Huang and Junan Zhang and Zhizheng Wu}, year = {2025}, journal = {arXiv preprint arXiv:2501.15442}, }
Amphion v0.1:
@inproceedings{amphion, author={Xueyao Zhang and Liumeng Xue and Yicheng Gu and Yuancheng Wang and Jiaqi Li and Haorui He and Chaoren Wang and Ting Song and Xi Chen and Zihao Fang and Haopeng Chen and Junan Zhang and Tze Ying Tang and Lexiao Zou and Mingxuan Wang and Jun Han and Kai Chen and Haizhou Li and Zhizheng Wu}, title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit}, booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024}, year={2024} }