π‘ Some other speech AI projects from our team may interest you β¨.
Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction
Haoqiu Yan#, Yongxin Zhu#, Kai Zheng, Bing Liu, Haoyu Cao, Deqiang Jiang, Linli Xu
github github arXiv
This is a PyTorch implementation of the paper Generative Pre-Trained Speech Language Model with Efficient Hierarchical Transformer.
Demo page: https://youngsheen.github.io/GPST/demo
The overview of GPST as following picture shows. The overview of GPST
- Download the code
git clone https://github.com/youngsheen/GPST.git
cd GPST-
Install
fairseqandencodecviapip. Install seamless_communication and fairseq2. -
[Optional] Install
flash-attnfor faster attention computation.
Download the LibriSpeech or LibriLight dataset and place it in your directory at $PATH_TO_YOUR_WORKSPACE/datasets. We use xlsr2_1b_v2 from SeamlessM4T to extract semantic tokens and Encodec to extract acoustic tokens. You can set the bandwidth to 6kbps or 12 kbps to control the quality of speech resynthesis. We suggest using bandwidth=12 since the former half of its acoustic tokens are the same as 6kbps. The scripts will generate a manifest containing the path of all files, two lmdb folders containing semantic tokens and acoustic tokens separately.
bash preprocess/run.sh
OUTPUT_DIR=outputs ROOT=PATH mkdir -p $OUTPUT_DIR CUDA_VISIBLE_DEVICES=4,5 torchrun --nnodes=1 --nproc_per_node=2 --master_port=36666 \ $(which fairseq-hydra-train) --config-dir config \ --config-name st2at \ hydra.run.dir=$ROOT/gpst \ hydra.output_subdir=$OUTPUT_DIR \ hydra.job.name=$OUTPUT_DIR/train \ common.tensorboard_logdir=$OUTPUT_DIR/tb \ checkpoint.save_dir=$OUTPUT_DIR/checkpoints \ +task.data=$ROOT/LibriSpeech \
If you find GPST useful for your research and applications, please cite using this BibTeX:
@inproceedings{zhu-etal-2024-generative, title = "Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer", author = "Zhu, Yongxin and Su, Dan and He, Liqiang and Xu, Linli and Yu, Dong", editor = "Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek", booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = aug, year = "2024", address = "Bangkok, Thailand", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.acl-long.97", doi = "10.18653/v1/2024.acl-long.97", pages = "1764--1775", }