Name	Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets	assets
configs	configs
metrics	metrics
taming	taming
.gitignore	.gitignore
LICENSE	LICENSE
README.md	README.md
evaluation.py	evaluation.py
evaluation_speech.py	evaluation_speech.py
generate_manifest.py	generate_manifest.py
main.py	main.py
requirements.txt	requirements.txt
requirements_audio.txt	requirements_audio.txt

SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer

arXiv

News

Checkpoints are released.

Some other projects about Discrete Tokenizer based Multimodal GenAI from our team may interest you.

[NeurIPS 2024] Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective
Yongxin Zhu, Bocheng Li, Hang Zhang, Xin Li, Linli Xu, Lidong Bing
github github arXiv

[ACL 2024] Generative Pre-Trained Speech Language Model with Efficient Hierarchical Transformer
Yongxin Zhu, Dan Su, Liqiang He, Linli Xu, Dong Yu
github github arXiv (Adopted by Moshi)

[EMNLP 2023] DiffS2UT: A Semantic Preserving Diffusion Model for Textless Direct Speech-to-Speech Translation
Yongxin Zhu, Zhujin Gao, Xinyuan Zhou, Zhongyi Ye, Linli Xu
arXiv

Algorithm for SimVQ

You can find the core code here https://github.com/youngsheen/SimVQ/blob/main/taming/modules/vqvae/quantize.py#L28-L33

Note: Optimizing both the codebook C and the linear layer W can work as well.

Quantitative Comparison

Table 1. Reconstruction performance of different tokenizers on 128ドル \times 128$ ImageNet 50k validation set.

Method	Codebook Size	Codebook Utilization	rFID	LPIPS	PSNR	SSIM	Checkpoint
VQGAN	65,536	1.4%	3.74	0.17	22.20	70.6	-
VQGAN	65,536	4.5%	3.23	0.15	22.89	72.3	-
VQGAN-FC	65,536	100.0%	2.63	0.13	23.79	77.5	-
FSQ	64,000	100.0%	2.80	0.13	23.63	75.8	-
LFQ	65,536	100.0%	2.88	0.13	23.60	77.2	-
VQGAN-LC	65,536	100.0%	2.40	0.13	23.98	77.3	-
SimVQ (ours)	1024	100.0%	3.67	0.16	22.34	70.8	huggingface
SimVQ (ours)	8192	100.0%	2.98	0.14	23.23	74.7	huggingface
SimVQ (ours)	65,536	100.0%	2.24	0.12	24.15	78.4	huggingface
SimVQ (ours)	262,144	100.0%	1.99	0.11	24.68	80.3	huggingface

Table 2. Reconstruction performance of different tokenizers on LibriTTS test clean/other set.

Method	Bandwidth	Codebook Utilization	UTMOS	PESQ	STOI	V/UV F1	Checkpoint
Encodec	3.0kbps	-/-%	2.31/2.09	2.05/2.05	0.90/0.88	0.92/0.89	-
Vocos	3.0kbps	-/-%	3.53/3.06	2.40/2.19	0.92/0.90	0.94/0.91	-
SpeechTokenizer	3.0kbps	-/-%	3.56/3.02	1.93/1.74	0.88/0.84	0.93/0.89	-
WavTokenizer	0.9kbps	100/100%	3.74/3.43	2.01/2.26	0.89/0.89	0.92/0.92	-
WavTokenizer	1.05kbps	27/-%	4.00/-	2.36/-	0.81/-	0.94/-	-
SimVQ (ours)	0.9kbps	100.0/100.0%	4.00/3.51	2.33/2.08	0.91/0.88	0.94/0.91	huggingface
SimVQ (ours)	0.975kbps	99.4/99.4%	4.03/3.52	2.42/2.15	0.92/0.88	0.94/0.92	huggingface
SimVQ (ours)	1.2kbps	99.4/99.0%	4.03/3.52	2.54/2.26	0.93/0.90	0.94/0.92	huggingface
SimVQ (ours)	1.35kbps	95.6/94.7%	4.03/3.53	2.61/2.31	0.93/0.90	0.95/0.93	huggingface

Implementations

Installation

Dependencies: pip install -r requirements.txt
Extra dependencies for audio evaluation: pip install -r requirements_audio.txt
Datasets

imagenet
└── train/
 ├── n01440764
 ├── n01440764_10026.JPEG
 ├── n01440764_10027.JPEG
 ├── ...
 ├── n01443537
 ├── ...
└── val/
 ├── ...

LibriTTS
└── train-clean-100/
 ├── 103/
 ├── 1241/
 ├── 103_1241_000000_000001.wav
 ├── ...
 ├── 1034
 ├── ...
└── train-clean-360/
 ├── ...
└── train-other-500/
 ├── ...
└── dev-other/
 ├── ...
└── dev-clean/
 ├── ...
└── test-other/
 ├── ...
└── test-clean/
 ├── ...

Training Scripts

Image Tokenizer Training

XDG_CACHE_HOME="dataset/ILSVRC2012" python main.py fit --config configs/imagenet_simvq_128_B.yaml

Audio Tokenizer Training

You can get manifest .txt with generate_manifest.py

DATA_ROOT="/data3/yongxinzhu/libritts/LibriTTS" CUDA_VISIBLE_DEVICES=4,5,6,7 python main.py fit --config configs/libritts_24khz.yaml

Note: Some users have reported encountering NaN issues when training SimVQ on audio data. This appears to be a random occurrence, but we have found that using learning rate warmup can help mitigate the problem.

Evaluation Scripts

Image Tokenizer Evaluation

XDG_CACHE_HOME="dataset/ILSVRC2012" python evaluation.py --config_file vq_log/simvq_262k/size128/config.yaml --ckpt_path vq_log/simvq_262k/epoch=49-step=250250.ckpt

Audio Tokenizer Evaluation

DATA_ROOT="dataset/libritts" python evaluation_speech.py --config_file vq_audio_log/simvq_262k/1second/config.yaml --ckpt_path vq_audio_log/simvq_262k/epoch=49-step=138600.ckpt

Reconstruction Visualization

Figure 2. Visualization of the Open-MAGVIT2 tokenizer trained at 128ドル \times 128$ resolution (imagenet_simvq_128_Base version). (a) indicates the original images while (b) specifies the reconstruction images.

Figure 3. Visualization of the Open-MAGVIT2 tokenizer trained at LibriTTS (libritts_24khz version). (a) indicates the original images while (b) specifies the reconstruction images.

Acknowledgement

The codebase of SimVQ is adapted from Open-MAGVIT2 and WavTokenizer. Thanks for their wonderful work.

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

youngsheen/SimVQ

Folders and files

Latest commit

History

Repository files navigation

SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer

arXiv

News

Quantitative Comparison

Implementations

Installation

Training Scripts

Evaluation Scripts

Reconstruction Visualization

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors 2

Languages

License

youngsheen/SimVQ

Folders and files

Latest commit

History

Repository files navigation

SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer

arXiv

News

Quantitative Comparison

Implementations

Installation

Training Scripts

Evaluation Scripts

Reconstruction Visualization

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages