Name	Name	Last commit message	Last commit date
Latest commit History 34 Commits
ThinkSound	ThinkSound
assets/figs	assets/figs
data_utils	data_utils
docs	docs
scripts	scripts
third_party	third_party
README.md	README.md
app.py	app.py
cot_vgg_demo_caption.txt	cot_vgg_demo_caption.txt
defaults.ini	defaults.ini
demo_test.csv	demo_test.csv
eval_batch.py	eval_batch.py
extract_latents.py	extract_latents.py
predict.py	predict.py
pyproject.toml	pyproject.toml
requirements.txt	requirements.txt
setup.py	setup.py
setup_windows.bat	setup_windows.bat
train.py	train.py
unwrap.py	unwrap.py

ThinkSound

NeurIPS 2025

arXiv Online Demo Hugging Face ModelScope

If you find this project useful,
a star ⭐ on GitHub would be greatly appreciated!

ThinkSound is a unified Any2Audio generation framework with flow matching guided by Chain-of-Thought (CoT) reasoning.

PyTorch implementation for multimodal audio generation and editing: generate or edit audio from video, text, and audio, powered by step-by-step reasoning from Multimodal Large Language Models (MLLMs).

Teaser

📰 News

2025年09月19日 🎉 ThinkSound has been accepted to the NeurIPS 2025 Main Conference!
2025年09月01日 🔥 Our AudioCoT dataset is now open-sourced and available on Hugging Face!
2025年07月17日 🧠 Finetuning enabled: training and finetuning code is now publicly available, along with clear usage instructions to help you customize and extend ThinkSound with your own data.
2025年07月15日 📦 Simplified installation and usability: dependencies on PyPI for easy cross-platform setup; Windows .bat scripts automate environment creation and script running.
2025年07月08日 🔧 Major update: model lightweighted and optimized memory and GPU usage, now supports high-throughput audio generation at scale!
2025年07月01日 🔥Online demo on Hugging Face Spaces and ModelScope for interactive experience!
2025年07月01日 🔥Released inference scripts and web interface;
2025.06 🔥ThinkSound paper released on arXiv!
2025.06 🔥Online Demo is live - try it now!

🚀 Features

Any2Audio: Generate audio from arbitrary modalities — video, text, audio, or their combinations.
Video-to-Audio SOTA: Achieves state-of-the-art results on multiple V2A benchmarks.
CoT-Driven Reasoning: Chain-of-Thought reasoning for compositional and controllable audio generation via MLLMs.
Interactive Object-centric Editing: Refine or edit specific sound events by clicking on visual objects or using text instructions.
Unified Framework: One foundation model supports generation, editing, and interactive workflow.

✨ Method Overview

ThinkSound decomposes audio generation and editing into three interactive stages, all guided by MLLM-based Chain-of-Thought (CoT) reasoning:

Foley Generation: Generate foundational, semantically and temporally aligned soundscapes from video.
Object-Centric Refinement: Refine or add sounds for user-specified objects via clicks or regions in the video.
Targeted Audio Editing: Modify generated audio using high-level natural language instructions.

ThinkSound Overview

⚡ Quick Start

Environment Preparation:

git clone https://github.com/liuhuadai/ThinkSound.git
cd ThinkSound
conda create -n thinksound python=3.10
conda activate thinksound
pip install thinksound
conda install -y -c conda-forge 'ffmpeg<7'
# Download pretrained weights https://huggingface.co/liuhuadai/ThinkSound to Directory ckpts/
# model weights can be also downloaded from https://www.modelscope.cn/models/iic/ThinkSound
git lfs install
git clone https://huggingface.co/liuhuadai/ThinkSound ckpts
# To improve inference and training speed, you may optionally install a FlashAttention backend compatible with your system and PyTorch version.

✅ Windows Tip:
Windows users can simply run setup_windows.bat (or double-click it) to automatically create the conda environment, install all dependencies (including FFmpeg), and download the pretrained model — no manual setup required.
Make sure conda and git are installed and available in your system PATH before running the script.

▶️ Run the Demo

Linux/macOS

chmod +x scripts/demo.sh
./scripts/demo.sh <path-to-your-demo-video> <title> <CoT description> [use-half]

Windows

You can use the provided .bat script instead:

.\scripts\demo.bat <path-to-your-demo-video> <title> <CoT description> [use-half]

Note:

<path-to-your-demo-video>: The path to a single video
[use-half] (optional): Add use-half at the end to enable half precision feature extraction.

📦 Batch Inference

Linux/macOS

chmod +x scripts/eval_batch.sh
./scripts/eval_batch.sh <video_path> <csv_path> <save_path (optional)> [use-half]

Windows

Use the equivalent .bat script:

.\scripts\eval_batch.bat <video_path> <csv_path> <save_path (optional)> [use-half]

Note:

<video_path>: Path to the root directory containing all .mp4 videos to be processed (all videos must be of equal duration).
<csv_path>: A CSV file with text prompts for each video (see demo_test.csv for format).
<save_path> (optional): Where to save generated audio. Defaults to results/features.
[use-half] (optional): Add use-half at the end to enable half precision feature extraction.

Web Interface Usage

For an interactive experience, launch the Gradio web interface:

python app.py

🏋️ Train the Model

See Training.md

📝 TODO & Future Plans

- Release a more powerful foundation model covering multiple domains to provide more engaging and immersive foley creation
- Add support for additional modalities and downstream tasks
- Release models at different scales
- Open-source AudioCoT dataset and automated pipeline
- Release training scripts for ThinkSound models
- A beginner-friendly Windows quick-start README

📄 License

This project is released under the Apache 2.0 License.

Note: The code, models, and dataset are for research and educational purposes only. Commercial use is NOT permitted. For commercial licensing, please contact the authors.

📦 Third-Party Components

Stable Audio Open VAE (by Stability AI): This repository includes a fine-tuned VAE from Stable Audio Open, licensed under the Stability AI Community License. Commercial use and redistribution require prior permission from Stability AI.
📘 All other code and models are released under the Apache License 2.0.

Acknowledgements

Many thanks to:

stable-audio-tools (by Stability AI): For providing an easy-to-use framework for audio generation, as well as the VAE module and weights.
MMAudio: For the implementation of the MM-DiT backbone in the audio domain.

📖 Citation

If you find ThinkSound useful in your research or work, please cite our paper:

@misc{liu2025thinksoundchainofthoughtreasoningmultimodal,
 title={ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing}, 
 author={Huadai Liu and Jialei Wang and Kaicheng Luo and Wen Wang and Qian Chen and Zhou Zhao and Wei Xue},
 year={2025},
 eprint={2506.21448},
 archivePrefix={arXiv},
 primaryClass={eess.AS},
 url={https://arxiv.org/abs/2506.21448}, 
}

📬 Contact

✨ Feel free to open an issue or contact us via email (liuhuadai@zju.edu.cn) if you have any questions or suggestions!

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FunAudioLLM/ThinkSound

Folders and files

Latest commit

History

Repository files navigation

ThinkSound

Teaser

📰 News

🚀 Features

✨ Method Overview

⚡ Quick Start

▶️ Run the Demo

Linux/macOS

Windows

📦 Batch Inference

Linux/macOS

Windows

Web Interface Usage

🏋️ Train the Model

📝 TODO & Future Plans

📄 License

Acknowledgements

📖 Citation

📬 Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors 5

Languages