Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[NeurIPS 2025] PyTorch implementation of [ThinkSound], a unified framework for generating audio from any modality, guided by Chain-of-Thought (CoT) reasoning.

Notifications You must be signed in to change notification settings

FunAudioLLM/ThinkSound

Repository files navigation

ThinkSound

🌐 English | 简体中文 | 繁體中文 | Español | Français | 日本語

NeurIPS 2025

arXiv Online Demo Hugging Face ModelScope

If you find this project useful,
a star ⭐ on GitHub would be greatly appreciated!


ThinkSound is a unified Any2Audio generation framework with flow matching guided by Chain-of-Thought (CoT) reasoning.

PyTorch implementation for multimodal audio generation and editing: generate or edit audio from video, text, and audio, powered by step-by-step reasoning from Multimodal Large Language Models (MLLMs).

📰 News

  • 2025年09月19日 🎉 ThinkSound has been accepted to the NeurIPS 2025 Main Conference!
  • 2025年09月01日 🔥 Our AudioCoT dataset is now open-sourced and available on Hugging Face!
  • 2025年07月17日 🧠 Finetuning enabled: training and finetuning code is now publicly available, along with clear usage instructions to help you customize and extend ThinkSound with your own data.
  • 2025年07月15日 📦 Simplified installation and usability: dependencies on PyPI for easy cross-platform setup; Windows .bat scripts automate environment creation and script running.
  • 2025年07月08日 🔧 Major update: model lightweighted and optimized memory and GPU usage, now supports high-throughput audio generation at scale!
  • 2025年07月01日 🔥Online demo on Hugging Face Spaces and ModelScope for interactive experience!
  • 2025年07月01日 🔥Released inference scripts and web interface;
  • 2025.06 🔥ThinkSound paper released on arXiv!
  • 2025.06 🔥Online Demo is live - try it now!

🚀 Features

  • Any2Audio: Generate audio from arbitrary modalities — video, text, audio, or their combinations.
  • Video-to-Audio SOTA: Achieves state-of-the-art results on multiple V2A benchmarks.
  • CoT-Driven Reasoning: Chain-of-Thought reasoning for compositional and controllable audio generation via MLLMs.
  • Interactive Object-centric Editing: Refine or edit specific sound events by clicking on visual objects or using text instructions.
  • Unified Framework: One foundation model supports generation, editing, and interactive workflow.

✨ Method Overview

ThinkSound decomposes audio generation and editing into three interactive stages, all guided by MLLM-based Chain-of-Thought (CoT) reasoning:

  1. Foley Generation: Generate foundational, semantically and temporally aligned soundscapes from video.
  2. Object-Centric Refinement: Refine or add sounds for user-specified objects via clicks or regions in the video.
  3. Targeted Audio Editing: Modify generated audio using high-level natural language instructions.

ThinkSound Overview


⚡ Quick Start

Environment Preparation:

git clone https://github.com/liuhuadai/ThinkSound.git
cd ThinkSound
conda create -n thinksound python=3.10
conda activate thinksound
pip install thinksound
conda install -y -c conda-forge 'ffmpeg<7'
# Download pretrained weights https://huggingface.co/liuhuadai/ThinkSound to Directory ckpts/
# model weights can be also downloaded from https://www.modelscope.cn/models/iic/ThinkSound
git lfs install
git clone https://huggingface.co/liuhuadai/ThinkSound ckpts
# To improve inference and training speed, you may optionally install a FlashAttention backend compatible with your system and PyTorch version.

Windows Tip:
Windows users can simply run setup_windows.bat (or double-click it) to automatically create the conda environment, install all dependencies (including FFmpeg), and download the pretrained model — no manual setup required.
Make sure conda and git are installed and available in your system PATH before running the script.

▶️ Run the Demo

Linux/macOS

chmod +x scripts/demo.sh
./scripts/demo.sh <path-to-your-demo-video> <title> <CoT description> [use-half]

Windows

You can use the provided .bat script instead:

.\scripts\demo.bat <path-to-your-demo-video> <title> <CoT description> [use-half]

Note:

  • <path-to-your-demo-video>: The path to a single video
  • [use-half] (optional): Add use-half at the end to enable half precision feature extraction.

📦 Batch Inference

Linux/macOS

chmod +x scripts/eval_batch.sh
./scripts/eval_batch.sh <video_path> <csv_path> <save_path (optional)> [use-half]

Windows

Use the equivalent .bat script:

.\scripts\eval_batch.bat <video_path> <csv_path> <save_path (optional)> [use-half]

Note:

  • <video_path>: Path to the root directory containing all .mp4 videos to be processed (all videos must be of equal duration).
  • <csv_path>: A CSV file with text prompts for each video (see demo_test.csv for format).
  • <save_path> (optional): Where to save generated audio. Defaults to results/features.
  • [use-half] (optional): Add use-half at the end to enable half precision feature extraction.

Web Interface Usage

For an interactive experience, launch the Gradio web interface:

python app.py

🏋️ Train the Model

See Training.md


📝 TODO & Future Plans

    • Release a more powerful foundation model covering multiple domains to provide more engaging and immersive foley creation
    • Add support for additional modalities and downstream tasks
    • Release models at different scales
    • Open-source AudioCoT dataset and automated pipeline
    • Release training scripts for ThinkSound models
    • A beginner-friendly Windows quick-start README

📄 License

This project is released under the Apache 2.0 License.

Note: The code, models, and dataset are for research and educational purposes only. Commercial use is NOT permitted. For commercial licensing, please contact the authors.

📦 Third-Party Components

  • Stable Audio Open VAE (by Stability AI): This repository includes a fine-tuned VAE from Stable Audio Open, licensed under the Stability AI Community License. Commercial use and redistribution require prior permission from Stability AI.

  • 📘 All other code and models are released under the Apache License 2.0.


Acknowledgements

Many thanks to:

  • stable-audio-tools (by Stability AI): For providing an easy-to-use framework for audio generation, as well as the VAE module and weights.
  • MMAudio: For the implementation of the MM-DiT backbone in the audio domain.

📖 Citation

If you find ThinkSound useful in your research or work, please cite our paper:

@misc{liu2025thinksoundchainofthoughtreasoningmultimodal,
 title={ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing}, 
 author={Huadai Liu and Jialei Wang and Kaicheng Luo and Wen Wang and Qian Chen and Zhou Zhao and Wei Xue},
 year={2025},
 eprint={2506.21448},
 archivePrefix={arXiv},
 primaryClass={eess.AS},
 url={https://arxiv.org/abs/2506.21448}, 
}

📬 Contact

✨ Feel free to open an issue or contact us via email (liuhuadai@zju.edu.cn) if you have any questions or suggestions!

About

[NeurIPS 2025] PyTorch implementation of [ThinkSound], a unified framework for generating audio from any modality, guided by Chain-of-Thought (CoT) reasoning.

Topics

Resources

Stars

Watchers

Forks

AltStyle によって変換されたページ (->オリジナル) /