Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Prometheus-AI-3team/NetfLips

Repository files navigation

🎬 NetfLips

Unit-based Audiovisual Translation for Korean
Text-free Direct Speech Translation with Synchronized Lip Movement


πŸ“‹ Overview

NetfLipsλŠ” μ˜μ–΄ μ˜μƒμ„ μž…λ ₯λ°›μ•„ μŒμ„±κ³Ό μž… λͺ¨μ–‘이 λ™κΈ°ν™”λœ ν•œκ΅­μ–΄ λ²ˆμ—­ μ˜μƒμ„ μƒμ„±ν•˜λŠ” ν”„λ‘œμ νŠΈμž…λ‹ˆλ‹€.

✨ Key Features

  • 🎯 Unit-based Translation: ν…μŠ€νŠΈ 쀑간 ν‘œν˜„ 없이 μŒμ„±κ³Ό μ‹œκ° 정보λ₯Ό 곡톡 μœ λ‹›(Unit) ν‘œν˜„μœΌλ‘œ 직접 λͺ¨λΈλ§
  • πŸ”Š Speech & Visual Sync: μŒμ„±κ³Ό λΉ„λ””μ˜€λ₯Ό 곡톡 νŠΉμ§• κ³΅κ°„μ˜ Unit λ‹¨μœ„λ‘œ μ •λ ¬ν•˜μ—¬ κ°•κ±΄ν•œ λ²ˆμ—­ κ΅¬ν˜„
  • πŸ‡°πŸ‡· Korean Fine-tuning: 기쑴에 μ§€μ›λ˜μ§€ μ•Šλ˜ ν•œκ΅­μ–΄ capabilityλ₯Ό μœ„ν•œ Fine-tuning
  • πŸ’¬ Natural Synthesis: μžμ—°μŠ€λŸ¬μš΄ μŒμ„± ν•©μ„± 및 립싱크 생성

🎯 Keywords

#Unit-based Audiovisual Translation #Text-free Direct Speech Translation #Lip Sync #Speech Translation


πŸŽ₯ Demo

🌐 Demo Link

πŸ—οΈ Architecture

NetfLipsλŠ” 3단계 νŒŒμ΄ν”„λΌμΈμœΌλ‘œ κ΅¬μ„±λ©λ‹ˆλ‹€:

1️⃣ Unit Extraction

  • FLAC 볡원 (wav)
  • νŠΉμ§• μΆ”μΆœ (Mel Spectrogram)
  • K-means λΆ„λ₯˜
  • μ •μˆ˜ sequence둜 λ³€ν™˜

2️⃣ Unit Translation

  • Base Model: AV2AV (Choi, J., et al., 2024)
  • Translation: μ˜μ–΄ unit β†’ ν•œκ΅­μ–΄ unit
  • Framework: Fairseq toolkit 기반 unit sequence ν•™μŠ΅
  • Backbone: λŒ€κ·œλͺ¨ 사전 ν•™μŠ΅ λͺ¨λΈ mBART ν™œμš©

3️⃣ AV Generation

  • Unit β†’ Audio λ³€ν™˜
  • ν•œκ΅­μ–΄ unit & ν™”μž μž„λ² λ”© ν™œμš©
  • Speech Resynthesis

πŸ“Š Dataset

λ³Έ ν”„λ‘œμ νŠΈλŠ” λ‹€μŒ 데이터셋을 ν™œμš©ν•˜μ—¬ ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€:

Dataset Description Size
Zeroth Korean ASR ν•œκ΅­μ–΄ μŒμ„± 인식 데이터 12,245 λ¬Έμž₯
AIHub Ko-X ν†΅λ²ˆμ—­ μŒμ„± ν•œκ΅­μ–΄-μ˜μ–΄(λ―Έκ΅­) 병렬 μŒμ„± 데이터 169,488 λ¬Έμž₯

πŸš€ Getting Started

Prerequisites

# 1. λ ˆν¬μ§€ν† λ¦¬ 클둠
git clone https://github.com/Prometheus-AI-3team/NetfLips.git
cd NetfLips
# 2. μ„œλΈŒλͺ¨λ“ˆ(fairseq) update
git submodule init
git submodule update
# 2. Conda κΈ°λ³Έ ν™˜κ²½ 생성
conda env create -f environment.yml
conda activate unit2a
# 3. Pip λ‹€μš΄κ·Έλ ˆμ΄λ“œ (메타데이터 μ—λŸ¬ λ°©μ§€)
pip install "pip<24.1"
# 4. PyTorch μ„€μΉ˜ (CUDA 11.7 κΈ°μ€€)
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117

Installation

# 5. λ‚˜λ¨Έμ§€ 라이브러리 μ„€μΉ˜
pip install -r requirements.txt
# 6. Fairseq μ„€μΉ˜
cd av2av-main/fairseq
pip install -e .

πŸ’» Usage

Checkpoints

Model Name link
AV2Unit mav_hubert_large_noise.py download
Unit2Unit utut_sts_ft.pt download
Unit2AV unit_av_renderer_withKO.pt download

End-to-End Inference

PYTHONPATH=fairseq python inference.py \
 --in-vid-path /path/to/input.mp4 \
 --out-vid-path /path/to/output.mp4 \
 --src-lang en --tgt-lang ko \
 --av2unit-path /path/to/mavhubert_large_noise.pt \
 --utut-path /path/to/utut_sts_ft.pt \
 --unit2av-path /path/to/unit_av_renderer_withKO.pt \

Training & Inference

각 λͺ¨λ“ˆμ˜ ν•™μŠ΅ 및 μΆ”λ‘ (av2unit, unit2unit, unit2av)은 ν•΄λ‹Ήν•˜λŠ” λͺ¨λ“ˆμ˜ README.mdλ₯Ό μ°Έκ³ ν•΄μ£Όμ„Έμš”.

πŸ“ Project Structure

NetfLips/
β”œβ”€β”€ av2unit/ # Audio-Visual to Unit Extraction
β”‚ β”œβ”€β”€ avhubert/ # Feature extraction using AV-HuBERT
β”‚ └── inference.py # Unit extraction inference script
β”œβ”€β”€ unit2unit/ # Unit to Unit Translation
β”‚ β”œβ”€β”€ utut_pretrain/ # Pre-training modules
β”‚ β”œβ”€β”€ utut_finetune/ # Fine-tuning modules
β”‚ └── inference.py # Translation inference script
β”œβ”€β”€ unit2av/ # Unit to Audio-Visual Generation
β”‚ β”œβ”€β”€ model.py # Unit2AV model definition
β”‚ β”œβ”€β”€ train_unit2a.py # Training script for Unit2Audio
β”‚ └── inference_unit2av.py # Inference scripts
β”œβ”€β”€ fairseq/ # Fairseq Toolkit (Submodule)
β”œβ”€β”€ scripts/ # Utility Scripts for Data Preparation
β”œβ”€β”€ inference_av2av.py # Main End-to-End Inference Script
β”œβ”€β”€ environment.yml # Conda Environment Configuration
└── requirements.txt # Python Dependencies

πŸ”¬ Methodology

Data Preprocessing

  • FLAC 파일 볡원 및 wav λ³€ν™˜
  • Mel Spectrogram 기반 νŠΉμ§• μΆ”μΆœ
  • K-means ν΄λŸ¬μŠ€ν„°λ§μ„ ν†΅ν•œ Unit λΆ„λ₯˜

Model Training

  • mBART 기반 sequence-to-sequence ν•™μŠ΅
  • Fairseq toolkit ν™œμš©
  • Unit-to-Unit translation μ΅œμ ν™”

Audio-Visual Generation

  • ν•œκ΅­μ–΄ unitμ—μ„œ μŒμ„± μž¬ν•©μ„±
  • ν™”μž μž„λ² λ”©μ„ ν™œμš©ν•œ μžμ—°μŠ€λŸ¬μš΄ μŒμ„± 생성
  • 립싱크가 λ™κΈ°ν™”λœ λΉ„λ””μ˜€ 생성

πŸ› οΈ Technical Details

Base Model

  • AV2AV: Audio-Visual to Audio-Visual translation model
  • Reference: Choi, J., et al., 2024

Fine-tuning Strategy

  • ν•œκ΅­μ–΄ 미지원 문제 해결을 μœ„ν•œ Fine-tuning
  • 병렬 ν•œ-영 μŒμ„± 데이터 ν™œμš©
  • Unit-level translation ν•™μŠ΅

πŸ‘₯ Team Members From Prometheus(AI club)

Name batch
μž₯μ§€μˆ˜ 6th
μœ μ§€ν˜œ 6th
μ‹ κ·œμ²  8th
이가연 8th

πŸ“ Citation

@misc{netflips2024,
 title={NetfLips: Unit-based Audiovisual Translation for Korean},
 author={μž₯μ§€μˆ˜, μœ μ§€ν˜œ, μ‹ κ·œμ² , 이가연},
 year={2024}
}

References

  • Choi, J., et al. (2024). AV2AV: Audio-Visual to Audio-Visual Translation

License

이 ν”„λ‘œμ νŠΈλŠ” MIT λΌμ΄μ„ μŠ€ ν•˜μ— λ°°ν¬λ©λ‹ˆλ‹€. μžμ„Έν•œ λ‚΄μš©μ€ LICENSE νŒŒμΌμ„ μ°Έμ‘°ν•˜μ„Έμš”.


Acknowledgments

This repository is built upon AV2AV and Fairseq. We appreciate the open-source of the projects.

About

[2025-2] Textless Direct Audio-Visual Speech Translation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

Languages

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /