Turn any song into a karaoke track. Paste a YouTube URL, type a song name, or drop in a local file — Jiyajale strips the vocals using Meta's Demucs AI and hands you a studio-quality instrumental.
Built for my mom's YouTube singing channel so she can sing over clean instrumentals of her favorite ghazals and Bollywood songs.
Jiyajale runs a two-stage pipeline:
- Acquire — download from YouTube via yt-dlp (search by name or URL), or convert a local file (iTunes/ALAC) via ffmpeg
- Separate — run Demucs htdemucs_ft (Meta's fine-tuned 4-model ensemble) to split the audio into
no_vocals.wavandvocals.wav
Everything stays lossless WAV throughout — no lossy compression at any stage.
output/kisi-ranjish/
original.wav — downloaded/converted source
stems/htdemucs_ft/original/
no_vocals.wav — instrumental (sing over this)
vocals.wav — isolated vocals (use as reference)
Separation runs at roughly ×ばつ song length on an M4 Mac (Apple Silicon MPS) — a measured 41⁄2-minute song took ~3 minutes end-to-end with htdemucs_ft. CPU-only runs (the Docker/Railway build) are several times slower.
┌─────────────────────────────────────────────────────────┐
│ Input │
│ YouTube URL / Search Query Local File (ALAC/MP3) │
└────────────────┬────────────────────────────┬───────────┘
│ │
yt-dlp download ffmpeg convert
│ │
└────────────┬───────────────┘
│
original.wav
│
Demucs htdemucs_ft
(4 neural networks)
MPS GPU on Apple Silicon
│
┌──────────────┴──────────────┐
│ │
no_vocals.wav vocals.wav
(Instrumental) (Isolated vocals)
Phase 2 — Web UI (fully implemented, ships in Docker):
React (Vite) → FastAPI → Demucs pipeline
↓
WebSocket progress updates
Pitch-shift export (librosa)
Song library browser
| Layer | Technology |
|---|---|
| AI separation | Demucs htdemucs_ft — fine-tuned 4-model ensemble |
| GPU acceleration | PyTorch 2.10 + MPS (Apple Silicon) / CPU (Docker/Railway) |
| YouTube download | yt-dlp — search-by-name or direct URL |
| Audio conversion | ffmpeg |
| Pitch shifting | librosa |
| Backend API | FastAPI + WebSocket progress streaming |
| Frontend | React 19, Vite, Tone.js |
| Container | Docker (CPU PyTorch build for deployment) |
| Deploy target | Railway |
- Python 3.13+
- ffmpeg —
brew install ffmpeg - ~2 GB disk for PyTorch + Demucs models (downloaded on first run)
git clone https://github.com/upneja/jiyajale.git cd jiyajale python3.13 -m venv .venv source .venv/bin/activate pip install -r requirements.txt
# By song name (yt-dlp searches YouTube for the top result) ./separate.sh "Chupke Chupke Raat Din Ghulam Ali" "chupke-chupke" # By YouTube URL ./separate.sh "https://youtube.com/watch?v=..." "song-name"
# Backend (from repo root, venv active) uvicorn backend.main:app --reload --port 8000 # Frontend (separate terminal) cd frontend npm install npm run dev # → http://localhost:5173
docker build -t jiyajale . docker run -p 8000:8000 jiyajale # → http://localhost:8000
The Docker build uses a CPU-only PyTorch image to keep the container lean. Set DEMUCS_MODEL=htdemucs (env var) for the faster single-model variant if processing time is a constraint.
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/health |
Health check |
GET |
/api/songs |
List all processed songs |
POST |
/api/process |
Submit a song (form: query or file upload + optional song_name) |
GET |
/api/jobs/{song_name} |
Poll processing status |
GET |
/api/audio/{song_name}/{track} |
Stream audio (original | instrumental | vocals) |
POST |
/api/pitch-shift |
Export pitch-shifted track (form: song_name, track, semitones) |
WS |
/ws/status/{song_name} |
WebSocket — real-time progress during separation |
htdemucs_ft over htdemucs — The fine-tuned variant runs 4 neural network passes instead of 1, which is ×ばつ slower but produces audibly cleaner vocal removal. The residual bleed on the standard model was noticeable when singing over it; _ft is not.
WAV throughout — Lossless at every stage. Lossy intermediate formats would degrade the separation quality and the final output.
--two-stems vocals — Only splits into vocals + everything-else. The full four-stem output (drums / bass / vocals / other) isn't needed here and would be slower.
ytsearch1: prefix — Lets yt-dlp accept both a raw search query and a direct URL through the same code path.
MPS on Apple Silicon, CPU in Docker — The local workflow uses PyTorch MPS for GPU acceleration. The Docker build installs the smaller CPU-only wheel to keep image size manageable and avoid CUDA dependencies on the deploy target.
jiyajale/
├── separate.sh — CLI entry point (download + separate)
├── requirements.txt — Python dependencies (pip freeze)
├── Dockerfile — Multi-stage build: Python + Node + ffmpeg
├── railway.json — Railway deploy config
├── backend/
│ ├── main.py — FastAPI app, endpoints, WebSocket
│ ├── processing.py — yt-dlp download + Demucs separation pipeline
│ ├── pitch.py — Pitch-shift via librosa
│ └── test_*.py — pytest test suite
├── frontend/
│ ├── src/
│ │ ├── App.jsx
│ │ └── components/
│ │ ├── SongInput.jsx — URL / search / file upload
│ │ └── AudioPlayer.jsx — Playback with pitch slider
│ └── package.json
├── docs/plans/ — Design docs and implementation plans
└── PROCESS.md — Technical one-pager
Built with Claude Code: the CLI tool, web UI, and Railway deploy came together in a single evening session, with iteration in the months since. See PROCESS.md for technical details.
MIT