A RAG (Retrieval-Augmented Generation) system for DataCommit podcast episodes. Downloads audio from YouTube, transcribes with Whisper, and enables Q&A using Haystack, ChromaDB, and Gemini.
DataCommit is a Turkish podcast series where data science experts share their career journeys, technical knowledge, and industry experiences. ποΈ Watch all episodes on YouTube
datacommit.mp4
- Audio Download: yt-dlp
- Speech-to-Text: Local Whisper-Turbo
- Audio Processing: FFmpeg, librosa, K-Means
- Text Correction: Gemini 2.5 Flash Agent
- Backend: Python, Flask
- RAG Framework: Haystack 2.22
- Vector Database: ChromaDB
- LLM: Google Gemini 3 Flash
- Embeddings: Sentence Transformers (all-MiniLM-L6-v2)
- Frontend: HTML, CSS, JavaScript
- Python 3.10+
- Google Gemini API Key
git clone https://github.com/enesmanan/DataCommit.git cd DataCommit python -m venv venv venv\Scripts\activate # Windows pip install -r requirements.txt
Create a .env file in the project root:
GEMINI_API_KEY=your_gemini_api_key_here
python create_database.py
This will:
- Load all episode transcripts from
data/Final/ - Split them into chunks with metadata
- Create embeddings and store in ChromaDB
To rebuild database: delete
chroma_db/folder and run again.
python app.py
Open your browser at: http://localhost:5000
For audio preprocessing (YouTube to transcript), see /preprocessing
DataCommit/
βββ app.py # Flask web server
βββ rag_pipeline.py # RAG pipeline & Gemini integration
βββ create_database.py # Vector database creation
βββ data/ # Episode transcripts
βββ chroma_db/ # Vector database (auto-generated)
βββ static/ # Frontend assets (CSS, JS, images)
βββ templates/ # HTML templates
βββ preprocessing/ # Audio-to-text scripts
Enes Fehmi Manan
- π LinkedIn: linkedin.com/in/enesfehmimanan
- π GitHub: github.com/enesmanan
- π§ Email: enesmanan768@gmail.com
Made with β€οΈ for the Turkish Data Science Community