Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

enesmanan/DataCommit

Repository files navigation

DataCommit RAG Chatbot

A RAG (Retrieval-Augmented Generation) system for DataCommit podcast episodes. Downloads audio from YouTube, transcribes with Whisper, and enables Q&A using Haystack, ChromaDB, and Gemini.

DataCommit is a Turkish podcast series where data science experts share their career journeys, technical knowledge, and industry experiences. πŸŽ™οΈ Watch all episodes on YouTube

DataCommit Banner


datacommit.mp4

Tech Stack

Audio to Text Pipeline

  • Audio Download: yt-dlp
  • Speech-to-Text: Local Whisper-Turbo
  • Audio Processing: FFmpeg, librosa, K-Means
  • Text Correction: Gemini 2.5 Flash Agent

RAG Pipeline

  • Backend: Python, Flask
  • RAG Framework: Haystack 2.22
  • Vector Database: ChromaDB
  • LLM: Google Gemini 3 Flash
  • Embeddings: Sentence Transformers (all-MiniLM-L6-v2)
  • Frontend: HTML, CSS, JavaScript

Preprocessing Architecture

Preprocessing architecture


Setup

Prerequisites

1. Clone & Setup Environment

git clone https://github.com/enesmanan/DataCommit.git
cd DataCommit
python -m venv venv
venv\Scripts\activate # Windows
pip install -r requirements.txt

2. Configure Environment

Create a .env file in the project root:

GEMINI_API_KEY=your_gemini_api_key_here

3. Create Vector Database

python create_database.py

This will:

  • Load all episode transcripts from data/Final/
  • Split them into chunks with metadata
  • Create embeddings and store in ChromaDB

To rebuild database: delete chroma_db/ folder and run again.

4. Run the Application

python app.py

Open your browser at: http://localhost:5000

For audio preprocessing (YouTube to transcript), see /preprocessing


Project Structure

DataCommit/
β”œβ”€β”€ app.py # Flask web server
β”œβ”€β”€ rag_pipeline.py # RAG pipeline & Gemini integration
β”œβ”€β”€ create_database.py # Vector database creation
β”œβ”€β”€ data/ # Episode transcripts
β”œβ”€β”€ chroma_db/ # Vector database (auto-generated)
β”œβ”€β”€ static/ # Frontend assets (CSS, JS, images)
β”œβ”€β”€ templates/ # HTML templates
└── preprocessing/ # Audio-to-text scripts

πŸ“¬ Contact

Enes Fehmi Manan

Made with ❀️ for the Turkish Data Science Community

About

RAG chatbot for DataCommit podcast episodes with Haystack & Chroma & Gemini

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /