A comprehensive tool to extract, clean, chunk, summarize, and perform Question-Answering (RAG) on PDF documents. It uses Google Gemini for generation and embeddings, and FAISS for vector storage.
- Document Processing: Extracts text from PDFs, cleans noise (headers/footers), and chunks text intelligently.
- AI Summarization: Generates concise summaries of document sections using Gemini 2.5 Flash.
- RAG Q&A: Ask questions about your document and get answers based on accurate context retrieval.
- Dual Interfaces:
- API: FastAPI backend for integration.
- Web UI: Modern Streamlit interface.
- Python: Version 3.8 or higher.
- Dependencies:
pip install -r requirements.txt
-
Environment Variables: Create a
.envfile in the root directory:GEMINI_API_KEY=your_api_key_here -
Streamlit Secrets (For hosting UI): If deploying to Streamlit Cloud, add
GEMINI_API_KEYto your app's secrets.
This is the dynamic interface where you can upload ANY PDF.
cd src
/Users/hemant/.pyenv/versions/3.10.16/bin/python -m streamlit run streamlit_app.py- Upload: Drag & drop any PDF in the sidebar.
- Chat: Ask questions about the uploaded document immediately.
If you want to process the default file (file/data.pdf) without the UI:
cd src
/Users/hemant/.pyenv/versions/3.10.16/bin/python main.pyOutputs: output/cleaned.txt, output/index.faiss, output/metadata.pkl
For backend integration (uses the processed data from step 2 or UI).
cd src
/Users/hemant/.pyenv/versions/3.10.16/bin/python -m uvicorn api:app --reload --port 8000Test Endpoint:
curl -X POST "http://127.0.0.1:8000/ask" \ -H "Content-Type: application/json" \ -d '{"question": "What is the main topic?"}'
graph TD
A[π Input PDF] -->|extractor.py| B(π Raw Text)
B -->|cleaner.py| C{Clean Data?}
C -->|Remove Gibberish/Headers| D[π§Ή Cleaned Text]
D -->|chunker.py| E[π§© Text Chunks]
subgraph "Vector Search (RAG)"
E -->|embedding.py| K[π Vector Embeddings]
K --> M[ποΈ FAISS Index]
M --> N[π Retrieval System]
end
subgraph "Interfaces"
N --> P[π₯οΈ Streamlit UI]
N --> Q[π FastAPI]
end
Q & P --> R[π€ Gemini Answer]
src/streamlit_app.py: Dynamic Web UI (Upload & Chat).src/main.py: Pipeline orchestrator (Extract -> Chunk -> Index).src/rag_core.py: Shared logic for RAG initialization and retrieval.src/api.py: FastAPI application.src/embedding/: Handles Gemini Embeddings and FAISS indexing.src/summarization/: Summarization logic modules.Dockerfile: Configuration for Docker deployment.output/: Stores generated artifacts (index, metadata, cleaned text).