Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

hk2166/RAG_PDF_Summarizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

16 Commits

Repository files navigation

PDF Summarizer & RAG Q&A System

A comprehensive tool to extract, clean, chunk, summarize, and perform Question-Answering (RAG) on PDF documents. It uses Google Gemini for generation and embeddings, and FAISS for vector storage.

✨ Features

  • Document Processing: Extracts text from PDFs, cleans noise (headers/footers), and chunks text intelligently.
  • AI Summarization: Generates concise summaries of document sections using Gemini 2.5 Flash.
  • RAG Q&A: Ask questions about your document and get answers based on accurate context retrieval.
  • Dual Interfaces:
    • API: FastAPI backend for integration.
    • Web UI: Modern Streamlit interface.

πŸ“‹ Requirements

  • Python: Version 3.8 or higher.
  • Dependencies:
    pip install -r requirements.txt

πŸ”‘ Configuration

  1. Environment Variables: Create a .env file in the root directory:

    GEMINI_API_KEY=your_api_key_here
  2. Streamlit Secrets (For hosting UI): If deploying to Streamlit Cloud, add GEMINI_API_KEY to your app's secrets.

πŸš€ How to Run

1. Web UI (Recommended)

This is the dynamic interface where you can upload ANY PDF.

cd src
/Users/hemant/.pyenv/versions/3.10.16/bin/python -m streamlit run streamlit_app.py
  • Upload: Drag & drop any PDF in the sidebar.
  • Chat: Ask questions about the uploaded document immediately.

2. Output Data Pipeline (Manual)

If you want to process the default file (file/data.pdf) without the UI:

cd src
/Users/hemant/.pyenv/versions/3.10.16/bin/python main.py

Outputs: output/cleaned.txt, output/index.faiss, output/metadata.pkl

3. API Server (FastAPI)

For backend integration (uses the processed data from step 2 or UI).

cd src
/Users/hemant/.pyenv/versions/3.10.16/bin/python -m uvicorn api:app --reload --port 8000

Test Endpoint:

curl -X POST "http://127.0.0.1:8000/ask" \
 -H "Content-Type: application/json" \
 -d '{"question": "What is the main topic?"}'

πŸ”„ System Architecture

graph TD
 A[πŸ“„ Input PDF] -->|extractor.py| B(πŸ“ Raw Text)
 B -->|cleaner.py| C{Clean Data?}
 C -->|Remove Gibberish/Headers| D[🧹 Cleaned Text]
 D -->|chunker.py| E[🧩 Text Chunks]
 subgraph "Vector Search (RAG)"
 E -->|embedding.py| K[πŸ“ Vector Embeddings]
 K --> M[πŸ—„οΈ FAISS Index]
 M --> N[πŸ” Retrieval System]
 end
 subgraph "Interfaces"
 N --> P[πŸ–₯️ Streamlit UI]
 N --> Q[πŸ”Œ FastAPI]
 end
 Q & P --> R[πŸ€– Gemini Answer]
Loading

πŸ“‚ Project Structure

  • src/streamlit_app.py: Dynamic Web UI (Upload & Chat).
  • src/main.py: Pipeline orchestrator (Extract -> Chunk -> Index).
  • src/rag_core.py: Shared logic for RAG initialization and retrieval.
  • src/api.py: FastAPI application.
  • src/embedding/: Handles Gemini Embeddings and FAISS indexing.
  • src/summarization/: Summarization logic modules.
  • Dockerfile: Configuration for Docker deployment.
  • output/: Stores generated artifacts (index, metadata, cleaned text).

About

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /