Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

spyrchat/ReRag

Repository files navigation

ReRag: a Reconfigurable Retrieval-Augmented-Generation Experimentation and Validation framework

Version: 2.0.0
Author: Spiros Chatzigeorgiou

Production-ready Retrieval-Augmented Generation (RAG) system with hybrid retrieval, Self-RAG agent workflows, cross-encoder reranking, and comprehensive benchmarking.


πŸš€ Quick Start

Prerequisites

  • Python 3.11+
  • Docker & Docker Compose
  • 16GB+ RAM recommended
  • API keys: Google AI, OpenAI (optional: Voyage AI)

1. Setup Environment

# Clone repository
git clone <repository-url>
cd ReRag
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Configure API keys
cp .env_example .env
# Edit .env and add your API keys:
# GOOGLE_API_KEY=your_key_here
# OPENAI_API_KEY=your_key_here

2. Start Vector Database

# Start Qdrant
docker-compose up -d
# Verify it's running
curl http://localhost:6333/healthz
#You can see the ingestion results in Qdrant's Web UI visiting the link below:
http://localhost:6333/dashboard#/collections

3. Run Your First Pipeline

#First download the dataset from the scripts folder
# Ingest documents (requires dataset - see Data Ingestion section)
python bin/ingest.py ingest --config pipelines/configs/datasets/stackoverflow_hybrid.yml
# Run agent in interactive mode
python main.py
# Run agent with single query
python main.py --query "What are Python best practices?"
# Run Self-RAG mode (with iterative refinement)
python main.py --mode self-rag --query "Explain how asyncio works"

πŸ“š User Guide

Data Ingestion

Ingest documents into the vector database:

# Basic ingestion from config
python bin/ingest.py ingest --config pipelines/configs/datasets/stackoverflow_hybrid.yml
# Test with dry run (no upload)
python bin/ingest.py ingest --config my_config.yml --dry-run --max-docs 100
# Check ingestion status
python bin/ingest.py status
# Cleanup canary collections
python bin/ingest.py cleanup

Configuration File Format (pipelines/configs/datasets/*.yml):

dataset:
 name: "my_dataset"
 adapter: "stackoverflow" # or full path: "pipelines.adapters.custom.MyAdapter"
 path: "datasets/sosum/data"
embedding:
 strategy: "hybrid" # or "dense" or "sparse"
 dense:
 provider: "google"
 model: "text-embedding-004"
 sparse:
 provider: "sparse"
 model: "Qdrant/bm25"
qdrant:
 collection: "my_collection"
 host: "localhost"
 port: 6333

Retrieval Testing

Test retrieval pipelines before using in agents:

# Use any retrieval configuration
python bin/retrieval_pipeline.py \
 --config pipelines/configs/retrieval/basic_dense.yml \
 --query "How to handle Python exceptions?" \
 --top-k 5

Agent Workflows

Run the RAG agent with two available modes:

# Standard RAG mode (single-pass)
python main.py --query "Explain Python decorators"
# Self-RAG mode (iterative refinement with verification)
python main.py --mode self-rag --query "How does asyncio work?"
# Interactive chat
python main.py
# or
python main.py --mode self-rag

Benchmarking

Run evaluation experiments:

# Run experiment with output directory
python -m benchmarks.experiment1 --output-dir results/exp1
# Run 2D grid optimization for hybrid search parameters
python -m benchmarks.optimize_2d_grid_alpha_rrfk \
 --scenario-yaml benchmark_scenarios/your_scenario.yml \
 --dataset-path datasets/sosum/data \
 --n-folds 5 \
 --output-dir results/optimization
# Generate ground truth for evaluation
python -m benchmarks.generate_ground_truth \
 --queries-file queries.json \
 --output-file ground_truth.json

See benchmarks/README.md for detailed documentation.


πŸ“– System Architecture

Overview

Modular RAG system with three main subsystems:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ RAG System β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”‚
β”‚ πŸ“Š INGESTION β†’ πŸ” RETRIEVAL β†’ πŸ€– AGENT β”‚
β”‚ β”‚
β”‚ Documents Vector Search LangGraph β”‚
β”‚ Chunking Reranking Response Gen β”‚
β”‚ Embedding Filtering Verification β”‚
β”‚ ↓ ↓ ↓ β”‚
β”‚ └───────────→ Qdrant β†β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β”‚ πŸ“ˆ BENCHMARKS: Evaluation & Optimization β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Components

Component Purpose Documentation
pipelines/ Data ingestion & processing README
components/ Retrieval pipeline (filters, rerankers) README
embedding/ Multi-provider embeddings README
retrievers/ Dense/sparse/hybrid search README
agent/ LangGraph workflows (Standard + Self-RAG) README
database/ Qdrant vector database interface README
benchmarks/ Evaluation framework README
config/ Configuration system -

πŸ”§ Installation

1. Python Environment

# Clone repository
git clone <repository-url>
cd Thesis
# Create virtual environment (Python 3.11+ required)
python -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt

2. API Keys

# Create environment file
cp .env_example .env

Edit .env and add your API keys:

# Required
GOOGLE_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
# Optional
VOYAGE_API_KEY=your_key_here

3. Start Vector Database

# Start Qdrant using Docker
docker-compose up -d
# Verify it's running
curl http://localhost:6333/health

πŸ“ Project Structure

Thesis/
β”œβ”€β”€ readme.md # This file
β”œβ”€β”€ main.py # Agent entry point (Standard & Self-RAG modes)
β”œβ”€β”€ config.yml # Main configuration file
β”œβ”€β”€ docker-compose.yml # Qdrant database setup
β”œβ”€β”€ requirements.txt # Python dependencies
β”‚
β”œβ”€β”€ agent/ # LangGraph agent workflows
β”‚ β”œβ”€β”€ graph_refined.py # Standard RAG workflow
β”‚ β”œβ”€β”€ graph_self_rag.py # Self-RAG workflow (iterative refinement)
β”‚ β”œβ”€β”€ schema.py # State definitions
β”‚ └── nodes/ # Agent nodes (retriever, generator, grader)
β”‚
β”œβ”€β”€ pipelines/ # Data ingestion
β”‚ β”œβ”€β”€ adapters/ # Dataset adapters (StackOverflow, custom)
β”‚ β”œβ”€β”€ ingest/ # Ingestion pipeline core
β”‚ β”œβ”€β”€ eval/ # Retrieval evaluator
β”‚ └── configs/ # Dataset configurations
β”‚ └── datasets/ # Per-dataset configs
β”‚
β”œβ”€β”€ components/ # Retrieval pipeline components
β”‚ β”œβ”€β”€ retrieval_pipeline.py # Pipeline orchestration
β”‚ β”œβ”€β”€ rerankers.py # CrossEncoder, Semantic, ColBERT, MultiStage
β”‚ β”œβ”€β”€ filters.py # Tag, duplicate, relevance filters
β”‚ └── post_processors.py # Result enhancement & limiting
β”‚
β”œβ”€β”€ retrievers/ # Core retrieval implementations
β”‚ β”œβ”€β”€ dense_retriever.py # Dense/sparse/hybrid retrieval
β”‚ └── base.py # Abstract interfaces
β”‚
β”œβ”€β”€ embedding/ # Embedding providers
β”‚ β”œβ”€β”€ factory.py # Provider factory
β”‚ β”œβ”€β”€ providers/ # Google, OpenAI, Voyage, HuggingFace
β”‚ └── base_embedder.py # Abstract interfaces
β”‚
β”œβ”€β”€ database/ # Vector database
β”‚ β”œβ”€β”€ qdrant_controller.py # Qdrant integration
β”‚ └── base.py # Abstract interfaces
β”‚
β”œβ”€β”€ config/ # Configuration system
β”‚ β”œβ”€β”€ config_loader.py # YAML config loader
β”‚ └── llm_factory.py # LLM provider factory
β”‚
β”œβ”€β”€ benchmarks/ # Evaluation framework
β”‚ β”œβ”€β”€ experiment1.py # Main experiment runner
β”‚ β”œβ”€β”€ optimize_2d_grid_alpha_rrfk.py # Grid search optimization
β”‚ β”œβ”€β”€ llm_as_judge_eval.py # LLM-based evaluation
β”‚ β”œβ”€β”€ generate_ground_truth.py # Ground truth generation
β”‚ β”œβ”€β”€ benchmarks_runner.py # Core benchmark runner
β”‚ β”œβ”€β”€ benchmarks_metrics.py # Metrics (Recall, Precision, MRR, NDCG)
β”‚ β”œβ”€β”€ report_generator.py # Report generation (used by experiments)
β”‚ └── statistical_analyzer.py # Statistical analysis
β”‚
β”œβ”€β”€ bin/ # CLI tools
β”‚ β”œβ”€β”€ ingest.py # Ingestion CLI
β”‚ β”œβ”€β”€ retrieval_pipeline.py # Retrieval testing CLI
β”‚ β”œβ”€β”€ qdrant_inspector.py # Database inspection
β”‚ └── switch_agent_config.py # Config switcher
β”‚
β”œβ”€β”€ logs/ # Application logs
β”‚ β”œβ”€β”€ agent.log # Main agent log
β”‚ β”œβ”€β”€ ingestion.log # Ingestion log
β”‚ └── utils/logger.py # Custom logger
β”‚
└── tests/ # Test suite
 β”œβ”€β”€ test_self_rag_integration.py # Self-RAG integration tests
 └── [other test files]

βš™οΈ Configuration

Configuration Files

Main Config (config.yml):

  • System-wide settings
  • Loaded by config/config_loader.py

Pipeline Configs (pipelines/configs/):

  • datasets/ - Dataset-specific configs (ingestion)
  • retrieval/ - Retrieval pipeline configs

Example: Ingestion Config

dataset:
 name: "stackoverflow"
 adapter: "stackoverflow" # or full path
 path: "datasets/sosum/data"
embedding:
 strategy: "hybrid" # dense, sparse, or hybrid
 dense:
 provider: "google"
 model: "text-embedding-004"
 sparse:
 provider: "sparse"
 model: "Qdrant/bm25"
qdrant:
 collection: "my_collection"
 host: "localhost"
 port: 6333

Environment Variables

Variable Description Required
GOOGLE_API_KEY Google AI API key Yes
OPENAI_API_KEY OpenAI API key Yes
VOYAGE_API_KEY Voyage AI API key No

πŸ”Œ Extension Points

Add Custom Dataset Adapter

  1. Create adapter class:

    # pipelines/adapters/my_adapter.py
    from pipelines.contracts import BaseAdapter, Document
    class MyAdapter(BaseAdapter):
     def load_documents(self) -> List[Document]:
     # Load your data
     return documents
  2. Use in config:

    dataset:
     adapter: "pipelines.adapters.my_adapter.MyAdapter"
     path: "path/to/data"

Add Custom Reranker

Implement in components/rerankers.py or components/advanced_rerankers.py:

from components.rerankers import BaseReranker
class MyReranker(BaseReranker):
 def rerank(self, query: str, results: List[SearchResult]) -> List[SearchResult]:
 # Your reranking logic
 return reranked_results

Add Custom Agent Node

  1. Create node in agent/nodes/:

    from agent.schema import AgentState
    def my_node(state: AgentState) -> AgentState:
     # Process state
     return state
  2. Add to graph in agent/graph_refined.py or agent/graph_self_rag.py


🎯 Key Features

Retrieval Strategies

  • Dense Retrieval: Semantic search using embeddings (Google, OpenAI, Voyage, HuggingFace)
  • Sparse Retrieval: BM25-style keyword matching (Qdrant/bm25, SPLADE)
  • Hybrid Retrieval: Combines dense + sparse with RRF (Reciprocal Rank Fusion)

Reranking

  • Cross-Encoder: ms-marco-MiniLM-L-6-v2 (default)
  • Semantic: Sentence transformers for semantic similarity
  • ColBERT: Token-level contextual matching
  • Multi-Stage: Cascading rerankers for efficiency

Agent Modes

  • Standard RAG: Single-pass retrieval β†’ generation
  • Self-RAG: Iterative refinement with hallucination detection and context verification

Benchmarking

  • Metrics: Recall@K, Precision@K, MRR, NDCG@K
  • Optimization: Grid search for hybrid parameters (alpha, RRF-k)
  • LLM-as-Judge: Automated quality evaluation (faithfulness, relevance, helpfulness)
  • Statistical Analysis: Cross-validation, significance testing

πŸ“Š Testing

Run Integration Tests

# Self-RAG integration tests
pytest tests/test_self_rag_integration.py -v
# All tests
pytest tests/ -v

Verify Components

See components/LOGGING_GUIDE.md for how to verify rerankers and filters are working correctly via logs.


πŸ” CLI Tools

Tool Purpose Example
bin/ingest.py Ingest datasets python bin/ingest.py ingest --config my_config.yml
bin/retrieval_pipeline.py Test retrieval python bin/retrieval_pipeline.py --config config.yml --query "test"
bin/qdrant_inspector.py Inspect database python bin/qdrant_inspector.py list
bin/switch_agent_config.py Switch configs python bin/switch_agent_config.py

πŸ“ˆ System Requirements

Minimum:

  • Python 3.11+
  • 8GB RAM
  • 10GB storage

Recommended:

  • 16GB+ RAM
  • SSD storage
  • 4+ CPU cores

πŸ“š Documentation

  • Main README: This file
  • Components: components/README.md - Retrieval pipeline components
  • Pipelines: pipelines/README.md - Data ingestion system
  • Benchmarks: benchmarks/README.md - Evaluation framework
  • Agent: agent/README.md - LangGraph workflows
  • CLI Reference: CLI_REFERENCE.md - Command-line tools
  • Logging Guide: components/LOGGING_GUIDE.md - Verify components work

πŸ› οΈ Technologies

  • LangGraph: Agent workflow orchestration
  • Qdrant: Vector database
  • LangChain: Document processing
  • Sentence Transformers: Embeddings and reranking
  • Pydantic: Data validation

πŸ“§ Contact

Author: Spiros Chatzigeorgiou
Email: spyrchat@ece.auth.gr


Built for production RAG workflows with hybrid retrieval, advanced reranking, and comprehensive evaluation.

About

Production-ready Retrieval-Augmented Generation (RAG) system with hybrid retrieval, Self-RAG agent workflows, cross-encoder reranking, and comprehensive benchmarking.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /