Ablation oriented RAG system featuring adaptive query rewriting, multihop retrieval, Chain-of-Verification (CoVe), and cross-encoder reranking. Designed for Polish and multilingual knowledge bases.
- Overview
- Features
- Architecture
- Requirements
- Installation
- Configuration
- Usage
- API Reference
- Evaluation
- Experimental Results
- Project Structure
- License
- References
RAGx is an advanced Retrieval-Augmented Generation system that combines semantic search with large language models to provide accurate, citation-backed answers. The system features:
- Adaptive Query Processing - Automatic detection of query complexity and decomposition into sub-queries
- Multihop Retrieval - Three-stage reranking pipeline for complex questions requiring information synthesis
- Chain-of-Verification (CoVe) - Post-generation claim verification and correction
- Multi-Provider LLM Support - HuggingFace, Ollama, vLLM, and OpenAI-compatible APIs
The system is optimized for Polish Wikipedia but supports any multilingual corpus through configurable embeddings.
| Component | Description |
|---|---|
| Semantic Retrieval | Multilingual embeddings (GTE, E5, BGE) with Qdrant HNSW indexing |
| Cross-Encoder Reranking | Jina Reranker v2 for precision improvement on top-K results |
| LLM Generation | Multi-provider support with 4-bit quantization |
| Citation Enforcement | Inline source citations [N] for every factual claim |
| Feature | Description |
|---|---|
| Linguistic Analysis | spaCy-based POS tagging, dependency parsing, NER |
| Query Type Detection | Automatic classification: comparison, verification, similarity, chaining, temporal, aggregation, superlative |
| Sub-Query Decomposition | LLM-powered decomposition of complex questions |
| Adaptive Rewriting | Query expansion and reformulation based on linguistic features |
| Mode | Description |
|---|---|
| Single Query | Standard retrieval with optional reranking |
| Multihop | Parallel retrieval for sub-queries with three-stage fusion (local, fusion, global) |
| Component | Description |
|---|---|
| Claim Extraction | Automatic extraction of verifiable claims from generated answers |
| NLI Verification | Natural Language Inference-based claim verification against evidence |
| Correction | Automatic correction of unsupported or contradicted claims |
| Citation Injection | Adding citations for verified claims |
User Query
|
v
+------------------------------------------+
| 1. LINGUISTIC ANALYSIS (spaCy) |
| - POS tagging, dependency parsing |
| - Entity extraction, clause counting |
+------------------------------------------+
|
v
+------------------------------------------+
| 2. ADAPTIVE QUERY REWRITING (LLM) |
| - Query type detection |
| - Decomposition decision |
| - Sub-query generation |
+------------------------------------------+
|
+---------------+---------------+
| | |
[Simple] [Multihop]
| |
v v
+-------------+ +------------------------+
| Single | | Parallel Retrieval |
| Retrieval | | (per sub-query) |
+-------------+ +------------------------+
| |
v v
+-------------+ +------------------------+
| Standard | | Three-Stage Reranking: |
| Reranking | | 1. Local (per query) |
| | | 2. Fusion (by doc_id) |
| | | 3. Global (original Q) |
+-------------+ +------------------------+
| |
+-------+-------+
|
v
+------------------------------------------+
| 3. PROMPT ENGINEERING |
| - Template selection (basic/enhanced) |
| - Context formatting with metadata |
| - Language detection |
+------------------------------------------+
|
v
+------------------------------------------+
| 4. LLM GENERATION |
| - Multi-provider (HF/Ollama/vLLM/API) |
| - Chain-of-Thought reasoning |
| - Citation formatting |
+------------------------------------------+
|
v
+------------------------------------------+
| 5. CHAIN-OF-VERIFICATION (CoVe) |
| - Claim extraction |
| - NLI verification |
| - Correction and citation injection |
+------------------------------------------+
|
v
Final Answer + Sources + Metadata
| Resource | Specification |
|---|---|
| GPU | NVIDIA RTX 4070 (12GB VRAM) |
| CPU | AMD Ryzen 7 7800X3D (8 cores / 16 threads) |
| RAM | 32GB DDR5 |
| Storage | 1TB |
| OS | Windows 11 Home |
| LLM | Qwen2.5-14B / Qwen3-8B (4-bit via Ollama) |
| Resource | Specification |
|---|---|
| GPU | NVIDIA H100 (96GB VRAM) |
| CPU | AMD Ryzen 7 7800X3D (8 cores / 16 threads) |
| RAM | 128GB DDR5 |
| Storage | 2TB |
| OS | Ubuntu 22.04 LTS |
| LLM | Qwen3-32B (4-bit via vLLM) |
| Component | Model |
|---|---|
| Embedding | Alibaba-NLP/gte-multilingual-base |
| Reranker | jinaai/jina-reranker-v2-base-multilingual |
| Linguistic Analysis | spaCy pl_core_news_md |
| Vector Store | Qdrant (self-hosted) |
| Requirement | Version |
|---|---|
| Python | 3.12+ |
| Docker | 20.10+ |
| CUDA | 12.0+ (optional) |
git clone https://github.com/floressek/ragx.git
cd ragxUsing uv (recommended):
curl -LsSf https://astral.sh/uv/install.sh | sh uv pip install --system -e .
Using pip:
pip install -e .Using make:
make install
python -m spacy download pl_core_news_md python -m spacy download en_core_web_sm
docker-compose up -d qdrant
cp .env.example .env
# Edit .env with your settings# Vector Store QDRANT_URL=http://localhost:6333 QDRANT_COLLECTION=ragx_documents_v3 # Embeddings EMBEDDING_MODEL=Alibaba-NLP/gte-multilingual-base EMBEDDING_BATCH_SIZE=64 EMBEDDING_USE_PREFIXES=true # Reranking RERANKER_MODEL=jinaai/jina-reranker-v2-base-multilingual RERANKER_BATCH_SIZE=16 # LLM Provider: huggingface | ollama | vllm | api LLM_PROVIDER=huggingface LLM_MODEL=Qwen/Qwen2.5-7B-Instruct LLM_LOAD_IN_4BIT=true LLM_TEMPERATURE=0.2 LLM_MAX_NEW_TOKENS=2000 # Ollama (alternative) # LLM_PROVIDER=ollama # OLLAMA_HOST=http://localhost:11434 # LLM_MODEL_NAME_OLLAMA=qwen3:4b # vLLM (production) # LLM_PROVIDER=vllm # LLM_API_BASE_URL=http://localhost:8000/v1 # LLM_API_MODEL_NAME=Qwen/Qwen3-32B # Query Rewriting REWRITE_ENABLED=true REWRITE_TEMPERATURE=0.2 REWRITE_MAX_TOKENS=4096 # Multihop Configuration MULTIHOP_FUSION_STRATEGY=max MULTIHOP_GLOBAL_RANKER_WEIGHT=0.6 MULTIHOP_TOP_K_PER_SUBQUERY=20 MULTIHOP_FINAL_TOP_K=10 # Retrieval Pipeline TOP_K_RETRIEVE=100 RERANK_TOP_M=80 CONTEXT_TOP_N=8 # Chain-of-Verification COVE_ENABLED=true COVE_USE_BATCH_NLI=true
| File | Purpose |
|---|---|
configs/models.yaml |
Model configurations (embedder, reranker, LLM) |
configs/app.yaml |
Application settings |
configs/vector_store.qdrant.yaml |
Qdrant connection settings |
configs/eval.yaml |
Evaluation settings |
# Download Polish Wikipedia dump make download-wiki # Extract articles make extract-wiki # Ingest to Qdrant (test - 1k articles) make ingest-test # Ingest full corpus (200k+ articles) make ingest-full # Custom ingestion make ingest-custom MAX_ARTICLES=50000
Pre-built Qdrant snapshot available at: https://huggingface.co/datasets/Floressek/wiki-1m-qdrant-snapshot
# Start FastAPI server make api # Or with auto-reload for development make api-dev # Manual start python -m uvicorn src.ragx.api.main:app --host 0.0.0.0 --port 8000
./launch_ui.sh
# Or directly
streamlit run src/ragx/ui/chat_app.py# Search make search QUERY="sztuczna inteligencja" # Check status make status
| Method | Endpoint | Description |
|---|---|---|
| GET | /api |
API information and available endpoints |
| GET | /info/health |
Health check with model status |
| POST | /ask/baseline |
Simple RAG pipeline (retrieval + LLM) |
| POST | /ask/enhanced |
Full pipeline with query rewriting and multihop |
| POST | /llm/generate |
Direct LLM access (no RAG) |
| POST | /search/search |
Vector search only |
| POST | /search/rerank |
Search with reranking |
| POST | /analysis/linguistic |
Linguistic analysis of query |
| POST | /analysis/rewrite |
Query rewriting analysis |
| POST | /cove/verify |
CoVe verification of answer |
| POST | /eval/ablation |
Ablation study endpoint with configurable toggles |
Baseline Pipeline:
curl -X POST "http://localhost:8000/ask/baseline" \ -H "Content-Type: application/json" \ -d '{"query": "Co to jest sztuczna inteligencja?", "top_k": 5}'
Enhanced Pipeline (with query rewriting and multihop):
curl -X POST "http://localhost:8000/ask/enhanced" \ -H "Content-Type: application/json" \ -d '{"query": "ziemniaki vs pomidory, co ma wiecej blonnika?"}'
Ablation Study:
curl -X POST "http://localhost:8000/eval/ablation" \ -H "Content-Type: application/json" \ -d '{ "query": "Kto zalozyl Krakow?", "top_k": 8, "query_analysis_enabled": true, "reranker_enabled": true, "cot_enabled": true, "cove_mode": "auto", "prompt_template": "auto" }'
{
"answer": "Answer text with citations [1][2]...",
"sources": [
{
"id": "doc_id",
"text": "Source text...",
"doc_title": "Document Title",
"retrieval_score": 0.85,
"rerank_score": 0.92
}
],
"metadata": {
"pipeline": "enhanced",
"is_multihop": true,
"sub_queries": ["sub-query 1", "sub-query 2"],
"query_type": "comparison",
"rewrite_time_ms": 450.2,
"retrieval_time_ms": 25.8,
"rerank_time_ms": 180.5,
"llm_time_ms": 920.1,
"cove_time_ms": 340.0,
"total_time_ms": 1916.6
}
}Evaluation was conducted using ablation experiments on a test set of 1000 synthetic questions generated from the Polish Wikipedia corpus (1M articles). The evaluation uses RAGAS metrics and was performed on infrastructure provided by the Military University of Technology Cloud Laboratory.
Unlike standard benchmarks (PolQA, MKQA), synthetic questions generated directly from the indexed corpus provide:
- Full control over grounding - each question has exact source documents in the Qdrant database
- Coverage of all query types supported by the system
- Polish language optimization
The WikipediaQuestionGenerator creates evaluation questions through the following pipeline:
- Article Sampling - Random selection of article chunks from ingested data (minimum 200 characters)
- Question Generation - LLM-based generation (Qwen3:32B) with dedicated prompts per question type
- Grounding Validation - Cross-encoder verification that "ground truth" exists in the source article
- Export - JSONL format with fields:
ground_truth,type,contexts
Example generated question (JSONL format):
{
"question": "Jak nazywa sie wspolnik do Neville'a Roundego?",
"ground_truth": "Tekstor zostal przez Aldine...",
"type": "simple",
"source_title": "Neville'a...",
"contexts": [
"https://pl.wikipedia.org/wiki/Aldine#12661"
]
}| Metric | Description |
|---|---|
| Faithfulness | Factual consistency of answer with retrieved contexts |
| Answer Relevancy | Relevance of answer to the question |
| Context Precision | Proportion of relevant contexts in retrieved set |
| Context Recall | Coverage of ground truth by retrieved contexts |
# Generate test questions (default: 1000) make eval-generate NUM_QUESTIONS=1000 # Start RAG API server make eval-api # Run ablation study with checkpointing make eval-run # Resume interrupted evaluation make eval-resume RUN_ID=study_20240115_143022 # Quick validation (10 questions, 3 configs) make eval-quick # Clean checkpoints and results make eval-clean
Full results across 12 configurations on 1000 test questions. Bold values indicate best performance for each metric.
| Configuration | Faithfulness | Relevancy | Precision | Recall | Latency | Cov |
|---|---|---|---|---|---|---|
| baseline | 0.768 | 0.594 | 0.463 | 0.600 | 2.7s | 0.00 |
| enhanced_only | 0.850 | 0.646 | 0.443 | 0.622 | 7.6s | 0.00 |
| cot_only | 0.884 | 0.641 | 0.440 | 0.614 | 9.6s | 0.00 |
| reranker_only | 0.838 | 0.680 | 0.501 | 0.698 | 3.9s | 0.00 |
| cove_auto_only | 0.872 | 0.621 | 0.448 | 0.613 | 44.3s | 0.00 |
| cot_enhanced | 0.823 | 0.653 | 0.431 | 0.610 | 12.7s | 0.00 |
| multihop_only | 0.891 | 0.714 | 0.494 | 0.829 | 18.4s | 0.64 |
| multihop+cot | 0.870 | 0.762 | 0.493 | 0.828 | 25.7s | 0.62 |
| full_no_cove | 0.881 | 0.721 | 0.506 | 0.823 | 24.5s | 0.61 |
| full_cove_auto | 0.855 | 0.732 | 0.516 | 0.810 | 62.8s | 0.64 |
| full_cove_metadata | 0.858 | 0.743 | 0.498 | 0.827 | 60.4s | 0.62 |
| full_cove_suggest | 0.832 | 0.756 | 0.522 | 0.810 | 63.6s | 0.60 |
Individual component contribution compared to baseline:
| Component | Faithfulness | Relevancy | Recall | Latency |
|---|---|---|---|---|
| Reranker | +9.1% (0.77->0.84) | +15.3% (0.59->0.68) | +16.7% (0.60->0.70) | +1.2s |
| CoT | +14.2% (0.77->0.88) | +8.5% (0.59->0.64) | +1.6% (0.60->0.61) | +6.9s |
| Multihop | +15.6% (0.77->0.89) | +20.3% (0.59->0.71) | +38.4% (0.60->0.83) | +15.7s |
| CoVe (auto) | +11.6% (0.77->0.86) | +5.1% (0.59->0.62) | +1.6% (0.60->0.61) | +41.6s |
- Multihop module dominates in Faithfulness (0.891) and Context Recall (0.829)
- Multihop + CoT combination achieves highest Answer Relevancy (0.762)
- Baseline performs lowest across all metrics
- Any component addition improves results over baseline
- Best ROI: Multihop provides +38.4% Recall improvement with acceptable latency cost
62-64% of test queries are classified as multihop, indicating significant proportion of complex questions in the test set:
| Configuration | Multihop | Simple | Coverage |
|---|---|---|---|
| multihop_only | 645 | 355 | 64.0% |
| multihop+cot | 620 | 380 | 62.0% |
| full_no_cove | 615 | 385 | 61.0% |
| full_cove_auto | 640 | 360 | 63.5% |
| full_cove_metadata | 630 | 370 | 61.5% |
| full_cove_suggest | 620 | 380 | 62.0% |
| Tier | Configurations | Latency | Characteristics |
|---|---|---|---|
| Fast | baseline, reranker_only | 2.7-3.9s | Production real-time, basic quality |
| Medium | enhanced, cot_only, cot_enhanced | 7.6-12.7s | Quality/speed balance |
| Slow | multihop_only, multihop+cot, full_no_cove | 18.4-25.7s | High quality, acceptable latency |
| Very Slow | cove_auto_only, full_cove_* | 44.3-63.6s | Highest quality, offline use |
| Use Case | Configuration | Expected Latency |
|---|---|---|
| Real-time chat | reranker_only | ~4s |
| Balanced production | multihop_only | ~18s |
| Maximum quality (async) | full_cove_metadata | ~60s |
ragx/
├── src/ragx/
│ ├── api/ # FastAPI server
│ │ ├── routers/ # Endpoint handlers
│ │ ├── schemas/ # Pydantic models
│ │ ├── dependencies.py # Dependency injection
│ │ └── main.py # Application entry point
│ │
│ ├── ingestion/ # Data ingestion pipeline
│ │ ├── chunkers/ # Text chunking strategies
│ │ ├── pipelines/ # Ingestion orchestration
│ │ ├── extractions/ # Wikipedia extraction
│ │ └── utils/ # Ingestion utilities
│ │
│ ├── retrieval/ # Retrieval components
│ │ ├── embedder/ # Bi-encoder embeddings
│ │ ├── rerankers/ # Cross-encoder reranking
│ │ ├── analyzers/ # Linguistic analysis
│ │ ├── rewriters/ # Query rewriting
│ │ ├── cove/ # Chain-of-Verification
│ │ └── vector_stores/ # Qdrant integration
│ │
│ ├── pipelines/ # RAG pipelines
│ │ ├── base.py # Abstract base
│ │ ├── baseline.py # Simple RAG
│ │ ├── enhanced.py # Full pipeline
│ │ └── enhancers/ # Pipeline enhancers
│ │
│ ├── generation/ # LLM generation
│ │ ├── inference.py # Multi-provider inference
│ │ ├── providers/ # Provider implementations
│ │ └── prompts/ # Prompt templates
│ │
│ ├── ui/ # Streamlit chat interface -> claude generated, beta testing.
│ │ ├── chat_app.py # Main application
│ │ ├── components/ # UI components
│ │ └── config/ # UI configuration
│ │
│ └── utils/ # Shared utilities
│ ├── settings.py # Configuration management
│ ├── logging_config.py # Logging setup
│ └── model_registry.py # Model caching
│
├── configs/ # Configuration files
├── data/ # Data directory
│ ├── raw/ # Raw Wikipedia dumps
│ ├── processed/ # Extracted articles
│ └── db_snapshots/ # Qdrant snapshots
├── scripts/ # Utility scripts
├── results/ # Evaluation results
├── docker-compose.yml # Docker services
├── Makefile # Build commands
├── pyproject.toml # Project dependencies
└── README.md
This project is licensed under the MIT License. See LICENSE for details.
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020)
- Query Rewriting for Retrieval-Augmented Large Language Models (Jagerman et al., 2023)
- HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering (Yang et al., 2018)
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (Reimers & Gurevych, 2019)
- Chain-of-Verification Reduces Hallucination in Large Language Models (Dhuliawala et al., 2023)
- GTE-multilingual (Alibaba)
- Jina Reranker v2 (Jina AI)
- Qwen2.5 / Qwen3 (Alibaba Cloud)