Name	Name	Last commit message	Last commit date
Latest commit History 8 Commits
GainRAG	GainRAG
images	images
README.md	README.md
requirements.txt	requirements.txt

GainRAG: Preference Alignment in Retrieval-Augmented Generation through Gain Signal Synthesis

GainRAG Framework

🛠 Installation

The main dependencies are torch 2.5.1, vllm 0.7.3, FlagEmbedding 1.3.3, DeepSpeed, trl, peft, faiss/faiss-gpu.

conda create -n GainRAG python=3.9.18
conda activate GainRAG
pip install -r requirements.txt

💡 Preparation

Download Corpus & Index

Retrieval is performed on the set of Wikipeda passages used in DPR. Download passages:

wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz

Download passage embeddings pre-computed with Contriever or Contriever-msmarco:

wget https://dl.fbaipublicfiles.com/contriever/embeddings/contriever/wikipedia_embeddings.tar
wget https://dl.fbaipublicfiles.com/contriever/embeddings/contriever-msmarco/wikipedia_embeddings.tar

Retrieve top-k passages:

cd ./gainRAG/retrieval_engine
python retrieval.py # Remember to configure your parameters

🎯 Train Selector

Gain Signal Synthesis:

cd ./gainRAG
python -m llm_supervision.construct_hf \
 --data_path TODOpath/data.jsonl \
 --output_path TODOpath/data_train.json \
 --task HotpotQA \
 --alpha 0.5

Data format conversion:

cd ./data
python data2selector.py # Remember to configure your parameters

Selector Training:

The initial weight of the model is bge-rerank-base,

cd ./gainRAG
torchrun --nproc_per_node 1 \
	-m selector_finetune \
	--model_name_or_path path/bge-rerank-base \
 --train_data TODOpath/data.jsonl \
	--deepspeed TODOpath/deepspeed/ds_stage0.json \
	--output_dir TODOpath/model_outputs/\
	--overwrite_output_dir \
 --train_group_size 16 \
	--knowledge_distillation True \
 --query_max_len 512 \
 --passage_max_len 512 \
 --pad_to_multiple_of 8 \
 --learning_rate 6e-5 \
 --fp16 \
 --num_train_epochs 2 \
 --per_device_train_batch_size 8 \
 --gradient_accumulation_steps 1 \
 --dataloader_drop_last True \
 --warmup_ratio 0.1 \
 --gradient_checkpointing \
 --weight_decay 0.01 \
 --logging_steps 1 \
 --save_steps 1000

📈 Run Evaluation

0. Download Evaluation Data:

HotpotQA, 2WikiMultiHopQA, WebQuestions, NaturalQA, TriviaQA, SQuAD

1. Retrieve top-k passages:

cd ./gainRAG/retrieval_engine
python retrieval.py # Remember to configure your parameters

2. Select top-1 passages:

cd ./gainRAG
python -m selector_engine.selector_gainRag \
 --model_name_or_path "model_path/" \
 --data_path "path/GainRAG/data/eval_data/HotpotQA.jsonl" \
 --output_path "path/GainRAG/data/test.json" \
 --K_docs 1

3. Run generation & evaluation:

cd ./gainRAG
python -m rag_workflow.rag_generation \
 --data_path "selector_output_path" \
 --task "HotpotQA" \
 --lm_type "Llama-3-8B-Instruct" \
 --K_docs 1

Citation

@article{jiang2025gainrag,
 title={GainRAG: Preference Alignment in Retrieval-Augmented Generation through Gain Signal Synthesis},
 author={Jiang, Yi and Zhao, Sendong and Li, Jianbo and Wang, Haochun and Qin, Bing},
 journal={arXiv preprint arXiv:2505.18710},
 year={2025}
}

Thanks for your interest in our work!

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

liunian-Jay/GainRAG

Folders and files

Latest commit

History

Repository files navigation

GainRAG: Preference Alignment in Retrieval-Augmented Generation through Gain Signal Synthesis

🛠 Installation

💡 Preparation

🎯 Train Selector

📈 Run Evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GainRAG: Preference Alignment in Retrieval-Augmented Generation through Gain Signal Synthesis

🛠 Installation

💡 Preparation

🎯 Train Selector

📈 Run Evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages