A Study of Hard Negative Samples on Reasoning-aware Retrieval Models
To check the status for our codebase, visit CODEBASE.md
This is the newer guide that teaches you how to generate better negative samples using LLM.
- Step 1: create a new file '.env' under
synthetic_data_generationfolder. Enter
OPENAI_API_KEY = [The key]
Replace [The key] with an openai api key. You can use the one provided by Jianming. It's under our Notion page -> Misc.
- Step 2: from
synthetic_data_generation, run
pip install -r requirements.txt
- Step 3:
if you are using Linux, run (not tested):
bash setup_java_linux.sh
if you are using Windows, run:
bash setup_java_winos.sh
It's highly likely that you will be encountering errors. After solving them you should have everything installed and you can proceed to running the commands introduced below.
First, make sure you cd to synthetic_data_generation/MyCode.
Second, make sure you login hugging face. Go to data_reader.py and un-comment
# login()
Run this line next:
python data_reader.py
It will ask your hugging face token. Input your token and you are logged in! After that, comment that line again if you want.
Below is the general template you should be using to generate hard query.
python -m hard_query_gen --dataset $DATASET --model_id $MODEL --queries_per_doc $queries_per_doc --num_docs $num_docs --filter fineweb --prompt_id $prompt_id
Example usage:
python -m hard_query_gen --dataset MedicalSciences --model_id gpt-4o --queries_per_doc 3 --num_docs 2 --prompt_id "hq_gen"
This command generates 6 tuples (3 for 2 docs) of (hard query, positive document, negative document) based on Medical Sciences dataset. Filtered by fineweb (discarding low-score documents and keep 1 in the end). "hq_gen" stands for hard-query generation. You can find the generated outputs in outputs/synthetic_questions/all_docs_train_data/hq_gen/gpt-4o/MedicalSciences_2_train_data.jsonl
💡 Notice that positive document and negative document are generated by BM25 and we won't be using the negative document result from this step. The hard negative document will be generated by supplement_negative_passage.py
⚠️ For production, I highly recommend you to set num_docs 50 and run multiple times to reach your goal. Do it all at once is not encouraged.
| argument | type | default |
|---|---|---|
| --dataset | str | MedicalSciences |
| --mode | str | None |
| --model_id | str | gpt-4o |
| --queries_per_doc | int | 3 |
| --num_docs | int | None |
| --filter | str | None |
| --output_dir | str | Don't change |
| --cache_dir | str | Don't change |
| --prompt_id | str | hq_gen |
| --temperature | float | 0 |
| --top_p | float | 0 |
| dataset |
|---|
| MedicalSciences |
| PMCTreatment |
| IIYiClinical |
First, make sure you cd to synthetic_data_generation/MyCode.
Run the following line.
python hq_to_hard_neg_doc.py
Check outputs/negative_passages folder for result.
You can configure in hq_to_hard_neg_doc.py.
# ================================================
data_path = "./outputs/synthetic_questions/all_docs_train_data/hq_gen/gpt-4o"
output_path = "./outputs/negative_passages"
num_workers: int = None
# ================================================
| variable | usage |
|---|---|
| data_path | the input folder containing results of hard queries |
| output_path | where generated hard negative documents should go |
| num_workers | not important |
First, make sure you cd to synthetic_data_generation/MyCode.
Run the following line.
python qrel_maker.py --dataset MedicalSciences
| argument | type | default |
|---|---|---|
| --dataset | str | MedicalSciences |
| dataset |
|---|
| MedicalSciences |
| PMCTreatment |
| IIYiClinical |
Assuming you are doing 'MedicalSciences'.
First, cd to MyCode. Run the following command multiple times. Check MyCode/outputs/generation_record/MedicalSciences.jsonl. Stop after the number entries no longer increase. This ensures each query gets exactly one hard negative document.
python -m hard_query_gen --dataset MedicalSciences --model_id gpt-4o --queries_per_doc 1 --num_docs 50 --prompt_id "hq_gen"
For the stop point, you can go to the next step if you see the record has this number of entries.
| dataset | number of entries |
|---|---|
| MedicalSciences | 88 |
| PMCTreatment | 150 |
| IIYiClinical | 129 |
Second, run
python hq_to_hard_neg_doc.py
Third, run
python qrel_maker.py --dataset MedicalSciences
Fourth, delete the jsonl file (MyCode/outputs/generation_record/MedicalSciences.jsonl) and repeat the last three steps to generate one more hard negative document for each query.
So, in total, you need to do this five times for one database. In the end,
For each query in the qrels, generate five hard negative documents base on randomly selected passages that are associated with the same query.