Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

JianxinLin28/COMPSCI-646-group

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

38 Commits

Repository files navigation

COMPSCI-646-group

A Study of Hard Negative Samples on Reasoning-aware Retrieval Models

🧑‍🏫 For graders

To check the status for our codebase, visit CODEBASE.md

📕 Setup

⚠️ I strongly recommend you to install everything in a dedicated python environment. (>= python 3.10)

This is the newer guide that teaches you how to generate better negative samples using LLM.

  • Step 1: create a new file '.env' under synthetic_data_generation folder. Enter
OPENAI_API_KEY = [The key]

Replace [The key] with an openai api key. You can use the one provided by Jianming. It's under our Notion page -> Misc.

  • Step 2: from synthetic_data_generation, run
pip install -r requirements.txt
  • Step 3:

if you are using Linux, run (not tested):

bash setup_java_linux.sh

if you are using Windows, run:

bash setup_java_winos.sh

It's highly likely that you will be encountering errors. After solving them you should have everything installed and you can proceed to running the commands introduced below.

📖Guide 1: Generate Hard Query

First, make sure you cd to synthetic_data_generation/MyCode.

Second, make sure you login hugging face. Go to data_reader.py and un-comment

# login()

Run this line next:

python data_reader.py

It will ask your hugging face token. Input your token and you are logged in! After that, comment that line again if you want.


Below is the general template you should be using to generate hard query.

python -m hard_query_gen --dataset $DATASET --model_id $MODEL --queries_per_doc $queries_per_doc --num_docs $num_docs --filter fineweb --prompt_id $prompt_id

Example usage:

python -m hard_query_gen --dataset MedicalSciences --model_id gpt-4o --queries_per_doc 3 --num_docs 2 --prompt_id "hq_gen"

This command generates 6 tuples (3 for 2 docs) of (hard query, positive document, negative document) based on Medical Sciences dataset. Filtered by fineweb (discarding low-score documents and keep 1 in the end). "hq_gen" stands for hard-query generation. You can find the generated outputs in outputs/synthetic_questions/all_docs_train_data/hq_gen/gpt-4o/MedicalSciences_2_train_data.jsonl

💡 Notice that positive document and negative document are generated by BM25 and we won't be using the negative document result from this step. The hard negative document will be generated by supplement_negative_passage.py

⚠️ For production, I highly recommend you to set num_docs 50 and run multiple times to reach your goal. Do it all at once is not encouraged.

Command arguments

argument type default
--dataset str MedicalSciences
--mode str None
--model_id str gpt-4o
--queries_per_doc int 3
--num_docs int None
--filter str None
--output_dir str Don't change
--cache_dir str Don't change
--prompt_id str hq_gen
--temperature float 0
--top_p float 0

Dataset

dataset
MedicalSciences
PMCTreatment
IIYiClinical

📖Guide 2: Generate Hard Negative Documents

First, make sure you cd to synthetic_data_generation/MyCode.

Run the following line.

python hq_to_hard_neg_doc.py

Check outputs/negative_passages folder for result.

You can configure in hq_to_hard_neg_doc.py.

# ================================================
data_path = "./outputs/synthetic_questions/all_docs_train_data/hq_gen/gpt-4o"
output_path = "./outputs/negative_passages"
num_workers: int = None
# ================================================
variable usage
data_path the input folder containing results of hard queries
output_path where generated hard negative documents should go
num_workers not important

📖Guide 3: Indexing Hard Negative Document for qrel

First, make sure you cd to synthetic_data_generation/MyCode.

Run the following line.

python qrel_maker.py --dataset MedicalSciences

Command arguments

argument type default
--dataset str MedicalSciences

Dataset

dataset
MedicalSciences
PMCTreatment
IIYiClinical

🚀Workload

Assuming you are doing 'MedicalSciences'.

First, cd to MyCode. Run the following command multiple times. Check MyCode/outputs/generation_record/MedicalSciences.jsonl. Stop after the number entries no longer increase. This ensures each query gets exactly one hard negative document.

python -m hard_query_gen --dataset MedicalSciences --model_id gpt-4o --queries_per_doc 1 --num_docs 50 --prompt_id "hq_gen"

For the stop point, you can go to the next step if you see the record has this number of entries.

dataset number of entries
MedicalSciences 88
PMCTreatment 150
IIYiClinical 129

Second, run

python hq_to_hard_neg_doc.py

Third, run

python qrel_maker.py --dataset MedicalSciences

Fourth, delete the jsonl file (MyCode/outputs/generation_record/MedicalSciences.jsonl) and repeat the last three steps to generate one more hard negative document for each query.

So, in total, you need to do this five times for one database. In the end,

For each query in the qrels, generate five hard negative documents base on randomly selected passages that are associated with the same query.

About

A Study of Hard Negative Samples on Reasoning-aware Retrieval Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

AltStyle によって変換されたページ (->オリジナル) /