Name	Name	Last commit message	Last commit date
Latest commit History 38 Commits
.vscode	.vscode
R2Med	R2Med
synthetic_data_generation	synthetic_data_generation
.gitignore	.gitignore
CODEBASE.md	CODEBASE.md
LINKS.md	LINKS.md
README.md	README.md

COMPSCI-646-group

A Study of Hard Negative Samples on Reasoning-aware Retrieval Models

🧑‍🏫 For graders

To check the status for our codebase, visit CODEBASE.md

📕 Setup

⚠️ I strongly recommend you to install everything in a dedicated python environment. (>= python 3.10)

This is the newer guide that teaches you how to generate better negative samples using LLM.

Step 1: create a new file '.env' under synthetic_data_generation folder. Enter

OPENAI_API_KEY = [The key]

Replace [The key] with an openai api key. You can use the one provided by Jianming. It's under our Notion page -> Misc.

Step 2: from synthetic_data_generation, run

pip install -r requirements.txt

Step 3:

if you are using Linux, run (not tested):

bash setup_java_linux.sh

if you are using Windows, run:

bash setup_java_winos.sh

It's highly likely that you will be encountering errors. After solving them you should have everything installed and you can proceed to running the commands introduced below.

📖Guide 1: Generate Hard Query

First, make sure you cd to synthetic_data_generation/MyCode.

Second, make sure you login hugging face. Go to data_reader.py and un-comment

# login()

Run this line next:

python data_reader.py

It will ask your hugging face token. Input your token and you are logged in! After that, comment that line again if you want.

Below is the general template you should be using to generate hard query.

python -m hard_query_gen --dataset $DATASET --model_id $MODEL --queries_per_doc $queries_per_doc --num_docs $num_docs --filter fineweb --prompt_id $prompt_id

Example usage:

python -m hard_query_gen --dataset MedicalSciences --model_id gpt-4o --queries_per_doc 3 --num_docs 2 --prompt_id "hq_gen"

This command generates 6 tuples (3 for 2 docs) of (hard query, positive document, negative document) based on Medical Sciences dataset. Filtered by fineweb (discarding low-score documents and keep 1 in the end). "hq_gen" stands for hard-query generation. You can find the generated outputs in outputs/synthetic_questions/all_docs_train_data/hq_gen/gpt-4o/MedicalSciences_2_train_data.jsonl

💡 Notice that positive document and negative document are generated by BM25 and we won't be using the negative document result from this step. The hard negative document will be generated by supplement_negative_passage.py

⚠️ For production, I highly recommend you to set num_docs 50 and run multiple times to reach your goal. Do it all at once is not encouraged.

Command arguments

argument	type	default
--dataset	str	MedicalSciences
--mode	str	None
--model_id	str	gpt-4o
--queries_per_doc	int	3
--num_docs	int	None
--filter	str	None
--output_dir	str	Don't change
--cache_dir	str	Don't change
--prompt_id	str	hq_gen
--temperature	float	0
--top_p	float	0

Dataset

dataset
MedicalSciences
PMCTreatment
IIYiClinical

📖Guide 2: Generate Hard Negative Documents

First, make sure you cd to synthetic_data_generation/MyCode.

Run the following line.

python hq_to_hard_neg_doc.py

Check outputs/negative_passages folder for result.

You can configure in hq_to_hard_neg_doc.py.

# ================================================
data_path = "./outputs/synthetic_questions/all_docs_train_data/hq_gen/gpt-4o"
output_path = "./outputs/negative_passages"
num_workers: int = None
# ================================================

variable	usage
data_path	the input folder containing results of hard queries
output_path	where generated hard negative documents should go
num_workers	not important

📖Guide 3: Indexing Hard Negative Document for qrel

First, make sure you cd to synthetic_data_generation/MyCode.

Run the following line.

python qrel_maker.py --dataset MedicalSciences

Command arguments

argument	type	default
--dataset	str	MedicalSciences

Dataset

dataset
MedicalSciences
PMCTreatment
IIYiClinical

🚀Workload

Assuming you are doing 'MedicalSciences'.

First, cd to MyCode. Run the following command multiple times. Check MyCode/outputs/generation_record/MedicalSciences.jsonl. Stop after the number entries no longer increase. This ensures each query gets exactly one hard negative document.

python -m hard_query_gen --dataset MedicalSciences --model_id gpt-4o --queries_per_doc 1 --num_docs 50 --prompt_id "hq_gen"

For the stop point, you can go to the next step if you see the record has this number of entries.

dataset	number of entries
MedicalSciences	88
PMCTreatment	150
IIYiClinical	129

Second, run

python hq_to_hard_neg_doc.py

Third, run

python qrel_maker.py --dataset MedicalSciences

Fourth, delete the jsonl file (MyCode/outputs/generation_record/MedicalSciences.jsonl) and repeat the last three steps to generate one more hard negative document for each query.

So, in total, you need to do this five times for one database. In the end,

For each query in the qrels, generate five hard negative documents base on randomly selected passages that are associated with the same query.

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JianxinLin28/COMPSCI-646-group

Folders and files

Latest commit

History

Repository files navigation

COMPSCI-646-group

🧑‍🏫 For graders

📕 Setup

📖Guide 1: Generate Hard Query

Command arguments

Dataset

📖Guide 2: Generate Hard Negative Documents

📖Guide 3: Indexing Hard Negative Document for qrel

Command arguments

Dataset

🚀Workload

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

COMPSCI-646-group

🧑‍🏫 For graders

📕 Setup

📖Guide 1: Generate Hard Query

Command arguments

Dataset

📖Guide 2: Generate Hard Negative Documents

📖Guide 3: Indexing Hard Negative Document for qrel

Command arguments

Dataset

🚀Workload

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages