Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

LuckyyySTA/GOLF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

3 Commits

Repository files navigation

GOLF: Guidance-Optimized Learning with Feedback

Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning

Paper GitHub

GOLF is an RL framework that explicitly exploits group-level language feedback to guide targeted exploration through actionable refinements. It aggregates two complementary feedback sources: (i) external critiques that pinpoint errors or propose targeted fixes, and (ii) intra-group attempts that supply alternative partial ideas and diverse failure patterns. These group-level feedbacks are aggregated to produce high-quality refinements, which are adaptively injected into training as off-policy scaffolds to provide targeted guidance in sparse-reward regions. Meanwhile, GOLF jointly optimizes generation and refinement within a unified RL loop, creating a virtuous cycle that continuously improves both capabilities.

This repository supports fuzzy tasks (e.g. chat) and verifiable tasks (math, code, IF) with task-specific reward and critique pipelines built on verl and AMPO-style hybrid GRPO.


Table of Contents


Installation

Requirements: Python 3.10, PyTorch, CUDA, vLLM (for rollout). We recommend a dedicated conda environment.

conda create -n golf python=3.10
conda activate golf
cd golf
cd verl
# For FSDP (no Megatron):
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
# For Megatron-backed training, use: bash scripts/install_vllm_sglang_mcore.sh
cd ..
pip install -r requirements.txt

Repository Structure

GOLF/
├── golf/ # Core training code (built on verl)
│ └── verl/
│ └── verl/adaptive_mix_src/ # GOLF trainer, critique refiner, reward
├── data/ # Data preparation scripts
├── exp_scripts/ # Training launch scripts
│ ├── critique_grpo_hybrid_math.sh # Verifiable: math
│ ├── critique_grpo_hybrid_if.sh # Verifiable: instruction following (IF)
│ ├── critique_grpo_hybrid_code.sh # Verifiable: code
│ └── critique_grpo_hybrid_wildchat.sh # Fuzzy: wildchat / chat
├── eval_scripts/ # Evaluation and generation
│ ├── eval_math.sh # Verifiable math eval
│ ├── eval_fuzzy.sh # Fuzzy benchmark eval (RLMT-style)
│ ├── generate_vllm.py
│ └── ...
└── README.md

Data Preparation

Data format and preprocessing per task:

Task Reference
Fuzzy (chat / instruction following) RLMT
Math critique-GRPO
Code SDPO
IF allenai/IF_multi_constraints_upto5

Training

Scripts assume repo root at $PROJECT_ROOT/GOLF. Set PROJECT_ROOT, MODEL_PATH, TRAIN_FILE, TEST_FILE (and for IF: IFEVAL_VAL_FILE, IFBENCH_VAL_FILE) as needed. Optional: export WANDB_API_KEY=your_key or WANDB_MODE=disabled.

Prepare data per task using the Data Preparation references above, then run:

Verifiable: Math

export PROJECT_ROOT=/path/to/your/projects
export MODEL_PATH=/path/to/pretrained_models/Qwen3-8B
export TRAIN_FILE=$PROJECT_ROOT/GOLF/data/openr1_math_4k_train.parquet
export TEST_FILE=$PROJECT_ROOT/GOLF/data/openr1_math_4k_test.parquet
bash exp_scripts/critique_grpo_hybrid_math.sh

Verifiable: Instruction Following (IF)

export PROJECT_ROOT=/path/to/your/projects
export MODEL_PATH=/path/to/pretrained_models/Qwen3-4B
export TRAIN_FILE=$PROJECT_ROOT/GOLF/data/if_train.parquet
export IFEVAL_VAL_FILE=$PROJECT_ROOT/GOLF/data/ifeval_test.parquet
export IFBENCH_VAL_FILE=$PROJECT_ROOT/GOLF/data/ifbench_test.parquet
bash exp_scripts/critique_grpo_hybrid_if.sh

Verifiable: Code

export PROJECT_ROOT=/path/to/your/projects
export MODEL_PATH=/path/to/pretrained_models/Qwen3-8B
export TRAIN_FILE=$PROJECT_ROOT/GOLF/data/lcb_v6_train.parquet
export TEST_FILE=$PROJECT_ROOT/GOLF/data/lcb_v6_test.parquet
bash exp_scripts/critique_grpo_hybrid_code.sh

Fuzzy: Wildchat / chat

export PROJECT_ROOT=/path/to/your/projects
export MODEL_PATH=/path/to/pretrained_models/Llama-3.1-8B-Instruct
export TRAIN_FILE=$PROJECT_ROOT/GOLF/data/wildchat-if_train.parquet
export TEST_FILE=$PROJECT_ROOT/GOLF/data/wildchat-if_val.parquet
bash exp_scripts/critique_grpo_hybrid_wildchat.sh

Checkpoints: $PROJECT_ROOT/GOLF/checkpoints/<model_name>/golf/<exp_name>/. Merge FSDP shards via eval_scripts/model_merge.sh when needed.


Evaluation

Verifiable: Math

Run inference then score (e.g. with Math-Verify or your validator). Set PROJECT_ROOT, EVAL_DATA, EVAL_OUTPUT_DIR, and the MODEL_PATHS array to your merged checkpoints.

export PROJECT_ROOT=/path/to/your/projects
# Edit eval_scripts/eval_math.sh: set MODEL_PATHS and MODEL_NAMES to your checkpoints
bash eval_scripts/eval_math.sh

Then run your preferred math metric on the generated *.jsonl under EVAL_OUTPUT_DIR.

Verifiable: Code / IF

Use the same pattern: point the eval scripts to your checkpoint dirs and data, run generate_vllm.py (or equivalent), then run task-specific scoring (e.g. pass@k for code, IFEval/IFBench for IF).

Fuzzy (RLMT-style)

For fuzzy benchmarks (e.g. creative writing, WildBench, arena), use:

export PROJECT_ROOT=/path/to/your/projects
export OPENAI_BASE_URL=http://your-vllm-server:80/v1 # or local vLLM
# Edit eval_scripts/eval_fuzzy.sh: set MODELS, MODEL_NAMES, BENCHMARKS
bash eval_scripts/eval_fuzzy.sh

Benchmark list and scoring follow the same spirit as RLMT; adjust BENCHMARKS and paths as in the script.


Acknowledgement

GOLF builds on the following projects:

  • AMPO — "More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration"; we adopt and extend the adaptive multi-guidance and hybrid training ideas.
  • verl — Volcano Engine Reinforcement Learning for LLMs; our training stack is built on verl’s GRPO/PPO and infrastructure.

We also thank RLMT, critique-GRPO, and SDPO, Math-Verify for data, benchmarks, and tooling.


Citation

If you use GOLF or this code, please cite:

@misc{huang2026bootstrappingexplorationgrouplevelnatural,
 title = {Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning},
 author = {Lei Huang and Xiang Cheng and Chenxiao Zhao and Guobin Shen and Junjie Yang and Xiaocheng Feng and Yuxuan Gu and Xing Yu and Bing Qin},
 year = {2026},
 eprint = {2603.04597},
 archivePrefix = {arXiv},
 primaryClass = {cs.CL},
 url = {https://arxiv.org/abs/2603.04597},
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

Languages

AltStyle によって変換されたページ (->オリジナル) /