Name	Name	Last commit message	Last commit date
Latest commit History 244 Commits
AEPO	AEPO
ARPO	ARPO
LLaMA-Factory	LLaMA-Factory
assets	assets
evaluation	evaluation
.DS_Store	.DS_Store
.gitignore	.gitignore
README.md	README.md
logo1.png	logo1.png

✨ Agentic Reinforced Policy Optimization

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Note

This project includes the codebase, datasets and chckpoints for two RL algorithms: Agentic Reinforced Policy Optimization (ARPO) and Agentic Entropy-Balanced Policy Optimization (AEPO). We will continue to iterate and expand our Agentic RL series.

📣 Latest News

[Jan 26, 2026]: 🎉 Our paper Agentic Reinforced Policy Optimization has been accepted at ICLR 2026!
[Jan 20, 2026]: 🎉 Our paper Agentic Entropy-Balanced Policy Optimization has been accepted at WWW 2026 (Oral)!
[Dec 20, 2025]: 🚀🚀🚀 We released AEPO-32B and ARPO-32B (based on QwQ-32B), achieving 53.4/12.8 and 51.5/11.2 on GAIA/HLE.
[Nov 03, 2025]: The brief introduction of AEPO can be found on a series of platforms like X, WeChat .
[Oct 17, 2025]: 📄 Our AEPO paper is now available on arXiv and Hugging Face daily paper.
[Oct 16, 2025]: 🚀🚀🚀 We propose a new algorithm AEPO, which focused on entropy-balanced agentic RL and consistently outperforms ARPO on datasets like GAIA, HLE, and AIME. Full codebase and 🤗 HF-Models of AEPO released.
[Aug 11, 2025]: The brief introduction of ARPO can be found on a series of platforms like X, WeChat, Zhihu, YouTube and Xiaohongshu .
[July 29, 2025]: 🔥 We are honored to be featured as 🤗 HuggingFace Daily Paper #1 and Weekly Paper #1 .
[July 29, 2025]: 📄 Our ARPO paper is now available on arXiv and Hugging Face daily paper.
[July 25, 2025]: 🔥 We released all our ARPO model checkpoints (3B~14B) and datasets(SFT, RL, Evaluation). Checkout 🤗ARPO Collection here. We will keep update it!
[July 25, 2025]: We have implemented extensive tool-call acceleration and memory optimization during RL training in ARPO.(Training Qwen3-14B in 1 node with a batch size of 128 takes only 10 minutes per step!!! we also maintain a dynamic cache mechanism to save your tool call results in real-time!!)
[July 25, 2025]: 🚀🚀🚀 Full codebase of ARPO released. ARPO supports multi-tool agentic RL training for the Qwen2.5, 3 and Llama3 models in 🤗 HF-Models .

🔥 Agentic RL Family

👏 Welcome to try our agentic RL series of algorithms:

Agentic Entropy-Balanced Policy Optimization
Authors: Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li, Jiajie Jin, Jinghan Yang, Hangyu Mao, Fuzheng Zhang, Kun Gai, Guorui Zhou†, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou†
TLDR: An agentic RL algorithm designed to balance entropy in both the rollout and policy update phases.
github github arXiv Paper Collection X (formerly Twitter) URL

Agentic Reinforced Policy Optimization
Authors: Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao , Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou†, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou†
TLDR: An agentic RL algorithm encourage the policy model to adaptively branch sampling during high-entropy tool-call rounds,
github github arXiv Paper Collection X (formerly Twitter) URL

Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning
Authors: Guanting Dong, Yifei Chen, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Yutao Zhu, Hangyu Mao, Guorui Zhou, Zhicheng Dou†, Ji-Rong Wen
TLDR: An end-to-end TIR post-training framework that empowers LLMs to autonomously interact with multi-tool environments through Self-Critic RL design
github github arXiv Paper Collection X (formerly Twitter) URL

DeepAgent: A General Reasoning Agent with Scalable Toolsets (New!)
Authors: Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Guanting Dong, Jiajie Jin, Yinuo Wang, Hao Wang, Yutao Zhu, Ji-Rong Wen, Yuan Lu, Zhicheng Dou
TLDR: An end-to-end deep reasoning agent that performs autonomous thinking, tool discovery, and action execution with brain-inspired memory folding mechanism.
github github arXiv Paper

📦 Dataset & Model Zoo

Dataset	Download
54K Agentic SFT Data	🤗 HuggingFace
1K Agentic Deep Search RL Data	🤗 HuggingFace
10K Agentic Reasoning RL Data	🤗 HuggingFace

Model(ARPO)	Download
Qwen3-8B-ARPO-DeepSearch	🤗 HuggingFace
Qwen3-14B-ARPO-DeepSearch	🤗 HuggingFace
QwQ-32B-ARPO-DeepSearch	🤗 HuggingFace
Qwen2.5-3B-ARPO	🤗 HuggingFace
Qwen2.5-7B-ARPO	🤗 HuggingFace
Llama3.1-8B-ARPO	🤗 HuggingFace

Model(AEPO)	Download
Qwen3-8B-AEPO-DeepSearch	🤗 HuggingFace
Qwen3-14B-AEPO-DeepSearch	🤗 HuggingFace
QwQ-32B-AEPO-DeepSearch	🤗 HuggingFace
Qwen2.5-7B-AEPO	🤗 HuggingFace

Overview
Quick Start
Citation

💡 Overview

AEPO (🔥New!)

We propose Agentic Entropy-Balanced Policy Optimization (AEPO), an agentic RL algorithm designed to balance entropy in both the rollout and policy update phases. AEPO comprises two core components:

image

Dynamic Entropy-Balanced Rollout Mechanism that adaptively allocates global and branch sampling budget through entropy pre-monitoring, while imposing a branch penalty on consecutive high-entropy tool-call steps to prevent over-branching issues;
Entropy-Balanced Policy Optimization that inserts a stop-gradient operation into the high-entropy clipping term to preserve and properly rescale gradients on high-entropy tokens (Entropy Clipping-Balanced Mechanism), while incorporating entropy-aware advantage estimation to prioritize learning on high-uncertainty tokens (Entropy-aware Advantage Estimation).

ARPO

We propose Agentic Reinforced Policy Optimization (ARPO), an agentic RL algorithm tailored for training multi-turn LLM-based agent. The core principle of ARPO is to encourage the policy model to adaptively branch sampling during high-entropy tool-call rounds, thereby efficiently aligning step-level tool-use behaviors.

intro

In figure (left), The initial tokens generated by the LLM after receiving each round of tool-call feedback consistently exhibit a high entropy. This indicates that external tool-call significantly introduces uncertainty into the LLM’s reasoning process.
In the figure (right), we validate ARPO's performance across 13 datasets. Notably, Qwen3-14B with ARPO excelled in Pass@5, achieving 61.2% on GAIA and 24.0% on HLE, while requiring only about half the tool calls compared to GRPO during training.

🏃 Quick Start

Reproducing ARPO/AEPO requires three steps: cold start fine-tuning (optional), ARPO/AEPO training, and evaluation. Below, we will provide a detailed explanation.

❄️ Cold-Start SFT Stage (Optional)

This stage is meant to help you reproduce our experimental results. If your want to RL from scratch, you can skip this stage.

1. Environment Setup

In this step, we will describe how to perform a cold start for the SFT stage using the LLaMA Factory repository. First, set up the environment as follows:

# Clone the ARPO repository (which includes LLaMA-Factory)
git clone https://github.com/dongguanting/ARPO
cd ARPO/LLaMA-Factory
# Create a new conda environment
conda create -n sft python=3.10
conda activate sft
# Install dependencies
pip install -r requirements.txt

2. Fine-Tuning Model

Download your SFT dataset from 🤗ARPO-SFT-54K and place it in LLaMA-Factory-main/data/final_sft_edition9.json. Define the dataset in dataset_info.json.
Configure Training

Update LLaMA-Factory/arpo_train_sft/yaml with the following content:

Training Configuration (click to expand)

### model
model_name_or_path: <your_model_path>
trust_remote_code: true
### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: ../examples/deepspeed/ds_z3_config.json # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]
### dataset
dataset_dir: dataset_info
dataset: <your_dataset>
template: qwen
cutoff_len: 15000
max_samples: 1000000
overwrite_cache: true
preprocessing_num_workers: 16
### output
output_dir: <your_output_dir>
logging_steps: 10
save_steps: 2000
plot_loss: true
overwrite_output_dir: true
### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 7.0e-6
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

Also, update the output directory in arpo_train_sft/sft_train.sh:

# Output directory
OUTPUT_DIR="<your_output_dir>"

After completing the information, you can fine-tune the model using the following command:

bash arpo_train_sft/sft_train.sh

🔥 ARPO/AEPO Stage

In this step, we will load the cold-start data for GRPO training. We reference the ReCall and VERL frameworks for RL training.

1. Environment Setup

you can install our additional environment as follow:

#create env
conda create -n arpo python==3.10
conda activate arpo
# install torch & flash-atten
pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn --no-build-isolation
# install RL basic env
cd ARPO
# This is our RL env freeze file. You can install it as a supplement or use it for checking.
pip install -r requirements.txt

2. Preparation

2.1 Data Preparation

In our paper, we offer two type of train & validation datasets to verify the effectiveness of ARPO:

Reasoning and Knowledge Dataset: This dataset is used to test the benchmarks listed in Table 1.
- train_10k.parquet: Contains 10K samples for mathematical and knowledge reasoning.
- test.parquet: Comprises 300 test samples from 8 datasets, including AIME24, AIME25, MATH500, GSM8k, HotpotQA, 2Wiki, Misque, and Bamboogle.
Deep Search Dataset: This dataset is used to test the benchmarks listed in Table 2.
- hard_search.parquet: Contains 1K samples, including 800 samples from simpledeepsearch and 200 samples from webdancer.
- gaia_test.parquet/hle_test.parquet: Contains test samples from GAIA and Humanity Last Exam (HLE).

2.2 API Key Configuration

Our search api tool utilizes Bright Data (A third-party Bing API, without the retirement risk of official Bing API). Before starting the training, please replace the API key and zone in the following files: ARPO/scripts/config/ppo_trainer_dr.yaml and ARPO/scripts/config/ppo_trainer.yaml.

Additionally, please also replace the API key and zone in the following file: /verl_arpo_entropy/verl/workers/rollout/tools/config_example.yaml. Below is the instruction on how to do this:

🔍 Click here! Watch the details of tool API configuration YAML

tools:
 # General tool configuration
 call_limit: 3 # Maximum number of tool calls allowed per sample
 max_workers: 64 # Maximum number of threads for concurrent tool execution
 timeout: 120 # Tool execution timeout (seconds)
 retry_count: 3 # Number of retry attempts for tool execution failures
 verbose_logging: true # Enable detailed logging
 fail_on_error: false # Throw an exception if tool loading fails
 
 # Tool instance definitions
 tool_instances:
 python: 
 class_path: verl.workers.rollout.tools.python_tool.PythonTool # Tool class path
 params: # Tool-specific parameters
 conda_path: /path/to/conda
 conda_env: verl
 
 search:
 class_path: verl.workers.rollout.tools.search_tool.BingSearchTool
 params:
 api_key: <your_API_key> # Replace with your Bright Data API key
 zone: <your_zone> # Replace with your Bright Data zone
 max_results: 10
 result_length: 1000
 location: cn

Make sure to replace <your_API_key> and <your_zone> with your actual Bright Data API key and zone. This configuration ensures that the search tool is properly set up to perform searches during the training process. If you have any questions or need further assistance, feel free to ask!

3. ARPO RL Training

We have open-sourced a series of ARPO scripts located in the /ARPO/scripts/ directory, which includes configurations for 7B, 8B, and 14B models. Below is an example of how to set up and run training for training ARPO. Make sure to replace placeholders like <your_path_to_ARPO>, <your_model_path>, and <your_checkpoint_save_dir> with your actual paths.

🔍 Click here! Watch the details of train bash

SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
PARENT_DIR="$(dirname "$SCRIPT_DIR")"
cd "$PARENT_DIR"
echo "Switched to parent directory: $PARENT_DIR"
# ============================ Environment Setup ============================
# Set basic environment variables
export PYTHONUNBUFFERED=1 
export HYDRA_FULL_ERROR=1 
export VLLM_ATTENTION_BACKEND=XFORMERS 
export VERL_LOGGING_LEVEL=DEBUG
export MKL_SERVICE_FORCE_INTEL=1 
export MKL_THREADING_LAYER=GNU 
export RAY_memory_usage_threshold=0.8 
export RAY_memory_monitor_refresh_ms=0 
# Set Python path
export PYTHONPATH="<your_path_to_ARPO>"/verl_arpo_entropy:$PYTHONPATH
# ============================ Basic Configuration ============================
# Experiment name and project
PROJECT_NAME="reasoning_tasks" # Modify experiment group
EXPERIMENT_NAME="ARPO_global_16_init_8_beam_2_random_0_arpo_0.2_entropy" # Modify experiment name
# Configuration file path
CONFIG_PATH="<your_path_to_ARPO>/scripts/config" # Modify the absolute path of the config folder, relative path is not recommended
CONFIG_NAME="ppo_trainer.yaml"
# Distributed training settings
NNODES=1 
N_GPUS_PER_NODE=8 
# ============================ Data Configuration ============================
# Data parameters
PROMPT_KEY="prompt" # Prompt field name
TRAIN_BATCH_SIZE=128 # Training batch size
PPO_MINI_BATCH_SIZE=16 # PPO mini-batch size
MAX_PROMPT_LENGTH=1536 # Maximum prompt length
MAX_RESPONSE_LENGTH=4096 # Maximum response length
# Data file paths
TRAIN_FILES="<your_path_to_ARPO>/rl_datasets/train.parquet" # Modify training data path
VALID_FILES="<your_path_to_ARPO>/rl_datasets/valid.parquet" # Modify validation data path
# ============================ Model Configuration ============================
# Actor model path
ACTOR_MODEL_PATH="<your_model_path>" # Modify training model path
# ============================ Rollout Configuration ==========================
# Rollout settings
ROLLOUT_NAME="vllm" # Use vllm engine
ROLLOUT_MODE="sync_with_tool" # Synchronous mode with tool support
ROLLOUT_N=16 # Number of responses generated per sample
INITIAL_ROLLOUTS=8 # Initial rollout number
BEAM_SIZE=2 # Beam size
BRANCH_PROBABILITY=0.5 # Branch probability
Entropy_weight=0.2
# ============================ Rollout Tools Configuration ==========================
SEARCH_CACHE_PATH="<your_path_to_ARPO>/search_cache/search_cache.json" # Modify
# ============================ Reward Model Configuration ==========================
# Reward model settings
REWARD_MANAGER="naive" # Reward manager type
CUSTOM_REWARD_FUNCTION_PATH="<your_path_to_ARPO>/verl_arpo_entropy/verl/utils/reward_score/deep_research.py" # Modify reward function path
CUSTOM_REWARD_FUNCTION_NAME="compute_score"
# ============================ Training Configuration ============================
# Training parameters
TOTAL_EPOCHS=2 # Total training epochs
SAVE_FREQ=5 # Save frequency
TEST_FREQ=5 # Test frequency
# ============================ Path Configuration ============================
# Save path
SAVE_PATH="<your_checkpoint_save_dir>/${EXPERIMENT_NAME}" # Modify save path
ROLLOUT_SAVE_PATH="${SAVE_PATH}/rollout"
# ============================ WandB Configuration ============================
# WandB settings
WANDB_API_KEY="<your_wandb_key>" # Modify your wandb key
SEARCH_CLASS_PATH="verl.workers.agent.tools.search_tool.BingSearchTool"
# ============================ Preparation ============================
# Login to WandB (if API key is provided)
if [ "$WANDB_API_KEY" != "" ]; then
 wandb login --relogin $WANDB_API_KEY
 export WANDB_DIR=${SAVE_PATH}
fi
# Create save directory
if [ ! -d "$SAVE_PATH" ]; then
 mkdir -p $SAVE_PATH
fi
# Create rollout save directory
if [ ! -d "$ROLLOUT_SAVE_PATH" ]; then
 mkdir -p $ROLLOUT_SAVE_PATH
fi
# ============================ Start Training ============================
python3 -m verl.trainer.main_ppo \
 --config-path=$CONFIG_PATH \
 --config-name=$CONFIG_NAME \
 algorithm.adv_estimator=grpo \
 algorithm.kl_ctrl.kl_coef=0.0 \
 data.train_files=${TRAIN_FILES} \
 data.val_files=${VALID_FILES} \
 data.prompt_key=${PROMPT_KEY} \
 data.train_batch_size=${TRAIN_BATCH_SIZE} \
 data.max_prompt_length=${MAX_PROMPT_LENGTH} \
 data.max_response_length=${MAX_RESPONSE_LENGTH} \
 actor_rollout_ref.model.path=${ACTOR_MODEL_PATH} \
 actor_rollout_ref.model.enable_gradient_checkpointing=True \
 actor_rollout_ref.model.use_remove_padding=True \
 actor_rollout_ref.actor.optim.lr=1e-6 \
 actor_rollout_ref.actor.ppo_mini_batch_size=${PPO_MINI_BATCH_SIZE} \
 actor_rollout_ref.actor.use_dynamic_bsz=True \
 actor_rollout_ref.actor.ppo_max_token_len_per_gpu=$((2*(MAX_PROMPT_LENGTH+MAX_RESPONSE_LENGTH))) \
 actor_rollout_ref.actor.use_kl_loss=True \
 actor_rollout_ref.actor.kl_loss_coef=0.0 \
 actor_rollout_ref.actor.kl_loss_type=low_var_kl \
 actor_rollout_ref.actor.fsdp_config.param_offload=False \
 actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
 actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=$((4*(MAX_PROMPT_LENGTH+MAX_RESPONSE_LENGTH))) \
 actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
 actor_rollout_ref.rollout.name=${ROLLOUT_NAME} \
 actor_rollout_ref.rollout.mode=${ROLLOUT_MODE} \
 actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
 actor_rollout_ref.rollout.n=${ROLLOUT_N} \
 actor_rollout_ref.rollout.initial_rollouts=${INITIAL_ROLLOUTS} \
 actor_rollout_ref.rollout.beam_size=${BEAM_SIZE} \
 actor_rollout_ref.rollout.branch_probability=${BRANCH_PROBABILITY} \
 actor_rollout_ref.rollout.entropy_weight=${Entropy_weight} \
 actor_rollout_ref.rollout.tools.tool_instances.search.params.cache_file=${SEARCH_CACHE_PATH} \
 actor_rollout_ref.rollout.tools.tool_instances.search.class_path=${SEARCH_CLASS_PATH} \
 actor_rollout_ref.rollout.multi_turn.enable=${ENABLE_MULTI_TURN} \
 actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=$((4*(MAX_PROMPT_LENGTH+MAX_RESPONSE_LENGTH))) \
 actor_rollout_ref.ref.fsdp_config.param_offload=True \
 reward_model.reward_manager=${REWARD_MANAGER} \
 custom_reward_function.path=${CUSTOM_REWARD_FUNCTION_PATH} \
 custom_reward_function.name=${CUSTOM_REWARD_FUNCTION_NAME} \
 trainer.critic_warmup=0 \
 trainer.logger="[console, wandb]" \
 trainer.project_name=${PROJECT_NAME} \
 trainer.experiment_name=${EXPERIMENT_NAME} \
 trainer.n_gpus_per_node=${N_GPUS_PER_NODE} \
 trainer.nnodes=${NNODES} \
 trainer.save_freq=${SAVE_FREQ} \
 trainer.test_freq=${TEST_FREQ} \
 trainer.total_epochs=${TOTAL_EPOCHS} \
 trainer.default_local_dir=${SAVE_PATH} \
 trainer.val_before_train=False \
 trainer.rollout_data_dir=${ROLLOUT_SAVE_PATH} \
 hydra.run.dir=${SAVE_PATH}/outputs 2>&1 | tee ${SAVE_PATH}/run.log

You can then run the following script to start training:

cd ./ARPO/scripts/
bash ARPO_7B_Reasoning_1node.sh

For the trained RL checkpoint, you can follow the code below to convert the weights to Hugging Face format:

bash ./ARPO/merge_ckpt/convert_checkpoint_from_verl_to_hf_qwen3.sh

4. AEPO RL Training

We have open-sourced a series of AEPO scripts located in the /AEPO/scripts/ directory, which includes configurations for 7B and 14B models. Below is an example of how to set up and run training for training AEPO. Make sure to replace placeholders like <your_path_to_AEPO>, <your_model_path>, and <your_checkpoint_save_dir> with your actual paths. Note that AEPO reuses the same dataset and search cache from the ARPO folder, so please ensure the related paths are correctly set.

You can modify the hyperparameters in our scripts to enable different modules of AEPO described in our paper:

ENABLE_DYNAMIC_ROLLOUTS: Whether to enable the Dynamic Entropy-Balanced Rollout Mechanism, defaults to False
ENABLE_ENTROPY_BALANCED_CLIPPING: Whether to enable the Entropy Clipping-Balanced Mechanism.
ENABLE_ENTROPY_BALANCED_ADVANTAGE: Whether to enable Entropy-aware Advantage Estimation.

🔍 Click here! Watch the details of train bash

# Switch to the directory of the script
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
PARENT_DIR="$(dirname "$SCRIPT_DIR")"
cd "$PARENT_DIR"
echo "Switched to parent directory: $PARENT_DIR"
# ============================ Environment Setting ============================
# Set basic environment variables
export PYTHONUNBUFFERED=1
export HYDRA_FULL_ERROR=1 
export VLLM_ATTENTION_BACKEND=XFORMERS 
export VERL_LOGGING_LEVEL=DEBUG
export MKL_SERVICE_FORCE_INTEL=1 
export MKL_THREADING_LAYER=GNU 
export RAY_memory_usage_threshold=0.8 
export RAY_memory_monitor_refresh_ms=0 
# Set Python path
export PYTHONPATH=${PARENT_DIR}/verl_aepo_entropy:$PYTHONPATH
# ============================ Basic Configuration ============================
# Experiment name and project
PROJECT_NAME="deep_research"
EXPERIMENT_NAME="aepo_qwen3_14b_deepresearch"
# Configuration file path
CONFIG_PATH="${PARENT_DIR}/scripts/config" # Modify the absolute path of the config folder, relative path is not recommended
CONFIG_NAME="ppo_trainer_dr.yaml"
# Distributed training settings
NNODES=1 
N_GPUS_PER_NODE=8 
# ============================ Data Configuration ============================
# Data parameters
PROMPT_KEY="prompt" # Prompt field name
TRAIN_BATCH_SIZE=64 # Training batch size
PPO_MINI_BATCH_SIZE=8 # PPO mini-batch size
MAX_PROMPT_LENGTH=2000 # Maximum prompt length
MAX_RESPONSE_LENGTH=6192 # Maximum response length
# Data file paths
TRAIN_FILES="${PARENT_DIR}/../ARPO/rl_datasets/hard_search_1k.parquet"
VALID_FILES=["${PARENT_DIR}/../ARPO/rl_datasets/gaia_test.parquet","${PARENT_DIR}/../ARPO/rl_datasets/hle_test.parquet"]
# ============================ Model Configuration ============================
# Actor model path
ACTOR_MODEL_PATH="<your_14B_model_path>"
# ============================ AEPO Configuration ============================
ENABLE_DYNAMIC_ROLLOUTS=False
ENABLE_ENTROPY_BALANCED_CLIPPING=True
ENABLE_ENTROPY_BALANCED_ADVANTAGE=True
# ============================ Rollout Configuration ==========================
# Rollout settings
ROLLOUT_NAME="vllm" # Use vllm engine
ROLLOUT_MODE="sync_with_tool" # Synchronous mode with tool support
ROLLOUT_N=12 # Number of responses generated per sample
INITIAL_ROLLOUTS=6 # Initial rollout number
BEAM_SIZE=2 # Beam size
BRANCH_PROBABILITY=0.5 # Branch probability
Entropy_weight=0.2
# ============================ Rollout Tools Configuration ==========================
SEARCH_CACHE_PATH="${PARENT_DIR}/../ARPO/search_cache/search_cache.json" # Modify
# ============================ Reward Model Configuration ==========================
# Reward model settings
REWARD_MANAGER="naive" # Reward manager type
CUSTOM_REWARD_FUNCTION_PATH="${PARENT_DIR}/verl_aepo_entropy/verl/utils/reward_score/deep_research.py"
CUSTOM_REWARD_FUNCTION_NAME="compute_score"
# ============================ Training Configuration ============================
# Training parameters
TOTAL_EPOCHS=5 # Total training epochs
SAVE_FREQ=5 # Save frequency
TEST_FREQ=5 # Test frequency
# ============================ Path Configuration ============================
# Save path
SAVE_PATH="<your_checkpoint_save_dir>/rl/${EXPERIMENT_NAME}"
ROLLOUT_SAVE_PATH="${SAVE_PATH}/rollout"
# ============================ WandB Configuration ============================
# WandB settings
WANDB_API_KEY="<your_wandb_key>" # Modify your wandb key
# ============================ Preparation ============================
# Login to WandB (if API key is provided)
if [ "$WANDB_API_KEY" != "" ]; then
 wandb login --relogin $WANDB_API_KEY
 export WANDB_DIR=${SAVE_PATH}
fi
# Create save directory
if [ ! -d "$SAVE_PATH" ]; then
 mkdir -p $SAVE_PATH
fi
# Create rollout save directory
if [ ! -d "$ROLLOUT_SAVE_PATH" ]; then
 mkdir -p $ROLLOUT_SAVE_PATH
fi
# ============================ Start Training ============================
python3 -m verl.trainer.main_ppo \
 --config-path=$CONFIG_PATH \
 --config-name=$CONFIG_NAME \
 algorithm.adv_estimator=grpo \
 algorithm.kl_ctrl.kl_coef=0.0 \
 data.train_files=${TRAIN_FILES} \
 data.val_files=${VALID_FILES} \
 data.prompt_key=${PROMPT_KEY} \
 data.train_batch_size=${TRAIN_BATCH_SIZE} \
 data.max_prompt_length=${MAX_PROMPT_LENGTH} \
 data.max_response_length=${MAX_RESPONSE_LENGTH} \
 actor_rollout_ref.model.path=${ACTOR_MODEL_PATH} \
 actor_rollout_ref.model.enable_gradient_checkpointing=True \
 actor_rollout_ref.model.use_remove_padding=True \
 actor_rollout_ref.actor.enable_entropy_balanced_clipping=${ENABLE_ENTROPY_BALANCED_CLIPPING} \
 actor_rollout_ref.actor.enable_entropy_balanced_advantage=${ENABLE_ENTROPY_BALANCED_ADVANTAGE} \
 actor_rollout_ref.actor.optim.lr=1e-6 \
 actor_rollout_ref.actor.ppo_mini_batch_size=${PPO_MINI_BATCH_SIZE} \
 actor_rollout_ref.actor.use_dynamic_bsz=True \
 actor_rollout_ref.actor.ppo_max_token_len_per_gpu=$((2*(MAX_PROMPT_LENGTH+MAX_RESPONSE_LENGTH))) \
 actor_rollout_ref.actor.use_kl_loss=True \
 actor_rollout_ref.actor.kl_loss_coef=0.0 \
 actor_rollout_ref.actor.kl_loss_type=low_var_kl \
 actor_rollout_ref.actor.fsdp_config.param_offload=False \
 actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
 actor_rollout_ref.rollout.enable_dynamic_rollouts=${ENABLE_DYNAMIC_ROLLOUTS} \
 actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=$((4*(MAX_PROMPT_LENGTH+MAX_RESPONSE_LENGTH))) \
 actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
 actor_rollout_ref.rollout.name=${ROLLOUT_NAME} \
 actor_rollout_ref.rollout.mode=${ROLLOUT_MODE} \
 actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
 actor_rollout_ref.rollout.n=${ROLLOUT_N} \
 actor_rollout_ref.rollout.initial_rollouts=${INITIAL_ROLLOUTS} \
 actor_rollout_ref.rollout.beam_size=${BEAM_SIZE} \
 actor_rollout_ref.rollout.branch_probability=${BRANCH_PROBABILITY} \
 actor_rollout_ref.rollout.entropy_weight=${Entropy_weight} \
 ++actor_rollout_ref.rollout.tools.tool_instances.search.params.cache_file=${SEARCH_CACHE_PATH} \
 actor_rollout_ref.rollout.multi_turn.enable=${ENABLE_MULTI_TURN} \
 actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=$((4*(MAX_PROMPT_LENGTH+MAX_RESPONSE_LENGTH))) \
 actor_rollout_ref.ref.fsdp_config.param_offload=True \
 reward_model.reward_manager=${REWARD_MANAGER} \
 custom_reward_function.path=${CUSTOM_REWARD_FUNCTION_PATH} \
 custom_reward_function.name=${CUSTOM_REWARD_FUNCTION_NAME} \
 trainer.critic_warmup=0 \
 trainer.logger="[console, wandb]" \
 trainer.project_name=${PROJECT_NAME} \
 trainer.experiment_name=${EXPERIMENT_NAME} \
 trainer.n_gpus_per_node=${N_GPUS_PER_NODE} \
 trainer.nnodes=${NNODES} \
 trainer.save_freq=${SAVE_FREQ} \
 trainer.test_freq=${TEST_FREQ} \
 trainer.total_epochs=${TOTAL_EPOCHS} \
 trainer.default_local_dir=${SAVE_PATH} \
 trainer.val_before_train=False \
 trainer.rollout_data_dir=${ROLLOUT_SAVE_PATH} \
 hydra.run.dir=${SAVE_PATH}/outputs 2>&1 | tee ${SAVE_PATH}/run.log

You can then run the following script to start training:

cd ./AEPO/scripts/
bash AEPO_Qwen3_14B_DeepResearch.sh

Same as ARPO, for the trained RL checkpoint, you can follow the code below to convert the weights to Hugging Face format:

bash ./ARPO/merge_ckpt/convert_checkpoint_from_verl_to_hf_qwen3.sh

✅ ARPO/AEPO Evaluation

If you have already trained a model, you can refer to the following process for TIR capability evaluation. Of course, you can also download our checkpoint from 🤗ARPO-Huggingface-Collection and 🤗AEPO-Huggingface-Collection for directly testing. This guide walks you through setting up two separate environments:

One for vLLM inference service (vllm_env)
One for evaluation pipeline (evaluation)

1. Setup vLLM Inference Environment

# Step into the vllm_scripts directory
cd evaluation/vllm_scripts
# Create a dedicated conda environment for vLLM
conda create -n vllm_env python=3.10
conda activate vllm_env
# Install dependencies (edit as needed)
pip install -r requirements.txt

Edit the following launch scripts with your own model paths and names:

In vllm_launch_reasoning_model_cuda4-7.sh:

MODEL_PATH="<path/to/your/reasoning_model_checkpoint>"
MODEL_NAME="your_model_name"

For summarization models (choose one):

MODEL_PATH="<path/to/your/summarization_model_checkpoint>"
MODEL_NAME="your_summarization_model_name"

Launch the vLLM services:

# Start the reasoning model
bash vllm_launch_reasoning_model_cuda4-7.sh
# Start the summarization model (choose one)
bash vllm_launch_summarize_model_cuda0-3_<your_model>.sh

2. Setup Evaluation Environment

# Create a separate environment for evaluation
conda create -n evaluation python=3.10
conda activate evaluation
# Install required packages
cd evaluation
pip install -r requirements.txt

3. Configure and Run Evaluation

Edit the infer_local_sds.sh script with the following values:

# Activate your Conda environment manually if 'conda' is not available in shell
source < /path/to/your/conda >/bin/activate
conda activate < your env name >
# Datasets to evaluate — uncomment the ones you want to include:
# Options: aime24, aime25, math500, gsm8k, math, webwalker, hotpotqa, 2wiki, bamboogle, musique, hle, gaia, SimpleQA, xbench
data_names=(
 "hle"
 "gaia"
)
# Required parameters to update:
EXP_NAME="<your_exp_name>" # Name of this experiment run
MODEL_PATH="<your_model_path>" # Path to the reasoning model
OUTPUT_PATH="<your_output_path>" # Directory to save outputs
CONDA_PATH="<your_conda_path>" # Path to your Conda installation
CONDA_ENV="<your_env_name>" # Name of your Conda environment
BING_API_KEY="<your_bing_search_api_key>" # Bing Search API key
BING_ZONE="<your_bing_zone>" # Bing API zone
SUMM_MODEL_PATH="<your_summarization_model_path>" # Path to summarization model checkpoints

For Bing API usage, please refer to Bright Data.

Run the evaluation:

bash evaluation/infer_local_sds.sh

🔸 For Chinese datasets like xbench, we recommend using Jina API for better webpage extraction.

To enable Jina Reader API, modify evaluation/src/tools/search_tool_sds.py (line ~155):
# Change from:
lambda: self.extract_text_from_url(url, use_jina=False, jina_api_key=None)
# To:
lambda: self.extract_text_from_url(url, use_jina=True, jina_api_key="your_jina_api_key")
Get your Jina API key at https://jina.ai/reader

4. Calculate Metrics

After generating inference results, you can use a large model like Qwen2.5-72B-Instruct to evaluate them with more powerful understanding capabilities.

First, use the vLLM environment to start the evaluation model:

bash evaluation/deploy_qwen2.5_72B_instruct.sh

In that script, make sure to update the vllm serve command with your own model path:

# Activate your Conda environment manually if 'conda' is not available in shell
source < /path/to/your/conda >/bin/activate
conda activate < your env name >
vllm serve <your_model_path> \
 --served-model-name Qwen2.5-72B-Instruct \
 --max-model-len 32768 \
 --tensor_parallel_size 4 \
 --gpu-memory-utilization 0.75 \
 --quantization gptq \
 --port 8001

Before running the evaluation script, update the following line in evaluate_passk.sh to specify the output directory:

OUTPUT_DIR="<your_result_directory>"

Then, run the evaluation script to calculate metrics:

bash evaluation/evaluate_passk.sh

📄 Citation

If you find this work helpful, please cite our paper:

@article{dong2025arpo,
 author = {Guanting Dong and
 Hangyu Mao and
 Kai Ma and
 Licheng Bao and
 Yifei Chen and
 Zhongyuan Wang and
 Zhongxia Chen and
 Jiazhen Du and
 Huiyang Wang and
 Fuzheng Zhang and
 Guorui Zhou and
 Yutao Zhu and
 Ji{-}Rong Wen and
 Zhicheng Dou},
 title = {Agentic Reinforced Policy Optimization},
 journal = {CoRR},
 volume = {abs/2507.19849},
 year = {2025},
 url = {https://doi.org/10.48550/arXiv.2507.19849},
 doi = {10.48550/ARXIV.2507.19849},
 eprinttype = {arXiv},
 eprint = {2507.19849},
 timestamp = {2025年8月22日 07:48:19 +0200},
 biburl = {https://dblp.org/rec/journals/corr/abs-2507-19849.bib},
 bibsource = {dblp computer science bibliography, https://dblp.org}
}
@article{dong2025aepo,
 author = {Guanting Dong and
 Licheng Bao and
 Zhongyuan Wang and
 Kangzhi Zhao and
 Xiaoxi Li and
 Jiajie Jin and
 Jinghan Yang and
 Hangyu Mao and
 Fuzheng Zhang and
 Kun Gai and
 Guorui Zhou and
 Yutao Zhu and
 Ji{-}Rong Wen and
 Zhicheng Dou},
 title = {Agentic Entropy-Balanced Policy Optimization},
 journal = {CoRR},
 volume = {abs/2510.14545},
 year = {2025},
 url = {https://doi.org/10.48550/arXiv.2510.14545},
 doi = {10.48550/ARXIV.2510.14545},
 eprinttype = {arXiv},
 eprint = {2510.14545},
 timestamp = {2025年11月14日 15:17:45 +0100},
 biburl = {https://dblp.org/rec/journals/corr/abs-2510-14545.bib},
 bibsource = {dblp computer science bibliography, https://dblp.org}
}
@article{dong2025tool,
 author = {Guanting Dong and
 Yifei Chen and
 Xiaoxi Li and
 Jiajie Jin and
 Hongjin Qian and
 Yutao Zhu and
 Hangyu Mao and
 Guorui Zhou and
 Zhicheng Dou and
 Ji{-}Rong Wen},
 title = {Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement
 Learning},
 journal = {CoRR},
 volume = {abs/2505.16410},
 year = {2025},
 url = {https://doi.org/10.48550/arXiv.2505.16410},
 doi = {10.48550/ARXIV.2505.16410},
 eprinttype = {arXiv},
 eprint = {2505.16410},
 timestamp = {2025年6月26日 07:49:34 +0200},
 biburl = {https://dblp.org/rec/journals/corr/abs-2505-16410.bib},
 bibsource = {dblp computer science bibliography, https://dblp.org}
}

🤝 Acknowledge

This training implementation builds upon Tool-Star, Llama Factory, verl and ReCall. For evaluation, we rely on WebThinker, HIRA, WebSailor, Search-o1, and FlashRAG. The Python interpreter design references ToRA and ToRL, while our models are trained using Qwen2.5. We express our sincere gratitude to these projects for their invaluable contributions to the open-source community.

📄 License

This project is released under the MIT License.

📞 Contact

For any questions or feedback, please reach out to us at dongguanting@ruc.edu.cn.

Star History

Star History Chart

Folders and files

Latest commit

History

Repository files navigation

✨ Agentic Reinforced Policy Optimization

If you like our project, please give us a star ⭐ on GitHub for the latest update.

📣 Latest News

🔥 Agentic RL Family

📦 Dataset & Model Zoo

Table of Contents

💡 Overview

AEPO (🔥New!)

ARPO

🏃 Quick Start

❄️ Cold-Start SFT Stage (Optional)

1. Environment Setup

2. Fine-Tuning Model

🔥 ARPO/AEPO Stage

1. Environment Setup

2. Preparation

2.1 Data Preparation

2.2 API Key Configuration

3. ARPO RL Training

4. AEPO RL Training

✅ ARPO/AEPO Evaluation

1. Setup vLLM Inference Environment

2. Setup Evaluation Environment

3. Configure and Run Evaluation

4. Calculate Metrics

📄 Citation

🤝 Acknowledge

📄 License

📞 Contact

Star History

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages