Arxiv(ARPO) | Arxiv(AEPO) | π€ Paper(ARPO) | π€ Paper(AEPO) | π€ Models(ARPO) | π€ Models(AEPO)
X@AK | X@ζΊε¨δΉεΏ | WeChat@ζΊε¨δΉεΏ | Zhihu | YouTube | Xiaohongshu |
Note
This project includes the codebase, datasets and chckpoints for two RL algorithms: Agentic Reinforced Policy Optimization (ARPO) and Agentic Entropy-Balanced Policy Optimization (AEPO). We will continue to iterate and expand our Agentic RL series.
- [Jan 26, 2026]: π Our paper Agentic Reinforced Policy Optimization has been accepted at ICLR 2026!
- [Jan 20, 2026]: π Our paper Agentic Entropy-Balanced Policy Optimization has been accepted at WWW 2026 (Oral)!
- [Dec 20, 2025]: πππ We released AEPO-32B and ARPO-32B (based on QwQ-32B), achieving 53.4/12.8 and 51.5/11.2 on GAIA/HLE.
- [Nov 03, 2025]: The brief introduction of AEPO can be found on a series of platforms like X, WeChat .
- [Oct 17, 2025]: π Our AEPO paper is now available on arXiv and Hugging Face daily paper.
- [Oct 16, 2025]: πππ We propose a new algorithm AEPO, which focused on entropy-balanced agentic RL and consistently outperforms ARPO on datasets like GAIA, HLE, and AIME. Full codebase and π€ HF-Models of AEPO released.
- [Aug 11, 2025]: The brief introduction of ARPO can be found on a series of platforms like X, WeChat, Zhihu, YouTube and Xiaohongshu .
- [July 29, 2025]: π₯ We are honored to be featured as π€ HuggingFace Daily Paper #1 and Weekly Paper #1 .
- [July 29, 2025]: π Our ARPO paper is now available on arXiv and Hugging Face daily paper.
- [July 25, 2025]: π₯ We released all our ARPO model checkpoints (3B~14B) and datasets(SFT, RL, Evaluation). Checkout π€ARPO Collection here. We will keep update it!
- [July 25, 2025]: We have implemented extensive tool-call acceleration and memory optimization during RL training in ARPO.(Training Qwen3-14B in 1 node with a batch size of 128 takes only 10 minutes per step!!! we also maintain a dynamic cache mechanism to save your tool call results in real-time!!)
- [July 25, 2025]: πππ Full codebase of ARPO released. ARPO supports multi-tool agentic RL training for the Qwen2.5, 3 and Llama3 models in π€ HF-Models .
π Welcome to try our agentic RL series of algorithms:
Agentic Entropy-Balanced Policy Optimization
Authors: Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li, Jiajie Jin, Jinghan Yang, Hangyu Mao, Fuzheng Zhang, Kun Gai, Guorui Zhouβ , Yutao Zhu, Ji-Rong Wen, Zhicheng Douβ
TLDR: An agentic RL algorithm designed to balance entropy in both the rollout and policy update phases.
github github arXiv Paper Collection X (formerly Twitter) URL
Agentic Reinforced Policy Optimization
Authors: Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao , Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhouβ , Yutao Zhu, Ji-Rong Wen, Zhicheng Douβ
TLDR: An agentic RL algorithm encourage the policy model to adaptively branch sampling during high-entropy tool-call rounds,
github github arXiv Paper Collection X (formerly Twitter) URL
Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning
Authors: Guanting Dong, Yifei Chen, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Yutao Zhu, Hangyu Mao, Guorui Zhou, Zhicheng Douβ , Ji-Rong Wen
TLDR: An end-to-end TIR post-training framework that empowers LLMs to autonomously interact with multi-tool environments through Self-Critic RL design
github github arXiv Paper Collection X (formerly Twitter) URL
DeepAgent: A General Reasoning Agent with Scalable Toolsets (New!)
Authors: Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Guanting Dong, Jiajie Jin, Yinuo Wang, Hao Wang, Yutao Zhu, Ji-Rong Wen, Yuan Lu, Zhicheng Dou
TLDR: An end-to-end deep reasoning agent that performs autonomous thinking, tool discovery, and action execution with brain-inspired memory folding mechanism.
github github arXiv Paper
| Dataset | Download |
|---|---|
| 54K Agentic SFT Data | π€ HuggingFace |
| 1K Agentic Deep Search RL Data | π€ HuggingFace |
| 10K Agentic Reasoning RL Data | π€ HuggingFace |
| Model(ARPO) | Download |
|---|---|
| Qwen3-8B-ARPO-DeepSearch | π€ HuggingFace |
| Qwen3-14B-ARPO-DeepSearch | π€ HuggingFace |
| QwQ-32B-ARPO-DeepSearch | π€ HuggingFace |
| Qwen2.5-3B-ARPO | π€ HuggingFace |
| Qwen2.5-7B-ARPO | π€ HuggingFace |
| Llama3.1-8B-ARPO | π€ HuggingFace |
| Model(AEPO) | Download |
|---|---|
| Qwen3-8B-AEPO-DeepSearch | π€ HuggingFace |
| Qwen3-14B-AEPO-DeepSearch | π€ HuggingFace |
| QwQ-32B-AEPO-DeepSearch | π€ HuggingFace |
| Qwen2.5-7B-AEPO | π€ HuggingFace |
We propose Agentic Entropy-Balanced Policy Optimization (AEPO), an agentic RL algorithm designed to balance entropy in both the rollout and policy update phases. AEPO comprises two core components:
image-
Dynamic Entropy-Balanced Rollout Mechanism that adaptively allocates global and branch sampling budget through entropy pre-monitoring, while imposing a branch penalty on consecutive high-entropy tool-call steps to prevent over-branching issues;
-
Entropy-Balanced Policy Optimization that inserts a stop-gradient operation into the high-entropy clipping term to preserve and properly rescale gradients on high-entropy tokens (Entropy Clipping-Balanced Mechanism), while incorporating entropy-aware advantage estimation to prioritize learning on high-uncertainty tokens (Entropy-aware Advantage Estimation).
We propose Agentic Reinforced Policy Optimization (ARPO), an agentic RL algorithm tailored for training multi-turn LLM-based agent. The core principle of ARPO is to encourage the policy model to adaptively branch sampling during high-entropy tool-call rounds, thereby efficiently aligning step-level tool-use behaviors.
intro-
In figure (left), The initial tokens generated by the LLM after receiving each round of tool-call feedback consistently exhibit a high entropy. This indicates that external tool-call significantly introduces uncertainty into the LLMβs reasoning process.
-
In the figure (right), we validate ARPO's performance across 13 datasets. Notably, Qwen3-14B with ARPO excelled in Pass@5, achieving 61.2% on GAIA and 24.0% on HLE, while requiring only about half the tool calls compared to GRPO during training.
Reproducing ARPO/AEPO requires three steps: cold start fine-tuning (optional), ARPO/AEPO training, and evaluation. Below, we will provide a detailed explanation.
This stage is meant to help you reproduce our experimental results. If your want to RL from scratch, you can skip this stage.
In this step, we will describe how to perform a cold start for the SFT stage using the LLaMA Factory repository. First, set up the environment as follows:
# Clone the ARPO repository (which includes LLaMA-Factory) git clone https://github.com/dongguanting/ARPO cd ARPO/LLaMA-Factory # Create a new conda environment conda create -n sft python=3.10 conda activate sft # Install dependencies pip install -r requirements.txt
-
Download your SFT dataset from π€ARPO-SFT-54K and place it in
LLaMA-Factory-main/data/final_sft_edition9.json. Define the dataset indataset_info.json. -
Configure Training
Update LLaMA-Factory/arpo_train_sft/yaml with the following content:
Training Configuration (click to expand)
### model model_name_or_path: <your_model_path> trust_remote_code: true ### method stage: sft do_train: true finetuning_type: full deepspeed: ../examples/deepspeed/ds_z3_config.json # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json] ### dataset dataset_dir: dataset_info dataset: <your_dataset> template: qwen cutoff_len: 15000 max_samples: 1000000 overwrite_cache: true preprocessing_num_workers: 16 ### output output_dir: <your_output_dir> logging_steps: 10 save_steps: 2000 plot_loss: true overwrite_output_dir: true ### train per_device_train_batch_size: 1 gradient_accumulation_steps: 2 learning_rate: 7.0e-6 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000
Also, update the output directory in arpo_train_sft/sft_train.sh:
# Output directory OUTPUT_DIR="<your_output_dir>"
After completing the information, you can fine-tune the model using the following command:
bash arpo_train_sft/sft_train.sh
In this step, we will load the cold-start data for GRPO training. We reference the ReCall and VERL frameworks for RL training.
you can install our additional environment as follow:
#create env conda create -n arpo python==3.10 conda activate arpo # install torch & flash-atten pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124 pip3 install flash-attn --no-build-isolation # install RL basic env cd ARPO # This is our RL env freeze file. You can install it as a supplement or use it for checking. pip install -r requirements.txt
In our paper, we offer two type of train & validation datasets to verify the effectiveness of ARPO:
-
Reasoning and Knowledge Dataset: This dataset is used to test the benchmarks listed in Table 1.
- train_10k.parquet: Contains 10K samples for mathematical and knowledge reasoning.
- test.parquet: Comprises 300 test samples from 8 datasets, including AIME24, AIME25, MATH500, GSM8k, HotpotQA, 2Wiki, Misque, and Bamboogle.
-
Deep Search Dataset: This dataset is used to test the benchmarks listed in Table 2.
- hard_search.parquet: Contains 1K samples, including 800 samples from simpledeepsearch and 200 samples from webdancer.
- gaia_test.parquet/hle_test.parquet: Contains test samples from GAIA and Humanity Last Exam (HLE).
Our search api tool utilizes Bright Data (A third-party Bing API, without the retirement risk of official Bing API). Before starting the training, please replace the API key and zone in the following files: ARPO/scripts/config/ppo_trainer_dr.yaml and ARPO/scripts/config/ppo_trainer.yaml.
Additionally, please also replace the API key and zone in the following file: /verl_arpo_entropy/verl/workers/rollout/tools/config_example.yaml. Below is the instruction on how to do this:
π Click here! Watch the details of tool API configuration YAML
tools: # General tool configuration call_limit: 3 # Maximum number of tool calls allowed per sample max_workers: 64 # Maximum number of threads for concurrent tool execution timeout: 120 # Tool execution timeout (seconds) retry_count: 3 # Number of retry attempts for tool execution failures verbose_logging: true # Enable detailed logging fail_on_error: false # Throw an exception if tool loading fails # Tool instance definitions tool_instances: python: class_path: verl.workers.rollout.tools.python_tool.PythonTool # Tool class path params: # Tool-specific parameters conda_path: /path/to/conda conda_env: verl search: class_path: verl.workers.rollout.tools.search_tool.BingSearchTool params: api_key: <your_API_key> # Replace with your Bright Data API key zone: <your_zone> # Replace with your Bright Data zone max_results: 10 result_length: 1000 location: cn
Make sure to replace <your_API_key> and <your_zone> with your actual Bright Data API key and zone. This configuration ensures that the search tool is properly set up to perform searches during the training process. If you have any questions or need further assistance, feel free to ask!
We have open-sourced a series of ARPO scripts located in the /ARPO/scripts/ directory, which includes configurations for 7B, 8B, and 14B models. Below is an example of how to set up and run training for training ARPO. Make sure to replace placeholders like <your_path_to_ARPO>, <your_model_path>, and <your_checkpoint_save_dir> with your actual paths.
π Click here! Watch the details of train bash
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )" PARENT_DIR="$(dirname "$SCRIPT_DIR")" cd "$PARENT_DIR" echo "Switched to parent directory: $PARENT_DIR" # ============================ Environment Setup ============================ # Set basic environment variables export PYTHONUNBUFFERED=1 export HYDRA_FULL_ERROR=1 export VLLM_ATTENTION_BACKEND=XFORMERS export VERL_LOGGING_LEVEL=DEBUG export MKL_SERVICE_FORCE_INTEL=1 export MKL_THREADING_LAYER=GNU export RAY_memory_usage_threshold=0.8 export RAY_memory_monitor_refresh_ms=0 # Set Python path export PYTHONPATH="<your_path_to_ARPO>"/verl_arpo_entropy:$PYTHONPATH # ============================ Basic Configuration ============================ # Experiment name and project PROJECT_NAME="reasoning_tasks" # Modify experiment group EXPERIMENT_NAME="ARPO_global_16_init_8_beam_2_random_0_arpo_0.2_entropy" # Modify experiment name # Configuration file path CONFIG_PATH="<your_path_to_ARPO>/scripts/config" # Modify the absolute path of the config folder, relative path is not recommended CONFIG_NAME="ppo_trainer.yaml" # Distributed training settings NNODES=1 N_GPUS_PER_NODE=8 # ============================ Data Configuration ============================ # Data parameters PROMPT_KEY="prompt" # Prompt field name TRAIN_BATCH_SIZE=128 # Training batch size PPO_MINI_BATCH_SIZE=16 # PPO mini-batch size MAX_PROMPT_LENGTH=1536 # Maximum prompt length MAX_RESPONSE_LENGTH=4096 # Maximum response length # Data file paths TRAIN_FILES="<your_path_to_ARPO>/rl_datasets/train.parquet" # Modify training data path VALID_FILES="<your_path_to_ARPO>/rl_datasets/valid.parquet" # Modify validation data path # ============================ Model Configuration ============================ # Actor model path ACTOR_MODEL_PATH="<your_model_path>" # Modify training model path # ============================ Rollout Configuration ========================== # Rollout settings ROLLOUT_NAME="vllm" # Use vllm engine ROLLOUT_MODE="sync_with_tool" # Synchronous mode with tool support ROLLOUT_N=16 # Number of responses generated per sample INITIAL_ROLLOUTS=8 # Initial rollout number BEAM_SIZE=2 # Beam size BRANCH_PROBABILITY=0.5 # Branch probability Entropy_weight=0.2 # ============================ Rollout Tools Configuration ========================== SEARCH_CACHE_PATH="<your_path_to_ARPO>/search_cache/search_cache.json" # Modify # ============================ Reward Model Configuration ========================== # Reward model settings REWARD_MANAGER="naive" # Reward manager type CUSTOM_REWARD_FUNCTION_PATH="<your_path_to_ARPO>/verl_arpo_entropy/verl/utils/reward_score/deep_research.py" # Modify reward function path CUSTOM_REWARD_FUNCTION_NAME="compute_score" # ============================ Training Configuration ============================ # Training parameters TOTAL_EPOCHS=2 # Total training epochs SAVE_FREQ=5 # Save frequency TEST_FREQ=5 # Test frequency # ============================ Path Configuration ============================ # Save path SAVE_PATH="<your_checkpoint_save_dir>/${EXPERIMENT_NAME}" # Modify save path ROLLOUT_SAVE_PATH="${SAVE_PATH}/rollout" # ============================ WandB Configuration ============================ # WandB settings WANDB_API_KEY="<your_wandb_key>" # Modify your wandb key SEARCH_CLASS_PATH="verl.workers.agent.tools.search_tool.BingSearchTool" # ============================ Preparation ============================ # Login to WandB (if API key is provided) if [ "$WANDB_API_KEY" != "" ]; then wandb login --relogin $WANDB_API_KEY export WANDB_DIR=${SAVE_PATH} fi # Create save directory if [ ! -d "$SAVE_PATH" ]; then mkdir -p $SAVE_PATH fi # Create rollout save directory if [ ! -d "$ROLLOUT_SAVE_PATH" ]; then mkdir -p $ROLLOUT_SAVE_PATH fi # ============================ Start Training ============================ python3 -m verl.trainer.main_ppo \ --config-path=$CONFIG_PATH \ --config-name=$CONFIG_NAME \ algorithm.adv_estimator=grpo \ algorithm.kl_ctrl.kl_coef=0.0 \ data.train_files=${TRAIN_FILES} \ data.val_files=${VALID_FILES} \ data.prompt_key=${PROMPT_KEY} \ data.train_batch_size=${TRAIN_BATCH_SIZE} \ data.max_prompt_length=${MAX_PROMPT_LENGTH} \ data.max_response_length=${MAX_RESPONSE_LENGTH} \ actor_rollout_ref.model.path=${ACTOR_MODEL_PATH} \ actor_rollout_ref.model.enable_gradient_checkpointing=True \ actor_rollout_ref.model.use_remove_padding=True \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.actor.ppo_mini_batch_size=${PPO_MINI_BATCH_SIZE} \ actor_rollout_ref.actor.use_dynamic_bsz=True \ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=$((2*(MAX_PROMPT_LENGTH+MAX_RESPONSE_LENGTH))) \ actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.kl_loss_coef=0.0 \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.fsdp_config.param_offload=False \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=$((4*(MAX_PROMPT_LENGTH+MAX_RESPONSE_LENGTH))) \ actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ actor_rollout_ref.rollout.name=${ROLLOUT_NAME} \ actor_rollout_ref.rollout.mode=${ROLLOUT_MODE} \ actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \ actor_rollout_ref.rollout.n=${ROLLOUT_N} \ actor_rollout_ref.rollout.initial_rollouts=${INITIAL_ROLLOUTS} \ actor_rollout_ref.rollout.beam_size=${BEAM_SIZE} \ actor_rollout_ref.rollout.branch_probability=${BRANCH_PROBABILITY} \ actor_rollout_ref.rollout.entropy_weight=${Entropy_weight} \ actor_rollout_ref.rollout.tools.tool_instances.search.params.cache_file=${SEARCH_CACHE_PATH} \ actor_rollout_ref.rollout.tools.tool_instances.search.class_path=${SEARCH_CLASS_PATH} \ actor_rollout_ref.rollout.multi_turn.enable=${ENABLE_MULTI_TURN} \ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=$((4*(MAX_PROMPT_LENGTH+MAX_RESPONSE_LENGTH))) \ actor_rollout_ref.ref.fsdp_config.param_offload=True \ reward_model.reward_manager=${REWARD_MANAGER} \ custom_reward_function.path=${CUSTOM_REWARD_FUNCTION_PATH} \ custom_reward_function.name=${CUSTOM_REWARD_FUNCTION_NAME} \ trainer.critic_warmup=0 \ trainer.logger="[console, wandb]" \ trainer.project_name=${PROJECT_NAME} \ trainer.experiment_name=${EXPERIMENT_NAME} \ trainer.n_gpus_per_node=${N_GPUS_PER_NODE} \ trainer.nnodes=${NNODES} \ trainer.save_freq=${SAVE_FREQ} \ trainer.test_freq=${TEST_FREQ} \ trainer.total_epochs=${TOTAL_EPOCHS} \ trainer.default_local_dir=${SAVE_PATH} \ trainer.val_before_train=False \ trainer.rollout_data_dir=${ROLLOUT_SAVE_PATH} \ hydra.run.dir=${SAVE_PATH}/outputs 2>&1 | tee ${SAVE_PATH}/run.log
You can then run the following script to start training:
cd ./ARPO/scripts/
bash ARPO_7B_Reasoning_1node.shFor the trained RL checkpoint, you can follow the code below to convert the weights to Hugging Face format:
bash ./ARPO/merge_ckpt/convert_checkpoint_from_verl_to_hf_qwen3.sh
We have open-sourced a series of AEPO scripts located in the /AEPO/scripts/ directory, which includes configurations for 7B and 14B models. Below is an example of how to set up and run training for training AEPO. Make sure to replace placeholders like <your_path_to_AEPO>, <your_model_path>, and <your_checkpoint_save_dir> with your actual paths. Note that AEPO reuses the same dataset and search cache from the ARPO folder, so please ensure the related paths are correctly set.
You can modify the hyperparameters in our scripts to enable different modules of AEPO described in our paper:
-
ENABLE_DYNAMIC_ROLLOUTS: Whether to enable the Dynamic Entropy-Balanced Rollout Mechanism, defaults to False
-
ENABLE_ENTROPY_BALANCED_CLIPPING: Whether to enable the Entropy Clipping-Balanced Mechanism.
-
ENABLE_ENTROPY_BALANCED_ADVANTAGE: Whether to enable Entropy-aware Advantage Estimation.
π Click here! Watch the details of train bash
# Switch to the directory of the script SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )" PARENT_DIR="$(dirname "$SCRIPT_DIR")" cd "$PARENT_DIR" echo "Switched to parent directory: $PARENT_DIR" # ============================ Environment Setting ============================ # Set basic environment variables export PYTHONUNBUFFERED=1 export HYDRA_FULL_ERROR=1 export VLLM_ATTENTION_BACKEND=XFORMERS export VERL_LOGGING_LEVEL=DEBUG export MKL_SERVICE_FORCE_INTEL=1 export MKL_THREADING_LAYER=GNU export RAY_memory_usage_threshold=0.8 export RAY_memory_monitor_refresh_ms=0 # Set Python path export PYTHONPATH=${PARENT_DIR}/verl_aepo_entropy:$PYTHONPATH # ============================ Basic Configuration ============================ # Experiment name and project PROJECT_NAME="deep_research" EXPERIMENT_NAME="aepo_qwen3_14b_deepresearch" # Configuration file path CONFIG_PATH="${PARENT_DIR}/scripts/config" # Modify the absolute path of the config folder, relative path is not recommended CONFIG_NAME="ppo_trainer_dr.yaml" # Distributed training settings NNODES=1 N_GPUS_PER_NODE=8 # ============================ Data Configuration ============================ # Data parameters PROMPT_KEY="prompt" # Prompt field name TRAIN_BATCH_SIZE=64 # Training batch size PPO_MINI_BATCH_SIZE=8 # PPO mini-batch size MAX_PROMPT_LENGTH=2000 # Maximum prompt length MAX_RESPONSE_LENGTH=6192 # Maximum response length # Data file paths TRAIN_FILES="${PARENT_DIR}/../ARPO/rl_datasets/hard_search_1k.parquet" VALID_FILES=["${PARENT_DIR}/../ARPO/rl_datasets/gaia_test.parquet","${PARENT_DIR}/../ARPO/rl_datasets/hle_test.parquet"] # ============================ Model Configuration ============================ # Actor model path ACTOR_MODEL_PATH="<your_14B_model_path>" # ============================ AEPO Configuration ============================ ENABLE_DYNAMIC_ROLLOUTS=False ENABLE_ENTROPY_BALANCED_CLIPPING=True ENABLE_ENTROPY_BALANCED_ADVANTAGE=True # ============================ Rollout Configuration ========================== # Rollout settings ROLLOUT_NAME="vllm" # Use vllm engine ROLLOUT_MODE="sync_with_tool" # Synchronous mode with tool support ROLLOUT_N=12 # Number of responses generated per sample INITIAL_ROLLOUTS=6 # Initial rollout number BEAM_SIZE=2 # Beam size BRANCH_PROBABILITY=0.5 # Branch probability Entropy_weight=0.2 # ============================ Rollout Tools Configuration ========================== SEARCH_CACHE_PATH="${PARENT_DIR}/../ARPO/search_cache/search_cache.json" # Modify # ============================ Reward Model Configuration ========================== # Reward model settings REWARD_MANAGER="naive" # Reward manager type CUSTOM_REWARD_FUNCTION_PATH="${PARENT_DIR}/verl_aepo_entropy/verl/utils/reward_score/deep_research.py" CUSTOM_REWARD_FUNCTION_NAME="compute_score" # ============================ Training Configuration ============================ # Training parameters TOTAL_EPOCHS=5 # Total training epochs SAVE_FREQ=5 # Save frequency TEST_FREQ=5 # Test frequency # ============================ Path Configuration ============================ # Save path SAVE_PATH="<your_checkpoint_save_dir>/rl/${EXPERIMENT_NAME}" ROLLOUT_SAVE_PATH="${SAVE_PATH}/rollout" # ============================ WandB Configuration ============================ # WandB settings WANDB_API_KEY="<your_wandb_key>" # Modify your wandb key # ============================ Preparation ============================ # Login to WandB (if API key is provided) if [ "$WANDB_API_KEY" != "" ]; then wandb login --relogin $WANDB_API_KEY export WANDB_DIR=${SAVE_PATH} fi # Create save directory if [ ! -d "$SAVE_PATH" ]; then mkdir -p $SAVE_PATH fi # Create rollout save directory if [ ! -d "$ROLLOUT_SAVE_PATH" ]; then mkdir -p $ROLLOUT_SAVE_PATH fi # ============================ Start Training ============================ python3 -m verl.trainer.main_ppo \ --config-path=$CONFIG_PATH \ --config-name=$CONFIG_NAME \ algorithm.adv_estimator=grpo \ algorithm.kl_ctrl.kl_coef=0.0 \ data.train_files=${TRAIN_FILES} \ data.val_files=${VALID_FILES} \ data.prompt_key=${PROMPT_KEY} \ data.train_batch_size=${TRAIN_BATCH_SIZE} \ data.max_prompt_length=${MAX_PROMPT_LENGTH} \ data.max_response_length=${MAX_RESPONSE_LENGTH} \ actor_rollout_ref.model.path=${ACTOR_MODEL_PATH} \ actor_rollout_ref.model.enable_gradient_checkpointing=True \ actor_rollout_ref.model.use_remove_padding=True \ actor_rollout_ref.actor.enable_entropy_balanced_clipping=${ENABLE_ENTROPY_BALANCED_CLIPPING} \ actor_rollout_ref.actor.enable_entropy_balanced_advantage=${ENABLE_ENTROPY_BALANCED_ADVANTAGE} \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.actor.ppo_mini_batch_size=${PPO_MINI_BATCH_SIZE} \ actor_rollout_ref.actor.use_dynamic_bsz=True \ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=$((2*(MAX_PROMPT_LENGTH+MAX_RESPONSE_LENGTH))) \ actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.kl_loss_coef=0.0 \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.fsdp_config.param_offload=False \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ actor_rollout_ref.rollout.enable_dynamic_rollouts=${ENABLE_DYNAMIC_ROLLOUTS} \ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=$((4*(MAX_PROMPT_LENGTH+MAX_RESPONSE_LENGTH))) \ actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ actor_rollout_ref.rollout.name=${ROLLOUT_NAME} \ actor_rollout_ref.rollout.mode=${ROLLOUT_MODE} \ actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ actor_rollout_ref.rollout.n=${ROLLOUT_N} \ actor_rollout_ref.rollout.initial_rollouts=${INITIAL_ROLLOUTS} \ actor_rollout_ref.rollout.beam_size=${BEAM_SIZE} \ actor_rollout_ref.rollout.branch_probability=${BRANCH_PROBABILITY} \ actor_rollout_ref.rollout.entropy_weight=${Entropy_weight} \ ++actor_rollout_ref.rollout.tools.tool_instances.search.params.cache_file=${SEARCH_CACHE_PATH} \ actor_rollout_ref.rollout.multi_turn.enable=${ENABLE_MULTI_TURN} \ actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=$((4*(MAX_PROMPT_LENGTH+MAX_RESPONSE_LENGTH))) \ actor_rollout_ref.ref.fsdp_config.param_offload=True \ reward_model.reward_manager=${REWARD_MANAGER} \ custom_reward_function.path=${CUSTOM_REWARD_FUNCTION_PATH} \ custom_reward_function.name=${CUSTOM_REWARD_FUNCTION_NAME} \ trainer.critic_warmup=0 \ trainer.logger="[console, wandb]" \ trainer.project_name=${PROJECT_NAME} \ trainer.experiment_name=${EXPERIMENT_NAME} \ trainer.n_gpus_per_node=${N_GPUS_PER_NODE} \ trainer.nnodes=${NNODES} \ trainer.save_freq=${SAVE_FREQ} \ trainer.test_freq=${TEST_FREQ} \ trainer.total_epochs=${TOTAL_EPOCHS} \ trainer.default_local_dir=${SAVE_PATH} \ trainer.val_before_train=False \ trainer.rollout_data_dir=${ROLLOUT_SAVE_PATH} \ hydra.run.dir=${SAVE_PATH}/outputs 2>&1 | tee ${SAVE_PATH}/run.log
You can then run the following script to start training:
cd ./AEPO/scripts/
bash AEPO_Qwen3_14B_DeepResearch.shSame as ARPO, for the trained RL checkpoint, you can follow the code below to convert the weights to Hugging Face format:
bash ./ARPO/merge_ckpt/convert_checkpoint_from_verl_to_hf_qwen3.sh
If you have already trained a model, you can refer to the following process for TIR capability evaluation. Of course, you can also download our checkpoint from π€ARPO-Huggingface-Collection and π€AEPO-Huggingface-Collection for directly testing. This guide walks you through setting up two separate environments:
- One for vLLM inference service (
vllm_env) - One for evaluation pipeline (
evaluation)
# Step into the vllm_scripts directory cd evaluation/vllm_scripts # Create a dedicated conda environment for vLLM conda create -n vllm_env python=3.10 conda activate vllm_env # Install dependencies (edit as needed) pip install -r requirements.txt
Edit the following launch scripts with your own model paths and names:
In vllm_launch_reasoning_model_cuda4-7.sh:
MODEL_PATH="<path/to/your/reasoning_model_checkpoint>" MODEL_NAME="your_model_name"
For summarization models (choose one):
MODEL_PATH="<path/to/your/summarization_model_checkpoint>" MODEL_NAME="your_summarization_model_name"
Launch the vLLM services:
# Start the reasoning model bash vllm_launch_reasoning_model_cuda4-7.sh # Start the summarization model (choose one) bash vllm_launch_summarize_model_cuda0-3_<your_model>.sh
# Create a separate environment for evaluation conda create -n evaluation python=3.10 conda activate evaluation # Install required packages cd evaluation pip install -r requirements.txt
Edit the infer_local_sds.sh script with the following values:
# Activate your Conda environment manually if 'conda' is not available in shell source < /path/to/your/conda >/bin/activate conda activate < your env name > # Datasets to evaluate β uncomment the ones you want to include: # Options: aime24, aime25, math500, gsm8k, math, webwalker, hotpotqa, 2wiki, bamboogle, musique, hle, gaia, SimpleQA, xbench data_names=( "hle" "gaia" ) # Required parameters to update: EXP_NAME="<your_exp_name>" # Name of this experiment run MODEL_PATH="<your_model_path>" # Path to the reasoning model OUTPUT_PATH="<your_output_path>" # Directory to save outputs CONDA_PATH="<your_conda_path>" # Path to your Conda installation CONDA_ENV="<your_env_name>" # Name of your Conda environment BING_API_KEY="<your_bing_search_api_key>" # Bing Search API key BING_ZONE="<your_bing_zone>" # Bing API zone SUMM_MODEL_PATH="<your_summarization_model_path>" # Path to summarization model checkpoints
For Bing API usage, please refer to Bright Data.
Run the evaluation:
bash evaluation/infer_local_sds.sh
πΈ For Chinese datasets like
xbench, we recommend using Jina API for better webpage extraction.To enable Jina Reader API, modify
evaluation/src/tools/search_tool_sds.py(line ~155):# Change from: lambda: self.extract_text_from_url(url, use_jina=False, jina_api_key=None) # To: lambda: self.extract_text_from_url(url, use_jina=True, jina_api_key="your_jina_api_key")Get your Jina API key at https://jina.ai/reader
After generating inference results, you can use a large model like Qwen2.5-72B-Instruct to evaluate them with more powerful understanding capabilities.
First, use the vLLM environment to start the evaluation model:
bash evaluation/deploy_qwen2.5_72B_instruct.sh
In that script, make sure to update the vllm serve command with your own model path:
# Activate your Conda environment manually if 'conda' is not available in shell source < /path/to/your/conda >/bin/activate conda activate < your env name > vllm serve <your_model_path> \ --served-model-name Qwen2.5-72B-Instruct \ --max-model-len 32768 \ --tensor_parallel_size 4 \ --gpu-memory-utilization 0.75 \ --quantization gptq \ --port 8001
Before running the evaluation script, update the following line in evaluate_passk.sh to specify the output directory:
OUTPUT_DIR="<your_result_directory>"Then, run the evaluation script to calculate metrics:
bash evaluation/evaluate_passk.sh
If you find this work helpful, please cite our paper:
@article{dong2025arpo, author = {Guanting Dong and Hangyu Mao and Kai Ma and Licheng Bao and Yifei Chen and Zhongyuan Wang and Zhongxia Chen and Jiazhen Du and Huiyang Wang and Fuzheng Zhang and Guorui Zhou and Yutao Zhu and Ji{-}Rong Wen and Zhicheng Dou}, title = {Agentic Reinforced Policy Optimization}, journal = {CoRR}, volume = {abs/2507.19849}, year = {2025}, url = {https://doi.org/10.48550/arXiv.2507.19849}, doi = {10.48550/ARXIV.2507.19849}, eprinttype = {arXiv}, eprint = {2507.19849}, timestamp = {2025εΉ΄8ζ22ζ₯ 07:48:19 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-2507-19849.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } @article{dong2025aepo, author = {Guanting Dong and Licheng Bao and Zhongyuan Wang and Kangzhi Zhao and Xiaoxi Li and Jiajie Jin and Jinghan Yang and Hangyu Mao and Fuzheng Zhang and Kun Gai and Guorui Zhou and Yutao Zhu and Ji{-}Rong Wen and Zhicheng Dou}, title = {Agentic Entropy-Balanced Policy Optimization}, journal = {CoRR}, volume = {abs/2510.14545}, year = {2025}, url = {https://doi.org/10.48550/arXiv.2510.14545}, doi = {10.48550/ARXIV.2510.14545}, eprinttype = {arXiv}, eprint = {2510.14545}, timestamp = {2025εΉ΄11ζ14ζ₯ 15:17:45 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2510-14545.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } @article{dong2025tool, author = {Guanting Dong and Yifei Chen and Xiaoxi Li and Jiajie Jin and Hongjin Qian and Yutao Zhu and Hangyu Mao and Guorui Zhou and Zhicheng Dou and Ji{-}Rong Wen}, title = {Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning}, journal = {CoRR}, volume = {abs/2505.16410}, year = {2025}, url = {https://doi.org/10.48550/arXiv.2505.16410}, doi = {10.48550/ARXIV.2505.16410}, eprinttype = {arXiv}, eprint = {2505.16410}, timestamp = {2025εΉ΄6ζ26ζ₯ 07:49:34 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-2505-16410.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
This training implementation builds upon Tool-Star, Llama Factory, verl and ReCall. For evaluation, we rely on WebThinker, HIRA, WebSailor, Search-o1, and FlashRAG. The Python interpreter design references ToRA and ToRL, while our models are trained using Qwen2.5. We express our sincere gratitude to these projects for their invaluable contributions to the open-source community.
This project is released under the MIT License.
For any questions or feedback, please reach out to us at dongguanting@ruc.edu.cn.