CodeScaler Paper on arXiv GitHub Code GitHub Page Datasets on Hugging Face CodeScaler on Hugging Face
-
We propose CodeScaler, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization.
-
Across five coding benchmarks, CodeScaler improves Qwen3-8B-Base by an average of +11.72 points, outperforming binary execution-based RL by +1.82 points, and enables scalable reinforcement learning on synthetic datasets without any test cases.
-
At inference time, CodeScaler serves as an effective test-time scaling method, achieving performance comparable to unit test approaches while providing a Γγ°γ€ reduction in latency. Moreover, CodeScaler surpasses existing reward models on RM-Bench not only in the code domain but also in general and reasoning domains.
-
[2026-02] π We have released the CodeScaler Paper on Arxiv!
-
[2026-02] π We have released the code, dataset and models for CodeScaler!
- CodeScalerPair-51K: We construct high-quality preference data from on-policy training trajectories.
We release CodeScaler at different scales from 1.7B, 4B to 8B.
-
CodeScaler-1.7B: A reward model trained on CodeScalerPair-51K from Skywork/Skywork-Reward-V2-Qwen3-1.7B.
-
CodeScaler-4B: A reward model trained on CodeScalerPair-51K from Skywork/Skywork-Reward-V2-Qwen3-4B.
-
CodeScaler-8B: A reward model trained on CodeScalerPair-51K from Skywork/Skywork-Reward-V2-Qwen3-8B.
Step 1: Clone the repository
git clone https://github.com/LARK-AI-Lab/CodeScaler.git
cd CodeScalerStep 2: Create a conda environment
conda create -n CodeScaler python==3.10.19 conda activate CodeScaler
Step 3: Install dependencies
pip install -r requirements.txt
Step 4: Install FlashAttention
pip install --no-cache-dir \ https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/\ flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
π‘ Tip: You can also install FlashAttention based on your specific PyTorch and CUDA versions for optimal performance.
Prepare the training and evaluation datasets:
# Prepare training dataset python data/prepare_deepcoder.py # Download and prepare evaluation dataset python data/download_dataset.py python data/prepare_evaluation.py
π‘ Tip: The training dataset is based on DeepCoder training datasets, and evaluation includes multiple coding benchmarks.
Train Qwen3-8B-Base on DeepCoder dataset using CodeScaler as reward model:
# Login to Weights & Biases for experiment tracking wandb login # Start training bash scripts/train.sh
π‘ Tip: Check
scripts/train.shto customize hyperparameters such as learning rate, batch size, and training epochs.
Evaluate your trained model:
# Run evaluation on benchmarks
bash scripts/eval.shimport torch from transformers import AutoTokenizer, AutoModelForSequenceClassification device = "cuda" if torch.cuda.is_available() else "cpu" model_path = 'LARK-Lab/CodeScaler-8B' tokenizer = AutoTokenizer.from_pretrained(model_path) reward_model = AutoModelForSequenceClassification.from_pretrained(model_path).to(device) reward_model.eval() question = """\ Given an integer array nums and an integer k, return the total number of continuous subarrays whose sum equals k. A subarray is a contiguous part of the array. For example: ``` Input: nums = [1, 1, 1], k = 2 Output: 2 ``` """ # Correct solution using prefix sum approach program_correct = """\ from collections import defaultdict def subarraySum(nums, k): prefix = 0 count = 0 freq = defaultdict(int) freq[0] = 1 # Important: subarray starting from index 0 for num in nums: prefix += num if prefix - k in freq: count += freq[prefix - k] freq[prefix] += 1 return count """ # Incorrect solution using sliding window (fails on negative numbers) program_wrong = """\ def subarraySum(nums, k): left = 0 curr_sum = 0 count = 0 for right in range(len(nums)): curr_sum += nums[right] while curr_sum > k and left <= right: curr_sum -= nums[left] left += 1 if curr_sum == k: count += 1 return count """ convs = [ [ { "content": question, "role": "user", }, { "role": "assistant", "content": program } ] for program in [program_correct, program_wrong] ] texts = [ tokenizer.apply_chat_template(conv, tokenize=False) for conv in convs ] toks = tokenizer( texts, truncation=True, padding=True, max_length=2048, return_tensors="pt", ) with torch.no_grad(): outputs = reward_model( input_ids=toks["input_ids"].to(device), attention_mask=toks["attention_mask"].to(device), ) scores = outputs.logits.squeeze(-1).cpu().tolist() print("RM Scores:", scores) # RM Scores: [6.5424089431762695, -0.0312652587890625]
If you find our work helpful, please consider citing:
@misc{zhu2026codescalerscalingcodellm,
title={CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models},
author={Xiao Zhu and Xinyu Zhou and Boyu Zhu and Hanxu Hu and Mingzhe Du and Haotian Zhang and Huiming Wang and Zhijiang Guo},
year={2026},
eprint={2602.17684},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.17684},
}