Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

LARK-AI-Lab/CodeScaler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

11 Commits

Repository files navigation

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

CodeScaler Paper on arXiv GitHub Code GitHub Page Datasets on Hugging Face CodeScaler on Hugging Face

πŸ“Š Overview

Overview of models

  • We propose CodeScaler, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization.

  • Across five coding benchmarks, CodeScaler improves Qwen3-8B-Base by an average of +11.72 points, outperforming binary execution-based RL by +1.82 points, and enables scalable reinforcement learning on synthetic datasets without any test cases.

  • At inference time, CodeScaler serves as an effective test-time scaling method, achieving performance comparable to unit test approaches while providing a ×ば぀ reduction in latency. Moreover, CodeScaler surpasses existing reward models on RM-Bench not only in the code domain but also in general and reasoning domains.

News

  • [2026-02] πŸŽ‰ We have released the CodeScaler Paper on Arxiv!

  • [2026-02] πŸŽ‰ We have released the code, dataset and models for CodeScaler!

πŸ“š Datasets

  • CodeScalerPair-51K: We construct high-quality preference data from on-policy training trajectories.

πŸ€– Models

We release CodeScaler at different scales from 1.7B, 4B to 8B.

  • CodeScaler-1.7B: A reward model trained on CodeScalerPair-51K from Skywork/Skywork-Reward-V2-Qwen3-1.7B.

  • CodeScaler-4B: A reward model trained on CodeScalerPair-51K from Skywork/Skywork-Reward-V2-Qwen3-4B.

  • CodeScaler-8B: A reward model trained on CodeScalerPair-51K from Skywork/Skywork-Reward-V2-Qwen3-8B.

πŸš€ Quick Start

βš™οΈ Environment Setup

Step 1: Clone the repository

git clone https://github.com/LARK-AI-Lab/CodeScaler.git
cd CodeScaler

Step 2: Create a conda environment

conda create -n CodeScaler python==3.10.19
conda activate CodeScaler

Step 3: Install dependencies

pip install -r requirements.txt

Step 4: Install FlashAttention

pip install --no-cache-dir \
 https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/\
flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

πŸ’‘ Tip: You can also install FlashAttention based on your specific PyTorch and CUDA versions for optimal performance.

πŸ“¦ Data Preparation

Prepare the training and evaluation datasets:

# Prepare training dataset
python data/prepare_deepcoder.py
# Download and prepare evaluation dataset
python data/download_dataset.py
python data/prepare_evaluation.py

πŸ’‘ Tip: The training dataset is based on DeepCoder training datasets, and evaluation includes multiple coding benchmarks.

πŸ‹οΈ Training

Train Qwen3-8B-Base on DeepCoder dataset using CodeScaler as reward model:

# Login to Weights & Biases for experiment tracking
wandb login
# Start training
bash scripts/train.sh

πŸ’‘ Tip: Check scripts/train.sh to customize hyperparameters such as learning rate, batch size, and training epochs.

πŸ“ˆ Evaluation

Evaluate your trained model:

# Run evaluation on benchmarks
bash scripts/eval.sh

πŸ’» Use CodeScaler for RM Scoring

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
device = "cuda" if torch.cuda.is_available() else "cpu"
model_path = 'LARK-Lab/CodeScaler-8B'
tokenizer = AutoTokenizer.from_pretrained(model_path)
reward_model = AutoModelForSequenceClassification.from_pretrained(model_path).to(device)
reward_model.eval()
question = """\
Given an integer array nums and an integer k, return the total number of continuous subarrays whose sum equals k.
A subarray is a contiguous part of the array.
For example:
```
Input:
nums = [1, 1, 1], k = 2

Output:
2
```
"""
# Correct solution using prefix sum approach
program_correct = """\
from collections import defaultdict

def subarraySum(nums, k):
 prefix = 0
 count = 0
 freq = defaultdict(int)
 freq[0] = 1 # Important: subarray starting from index 0

 for num in nums:
 prefix += num

 if prefix - k in freq:
 count += freq[prefix - k]

 freq[prefix] += 1

 return count
"""
# Incorrect solution using sliding window (fails on negative numbers)
program_wrong = """\
def subarraySum(nums, k):
 left = 0
 curr_sum = 0
 count = 0

 for right in range(len(nums)):
 curr_sum += nums[right]

 while curr_sum > k and left <= right:
 curr_sum -= nums[left]
 left += 1

 if curr_sum == k:
 count += 1

 return count
"""
convs = [
 [
 {
 "content": question,
 "role": "user",
 },
 {
 "role": "assistant",
 "content": program
 }
 ] for program in [program_correct, program_wrong]
]
texts = [
 tokenizer.apply_chat_template(conv, tokenize=False)
 for conv in convs
]
toks = tokenizer(
 texts,
 truncation=True,
 padding=True,
 max_length=2048,
 return_tensors="pt",
)
with torch.no_grad():
 outputs = reward_model(
 input_ids=toks["input_ids"].to(device),
 attention_mask=toks["attention_mask"].to(device),
 )
 scores = outputs.logits.squeeze(-1).cpu().tolist()
print("RM Scores:", scores)
# RM Scores: [6.5424089431762695, -0.0312652587890625]

Citation

If you find our work helpful, please consider citing:

@misc{zhu2026codescalerscalingcodellm,
 title={CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models}, 
 author={Xiao Zhu and Xinyu Zhou and Boyu Zhu and Hanxu Hu and Mingzhe Du and Haotian Zhang and Huiming Wang and Zhijiang Guo},
 year={2026},
 eprint={2602.17684},
 archivePrefix={arXiv},
 primaryClass={cs.LG},
 url={https://arxiv.org/abs/2602.17684}, 
}

About

The official repo for "CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /