Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

CodeQwen: Execution-Driven Fine-Tuning for Python Code Generation

License

Notifications You must be signed in to change notification settings

hza2002/CodeQwen

Repository files navigation

CodeQwen

Execution-Driven Fine-Tuning for Python Code Generation

Introduction

This project aims to enhance the code generation capabilities of Qwen3-8B on the EvalPlus benchmark by leveraging high-quality data curation and execution-driven optimization.

We employ a three-stage training pipeline:

  1. SFT (Supervised Fine-Tuning): Uses a quality-first hybrid dataset filtered via AST parsing and execution checks.
  2. DPO (Direct Preference Optimization): Aligns the model using execution-validated preference pairs (correct vs. incorrect solutions).
  3. GRPO (Group Relative Policy Optimization): Optimizes the model using a multi-dimensional reward function based on test pass rates and syntax validity.

Our approach grounds the training signal in actual code execution, enabling the 8B model to achieve performance comparable to larger commercial systems.

Methodology

I. SFT (Supervised Fine-Tuning)

  • Hybrid Dataset Curation: A quality-first blended dataset constructed from Magicoder-Evol, Magicoder-OSS, Python-Code-Instructions, CodeAlpaca, and Code-Exercises.
  • Syntax-Aware Code Filtering: Syntax integrity verified through AST parsing and feature-based heuristics.

II. SFT -> DPO

  • Hybrid Preference Generation: Fuses self-generated MBPP programming solution preference pairs with parsed and cleaned external code preference datasets.
  • Execution-Validated Preference Construction: Automatically executes code and runs unit tests to verify correctness.
  • Iterative Multi-Round DPO: Conducts multiple rounds of training with curriculum learning.

III. SFT -> GRPO

  • Code Execution-Validated Reward Mechanism: A reward function constructed by dynamically executing generated code within a secure sandbox.
  • Training Optimization: Integrates adaptive early stopping and best model retention.

Repository Structure

  • stage1_sft/: Code for Supervised Fine-Tuning.
  • stage2_dpo/: Code for Direct Preference Optimization.
  • stage3_grpo/: Code for Group Relative Policy Optimization.
  • evalplus_results/: Evaluation results on EvalPlus benchmarks.
  • plot_results/: Scripts for plotting results.

Getting Started

Installation

Clone the repository:

git clone https://github.com/hza2002/CodeQwen.git

Install dependencies (Python 3.10.19):

cd CodeQwen
pip install -r requirements.txt

Usage

Stage 1: Supervised Fine-Tuning (SFT)

Train the base Qwen model on the SFT dataset.

cd stage1_sft && ./run.sh

Stage 2: Direct Preference Optimization (DPO)

Generate preference data and train the DPO model.

cd stage2_dpo && ./run.sh

Stage 3: Group Relative Policy Optimization (GRPO)

Train using the GRPO method.

cd stage3_grpo && ./run.sh

Results

Starting from the Qwen3-8B base model (62.2% MBPP+), we achieved:

  • SFT: 65.6%
  • SFT -> GRPO: 66.7%
  • SFT -> DPO: 69.3% (Matching Claude-3-haiku and outperforming DeepSeek-Coder-6.7B)

Results Chart

About

CodeQwen: Execution-Driven Fine-Tuning for Python Code Generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

AltStyle によって変換されたページ (->オリジナル) /