Execution-Driven Fine-Tuning for Python Code Generation
This project aims to enhance the code generation capabilities of Qwen3-8B on the EvalPlus benchmark by leveraging high-quality data curation and execution-driven optimization.
We employ a three-stage training pipeline:
- SFT (Supervised Fine-Tuning): Uses a quality-first hybrid dataset filtered via AST parsing and execution checks.
- DPO (Direct Preference Optimization): Aligns the model using execution-validated preference pairs (correct vs. incorrect solutions).
- GRPO (Group Relative Policy Optimization): Optimizes the model using a multi-dimensional reward function based on test pass rates and syntax validity.
Our approach grounds the training signal in actual code execution, enabling the 8B model to achieve performance comparable to larger commercial systems.
- Hybrid Dataset Curation: A quality-first blended dataset constructed from Magicoder-Evol, Magicoder-OSS, Python-Code-Instructions, CodeAlpaca, and Code-Exercises.
- Syntax-Aware Code Filtering: Syntax integrity verified through AST parsing and feature-based heuristics.
- Hybrid Preference Generation: Fuses self-generated MBPP programming solution preference pairs with parsed and cleaned external code preference datasets.
- Execution-Validated Preference Construction: Automatically executes code and runs unit tests to verify correctness.
- Iterative Multi-Round DPO: Conducts multiple rounds of training with curriculum learning.
- Code Execution-Validated Reward Mechanism: A reward function constructed by dynamically executing generated code within a secure sandbox.
- Training Optimization: Integrates adaptive early stopping and best model retention.
stage1_sft/: Code for Supervised Fine-Tuning.stage2_dpo/: Code for Direct Preference Optimization.stage3_grpo/: Code for Group Relative Policy Optimization.evalplus_results/: Evaluation results on EvalPlus benchmarks.plot_results/: Scripts for plotting results.
Clone the repository:
git clone https://github.com/hza2002/CodeQwen.git
Install dependencies (Python 3.10.19):
cd CodeQwen
pip install -r requirements.txtTrain the base Qwen model on the SFT dataset.
cd stage1_sft && ./run.sh
Generate preference data and train the DPO model.
cd stage2_dpo && ./run.sh
Train using the GRPO method.
cd stage3_grpo && ./run.sh
Starting from the Qwen3-8B base model (62.2% MBPP+), we achieved:
- SFT: 65.6%
- SFT -> GRPO: 66.7%
- SFT -> DPO: 69.3% (Matching Claude-3-haiku and outperforming DeepSeek-Coder-6.7B)