Skip to main content Link Menu Expand (external link) Document Search Copy Copied
aider
Aider is AI pair programming in your terminal. Aider is on GitHub and Discord.

Refactoring leaderboard

Aider’s refactoring benchmark asks the LLM to refactor 89 large methods from large python classes. This is a more challenging benchmark, which tests the model’s ability to output long chunks of code without skipping sections or making mistakes. It was developed to provoke and measure GPT-4 Turbo’s "lazy coding" habit.

The refactoring benchmark requires a large context window to work with large source files. Therefore, results are available for fewer models.

Model Percent completed correctly Percent using correct edit format Command Edit format
claude-3-5-sonnet-20241022 92.1% 91.0% aider --sonnet diff
o1-preview 75.3% 57.3% aider --model o1-preview diff
claude-3-opus-20240229 72.3% 79.5% aider --opus diff
claude-3.5-sonnet-20240620 64.0% 76.4% aider --sonnet diff
gpt-4o 62.9% 53.9% aider diff
gpt-4-1106-preview 50.6% 39.3% aider --model gpt-4-1106-preview udiff
gpt-4o-2024年08月06日 49.4% 89.9% aider --model openai/gpt-4o-2024年08月06日 diff
gemini/gemini-1.5-pro-latest 49.4% 7.9% aider --model gemini/gemini-1.5-pro-latest diff-fenced
o1-mini 44.9% 29.2% aider --model o1-mini diff
gpt-4-turbo-2024年04月09日 (udiff) 34.1% 30.7% aider --gpt-4-turbo udiff
gpt-4-0125-preview 33.7% 47.2% aider --model gpt-4-0125-preview udiff
DeepSeek Coder V2 0724 (deprecated) 32.6% 59.6% aider --model deepseek/deepseek-coder diff
DeepSeek Chat V2.5 31.5% 67.4% aider --deepseek diff
gpt-4-turbo-2024年04月09日 (diff) 21.4% 6.8% aider --model gpt-4-turbo-2024年04月09日 diff

By Paul Gauthier, last updated April 12, 2025.

AltStyle によって変換されたページ (->オリジナル) /