Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Live-SWE-agent: live, runtime self-evolving software engineering agent

License

Notifications You must be signed in to change notification settings

OpenAutoCoder/live-swe-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

11 Commits

Repository files navigation

Live-SWE-agent | The First Live AI Software Agent

📣News | 🏆Leaderboard | 📊Comparison | 🚀Setup | ⚙️Artifacts | 📜Attribution | 🙏Acknowledgements

Live-SWE-agent is the first live, runtime self-evolving software engineering agent that expands and revises its own capabilities on the fly while working on a real-world issue. Our key insight is that software agents are themselves software systems, and modern LLM-based agents already possess the intrinsic capability to extend or modify their own behavior at runtime.

📣 News

  • [Nov 24th, 2025]: Claude Opus 4.5 + Live-SWE-agent scores 79.2% on SWE-bench Verified, leading all current open-source scaffolds and coming very close to Anthropic’s internal, manually engineered scaffold for Opus 4.5!!
  • [Nov 20th, 2025]: Gemini 3 Pro + Live-SWE-agent scores 77.4% on SWE-bench Verified, outperforming all available models (including Claude 4.5)!
  • [Nov 17th, 2025]: Live-SWE-agent achieves the new state-of-the-art solve rate of 45.8% on SWE-Bench Pro!
  • [Nov 17th, 2025]: We've released Live-SWE-agent 1.0.0!

🏆 Leaderboard

For software tasks, recent LLMs are often benchmarked using manually engineered, proprietary agent scaffolds, which makes it difficult to compare the true capabilities of different models fairly.

Live-SWE-agent not only demonstrates that a minimal, open, and live scaffold already has the ability to outperform proprietary scaffolds, but also offers a unified and powerful platform that enables genuinely fair, apples-to-apples comparisons for future model releases.

As shown below, on our leaderboard of recent models (all evaluated with Live-SWE-agent), Claude Opus 4.5 retains the #1 spot with a score of 79.2% on SWE-bench Verified by a large margin.

More model scores are coming soon! For more details, please visit our leaderboard. Feel free to submit your model's evaluation results to help build a more comprehensive and fair benchmarking platform!

📊 Comparison

Below shows the comparison graph between Live-SWE-agent and state-of-the-art open-source solutions and proprietary commercial agent scaffolds on SWE-bench Verified and SWE-Bench Pro.

🚀 Setup

We built Live-SWE-agent on top of the popular mini-swe-agent framework with very minimal modifications.

To use Live-SWE-agent, simply install mini-swe-agent first using this guide and use the custom Live-SWE-agent config:

mini --config config/livesweagent.yaml # using custom Live-SWE-agent config

See the config folder for more details.

⚙️ Artifacts

You can download the complete trajectories, patches, and results of Live-SWE-agent in our v1.0.0 release:

  • swebench_verified: complete runs on SWE-bench Verified
  • swebench_pro: complete runs on SWE-Bench Pro

You also obtain them in our 🤗 huggingface datasets

📜 Attribution

@article{livesweagent,
 author = {Xia, Chunqiu Steven and Wang, Zhe and Yang, Yan and Wei, Yuxiang and Zhang, Lingming},
 title = {Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?},
 year = {2025},
 journal = {arXiv preprint},
}

🙏 Acknowledgements

AltStyle によって変換されたページ (->オリジナル) /