swe-bench

Star

Here are 85 public repositories matching this topic...

Language: All

Filter by language

All 85 Python 54 TypeScript 9 Shell 6 Go 2 HTML 2 Rust 2 JavaScript 1 Kotlin 1 TeX 1

Sort: Most stars

Sort options

Most stars Fewest stars Most forks Fewest forks Recently updated Least recently updated

smallcloudai / refact

Star 3.6k

AI Agent that handles engineering tasks end-to-end: integrates with developers’ tools, plans, executes, and iterates until it achieves a successful result.

open-source enterprise vscode self-hosted developer-tools on-prem fine-tuning rag ai-agent swe-bench

Updated May 30, 2026
Rust

CORAL

Human-Agent-Society / CORAL

Star 723

CORAL is a robust, lightweight infrastructure for multi-agent autonomous self-evolution, built for autoresearch. Works with Claude Code, Codex, Cursor, OpenCode, Kiro, and more.

opencode multi-agent code-generation evolutionary-algorithm codex autonomous-agents agent-framework large-language-models llm-agents agentic-ai self-evolving claude-code coding-agent alpha-evolve swe-bench self-evolving-agents autoresearch

Updated Jun 12, 2026
Python

bernstein

Audit-grade multi-agent orchestration for CLI coding agents (Claude Code, Codex, Gemini CLI, +40 more). HMAC-chained audit log, signed agent cards, per-artefact lineage, air-gap deploy. The orchestrator your compliance team will sign off on. https://bernstein.run

python multi-agent ai-agents cli-tool agent-framework llm aider anthropic agentic-ai ai-coding model-context-protocol mcp-server claude-code codex-cli coding-agent swe-bench agent-orchestrator hmac-audit parallel-worktrees deterministic-scheduler

Updated Jun 13, 2026
Python

JARVIS-Xs / SE-Agent

Star 278

SE-Agent is a self-evolution framework for LLM Code agents. It enables trajectory-level evolution to exchange information across reasoning paths via Revision, Recombination, and Refinement, expanding the search space and escaping local optima. On SWE-bench Verified, it achieves SOTA performance

mcts code-fix swe-agent test-time-scaling claude-code code-agent swe-bench self-evolve

Updated Sep 23, 2025
Python

hwfengcs / DM-Code-Agent

Star 141

Lightweight, auditable Python code agent (~1500 LOC) — ReAct + Planner + Reflexion + Hybrid RAG, with SWE-bench Lite eval and trace replay.

agent mcp rag llm llm-agent react-agent agent-skills agent-evaluation reflexion-agent code-agent swe-bench

Updated Jun 4, 2026
Python

usetig / sage

Star 96

An LLM council that reviews your coding agent's every move

cli devtools developer-tools code-review cursor codex ai-agents code-reviews react-ink coding-assistant anthropic coding-agents gemini-cli coderabbit vibe-coding vibecoding claude-code swe-bench claude-code-hooks llm-council

Updated Apr 28, 2026
TypeScript

logic-star-ai / insights

Star 49

We track and analyze the activity and performance of autonomous code agents in the wild

agents swe-agent swe-bench

Updated Dec 5, 2025
TypeScript

repoagentbench

HumphreySun98 / repoagentbench

Star 32

SWE-bench for your codebase — mine your merged PRs into local, contamination-free coding-agent benchmarks. Adapters: claude-code, aider (Opus 4.7 / GPT-5.5 / Sonnet 4.6 / Gemini 3.1 Pro).

benchmark developer-tools ai-agents aider llm-eval coding-agents agent-evals swe-bench gemini-3-1-pro claude-opus-4-7 gpt-5-5

Updated Apr 30, 2026
Python

shreyash-sharma / provenant

Star 23

Wiki-based retrieval for AI coding agents. ×ばつ token reduction, +24pp Coverage@5 on SWE-bench Verified.

python ai mcp developer-tools llm retrieval-augmented-generation codebase-indexing swe-bench

Updated May 28, 2026
Python

strands-labs / benchmark-harnesses

Star 21

Strands-based agents and harnesses for agentic benchmarks.

machine-learning ai benchmarks llm genai agentic agentic-ai swe-bench strands-agents terminal-bench strands-labs

Updated Jun 9, 2026
Python

wbopan / retro-harness

Star 20

RHO: Evolving Agents in the Dark — Retrospective Harness Optimization via Self-Preference. Improving LLM agents from unlabeled past trajectories (arXiv:2606.05922).

research self-supervised llm prompt-optimization llm-agents swe-bench agent-optimization

Updated Jun 12, 2026
Python

KRLabsOrg / squeez

Star 18

Squeeze verbose LLM agent tool output down to only the relevant lines

python pytorch lora tool-use llm context-compression coding-agent swe-bench

Updated Apr 27, 2026
Python

vishal-dehurdle / state-harness

Star 12

Runtime safety net for LLM agents. Detects token spirals, kills doomed tasks early, tells you exactly why. Rust core, Python SDK. pip install state-harness

multi-agent circuit-breaker agents cost-control runtime-monitoring rust-python pyo3 lyapunov-stability llm agent-safety swe-bench failure-diagnostics token-spiral

Updated Jun 12, 2026
Python

Vexp-ai / vexp-swe-bench

Star 11

Open benchmark for AI coding agents on SWE-bench Verified. Compare resolution rates, cost, and unique wins.

benchmark mcp developer-tools ai-agents ai-coding claude-code swe-bench context-engineering

Updated May 2, 2026
Shell

verseles / showdown

Star 11

Comprehensive LLM leaderboard aggregating multiple benchmarks into transparent rankings. Open data, community-driven, built with Svelte.

benchmark ai score lmarena swe-bench

Updated Apr 23, 2026
HTML

xmpuspus / ai-workflow-benchmark

Star 10

Benchmark harness measuring AI coding tool+workflow performance, not just model capability. 100 tasks, sigmoid scoring, 12 capability dimensions, gap analysis.

benchmark developer-tools code-generation ai-agents llm-evaluation llm-benchmarking coding-agents ai-coding claude-code swe-bench

Updated May 30, 2026
Python

greynewell / mcpbr

Sponsor

Star 10

Benchmark your MCP server.

python benchmarking machine-learning mcp ml-evaluation llm-evaluation model-context-protocol swe-bench

Updated Apr 28, 2026
Python

agentic-trust-labs / glassbox-ai

Star 8

Lean orchestration platform for enterprise AI — where each decision costs hundreds. State machine core, HITL as a first-class state, corrections that accumulate. First use-case being Coding agent. Open research, early stage.