Skip to content

#

terminal-bench

Here are 19 public repositories matching this topic...

Language: All

Filter by language

All 19 Python 11 TypeScript 2 C++ 1 Dockerfile 1 Go 1 JavaScript 1

Sort: Most stars

Sort options

Most stars Fewest stars Most forks Fewest forks Recently updated Least recently updated

harbor-framework / harbor

Harbor is a framework for running agent evaluations and creating and using RL environments.

rl-environments evals terminal-bench

Updated Jun 13, 2026
Python

itayinbarr / little-coder

A harness optimized to smaller LLMs

benchmark code-generation tool-use local-llm ollama qwen small-language-models coding-agents ai-coding-assistant coding-agent terminal-bench aider-polygot

Updated Jun 8, 2026
TypeScript

LiberCoders / CLI-Gym

Official Implementation of "CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion"

data datapipeline aicoding codeagent terminal-bench

Updated May 26, 2026
Python

strands-labs / benchmark-harnesses

Strands-based agents and harnesses for agentic benchmarks.

machine-learning ai benchmarks llm genai agentic agentic-ai swe-bench strands-agents terminal-bench strands-labs

Updated Jun 9, 2026
Python

plaume8 / spoox

Spoox CLI - Terminal Agent - SPlit lOOp eXand agent

terminal-based multi-agent-systems agentic-ai terminal-agent terminal-bench agent-cli

Updated May 19, 2026
Python

li-boxuan / Terminal-bench-OpenHands-trajectories

Trajectories for running OpenHands on Terminal Bench

trajectories llm openhands terminal-bench

Updated Jul 25, 2025

scitix / Agent-Sandbox

Fast, Multi-Cloud Sandbox Engine for AI Agents

kubernetes reinforcement-learning sandbox agents e2b agent-sandbox rlvr swe-bench terminal-bench agentic-rl e2b-compatible swe-rex

Updated Jun 11, 2026
Go

HansBug / oc-repl

Codex-style REPL for terminal-agent models trained with camel-ai TerminalToolkit / terminal-bench terminus-2 protocols. Built to drive HansBug/OpenClaw-RL checkpoints.

repl terminus openai-compatible camel-ai qwen3 terminal-agent terminal-bench openclaw-rl

Updated May 12, 2026
Python

sam-siavoshian / Symposium

Multi-agent reasoning MCP server for Claude Code. Spawns parallel research agents to find knowledge LLMs don't have. +23.1% on Terminal Bench 2.0 SWE tasks.

research mcp multi-agent developer-tools ai-agents claude training-data fine-tuning dpo nia llm model-context-protocol mcp-server claude-code terminal-bench research-engine

Updated May 23, 2026
TypeScript

mrazakhan / ContinuumAI

Solving the amnesiac problem for LLM agents. Research series on agents that compound knowledge across sessions — first measurement: +4.6 pp accuracy lift on Terminal-Bench 2.1 with an open-weight executor and a single failure-derived skill file.

reproducible-research ai-agents cost-optimization skill-learning openrouter agent-benchmark terminal-bench open-weight-llm

Updated Jun 11, 2026

tianyi-zhang-02 / monitorstress

Multi-layer audit framework for agent benchmark integrity

llm-agents llm-as-judge agent-evaluation swe-bench reward-hacking terminal-bench benchmark-auditing

Updated Jun 8, 2026
Python

ayush0824 / parse-log-stats

reproducible Terminal-Bench task that evaluates a Bash script for parsing log files.

python llm agentic-ai terminal-bench

Updated Apr 6, 2026
Dockerfile

ruslandavidenko / terminal-bench-log-summary-audit

Deterministic Linux log analysis benchmark task for AI-agent evaluation using Docker and Terminal-Bench

python linux docker benchmarking automation pytest ai-agents terminal-bench

Updated May 26, 2026
Python

Ajeenckya5 / LLM_Agent_Distillation

Full self-improving long-horizon LLM agent with strategy memory, failure analysis, Grok teacher labels, and QLoRA student distillation.

llama agents knowledge-distillation llm chromadb qlora terminal-bench long-horizon-tasks self-improving-agents agentbench

Updated May 26, 2026
Python

mtepenner / snorkel-tasks

Central repository for Project Terminus task submissions. These environments and Oracle solutions are designed to evaluate state-of-the-art AI agents (like GPT-5.2 and Claude Opus 4.6) on complex, multi-step engineering challenges within a sandboxed terminal.

ai-agents llm-evaluation cli-automation terminal-bench project-terminus snorkel-ai

Updated May 22, 2026
C++

piyushhhxyz / vorflux-swe-benchmarks

Transparent benchmark results for Vorflux — SWE-bench Verified (91%) & Terminal Bench 2 (86%)

evaluation benchmarks ai-agent swe-bench terminal-bench vorflux

Updated May 20, 2026
JavaScript

lemegetonV / terminal-bench-task-workflow

Company-neutral workflow kit for creating, reviewing, calibrating, and packaging Harbor / Terminal-Bench tasks with LLM agents.

harbor llm-agents terminal-bench task-authoring benchmark-tasks

Updated May 18, 2026
Python

sady4850 / hookele-agent

Autonomous coding agent for Terminal-Bench 2.0 — 61.6% ± 1.9 on the official leaderboard with GPT-5.1 Codex Mini.

benchmark openai autonomous-agent ai-agent llm-agent coding-agent terminal-bench

Updated May 20, 2026
Python

basilisk-labs / agentplane-harbor-adapter

⚓ Harbor benchmark adapter for running AgentPlane as a reproducible coding-agent harness with evidence artifacts.

python benchmark evaluation developer-tools harbor ai-agents coding-agents terminal-bench agentplane

Updated May 3, 2026
Python

Improve this page

Add a description, image, and links to the terminal-bench topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the terminal-bench topic, visit your repo's landing page and select "manage topics."