Harbor is a framework for running agent evaluations and creating and using RL environments.
-
Updated
Jun 13, 2026 - Python
Harbor is a framework for running agent evaluations and creating and using RL environments.
A harness optimized to smaller LLMs
Official Implementation of "CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion"
Strands-based agents and harnesses for agentic benchmarks.
Spoox CLI - Terminal Agent - SPlit lOOp eXand agent
Trajectories for running OpenHands on Terminal Bench
Fast, Multi-Cloud Sandbox Engine for AI Agents
Codex-style REPL for terminal-agent models trained with camel-ai TerminalToolkit / terminal-bench terminus-2 protocols. Built to drive HansBug/OpenClaw-RL checkpoints.
Multi-agent reasoning MCP server for Claude Code. Spawns parallel research agents to find knowledge LLMs don't have. +23.1% on Terminal Bench 2.0 SWE tasks.
Solving the amnesiac problem for LLM agents. Research series on agents that compound knowledge across sessions — first measurement: +4.6 pp accuracy lift on Terminal-Bench 2.1 with an open-weight executor and a single failure-derived skill file.
Multi-layer audit framework for agent benchmark integrity
reproducible Terminal-Bench task that evaluates a Bash script for parsing log files.
Deterministic Linux log analysis benchmark task for AI-agent evaluation using Docker and Terminal-Bench
Full self-improving long-horizon LLM agent with strategy memory, failure analysis, Grok teacher labels, and QLoRA student distillation.
Central repository for Project Terminus task submissions. These environments and Oracle solutions are designed to evaluate state-of-the-art AI agents (like GPT-5.2 and Claude Opus 4.6) on complex, multi-step engineering challenges within a sandboxed terminal.
Transparent benchmark results for Vorflux — SWE-bench Verified (91%) & Terminal Bench 2 (86%)
Company-neutral workflow kit for creating, reviewing, calibrating, and packaging Harbor / Terminal-Bench tasks with LLM agents.
Autonomous coding agent for Terminal-Bench 2.0 — 61.6% ± 1.9 on the official leaderboard with GPT-5.1 Codex Mini.
⚓ Harbor benchmark adapter for running AgentPlane as a reproducible coding-agent harness with evidence artifacts.
Add a description, image, and links to the terminal-bench topic page so that developers can more easily learn about it.
To associate your repository with the terminal-bench topic, visit your repo's landing page and select "manage topics."