agent-eval

Vendor-neutral research umbrella for measuring AI plugin, agent, and MCP server quality across CLI runtimes (Claude Code, Gemini CLI, Copilot CLI, Codex CLI).

mcp skill-discovery opentelemetry ai-evaluation gemini-cli claude-code plugin-testing cross-cli agent-eval invocation-rate

Updated Jun 15, 2026
Python

zendodx / evalkit-framework

Star 0

🚀 基于Java的开源AI自动化评测框架 / An open source AI automation evaluation framework based on Java

java framework java-8 eval eval-framework ai-eval agent-eval

Updated Jun 11, 2026
Java

ttxs69 / awesome-coding-agent-eval

Star 0

A curated list of benchmarks, harnesses, leaderboards, and tools for evaluating AI coding agents.

benchmark leaderboard evaluation awesome-list codex ai-agent llm aider claude-code coding-agent swe-bench agent-eval ai-coding-agent-benchmark coding-agent-benchmark

Updated Jun 8, 2026

rogerchappel / ledgerpet

Star 0

Local-first synthetic finance anomaly trainer for agent evals.

cli synthetic-data local-first agent-eval finance-ops

Updated Jun 13, 2026
JavaScript

stevenchouai / agent-scorecard

Star 0

Trace-first evaluation harness for deciding whether AI agents deserve more tokens, permissions, and trust

python evaluation roi ai-agents proof-chain agent-eval

Updated May 16, 2026
Python

hermes-labs-ai / agent-convergence-scorer

Star 0

agent-convergence-scorer is a CLI and Python library that scores how lexically similar N agent or LLM outputs are: exact-match rate, Jaccard token overlap, divergence point, and a composite convergence score over any list of runs. An eval primitive for measuring reproducibility and fan-out collapse. Lexical, not semantic. Zero deps.

cli benchmark consistency evaluation similarity multi-agent convergence reproducibility agents jaccard divergence llm llm-evaluation ai-reliability eval-harness agent-eval

Updated Jun 7, 2026
Python

pingwest-ai / agent-eval

Star 0

开源通用 AI Agent 真实任务评测 · 同 Prompt、客观开奖、评分细则全公开 | Open-source evaluation of general-purpose AI Agents on real-world tasks with verifiable outcomes — by PingWest / 硅星人

benchmark evaluation ai-agents llm llm-evaluation deep-research agent-eval

Updated Jun 13, 2026

Viprasol-Tech / agentcheck

Star 0

Regression testing for AI agents — snapshot tool-calls, diff in CI, fail on regressions. A GitHub Action. By Viprasol Tech.

testing typescript ci snapshot-testing regression-testing ai-agents github-action llm llmops agent-eval

Updated Jun 7, 2026
TypeScript

mizcausevic-dev / agent-eval-arena

Star 0

Agent and LLM evaluation harness — golden datasets, multi-scorer execution, regression detection across model versions, cost-quality leaderboards, and CI gates for model promotion.

express typescript platform-engineering regression-detection ml-ops ai-platform ai-governance llm-eval agent-eval ci-gate

Updated Jun 11, 2026
TypeScript

zyy5114 / AgentEvalKit

Star 0

Lightweight CI-native regression and behavior-aware evaluation toolkit for black-box agent workflows.

python cli json-schema tooling regression-testing github-actions llm-evals agent-eval

Updated May 9, 2026
Python

jeremylongshore / j-rig-skill-binary-eval

Sponsor

Star 0

Binary-criteria evaluation harness for Claude skills with planned extension to plugins, agents, and MCP servers. Score every change yes/no across 7 layers — package integrity, trigger quality, functional quality, regression protection, baseline value, model variance, rollout safety. Never gradients.

mcp regression-testing skill-evaluation ai-evaluation llm-eval claude-code plugin-testing eval-harness agent-eval binary-criteria

Updated Jun 15, 2026
TypeScript

Improve this page

Add a description, image, and links to the agent-eval topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the agent-eval topic, visit your repo's landing page and select "manage topics."

Learn more

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent-eval

Here are 19 public repositories matching this topic...

zozo123 / meta-harness-on-islo

0-co / company

zozo123 / meta-harness-on-islo-page

linny006 / agent-eval-harness

arthursoares / openclaw-llm-bench

gojiplus / understudy

fitchmultz / agent-eval

tushariitr-19 / assay

jeremylongshore / intent-eval-lab

zendodx / evalkit-framework

ttxs69 / awesome-coding-agent-eval

rogerchappel / ledgerpet

stevenchouai / agent-scorecard

hermes-labs-ai / agent-convergence-scorer

pingwest-ai / agent-eval

Viprasol-Tech / agentcheck

mizcausevic-dev / agent-eval-arena

zyy5114 / AgentEvalKit

jeremylongshore / j-rig-skill-binary-eval

Improve this page

Add this topic to your repo