🧪 Evaluation framework for testing Claude Code skills at scale. Run regression suites across model versions.
-
Updated
May 22, 2026
🧪 Evaluation framework for testing Claude Code skills at scale. Run regression suites across model versions.
daily puzzle for ai agents
Free TypeScript Lite starter for checking cited RAG answers against source chunks.
Open evaluation harness for mental health LLM responses. 5 clinically-grounded rubrics, LLM-as-judge with bias controls, crisis-detection routing to 988 protocols.
Config-driven CLI that runs promptfoo evals, identifies low-scoring prompts, rewrites them via Claude API, and re-evaluates.
AI 聊天教练 MVP:Spring Boot、DeepSeek、结构化输出、两段式分析和轻量评测体系。
🚀 基于Java的开源AI自动化评测框架 / An open source AI automation evaluation framework based on Java
Add a description, image, and links to the ai-eval topic page so that developers can more easily learn about it.
To associate your repository with the ai-eval topic, visit your repo's landing page and select "manage topics."