The open source developer platform to build AI/LLM applications and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.
-
Updated
Oct 17, 2025 - Python
The open source developer platform to build AI/LLM applications and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.
🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
The LLM Evaluation Framework
Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
AI Observability & Evaluation
the LLM vulnerability scanner
ReLE中文大模型能力评测(持续更新):目前已囊括291个大模型,覆盖chatgpt、gpt-5、o4-mini、谷歌gemini-2.5、Claude4、智谱GLM-Z1、文心一言、qwen-max、百川、讯飞星火、商汤senseChat、minimax等商用模型, 以及kimi-k2、ernie4.5、minimax-M1、DeepSeek-R1-0528、deepseek-v3.1、qwen3-2507、llama4、phi-4、GLM4.5、gemma3、mistral等开源大模型。不仅提供排行榜,也提供规模超200万的大模型缺陷库!方便广大社区研究分析、改进大模型。
🐢 Open-Source Evaluation & Testing library for LLM Agents
🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
The LLM's practical guide: From the fundamentals to deploying advanced LLM and RAG apps to AWS using LLMOps best practices
The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
Evaluation and Tracking for LLM Experiments and AI Agents
Laminar - open-source all-in-one platform for engineering AI products. Create data flywheel for your AI app. Traces, Evals, Datasets, Labels. YC S24.
Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪
Comprehensive resources on Generative AI, including a detailed roadmap, projects, use cases, interview preparation, and coding preparation.
Build, enrich, and transform datasets using AI models with no code
Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.
UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection
Add a description, image, and links to the llm-evaluation topic page so that developers can more easily learn about it.
To associate your repository with the llm-evaluation topic, visit your repo's landing page and select "manage topics."