llm-evaluation

Star

Here are 330 public repositories matching this topic...

Language: All

Filter by language

All 330 Python 168 Jupyter Notebook 61 TypeScript 33 HTML 11 JavaScript 5 CSS 2 Rust 2 Vue 2 C# 1 Go 1

Sort: Most stars

Sort options

Most stars Fewest stars Most forks Fewest forks Recently updated Least recently updated

mlflow

mlflow / mlflow

Star 22.5k

The open source developer platform to build AI/LLM applications and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.

open-source machine-learning ai apache-spark evaluation ml openai agents observability model-management mlops mlflow agentops prompt-engineering ai-governance langchain llmops llm-evaluation

Updated Oct 17, 2025
Python

langfuse

langfuse / langfuse

Star 17.2k

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai observability autogen large-language-models llm prompt-engineering langchain llmops llama-index prompt-management llm-evaluation llm-observability

Updated Oct 17, 2025
TypeScript

comet-ml / opik

Star 14.9k

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

open-source playground openai hacktoberfest llm prompt-engineering hacktoberfest2025 langchain llmops llama-index llm-evaluation llm-observability

Updated Oct 17, 2025
Python

confident-ai / deepeval

Star 11.6k

The LLM Evaluation Framework

python hacktoberfest evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Updated Oct 17, 2025
Python

promptfoo / promptfoo

Star 8.7k

Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd pentesting cicd vulnerability-scanners prompts evaluation-framework red-teaming rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Oct 17, 2025
TypeScript

phoenix

Arize-ai / phoenix

Star 7.3k

AI Observability & Evaluation

openai datasets agents ai-monitoring ai-observability prompt-engineering llms langchain llmops anthropic llamaindex llm-eval evals llm-evaluation aiengineering smolagents

Updated Oct 16, 2025
Jupyter Notebook

NVIDIA / garak

Star 6.2k

the LLM vulnerability scanner

ai vulnerability-assessment security-scanners llm-security llm-evaluation

Updated Oct 14, 2025
Python

jeinlee1991 / chinese-llm-benchmark

Star 5k

ReLE中文大模型能力评测(持续更新):目前已囊括291个大模型,覆盖chatgpt、gpt-5、o4-mini、谷歌gemini-2.5、Claude4、智谱GLM-Z1、文心一言、qwen-max、百川、讯飞星火、商汤senseChat、minimax等商用模型, 以及kimi-k2、ernie4.5、minimax-M1、DeepSeek-R1-0528、deepseek-v3.1、qwen3-2507、llama4、phi-4、GLM4.5、gemma3、mistral等开源大模型。不仅提供排行榜,也提供规模超200万的大模型缺陷库!方便广大社区研究分析、改进大模型。

artificial-intelligence llm-agent llm-evaluation agentic-ai

Updated Oct 16, 2025

giskard-oss

Giskard-AI / giskard-oss

Sponsor

Star 4.9k

🐢 Open-Source Evaluation & Testing library for LLM Agents

ai-security mlops fairness-ai responsible-ai ml-validation red-team-tools trustworthy-ai ml-testing llm ai-red-team ai-testing llmops llm-security llm-eval llm-evaluation rag-evaluation agent-evaluation

Updated Oct 10, 2025
Python

Helicone / helicone

Star 4.6k

🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

open-source playground monitoring analytics evaluation ycombinator openai gpt large-language-models llm prompt-engineering langchain llmops llama-index prompt-management llm-evaluation llm-observability agent-monitoring llm-cost

Updated Oct 17, 2025
TypeScript

AutoRAG

Marker-Inc-Korea / AutoRAG

Star 4.4k

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

python open-source qa benchmarking ops pipeline analysis optimization evaluation embeddings automl document-parser rag llm retrieval-augmented-generation llm-ops llm-evaluation rag-evaluation

Updated Oct 13, 2025
Python

PacktPublishing / LLM-Engineers-Handbook

Star 4.3k

The LLM's practical guide: From the fundamentals to deploying advanced LLM and RAG apps to AWS using LLMOps best practices

aws rag mlops llm llmops genai fine-tuning-llm llm-evaluation ml-system-design

Updated Mar 8, 2025
Python

agenta

Agenta-AI / agenta

Star 3.3k

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

prompt-engineering prompt-management llm-tools llm-framework llm-playground llm-platform llm-evaluation rag-evaluation llm-monitoring llm-as-a-judge llm-observability llmops-platform

Updated Oct 17, 2025
Python

truera / trulens

Star 2.9k

Evaluation and Tracking for LLM Experiments and AI Agents

machine-learning neural-networks ai-agents explainable-ml agentops ai-monitoring ai-observability llms llmops llm-eval evals llm-evaluation agent-evaluation

Updated Oct 17, 2025
Python

lmnr-ai / lmnr

Star 2.3k

Laminar - open-source all-in-one platform for engineering AI products. Create data flywheel for your AI app. Traces, Evals, Datasets, Labels. YC S24.

rust open-source typescript ai monitoring analytics evaluation ts self-hosted rust-lang developer-tools agents observability aiops ai-observability llmops evals llm-evaluation llm-observability llm-workflow

Updated Oct 17, 2025
TypeScript

agentic_security

msoedov / agentic_security

Star 1.7k

Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪

agent-framework ai-red-team prompt-testing llm-security llm-vulnerabilities llm-evaluation llm-fuzzing llm-evaluation-framework llm-guardrails llm-scanner llm-jailbreaks llm-fuzzer llm-fuzzer-aggregator agent-security

Updated Oct 17, 2025
Python

genieincodebottle / generative-ai

Star 1.5k

Comprehensive resources on Generative AI, including a detailed roadmap, projects, use cases, interview preparation, and coding preparation.

mcp gemini interview-questions claude multimodal n8n n8n-workflow openai-api generative-ai langchain large-language-model genai llm-agent retrieval-augmented-generation llm-evaluation genai-usecase langgraph llama3 agentic-ai model-context-protocol

Updated Oct 16, 2025
Jupyter Notebook

huggingface / aisheets

Star 1.5k

Build, enrich, and transform datasets using AI models with no code

oss ai synthetic-data nocode llms llm-evaluation

Updated Oct 17, 2025
TypeScript

microsoft / prompty

Star 1.1k

Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.

promptengineering llms generative-ai llm-evaluation prompty

Updated Oct 11, 2025
Python

cvs-health / uqlm

Star 1.1k

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

uncertainty-quantification uncertainty-estimation ai-safety confidence-score hallucination confidence-estimation ai-evaluation llm llm-evaluation llm-safety hallucination-evaluation hallucination-detection hallucination-mitigation llm-hallucination

Updated Oct 17, 2025
Python

Improve this page

Add a description, image, and links to the llm-evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly