evaluation

Star

Here are 1,646 public repositories matching this topic...

Language: All

Filter by language

All 1,646 Python 721 Jupyter Notebook 235 JavaScript 70 Java 62 TypeScript 55 HTML 51 C++ 46 R 27 Go 23 C# 22

Sort: Most stars

Sort options

Most stars Fewest stars Most forks Fewest forks Recently updated Least recently updated

mlflow

mlflow / mlflow

Star 22.5k

The open source developer platform to build AI/LLM applications and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.

open-source machine-learning ai apache-spark evaluation ml openai agents observability model-management mlops mlflow agentops prompt-engineering ai-governance langchain llmops llm-evaluation

Updated Oct 17, 2025
Python

langfuse

langfuse / langfuse

Star 17.2k

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai observability autogen large-language-models llm prompt-engineering langchain llmops llama-index prompt-management llm-evaluation llm-observability

Updated Oct 17, 2025
TypeScript

explodinggradients / ragas

Star 11.1k

Supercharge Your LLM Application Evaluations 🚀

evaluation llm llmops

Updated Oct 17, 2025
Python

mrgloom / awesome-semantic-segmentation

Star 10.7k

🤘 awesome-semantic-segmentation

benchmark evaluation deeplearning semantic-segmentation

Updated May 8, 2021

promptfoo / promptfoo

Star 8.7k

Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd pentesting cicd vulnerability-scanners prompts evaluation-framework red-teaming rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Oct 17, 2025
TypeScript

oumi-ai / oumi

Star 8.5k

Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!

evaluation inference llama fine-tuning sft dpo slms llms vlms gpt-oss gpt-oss-120b gpt-oss-20b

Updated Oct 16, 2025
Python

Tencent / WeKnora

Star 6.5k

LLM-powered framework for deep document understanding, semantic retrieval, and context-aware answers using RAG paradigm.

agent golang multi-tenant ai chatbot evaluation embeddings openai question-answering chatbots knowledge-base semantic-search reranking multimodel rag vector-search llm generative-ai agentic ollama

Updated Oct 16, 2025
Go

open-compass / opencompass

Star 6.2k

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

benchmark evaluation openai llm chatgpt large-language-model llama2 llama3

Updated Oct 17, 2025
Python

coze-dev / coze-loop

Star 5k

Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.