mmlu

Code and data accompanying the article "The impact of quantising a small open source LLM". This repository explores how quantisation affects performance, VRAM usage, and inference speed in Qwen3 1.7B.

open-source ai quantization llm generative-ai mmlu

Updated Jul 5, 2025
Python

NahuelGiudizi / llm-evaluation

Star 2

Enterprise-grade LLM evaluation framework | Multi-model benchmarking, honest dashboards, system profiling | Academic metrics: MMLU, TruthfulQA, HellaSwag | Zero fake data | PyPI: llm-benchmark-toolkit | Blog: https://dev.to/nahuelgiudizi/building-an-honest-llm-evaluation-framework-from-fake-metrics-to-real-benchmarks-2b90

visualization python benchmarking machine-learning performance-testing academic-metrics mmlu ollama llm-evaluation truthfulqa hellaswag

Updated Dec 5, 2025
Python

RobotStudyCompanion / Benchmark_LM

Sponsor

Star 2

Benchmark suite for open-source language models on the edge. Evaluates inference efficiency, MMLU accuracy, and LLM-rated teaching effectiveness.

python raspberry-pi benchmark language-models reproducibility edge-computing social-robots educational-robotics llm mmlu ollama teaching-effectiveness arso2026

Updated Jun 11, 2026
Python

sergeyklay / factly

Sponsor

Star 2

CLI tool to evaluate LLM factuality on MMLU benchmark.

cli benchmark openai factuality ai-evaluation llm prompt-engineering chatgpt mmlu llm-evaluation

Updated Nov 26, 2025
Python

RenaudGaudron / MMLU_benchmark

Star 1

An easy-to-use and standardised framework for evaluating Large Language Models (LLMs) on the Massive Multitask Language Understanding (MMLU) dataset. Currently supported: Hugging Face transformer models and Bedrock models.

open-source benchmark ai llm generative-ai mmlu

Updated Jul 12, 2025
Python

chengjun-xu / ai-eval-platform

Star 1

大模型评测平台 — 本地/API/HuggingFace/OpenCompass 三路后端,支持数据生产(Self-Instruct/Evol-Instruct)、长尾场景生成、弱项挖掘、回归分析、污染检测、Bad Case归因。可扩展的 Benchmark 系统和 LLM-as-Judge 自动评分。

python flask humaneval ai-evaluation gsm8k mmlu llm-evaluation benchmark-platform rag-evaluation llm-as-judge opencompass llm-benchmark data-contamination-detection

Updated Jun 7, 2026
Python

Shuichi346 / llm-benchmark-script

Star 0

A tool to evaluate and compare local LLMs running on Ollama or LM Studio under identical conditions using deepeval's public benchmarks (MMLU, TruthfulQA, GSM8K).

python macos benchmark quantization model-evaluation apple-silicon llm gsm8k local-llm mmlu ollama lmstudio truthfulqa deepeval

Updated Mar 14, 2026
Python

abhigupta2909 / LLMPerformanceLab

Star 0

LLMs' performance analysis on CPU, GPU, Execution Time and Energy Usage

javascript mysql java spring-boot reactjs flask-restful humaneval llms mmlu ollama-api

Updated Apr 1, 2024
Java

North-Shore-AI / datasets_ex

Sponsor

Star 0

Dataset management library for ML experiments—loaders for SciFact, FEVER, GSM8K, HumanEval, MMLU, TruthfulQA, HellaSwag; git-like versioning with lineage tracking; transformation pipelines; quality validation with schema checks and duplicate detection; GenStage streaming for large datasets. Built for reproducible AI research.

machine-learning streaming elixir otp data-validation beam fever versioning benchmarks data-management datasets reproducibility genstage data-quality humaneval gsm8k mmlu scifact nshkr-crucible north-shore-ai

Updated Apr 23, 2026
Elixir

AndrewHeller17 / Effect-of-Emotional-Framing-on-LLM-Performance

Star 0

Evaluated the impact of emotional prompt framing on LLM reasoning accuracy across industry benchmarks (MMLU, GPQA) using controlled experimental conditions.

python nlp machine-learning research llm chatgpt mmlu gpqa

Updated Mar 3, 2026
Jupyter Notebook

Improve this page

Add a description, image, and links to the mmlu topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the mmlu topic, visit your repo's landing page and select "manage topics."

Learn more

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mmlu

Here are 27 public repositories matching this topic...

baichuan-inc / Baichuan-7B

baichuan-inc / Baichuan2

baichuan-inc / Baichuan-13B

microsoft / MMLU-CF

ExplainableML / in-context-impersonation

vignesh2027 / LLM-Evaluation-Framework

SS47816 / AGI-Elo

notwitcheer / llm-bench-rig

mbzuai-nlp / UrduMMLU

he-yufeng / LiteBench

RenaudGaudron / llm-quantisation-performance-study

NahuelGiudizi / llm-evaluation

RobotStudyCompanion / Benchmark_LM

sergeyklay / factly

RenaudGaudron / MMLU_benchmark

chengjun-xu / ai-eval-platform

Shuichi346 / llm-benchmark-script

abhigupta2909 / LLMPerformanceLab

North-Shore-AI / datasets_ex

AndrewHeller17 / Effect-of-Emotional-Framing-on-LLM-Performance

Improve this page

Add this topic to your repo