Skip to content

Thopterek/ChessBenchmark

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
OpenRouter		OpenRouter
base_metrics		base_metrics
tests		tests
week02		week02
week03		week03
week04		week04
week_01_base		week_01_base
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
asd.md		asd.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Repository files navigation

Benchmarking LLMs through utilization of chess

done in two person team, by myself (@Thopterek ) and @itsiros

image image image image image image image image

Some of the results

Related research

Bridging the Gap between Expert and Language Models: Concept-guided Chess Commentary Generation and Evaluation

Large Language Models as General Pattern Machines

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Validity Challenges in Machine Learning Benchmarks

GAIA:A BENCHMARK FOR GENERAL AI ASSISTANTS

A critical review of large language models: Sensitivity, bias, and the path toward specialized AI

Measuring General Intelligence with Generated Games

Maia-2: A Unified Model for Human-AI Alignment in Chess

Judge AI: Assessing Large Language Models in Judicial Decision- Judge AI: Assessing Large Language Models in Judicial Decision-Making

Aligning Superhuman AI with Human Behavior: Chess as a Model System

OK, I can partly explain the LLM chess weirdness now

ChessGPT: Bridging Policy Learning and Language Modeling

About

Aleph Alpha and LEVEL3, LLM benchmark

Topics

python benchmark llm llms-benchmarking

Resources

Stars

Watchers

Forks

Report repository

Releases

No releases published

Packages

No packages published

Languages