Name	Name	Last commit message	Last commit date
Latest commit History 33 Commits
data	data
eval_data	eval_data
figure	figure
pred_data/example	pred_data/example
README.md	README.md
eval_by_pred.ipynb	eval_by_pred.ipynb
eval_directly.ipynb	eval_directly.ipynb
eval_new_llm_by_pred.ipynb	eval_new_llm_by_pred.ipynb
eval_new_llm_directly.ipynb	eval_new_llm_directly.ipynb
irt.py	irt.py
minilongbench_metrics.py	minilongbench_metrics.py
minilongbench_scorer.py	minilongbench_scorer.py
representation_learning.ipynb	representation_learning.ipynb
requirements.txt	requirements.txt
sample_clustering.ipynb	sample_clustering.ipynb
utils.py	utils.py

Name

Last commit message

Last commit date

Latest commit

History

eval_new_llm_by_pred.ipynb

eval_new_llm_directly.ipynb

irt.py

minilongbench_metrics.py

minilongbench_scorer.py

representation_learning.ipynb

requirements.txt

sample_clustering.ipynb

utils.py

[ACL 25] MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models

Project Status Python 3.11+ GitHub paper

This repository is the official codebase of our paper "MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models".

The proposed MiniLongBench is a low-cost benchmark for evaluating the Long Context Understanding (LCU) capabilities of LLMs, featuring a compact yet diverse test set of only 237 samples spanning 6 major task categories and 21 distinct tasks.

Through empirical analysis of over 60 LLMs, MiniLongBench reduces the average evaluation cost to only 4.5% of the original while maintaining an average rank correlation coefficient of 0.97 with LongBench results.

🎉 News

2025-07 - We won the Outstanding Paper Award at ACL 2025🎉🎉🎉🎉🎉!

2025-05 - We released MiniLongBench dataset in [Baidu Drive] [Google Drive] [Hugging Face]. 👈🎉Please try it!

2025-05 - Our paper "MiniLongBench" has been accepted to ACL'25 main track! [Paper] 👈🎉Please read it!

⚙️ Environment Setup

Create a Python virtual environment and install all the packages listed in the requirements.txt.

conda create -n MiniLongBench python=3.11
conda activate MiniLongBench
pip install -r requirements.txt

To reproduce the construction of MiniLongBench, please install an apapted version of [py-irt].

pip install poetry
git clone https://github.com/linggm3/py-irt.git
cd py-irt
poetry install

🧪 Testing on MiniLongBench

Obtain LLM's output on MiniLongBench

Download MiniLongBench [Baidu Drive] [Google Drive] [Hugging Face].
Obtain LLM responses on [OpenCompass]:
- Evaluate the LLM across all 237 test samples in MiniLongBench.
- Generate outputs in the format: pred_data/example.

Calculate scores across the all test samples.

To generate and store the evaluation scores on 237 test samples:

python minilongbench_scorer.py

Calculate scores on MiniLongBench

There are two evaluation methods for MiniLongBench.

Predict the scores of LLMs on the full LongBench benchmark (eval_new_llm_by_pred.ipynb): This notebook show how to obtain MiniLongBench socres by predicting the scores of LLMs on the full LongBench benchmark.
Directly calculate the scores of LLMs on MiniLongBench (eval_new_llm_directly.ipynb): This notebook show how to obtain MiniLongBench socres directly.

🛠️ Reproducing the MiniLongBench

Representation Learning

representation_learning.ipynb demonstrates how to load LongBench's evaluation data, perform data preprocessing, and learn representations for both the LLMs and test samples.

Sample Clustering

sample_clustering.ipynb demonstrates how to cluster the representations of test samples and extract cluster centers as representative test samples.

Evaluation

There are two evaluation methods for MiniLongBench.

Predict the scores of LLMs on the full LongBench benchmark (eval_by_pred.ipynb).
Directly calculate the scores of LLMs on MiniLongBench (eval_directly.ipynb).

About

[ACL 25] The Low-cost Long Context Understanding Benchmark for Large Language Models (Outstanding Paper Award)

Resources

Stars

Watchers

Forks

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MilkThink-Lab/MiniLongBench

Folders and files

Latest commit

History

Repository files navigation

[ACL 25] MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models

🎉 News

⚙️ Environment Setup

🧪 Testing on MiniLongBench

Obtain LLM's output on MiniLongBench

Calculate scores across the all test samples.

Calculate scores on MiniLongBench

🛠️ Reproducing the MiniLongBench

Representation Learning

Sample Clustering

Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

[ACL 25] MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models

🎉 News

⚙️ Environment Setup

🧪 Testing on MiniLongBench

Obtain LLM's output on MiniLongBench

Calculate scores across the all test samples.

Calculate scores on MiniLongBench

🛠️ Reproducing the MiniLongBench

Representation Learning

Sample Clustering

Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages