Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

MilkThink-Lab/MiniLongBench

Repository files navigation

[ACL 25] MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models

Project Status Python 3.11+ GitHub paper

This repository is the official codebase of our paper "MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models".

The proposed MiniLongBench is a low-cost benchmark for evaluating the Long Context Understanding (LCU) capabilities of LLMs, featuring a compact yet diverse test set of only 237 samples spanning 6 major task categories and 21 distinct tasks.

Through empirical analysis of over 60 LLMs, MiniLongBench reduces the average evaluation cost to only 4.5% of the original while maintaining an average rank correlation coefficient of 0.97 with LongBench results.

πŸŽ‰ News

2025-07 - We won the Outstanding Paper Award at ACL 2025πŸŽ‰πŸŽ‰πŸŽ‰πŸŽ‰πŸŽ‰!

2025-05 - We released MiniLongBench dataset in [Baidu Drive] [Google Drive] [Hugging Face]. πŸ‘ˆπŸŽ‰Please try it!

2025-05 - Our paper "MiniLongBench" has been accepted to ACL'25 main track! [Paper] πŸ‘ˆπŸŽ‰Please read it!

βš™οΈ Environment Setup

Create a Python virtual environment and install all the packages listed in the requirements.txt.

conda create -n MiniLongBench python=3.11
conda activate MiniLongBench
pip install -r requirements.txt

To reproduce the construction of MiniLongBench, please install an apapted version of [py-irt].

pip install poetry
git clone https://github.com/linggm3/py-irt.git
cd py-irt
poetry install

πŸ§ͺ Testing on MiniLongBench

Obtain LLM's output on MiniLongBench

  1. Download MiniLongBench [Baidu Drive] [Google Drive] [Hugging Face].

  2. Obtain LLM responses on [OpenCompass]:

    • Evaluate the LLM across all 237 test samples in MiniLongBench.
    • Generate outputs in the format: pred_data/example.

Calculate scores across the all test samples.

To generate and store the evaluation scores on 237 test samples:

python minilongbench_scorer.py

Calculate scores on MiniLongBench

There are two evaluation methods for MiniLongBench.

  1. Predict the scores of LLMs on the full LongBench benchmark (eval_new_llm_by_pred.ipynb): This notebook show how to obtain MiniLongBench socres by predicting the scores of LLMs on the full LongBench benchmark.

  2. Directly calculate the scores of LLMs on MiniLongBench (eval_new_llm_directly.ipynb): This notebook show how to obtain MiniLongBench socres directly.

πŸ› οΈ Reproducing the MiniLongBench

Representation Learning

representation_learning.ipynb demonstrates how to load LongBench's evaluation data, perform data preprocessing, and learn representations for both the LLMs and test samples.

Sample Clustering

sample_clustering.ipynb demonstrates how to cluster the representations of test samples and extract cluster centers as representative test samples.

Evaluation

There are two evaluation methods for MiniLongBench.

  1. Predict the scores of LLMs on the full LongBench benchmark (eval_by_pred.ipynb).
  2. Directly calculate the scores of LLMs on MiniLongBench (eval_directly.ipynb).

About

[ACL 25] The Low-cost Long Context Understanding Benchmark for Large Language Models (Outstanding Paper Award)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /