Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit 1a297b5

Browse files
committed
update benchmarks post
1 parent 99033d4 commit 1a297b5

File tree

1 file changed

+15
-10
lines changed

1 file changed

+15
-10
lines changed

‎_blog/misc/25_data_science_benchmarks.md‎

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -4,33 +4,38 @@ title: Data science benchmarks for AI systems
44
category: blog
55
---
66

7+
8+
Some benchmarks focusing on getting insight directly from data using LLMs / LLM agents (requires the models to interact with the data through code). I find these really compelling, as they are really hard tasks, useful for real-world applications, and also an extensible stepping stone to accelerate scientific discovery.
9+
710
**ScienceAgentBench** ([chen...huan sun, 2024](https://arxiv.org/abs/2410.05080)) - 102 scientific coding tasks (from 44 papers in 4 disciplines validated by 9 subject-matter experts)
811

912
- target output for every task is a self-contained Python file
1013
- each task has (a) task instruction, (b) dataset info, (c) expert-provided info and (d) a groundtruth annotated program
11-
![Screenshot 2025年06月19日 at 2.19.17 PM](../../_notes/assets/Screenshot%202025-06-19%20at%202.19.17%E2%80%AFPM.png)
1214

13-
<img src="{{ site.baseurl }}/notes/assets/Screenshot%202025-06-19%20at%202.19.17%E2%80%AFPM.png" class="noninverted medium_image"/>
15+
<img src="{{ site.baseurl }}/notes/assets/Screenshot%202025-06-19%20at%202.19.17%E2%80%AFPM.png" class="noninverted full_image"/>
1416

1517
**AutoSDT**: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists ([li...huan sun, 2025](https://arxiv.org/abs/2506.08140)) - 5k scientific coding tasks automatically scraped from github repos for papers (as a sanity check, they manually verified that a subset were reasonable)
16-
![Screenshot 2025年06月19日 at 2.22.52 PM](../../_notes/assets/Screenshot%202025-06-19%20at%202.22.52%E2%80%AFPM.png)
1718

18-
**DiscoveryBench**: Towards Data-Driven Discovery with Large Language Models ([majumder...clark, 2024](https://arxiv.org/abs/2407.01725)) - 264 tasks collected across 6 diverse domains, such as sociology and engineering, by manually deriving discovery workflows from papers
19+
<imgsrc="{{ site.baseurl }}/notes/assets/Screenshot%202025-06-19%20at%202.22.52%E2%80%AFPM.png"class="noninverted full_image"/>
1920

21+
**DiscoveryBench**: Towards Data-Driven Discovery with Large Language Models ([majumder...clark, 2024](https://arxiv.org/abs/2407.01725)) - 264 tasks collected across 6 diverse domains, such as sociology and engineering, by manually deriving discovery workflows from papers
2022
- each task has datasets, metadata, natural-language discovery goal
21-
![Screenshot 2025年06月19日 at 2.18.31 PM](../../_notes/assets/Screenshot%202025-06-19%20at%202.18.31%E2%80%AFPM.png)
23+
24+
<img src="{{ site.baseurl }}/notes/assets/Screenshot%202025-06-19%20at%202.18.31%E2%80%AFPM.png" class="noninverted full_image"/>
2225

2326
**BLADE**: Benchmarking Language Model Agents for Data-Driven Science ([gu...althoff, 2024](https://arxiv.org/pdf/2408.09667)) - 12 tasks, each has a (fairly open-ended) research question, dataset, and groundtruth expert-conducted analysis
24-
![Screenshot 2025年06月19日 at 4.22.04 PM](../../_notes/assets/Screenshot%202025-06-19%20at%204.22.04%E2%80%AFPM.png)
27+
28+
<img src="{{ site.baseurl }}/notes/assets/Screenshot%202025-06-19%20at%204.22.04%E2%80%AFPM.png" class="noninverted full_image"/>
2529

2630
**Mlagentbench**: Benchmarking LLMs As AI Research Agents ([huang, vora, liang, & leskovec, 2023](https://arxiv.org/abs/2310.03302v2)) - 13 prediction tasks, e.g. CIFAR-10, BabyLM, kaggle (evaluate via test prediction perf.)
2731

28-
![Screenshot 2025年06月19日 at 4.02.49 PM](../../_notes/assets/Screenshot%202025-06-19%20at%204.02.49%E2%80%AFPM.png)
32+
<imgsrc="{{ site.baseurl }}/notes/assets/Screenshot%202025-06-19%20at%204.02.49%E2%80%AFPM.png"class="noninverted full_image"/>
2933

3034
**IDA-Bench**: Evaluating LLMs on Interactive Guided Data Analysis ([li...jordan, 2025](https://arxiv.org/pdf/2505.18223)) - scraped 25 notebooks from recent kaggle competitions, parse into goal + reference insights that incorporate domain knowledge
31-
3235
- paper emphasizes interactive setting: evaluates by using the instruction materials to build a knowledgeable user simulator and then tests data science agents' ability to help the user simulator improve predictive performance
33-
![Screenshot 2025年06月19日 at 4.39.46 PM](../../_notes/assets/Screenshot%202025-06-19%20at%204.39.46%E2%80%AFPM.png)
36+
37+
<img src="{{ site.baseurl }}/notes/assets/Screenshot%202025-06-19%20at%204.39.46%E2%80%AFPM.png" class="noninverted full_image"/>
3438

3539
**InfiAgent-DABench**: Evaluating Agents on Data Analysis Tasks ([hu...wu, 2024](https://arxiv.org/abs/2401.05507)) - 257 precise (relatively easy) questions that can be answered from 1 of 52 csv datasets
36-
![Screenshot 2025年06月19日 at 3.53.53 PM](../../_notes/assets/Screenshot%202025-06-19%20at%203.53.53%E2%80%AFPM.png)
40+
41+
<img src="{{ site.baseurl }}/notes/assets/Screenshot%202025-06-19%20at%203.53.53%E2%80%AFPM.png" class="noninverted full_image"/>

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /