Commit 1a297b5

committed

update benchmarks post

1 parent 99033d4 commit 1a297b5Copy full SHA for 1a297b5

File tree

1 file changed

+15

-10

lines changed

_blog/misc
- 25_data_science_benchmarks.md

1 file changed

+15

-10

lines changed

`‎_blog/misc/25_data_science_benchmarks.md‎`

Lines changed: 15 additions & 10 deletions

Original file line number	Diff line number	Diff line change
`@@ -4,33 +4,38 @@ title: Data science benchmarks for AI systems`
`4`	`4`	`category: blog`
`5`	`5`	`---`
`6`	`6`
	`7`	`+`
	`8`	`+Some benchmarks focusing on getting insight directly from data using LLMs / LLM agents (requires the models to interact with the data through code). I find these really compelling, as they are really hard tasks, useful for real-world applications, and also an extensible stepping stone to accelerate scientific discovery.`
	`9`	`+`
`7`	`10`	`ScienceAgentBench ([chen...huan sun, 2024](https://arxiv.org/abs/2410.05080)) - 102 scientific coding tasks (from 44 papers in 4 disciplines validated by 9 subject-matter experts)`
`8`	`11`
`9`	`12`	`- target output for every task is a self-contained Python file`
`10`	`13`	`- each task has (a) task instruction, (b) dataset info, (c) expert-provided info and (d) a groundtruth annotated program`
`11`		`-![Screenshot 2025年06月19日 at 2.19.17 PM](../../_notes/assets/Screenshot%202025-06-19%20at%202.19.17%E2%80%AFPM.png)`
`12`	`14`
`13`		`-<img src="{{ site.baseurl }}/notes/assets/Screenshot%202025-06-19%20at%202.19.17%E2%80%AFPM.png" class="noninverted medium_image"/>`
	`15`	`+<img src="{{ site.baseurl }}/notes/assets/Screenshot%202025-06-19%20at%202.19.17%E2%80%AFPM.png" class="noninverted full_image"/>`
`14`	`16`
`15`	`17`	`AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists ([li...huan sun, 2025](https://arxiv.org/abs/2506.08140)) - 5k scientific coding tasks automatically scraped from github repos for papers (as a sanity check, they manually verified that a subset were reasonable)`
`16`		`-![Screenshot 2025年06月19日 at 2.22.52 PM](../../_notes/assets/Screenshot%202025-06-19%20at%202.22.52%E2%80%AFPM.png)`
`17`	`18`
`18`		`-DiscoveryBench: Towards Data-Driven Discovery with Large Language Models ([majumder...clark, 2024](https://arxiv.org/abs/2407.01725)) - 264 tasks collected across 6 diverse domains, such as sociology and engineering, by manually deriving discovery workflows from papers`
	`19`	`+<imgsrc="{{ site.baseurl }}/notes/assets/Screenshot%202025-06-19%20at%202.22.52%E2%80%AFPM.png"class="noninverted full_image"/>`
`19`	`20`
	`21`	`+DiscoveryBench: Towards Data-Driven Discovery with Large Language Models ([majumder...clark, 2024](https://arxiv.org/abs/2407.01725)) - 264 tasks collected across 6 diverse domains, such as sociology and engineering, by manually deriving discovery workflows from papers`
`20`	`22`	`- each task has datasets, metadata, natural-language discovery goal`
`21`		`-![Screenshot 2025年06月19日 at 2.18.31 PM](../../_notes/assets/Screenshot%202025-06-19%20at%202.18.31%E2%80%AFPM.png)`
	`23`	`+`
	`24`	`+<img src="{{ site.baseurl }}/notes/assets/Screenshot%202025-06-19%20at%202.18.31%E2%80%AFPM.png" class="noninverted full_image"/>`
`22`	`25`
`23`	`26`	`BLADE: Benchmarking Language Model Agents for Data-Driven Science ([gu...althoff, 2024](https://arxiv.org/pdf/2408.09667)) - 12 tasks, each has a (fairly open-ended) research question, dataset, and groundtruth expert-conducted analysis`
`24`		`-![Screenshot 2025年06月19日 at 4.22.04 PM](../../_notes/assets/Screenshot%202025-06-19%20at%204.22.04%E2%80%AFPM.png)`
	`27`	`+`
	`28`	`+<img src="{{ site.baseurl }}/notes/assets/Screenshot%202025-06-19%20at%204.22.04%E2%80%AFPM.png" class="noninverted full_image"/>`
`25`	`29`
`26`	`30`	`Mlagentbench: Benchmarking LLMs As AI Research Agents ([huang, vora, liang, & leskovec, 2023](https://arxiv.org/abs/2310.03302v2)) - 13 prediction tasks, e.g. CIFAR-10, BabyLM, kaggle (evaluate via test prediction perf.)`
`27`	`31`
`28`		`-![Screenshot 2025年06月19日 at 4.02.49 PM](../../_notes/assets/Screenshot%202025-06-19%20at%204.02.49%E2%80%AFPM.png)`
	`32`	`+<imgsrc="{{ site.baseurl }}/notes/assets/Screenshot%202025-06-19%20at%204.02.49%E2%80%AFPM.png"class="noninverted full_image"/>`
`29`	`33`
`30`	`34`	`IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis ([li...jordan, 2025](https://arxiv.org/pdf/2505.18223)) - scraped 25 notebooks from recent kaggle competitions, parse into goal + reference insights that incorporate domain knowledge`
`31`		`-`
`32`	`35`	`- paper emphasizes interactive setting: evaluates by using the instruction materials to build a knowledgeable user simulator and then tests data science agents' ability to help the user simulator improve predictive performance`
`33`		`-![Screenshot 2025年06月19日 at 4.39.46 PM](../../_notes/assets/Screenshot%202025-06-19%20at%204.39.46%E2%80%AFPM.png)`
	`36`	`+`
	`37`	`+<img src="{{ site.baseurl }}/notes/assets/Screenshot%202025-06-19%20at%204.39.46%E2%80%AFPM.png" class="noninverted full_image"/>`
`34`	`38`
`35`	`39`	`InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks ([hu...wu, 2024](https://arxiv.org/abs/2401.05507)) - 257 precise (relatively easy) questions that can be answered from 1 of 52 csv datasets`
`36`		`-![Screenshot 2025年06月19日 at 3.53.53 PM](../../_notes/assets/Screenshot%202025-06-19%20at%203.53.53%E2%80%AFPM.png)`
	`40`	`+`
	`41`	`+<img src="{{ site.baseurl }}/notes/assets/Screenshot%202025-06-19%20at%203.53.53%E2%80%AFPM.png" class="noninverted full_image"/>`

0 commit comments

Comments

(0)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 1a297b5

File tree

1 file changed

1 file changed

`‎_blog/misc/25_data_science_benchmarks.md‎`

0 commit comments