One of the most hyped ideas in agent research is:
"Let the model write its own tools / skills."
But it is mostly a wasted effort. In this research, self-generated skills produced no meaningful improvement over baseline.
In some cases, they made performance worse.
Today's models simply cannot reliably create useful reusable procedural abstractions.
This matters because a huge part of current agent research assumes models can recursively improve by generating better skills/tools. This benchmark suggests that assumption is premature.
SkillBench chart showing self-generated skills did not meaningfully improve performance
Human-made skills work A LOT better
When Skills were carefully written by humans, performance jumped +16.2 percentage points on average.
But here's what's even more surprising:
Domain variance was extreme
- Some domains saw small gains (~4-5 pp)
- Others saw enormous gains (~50+ pp)
SkillBench chart showing high domain variance for human-made skills
Skills don't help the same in different fields.. They disproportionately help in structured, procedural domains.
Smaller models + skills β bigger models without skills
A smaller model with curated Skills matched or exceeded a larger model without Skills.
This is huge for cost optimization:
- Local agents
- Edge deployment
- Open-source models
Too many skills can hurt
Overly broad or verbose skill libraries degraded performance. Focused, minimal skill modules performed better.
SkillBench result showing too many skills can degrade performance
Pick your skills carefully. 2-3 skills work better than 4+ skills.
Here is my takeaway
If this paper is right (and i think it is, mostly because of my personal experiences with skill files):
- Scaling alone isn't enough
- Autonomy narratives are premature
- Skill architecture design is now a first-class research problem
Read the full paper: https://arxiv.org/pdf/2602.12670