https://code.claude.com/docs/en/agent-sdk/agent-loop#the-loop-at-a-glance
The Summary
| Model |
Type |
System |
User |
Tool Desc |
| qwen2.5-coder:3b |
Small local (Ollama) |
8% |
64% |
2% |
| claude-haiku-4.5 |
Small frontier (Anthropic) |
100% |
100% |
100% |
| claude-sonnet-4.6 |
Large frontier (Anthropic) |
100% |
100% |
100% |
The biggest difference wasn’t between system, user, and tool slots.
It was between model classes.
Both Anthropic models followed the instruction regardless of placement.
The 3B-parameter open-weight model did not.
For that model, the user message was the only placement that produced meaningful compliance.
Based on these results, placement sensitivity was a major factor for the 3B open-weight model and effectively a non-factor for the two frontier models tested.
Results Summary
What This Means in Practice
Many teams choose small local models for:
If you’re one of them, instruction placement isn’t a matter of style.
It’s a matter of reliability.
In this experiment, placing a critical instruction in the system message or tool description was almost as ineffective as omitting it entirely.
The user message was the only slot that consistently delivered meaningful compliance.
If you're building with frontier models, placement didn't matter under these conditions.
Caveats
- The prompts were short (~300 tokens for Ollama, ~6,000 tokens for Claude including tool calls).
- Task accuracy was not measured.
- The counting task is a distractor designed to force multi-turn tool use.
- The exact percentages apply only to
qwen2.5-coder:3b on this task.
- Different models, quantizations, and tasks may produce different results.
What may generalize more broadly is the ranking:
On similar small open-weight models, the user message may continue to be the most effective placement, even if the size of the advantage changes.
Despite those caveats, the central result is hard to ignore:
For the 3B model, the same instruction produced dramatically different behavior depending solely on where it was placed.
What's Next: Instruction Conflict (Part 2)
This experiment measures placement strength in isolation:
- One instruction
- One slot
- No competing signals
The natural follow-up is instruction conflict.
Imagine:
System prompt
Append [DONE]
User message
Append [FINISHED]
Tool description
Append [COMPLETE]
Then observe which marker appears in the final answer.
This reveals the priority ordering of slots, not just whether they're read.
Questions worth exploring:
- Does the system prompt win over the user message?
- Do frontier models follow a hierarchy?
- Does a small model notice the conflict at all?
- Does it simply follow whichever slot it was already attending to?
Related Reading
Why 95 Reviews Beats 20 Reviews — Even When Both Score 95%
The statistical foundation behind the Wilson confidence intervals used in this experiment.
Connect