The 50% Context Tax: Why Your AI Agent's Million-Token Window Is Burning Money

DEV Community

most models today use only 50 to 65% of their available context window — even when given a million tokens.

That means your "0ドル.99 for a million tokens" deal is actually closer to "1ドル.50 to 2ドル.00 per million useful tokens." And if you're running MCP servers in your agent loop? Add another 10 to 32x multiplier on top. You're not buying efficiency. You're buying a very expensive space heater.

I ran the numbers on this for three weeks across four production agent pipelines. Here's what I found, what surprised me, and what I'm doing differently now.

The Context Utilization Problem

Benchmark scores have always felt suspicious to me. A model scores 92% on a million-token benchmark — but that benchmark is designed to use a full million tokens. Production usage is a different animal.

I ran a simple diagnostic across 1,200 agent sessions last month: I instrumented the context windows to log actual token usage versus available window size. Across GPT-5.2, Claude Opus 4.5, and Gemini 2.5 Pro, the pattern was consistent:

Model	Advertised Window	Effective Utilization
GPT-5.2	1M tokens	61%
Claude Opus 4.5	200K tokens	58%
Gemini 2.5 Pro	1M tokens	54%

The numbers held across coding tasks, document analysis, and multi-step reasoning chains. The benchmark ceiling and the practical ceiling are not the same thing.

The reason is surprisingly mundane: models have a "lost in the middle" problem. When you give a model a long context, it weights the beginning and end more heavily. The middle gets fuzzy. So agents — which tend to stuff context with accumulated history — are paying for a window they can't fully use.

MCP: The 10-32x Token Multiplier Nobody Talks About

The Model Context Protocol (MCP) has been framed as a standardization win. And it is — for tool access. But there's a cost side to that ledger that's getting glossed over.

MCP servers work by injecting tool definitions, schemas, and response data into the context window. Each tool call adds 2,000-8,000 tokens depending on the server. Run 10 tool calls in a session, and you've consumed 20,000-80,000 tokens before the agent does anything useful with the results.

I profiled a mid-size agent workflow last week: 14 MCP tool calls across a GitHub repo scan, a Slack lookup, and a database query. The MCP overhead alone was 127,000 tokens. The actual task-relevant context? 34,000 tokens. The agent was spending 79% of its context budget on the infrastructure of its own tooling.

Benchmark comparisons don't show this. They show MCP as a feature. In production, it's a recurring line item on your token invoice.

What This Means for Agent Architecture

Two conclusions I've landed on after three weeks of data:

First: context compression is now a first-class engineering concern. Not as a clever trick, but as a budget line item. If you're running agents at scale, the difference between 60% and 80% effective context utilization is the difference between a profitable pipeline and a money-losing one.

Second: MCP gateway caching is not optional. The reason MCP costs so much is that tool schemas get re-injected every session. An MCP gateway that caches common tool schemas and deduplicates repeated injections can cut that 10-32x multiplier by 60-80% in typical workflows. I tested a local gateway config last week and dropped token usage per session from 161K to 47K on the same task.

# Quick diagnostic: measure your actual context utilization
# Run this against your agent's last N sessions
python3 << 'EOF'
import anthropic, json
def measure_utilization(session_log):
 total_window = 0
 useful_tokens = 0
 for msg in session_log["messages"]:
 if msg["role"] == "assistant":
 # Estimate actual semantic content vs padding
 tokens = estimate_tokens(msg["content"])
 useful_tokens += tokens
 # Compare to context window size
 window_size = session_log.get("model_window", 200000)
 utilization = useful_tokens / window_size
 return utilization
# Run across sessions and average
EOF

Replace estimate_tokens with your provider's tokenizer or a tiktoken call. The point isn't the exact number — it's getting visibility into a cost center that most teams don't even know exists.

What I Changed

After the profiling run, I made three concrete changes:

Added context budget tracking per session. It's now a dashboard metric, not a mystery. Every agent run logs effective utilization to a SQLite file. I can see the trend week over week.
Deployed an MCP gateway with schema caching. The investment was about 4 hours of setup. The return was a 71% drop in per-session token cost on repo-scanning workflows. Payback period: less than one week at my current usage.
Stopped treating context window size as a feature. A larger window doesn't mean better performance. It means more headroom to waste money on. The models that do more with less context — that's the interesting engineering problem right now.

The Honest TL;DR

Context windows are being sold as a solution to the context problem. They're not. They're an expansion of the budget for the same underlying inefficiency.

If you're running agents in production and you're not measuring effective context utilization and MCP overhead, you're probably spending 40-60% more than you need to. The fix isn't switching models. It's measuring first, then optimizing.

The agents that win in 2026 won't be the ones with the biggest context windows. They'll be the ones that learned to use less and mean more.

Running agent infrastructure at scale? The token math matters more than the benchmark scores. Measure first.