WTF Do You Spend 300ドルB On?: The Oracle-OpenAI Deal Decoded

DEV Community

So I did what any reasonable person would do. I grabbed a spreadsheet and started doing the math. And what I found is instructive (or terrifying) for any executive planning an AI initiative.

Let's Start With the GPU Math

When you hear "300ドル billion for AI," your brain immediately goes to GPUs. Those beautiful, expensive chips that make the magic happen and may or may not be a critical pillar of the US GDP. So let's price this out.

OpenAI is getting 4.5 gigawatts of compute capacity from Oracle. In GPU terms, that's roughly 2 million high-end processors. Let's do the math:

 H100 GPUs (the good stuff): 30,000ドル each
 ×ばつ 2 million GPUs
 = 60ドル billion
 Over 5 years = 12ドル billion per year

But wait, you need more than just GPUs. You need racks, networking, cooling, buildings. Let's be generous and double it:

 GPUs + Infrastructure: 24ドル billion per year
 Oracle contract: 60ドル billion per year
 Missing money: 36ドル billion per year
 Over 5 years: 180ドル billion unaccounted for

That's a 180ドル billion gap.

Where the hell is the rest going?

The Uncomfortable Answer

I've been lucky enough to work in several hyperscale clouds, and the engineers who've built these systems always come back and tell you something along the lines of: "You think the GPUs are the expensive part? Oh, you sweet summer child."

The truth sounds so bizarre, but it's true: The actual training is the cheap part.

Before you can train a single parameter, you need to:

1. Collect the Data (15% of compute = 9ドル billion/year)

Crawl massive data sets. Multiple times a day.
Parse, oh let's be conservative, 3.5 petabytes of HTML, JavaScript, and garbage
Store it all somewhere accessible

An SEO company I am friendly with once told me: "We spent 8ドル million just crawling our target domains. That was month one."

2. Deduplicate Everything (20% of compute = 12ドル billion/year)

The internet is 90% copies. The same article on 50 sites. The same Stack Overflow answer everywhere. You need to find near-duplicates across trillions of tokens.

But here's the kicker: You can't just check exact matches. "The quick brown fox" and "A quick, brown fox" need to be caught. Now do that for every possible pair in your trillion-token dataset.

Computational complexity: O(n2) Actual cost: Your entire Q3 budget

(Ok it's not ACTUALLY your Q3 budget, but ... it's a lot.)

3. Filter for Quality (25% of compute = 15ドル billion/year)

Run classification models on every single document:

Is this text coherent?
Is it factual?
Is it toxic?
Is it AI-generated? (AI to catch AI? Recursion stack overflow error.)
Does it contain PII?
Is it copyright-safe?

Each document needs MANY classification passes (let's call it 5-10). At a trillion documents, that's 10 trillion model inference calls before you've trained anything.

4. The Experimentation Graveyard (20% of compute = 12ドル billion/year)

Now you've got all your data - ready to go! But wait... Most training runs fail.

Wrong learning rate? 5ドル million down the drain
Bad data mixture? Another 10ドル million gone
Model diverged at hour 672? There goes 15ドル million
Discovered a bug after training? Start over. 20ドル million.

Industry estimate: For every GPT-4 that ships, there were 10-20 failed attempts. If the final run cost 100ドル million, the failures cost 1ドル-2 billion.

The Shocking Final Tally

When you add it all up, here's a back of the envelope where that 60ドル billion per year actually goes:

What Everyone Thinks	Cost	Reality
Training GPT-5, 6, 7	60ドルB/year	Nope
What It Actually Is
Data collection & storage	9ドルB/year
Deduplication	12ドルB/year
Quality filtering	15ドルB/year
Failed experiments	12ドルB/year
Actual model training	12ドルB/year

Only 20% goes to actual training. The other 80% is data prep and failures.

And YES I am probably wrong by about ... 25%? But even if I'm off by a factor of 2, it's still a HUGE portion of the budget.

This Is Every Organization (Including Yours)

Before you feel superior about AI "waste," let me tell you about every enterprise AI project I've seen (all values anonymized to protect the innocent :):

Major Bank ML Initiative:

Total budget: 2ドル million
Data preparation: 1ドル.5 million
Failed experiments: 400ドルk
Actual model training: 100ドルk
Models that made it to production: 1

Retail Company's "AI Transformation":

Year 1 spend: 5ドル million
Data lake setup: 2ドル million
Data cleaning: 1ドル.5 million
"POCs" that went nowhere: 1ドル million
Actual AI in production: 500ドルk

Transportation Giant's "AI-Powered Logistics":

3ドル million just cleaning GPS data from their fleet
2ドル.5 million merging incompatible routing systems
2ドル million on failed predictive maintenance attempts
2ドル million on smart routing POCs that never shipped
Only 500ドルk on the AI that's actually optimizing routes

See the pattern? Everyone spends 80% on data prep. Everyone.

Why Your AI Project Will Follow the Same Pattern

The uncomfortable truth is that this ratio—80% data prep, 20% actual AI—is physics, not inefficiency.

Here's why your organization will hit the same wall:

Your Data Is Garbage

It's in 47 different systems
With 12 different schemas
Half of it is duplicated
A quarter is wrong
Nobody knows where the good stuff is

Your Experiments Will Fail

Your first model won't work
Neither will your second
Or your third
The one that works will break in production
You'll start over

You'll Discover Hidden Costs

"We need to label the data" - $
"We need to clean the data" - $$
"We need more data" -
"We need different data" - $
"We need to start over" - Everything

What This Means for Your AI Project

For Enterprises

Stop budgeting EXCLUSIVELYfor model training. Start budgeting for end-to-end pipelines. Whatever you're allocating for AI, flip the ratio:

Current plan: 80% models, 20% data
Reality: 20% models, 80% data

For Startups

You can't compete with OpenAI's compute budget. But here's the secret: You don't need to. They're spending 45ドル billion on data prep because they're training on "all human knowledge." You can probably get by with a focused dataset and 50ドルk.

For Engineers

Job security! While everyone's learning prompt engineering, the real demand is for people who can build data pipelines that don't break when you throw petabytes at them.

The 300ドル Billion Reality Check

The Oracle-OpenAI deal isn't insane. It's insanely honest.

For the first time, a major AI company is admitting what it really costs to build these systems. Not the sexy GPU costs. Not the brilliant model architectures. The grinding, exhausting, expensive reality of data preparation.

Every organization rushing to build AI is about to learn what OpenAI already knows:

Your data is messier than you think
Cleaning it costs more than training
Most of your experiments will fail
The infrastructure you need isn't what you expect

The Bottom Line

This deal reveals an uncomfortable truth: We don't ONLY have an AI problem. We have a data problem.

In AI, the model is the celebrity, but data prep is the "below the line" people who make the movie. And just like in Hollywood, the crew costs more than the star.

Oracle and OpenAI aren't JUST building the future of AI. They're building industrial-scale data refineries that occasionally produce intelligence as a byproduct.

And at 300ドル billion, they're showing us what it really costs to turn the internet's garbage into gold—unless you're smart about where and how you process that data.

There's Another Way

This is exactly why we built Expanso. While OpenAI LIKELY spends 45ドル billion moving and cleaning data in centralized data centers a year (or more!), we're building intelligent pipelines that process data where it lives.

Think about it: Why move petabytes to the cloud when you can:

Process data at the edge, where it's created
Move only insights, not raw data
Skip 80% of the deduplication (because you never centralized the duplicates)
Reduce data prep costs by 10x through distributed processing

The question isn't whether they need to spend that much. Money will always fill the hole. The question is whether there's a more inefficient way to allocate the capital, and at Expanso we like to think there is.

The future isn't about who can afford the biggest data center. It's about who can be smartest about never needing one.

What percentage of your AI budget goes to data prep? Are you seeing the same 80/20 split? Drop me a note—I'm collecting data on the real economics of AI, and I promise I'll deduplicate it properly.

In addition to being CEO of Expanso , I'm currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost. I'd love to hear your thoughts !

Originally published at WTF Do You Spend 300ドルB On?: The Oracle-OpenAI Deal Decoded.