Why Inference Will Get Dirt Cheap Fast

We are no longer afraid that AI doesn’t work. Every day, we see it with our own eyes. No, the most persistent fear is that this honeymoon period is just temporary. Right now, in a fight for market share, the token dealers sell us inference at a loss, hoping to hook us. Once the market is saturated, they’ll start charging us the real deal. And at that point, all these cool new toys become prohibitively expensive.

I understand where the sentiment comes from. A $200 Claude Max account is a bargain compared to paying per token through the API platform. Opus is an expensive model, and it’s crazy to see how quickly one can burn through $100 of credits. I have a simple OpenClaw bot that keeps me up to date about the stock market. It would consume the equivalent of a second-hand car if I were to let it run on Opus for a year.

So, this all-you-can-eat buffet offered by Anthropic and OpenAI is too good to be true. They are definitely losing money on some of their accounts. That must mean things are about to get more expensive, right?

I don’t think so. AI inference today is the worst it will ever be. It is also the most expensive it will ever be. I see 5 forces compounding towards dirt-cheap AI.

1. Hardware gets, like, really cheap

Moore’s Law is often pronounced dead, but it remains alive and kicking when it comes to GPUs. The cost of transistors halves every 3 years, but there are reasons to believe GPUs will go cheap much faster.

NVIDIA’s H100 delivers roughly 3x the inference throughput of an A100 at a similar price point. The B200 will do the same to the H100.

This isn’t speculative. It’s the most predictable trend in technology.

Timeframe: the coming few years.

2. Models git gud

The previous generation of large language models was amazing. But it was not efficient. A dense 70 billion parameter model would activate all those parameters for every token, regardless of whether we asked “what’s 2+2” or “explain quantum mechanics.” This is like driving a Formula 1 car to the grocery store.

Modern models no longer spend the same amount of computation on every question. Simple tasks use very little processing, while harder reasoning uses more. The cost of an answer now depends on the difficulty of the problem.

Most real-world questions are simple, which means that for a lot of tasks, smaller models can compete with the big boys. At a discount.

My OpenClaw stock market guru has been running on Haiku 3.5 for a while now, and I haven’t had to top up my credits.

Timeframe: run a cheap, good-enough model today.

3. It’s not all talk

The industry is obsessed with latency: how fast can a token be generated? That’s because the killer app for LLMs is still ChatGPT. If we want it to feel like a conversation partner, we don’t want to wait two minutes for a response.

But most valuable AI workloads aren’t chat. They’re batch processing. Document analysis. Code review. Data transformation. Report generation. These tasks don’t care whether the result arrives in 500ms or 5 minutes, as long as it’s correct. A model that takes 30 seconds to process a document but costs one-tenth, is better than Opus.

Slow inference is cheap inference. Or better: real-time comes at a premium. You can run it on an older machine or do more of it on a modern one.

Using high-end, blazingly fast real-time models for batch processing is, again, the supermarket Ferrari.

Timeframe: today, slower models for any asynchronous tasks.

4. Agents don’t need supercomputers

Last night, I ran a tool-calling benchmark on my laptop. The idea was to see how well AI agents would run on consumer-grade hardware. And that means running it on a CPU, rather than a GPU.

For simple tasks, a 0.5B-parameter model — roughly 400MB on disk — correctly called tools at sub-second latency. All on CPU.

The implications for cost are simple. If a basic AI agent can run on a laptop, then the infrastructure cost is “commodity hardware.” Sure, none of those models reached Opus’ level of reasoning, but that is not necessary for most tasks. It’s perfectly possible to call Opus for planning and then hand off the individual tasks to smaller, cheaper, local agents. You drive your Ferrari on race day and take your Prius to the supermarket.

Timeframe: Tomorrow-ish. The models exist. The hardware exists. The orchestration tools are not there yet.

5. The GPU flood is coming

Hyperscalers — AWS, Azure, Google Cloud — depreciate their GPUs over 5-6 years for accounting purposes. But they don’t actually run those chips for 6 years. The competitive pressure to offer the best makes them replace GPUs earlier. An A100 purchased in 2022 will be replaced by a B200 or its successor by 2025-2026.

Those chips don’t end up in a landfill. They end up on the industry equivalent of eBay. And there will be a lot of them.

The current wave of GPU buildout is unprecedented. It’s not unreasonable to suggest hyperscalers will have deployed millions of high-end GPUs by 2030. When the next generation arrives, a wave of perfectly fine previous-generation GPUs will be offered at fire-sale prices.

An A100 running inference in 2029 will be just as capable as it was in 2023. It will just be way cheaper to buy.

The hyperscaler arms race will give us more GPUs than we can handle.

Timeframe: 2028-2029.

Each of these forces puts a dent in the cost of inference. But together, they compound.

Better, cheaper hardware will give us more, cheaper tokens. Rightsizing models will increase efficiency.

Today, my Jim Cramer-bot can rack up a hefty inference bill.

Part of that is cost, and part of that is Sam Altman’s margin. But as costs go down, self-hosting gets better, and competition rises, inference will become a utility.

By 2030, the cost of inference will most likely be a rounding error.

Enjoyed this? Get new posts delivered to your inbox

Subscribe on Substack →