I think what we'll eventually see is frontier models getting priced dramatically more expensive (or rate limited), and more people getting pickier about what they send to frontier models vs cheaper, less powerful ones. This is already happening to some extent, with Opus being opt-in and much more restricted than Sonnet within Claude Code.
An unknown to me: are the less powerful models cheaper to serve, proportional to how much less capable they are than frontier models? One possible explanation for why e.g. OpenAI was eager to retire GPT 4 is that those older models are still money losers.
Everything I've seen makes me suspect that models have continually got more efficient to serve.
The strongest evidence is that the models I can run on my own laptop got massively better over the last three years, despite me keeping the same M2 64GB machine without upgrading it.
Compare original LLaMA from 2023 to gpt-oss-20b from this year - same hardware, huge difference.
The next clue is the continuing drop in API prices - at least prior to the reasoning rush of the last few months.
One more clue: o3. OpenAI's o3 had a 80% price drop a few months ago which I believe was due to them finding further efficiencies in serving that model at the same quality.
My hunch is that there are still efficiencies to be wrung out here. I think we'll be able to tell if that's not holding if API prices stop falling over time.
Why do you think OpenAI wanted to get rid of GPT-4 etc so aggressively?
I suppose there's a distinction between new less capable models, I can see why those would be more efficient. But maybe the older frontier models are less efficient to serve?
Definitely less efficient to serve. They used to charge $60/million input tokens for GPT-3 Da Vinci. They charge $1.25/million for GPT-5.
Plus I believe they have to keep each model in GPU memory to serve it, which means that any GPU serving an older model is unavailable to serve the newer ones.
In most use cases the main cost is always input, not output. Agentic workflows, on the other hand, do eat up a ton of tokens on multiple calls. Which can usually be optimized but nobody cares.
The price of a token doesn't necessarily reflect the true cost of running a model.
After Claude Opus 4 released the price of OpenAIs o3 tokens where slashed practically over night.[0] If you think this happened because inference cost went down, I have a bridge to sell to you.
Generally I'm skeptical of the idea that any of the major providers are selling inference at a loss. Obviously they're losing money when you include the cost of research and training, but every indication I've seen is that they're not keen to sell $1 for 80 cents.
If you want a hint at the real costs of inference look to the companies that sell access to hosted open source models. They don't have any research costs to cover so their priority is to serve as inexpensively as possible while still turning a profit.
Cost to run a million tokens through GPT-3 Da-Vinci in 2022: $60
Cost to run a million tokens through GPT-5 today: $1.25