The 2nd order effect that not a lot of people talk about is price: the fact that model scaling at this pace also correlates with price is amazing.
I think this is just as important to distribution of AI as model intelligence is.
AFAIK there are no fundamental "laws" that prevent price from continuing to fall, at least correlated with Moore's law (or whatever the current AI/Nvidia chip development cycle is called right now)- each new generation of hardware is significantly faster/cheaper than the next- so will we see a ChatGPT-5 model at half the price in a year? (yes I know that thinking models cost more, but just on a per-token basis)
You are vastly underestimating the price decline. To cherrypick one article; in the first two years since GPT 3.5, inference price for the same amount of intelligence has decreased 10x per year according to a study by Andreessen Horowitz https://a16z.com/llmflation-llm-inference-cost/. So in a stark slowdown scenario, we could still see a 1000x decrease in the next 5 years.
Price deflation is not tied to Moore's right now because much of the performance gains are from model optimization, high bandwidth memory supply chains, and electrical capacity build out, not FLOP density.
True! I just know that model optimization gains are much less guaranteed than say, FLOP density, even though model optimization has so far provided way more gains than hardware advancements.
Part of me is optimistic that when the AI bubble bursts the excess data center capacity is going to be another force driving the cost of inference down.
> I just know that model optimization gains are much less guaranteed than say, FLOP density, even though model optimization has so far provided way more gains than hardware advancements.
Performance gained from model improvements has outpaced performance gained from hardware improvements for decades.
Strange - the model is marked as "Trains on data" ("To our knowledge, this provider may use your prompts and completions to train new models. This provider is disabled, but it can be re-enabled by changing your data policy.").
This is usually not the case for paid models -- is Openrouter just marking this model incorrectly or do Deepseek actually train on submitted data?
I don't know why they need to claim to be open. Their job is to connect you to providers on the basis of price and various metrics they track. Open or close would makes no difference to me.
I always interpreted it as "open" as in "open market".
It's a frictionless marketplace connecting inference providers and customers, creating a more competitive market. Or a more open market if you play a bit fast and loose with terminology
It's in the name. Why not name themselves ModelRouter or something similar?
If they lead the market, they'll extract value in lots of ways that an open company could at least be compelled not to. Plus there won't be competition.
They're probably selling your data to LLM companies and you don't even see what they're doing.
Without competition, they'll raise their rates.
If they were open, you could potentially run the offering on-prem. You could bolt on new providers or use it internally for your own routing.
I think it's just called OpenRouter because the founder previously started OpenSea (an NFT marketplace), and also probably to sound a bit similar to OpenAI. It's like companies calling their products "natural" or "organic" or "artisan" when they can get away with it, just a marketing strategy of using words that conjure up vaguely positive connotations in your mind.
They can't raise their prices much because providers have the upper band, so users will always be able to go directly to the source. I use openrouter and openai, anthropic, google, etc.
They trained a thing to learn mimicking the full attention distribution but only filtering the top-k (k=2048) most important attention tokens so that when the context window increases, the compute does not go up linearly but constantly for the attention->[query,key] process (it does grow up linearly in the graph because you still need to roughly scan the entire context window (which an "indexer" will do), but just very roughly here in order to speed up things, which is O(L) here).
One huge problem with these "cheap" models is that they happen to be more expensive in the typical agent workflow if the provider does not support caching.
Input and output costs are peanuts compared to the order of magnitude(or more) amount of tokens that hit the cache.
At that point you might as well use GPT-5. It will be the same price or cheaper, and more capable.
> One huge problem with these "cheap" models is that they happen to be more expensive in the typical agent workflow if the provider does not support caching.
DeepSeek supports caching and cache hits are a tenth of the cost.
First you complained about lack of caching. When you were informed that the model supports caching, instead of admitting your error you switched to an unrelated complaint. I hope that you you do not use similar strategies for discussion in your personal and work life.
caching is not a function of the model but the provider, all models can be cached. the provider serving the model decides if they are going to cache it. openrouter is not a provider but a middleman between providers, so some of their providers for deepseek might provide caching and some might not. if you just use any then you might run into the issue. some of their provider might use your data for training, some might not. you have to look at the list and you can cherry pick ones that won't train on your data and that also provide caching.
Interesting that models still evolve fast enough that dedicated model-specific hardware isn't a big contender right now. We're still seeing major scaling gains on mostly generic platforms.
You guys rock!
I'm very curious how will this perform against real word data, where small nuance matters.
Also have you tested it beyond 128K context window?
I think this is just as important to distribution of AI as model intelligence is.
AFAIK there are no fundamental "laws" that prevent price from continuing to fall, at least correlated with Moore's law (or whatever the current AI/Nvidia chip development cycle is called right now)- each new generation of hardware is significantly faster/cheaper than the next- so will we see a ChatGPT-5 model at half the price in a year? (yes I know that thinking models cost more, but just on a per-token basis)
reply