The real news is that non-thinking output is now 4x more expensive, which they of course carefully avoid mentioning in the blog, only comparing the thinking prices.
How cute they are with their phrasing:
> $2.50 / 1M output tokens (*down from $3.50 output)
Which should be "up from $0.60 (non-thinking)/down from $3.50 (thinking)"
I have LLM fatigue, so I'm not paying attention to headlines... but LLMs are thinking now? That used to be a goal post. "AI can't do {x} because it's not thinking." Now it's part of a pricing chart?
"Thinking" means spamming a bunch of stream-of-consciousness bs before it actually generates the final answer. It's kind of like the old trick of prompting to "think step by step". Seeding the context full of relevant questions and concepts improves the quality of the final generation, even though it's rarely a direct conclusion of the so-called thinking before it.
Gmail was in beta for what, 2 decades? Did you never use it during that time? They've been using these "Preview" models on their non-technical user facing Gemini app and product for months now. Like, Google themselves has been using them in production, on their main apps. And gemini-1.5-pro is 2 months from depreciation and there was no production alternative.
They told everyone to build their stuff on top of it, and then jacked up the price by 4x. Just pointing to some fine print doesn't change that.
Correct, though pretty much anything end-user facing is latency-sensitive, voice is a tiny percentage. No one likes waiting, the involvement of an LLM doesn't change this from a user PoV.
I wonder if you can hide the latency, especially for voice?
What I have in mind is to start the voice response with a non-thinking model, say a sentence or two in a fraction of a second. That will take the voice model a few seconds to read out. In that time, you use a thinking model to start working on the next part of the response?
In a sense, very similar to how everyone knows to stall in an interview by starting with 'this is a very good question...', and using that time to think some more.
Not at all. Non-thinking flash is... flash with the thinking budget set to 0 (which you can still run that way, just at 2x input 4x output pricing). Flash-lite is far weaker, unusable for the overwhelming majority of usecases of flash. A quick glance at the benchmark reveals this.
How cute they are with their phrasing:
> $2.50 / 1M output tokens (*down from $3.50 output)
Which should be "up from $0.60 (non-thinking)/down from $3.50 (thinking)"