Gemini flash models have the least hype, but in my experience in production have...

Nihilartikel · 2025-04-17T22:17:35 1744928255

100% agree. I had Gemini flash 2 chew through thousands of points of nasty unstructured client data and it did a 'better than human intern' level conversion into clean structured output for about $30 of API usage. I am sold. 2.5 pro experimental is a different league though for coding. I'm leveraging it for massive refactoring now and it is almost magical.

jdthedisciple · 2025-04-17T22:37:08 1744929428

> thousands of points of nasty unstructured client data

What I always wonder in these kinds of cases is: What makes you confident the AI actually did a good job since presumably you haven't looked at the thousands of client data yourself?

For all you know it made up 50% of the result.

mediaman · 2025-04-18T03:20:36 1744946436

This was solved a hundred years ago.

It's the same problem factories have: they produce a lot of parts, and it's very expensive to put a full operator or more on a machine to do 100% part inspection. And the machines aren't perfect, so we can't just trust that they work.

So starting in the 1920s Walter Shewhart and Edward Deming came up with Statistical Process Control. We accept the quality of the product produced based on the variance we see of samples, and how they measure against upper and lower control limits.

Based on that, we can estimate a "good parts rate" (which later got used in ideas like Six Sigma to describe the probability of bad parts being passed).

The software industry was built on determinism, but now software engineers will need to learn the statistical methods created by engineers who have forever lived in the stochastic world of making physical products.

thawawaycold · 2025-04-18T06:43:46 1744958626

I hope you're being sarcastic. SPC is necessary because mechanical parts have physical tolerances and manufacturing processes are affected by unavoidable statistical variations; it is beyond idiotic to be provided with a machine that can execute deterministic, repeatable processes and then throw that all into the gutter for mere convenience, justifying that simply because "the time is ripe for SWE to learn statistics"

int_19h · 2025-04-18T07:22:06 1744960926

We don't know how to implement a "deterministic, repeatable process" that can look at a bug in a repo and implement a fix end-to-end.

thawawaycold · 2025-04-18T07:35:21 1744961721

that is not what OP was talking about though.

rorytbyrne · 2025-04-18T09:23:29 1744968209

LLMs are literally stochastic, so the point is the same no matter what the example application is.

warkdarrior · 2025-04-18T18:26:55 1745000815

Humans are literally stochastic, so the point is the same no matter what the example application is.

perching_aix · 2025-04-18T18:09:49 1744999789

The deterministic, repeatable process of human (and now machine) judgement and semantic processing?

tominous · 2025-04-17T23:12:47 1744931567

In my case I had hundreds of invoices in a not-very-consistent PDF format which I had contemporaneously tracked in spreadsheets. After data extraction (pdftotext + OpenAI API), I cross-checked against the spreadsheets, and for any discrepancies I reviewed the original PDFs and old bank statements.

The main issue I had was it was surprisingly hard to get the model to consistently strip commas from dollar values, which broke the csv output I asked for. I gave up on prompt engineering it to perfection, and just looped around it with a regex check.

Otherwise, accuracy was extremely good and it surfaced a few errors in my spreadsheets over the years.

jofzar · 2025-04-18T01:04:29 1744938269

I hope there is a future where csv comma's don't screw up data. I know it will never happen but it's a nightmare.

Everyone has a story of a csv formatting nightmare

Nihilartikel · 2025-04-18T00:33:30 1744936410

For what it's worth, I did check over many hundreds of them. Formatted things for side by side comparison and ordered by some heuristics of data nastiness.

It wasn't a one shot deal at all. I found the ambiguous modalities in the data and hand corrected examples to include in the prompt. After about 10 corrections and some exposition about the cases it seemed to misundestand, it got really good. Edit: not too different from a feedback loop with an intern ;)

summerlight · 2025-04-17T23:16:29 1744931789

Though the same logic can be applied to everywhere, right? Even if it's done by human interns, you need to audit everything to be 100% confident or just have some trust on them.

andrei_says_ · 2025-04-18T18:34:54 1745001294

Not the same logic because interns can make meaning out of the data - that’s built-in error correction.

They also remember what they did - if you spot one misunderstanding, there’s a chance they’ll be able to check all similar scenarios.

Comparing the mechanics of an LLM to human intelligence shows deep misunderstanding of one, the other, or both - if done in good faith of course.

summerlight · 2025-04-18T22:14:26 1745014466

Not sure why you're trying to conflate intellectual capability problems into this and complicate the argument? The problem layout is the same. You delegate the works to someone so you cannot understand all the details. This makes a fundamental tension between trust and confidence. Their parameters might be different due to intellectual capability, but whoever you're going to delegate, you cannot evade this trade-off.

BTW, not sure if you have experiences of delegating some works to human interns or new grads and being rewarded by disastrous results? I've done that multiple times and don't trust anyone too much. This is why we typically develop review processes, guardrails etc etc.

FooBarWidget · 2025-04-18T05:05:27 1744952727

You can use AI to verify its own work. Last time I split a C++ header file into header + implementation file. I noticed some code got rewritten in a wrong manner, so I asked it to compare the new implementation file against the original header file, but to do so one method at a time. For each method, say whether the code is exactly the same and has the same behavior, ignoring superficial syntax changes and renames. Took me a few times to get the prompt right, though.

golergka · 2025-04-17T23:08:48 1744931328

Many types of data have very easily checkable aggregates. Think accounting books.

jofzar · 2025-04-18T01:02:21 1744938141

It also depends on what you are using the data for, if it's for non (precise) data based decisions then it's fine. Specially if you looking for "vibe" based decisions before then dedicating time to "actually" process the data for confirmation.

30$ to get an view into data that would take at least x many hours of someone's time is actually super cheap, specially if the decision of that result is then to invest or not invest the x many hours to confirm it.

pamplemoose · 2025-04-17T23:11:01 1744931461

You take a sample and check

visarga · 2025-04-18T04:50:55 1744951855

In my professional opinion they can extract data at 85-95% accuracy.

tcgv · 2025-04-18T16:20:06 1744993206

> I'm leveraging it for massive refactoring now and it is almost magical.

Can you share more about your strategy for "massive refactoring" with Gemini?

Like the steps in general for processing your codebase, and even your main goals for the refactoring.

roygbiv2 · 2025-04-18T04:15:00 1744949700

Isn't it better to get gemini to create a tool to format the data? Or was it in such a state that that would have been impossible?

cdelsolar · 2025-04-18T02:45:23 1744944323

what tool are you using 2.5-pro-exp through? Cline? Or the browser directly?

Nihilartikel · 2025-04-18T03:49:20 1744948160

For 2.5 pro exp I've been attaching files into AIStudio in the browser in some cases. In others, I have been using vscode's Gemini Code Assist which I believe recently started using 2.5 Pro. Though at one point I noticed that it was acting noticeably dumber, and over in the corner, sure enough it warned that it had reverted to 2.0 due to heavy traffic.

For the bulk data processing I just used the python API and Jupyter notebooks to build things out, since it was a one-time effort.

manmal · 2025-04-18T06:10:54 1744956654

Copilot experimental (need VSCode Insiders) has it. I‘ve thought about trying aider —-watch-files though, also works with multiple files.

statements · 2025-04-17T19:49:51 1744919391

Absolutely agree. Granted, it is task dependent. But when it comes to classification and attribute extraction, I've been using 2.0 Flash with huge access across massive datasets. It would not be even viable cost wise with other models.

sethkim · 2025-04-17T21:49:23 1744926563

How "huge" are these datasets? Did you build your own tooling to accomplish this?

bhl · 2025-04-17T23:00:34 1744930834

It's cheap but also lazy. It sometimes generates empty strings or empty arrays for tool calls, and then I just re-route the request to a stronger model for the tool call.

I've spent a lot of time on prompts and tool-calls to get Flash models to reason and execute well. When I give the same context to stronger models like 4o or Gemini 2.5 Pro, it's able to get to the same answers in less steps but at higher token cost.

Which is to be expected: more guardrails for smaller, weaker models. But then it's a tradeoff; no easy way to pick which models to use.

Instead of SQL optimization, it's now model optimization.

spruce_tips · 2025-04-17T19:43:25 1744919005

i have a high volume task i wrote an eval for and was pleasantly surprised at 2.0 flash's cost to value ratio especially compared to gpt4.1-mini/nano

accuracy | input price | output price

Gemini Flash 2.0 Lite: 67% | $0.075 | $0.30

Gemini Flash 2.0: 93% | $0.10 | $0.40

GPT-4.1-mini: 93% | $0.40 | $1.60

GPT-4.1-nano: 43% | $0.10 | $0.40

excited to to try out 2.5 flash

jay_kyburz · 2025-04-17T20:42:33 1744922553

Can I ask a serious question. What task are you writing where its ok to get 7% error rate. I can't get my head around how this can be used.

16bytes · 2025-04-17T21:53:53 1744926833

There are tons of AI/ML use-cases where 7% is acceptable.

Historically speaking, if you had a 15% word error rate in speech recognition, it would generally be considered useful. 7% would be performing well, and <5% would be near the top of the market.

Typically, your error rate just needs to be below the usefulness threshold and in many cases the cost of errors is pretty small.

omneity · 2025-04-17T20:48:55 1744922935

In my case, I have workloads like this where it’s possible to verify the correctness of the result after inference, so any success rate is better than 0 as it’s possible to identify the “good ones”.

nonethewiser · 2025-04-18T00:10:30 1744935030

Aren’t you basically just saying you are able to measure the error rate? I mean that’s good, but already a given in this scenario where hes reporting the 7% error rate.

jsnell · 2025-04-18T01:02:54 1744938174

No. If you're able to verify correctness of individual items of work, you can accept the 93% of verified items as-is and send the remaining 7% to some more expensive slow path.

That's very different from just knowing the aggregate error rate.

yjftsjthsd-h · 2025-04-18T20:31:23 1745008283

No, it's anything that's harder to write than verify. A simple example is a logic puzzle; it's hard to come up with a solution, but once you have a possible answer it's really easy to check it. In fact, it can be easier to vet multiple answers and tell the machine to try again than solve it once manually.

spruce_tips · 2025-04-17T20:46:45 1744922805

low stakes text classification but it's something that needs to be done and couldnt be done in reasonable time frames or at reasonable price points by humans

muzani · 2025-04-18T06:06:03 1744956363

I expect some manual correction after the work is done. I actually mentally counted all the times I pressed backspace while writing this paragraph, and it comes down to 45. I'm not counting the next paragraph or changing the number.

Humans make a ton of errors as well. I didn't even notice how many I was making here until I started counting it. AI is super useful to just write get a first draft out, not for the final work.

sroussey · 2025-04-18T20:15:41 1745007341

You could be OCRing a page that includes a summation line, then add up all the numbers and check against the sum.

dist-epoch · 2025-04-17T21:00:18 1744923618

[flagged]

wavewrangler · 2025-04-17T22:15:06 1744928106

Yeah, general propaganda and psyops are actually more effective around 12% - 15%, we find it is more accurate to the user base, thus is questioned less for standing out more /s

ghurtado · 2025-04-17T20:40:14 1744922414

I know it's a single data point, but yesterday I showed it a diagram of my fairly complex micropython program, (including RP2 specific features, DMA and PIO) and it was able to describe in detail not just the structure of the program, but also exactly what it does and how it does it. This is before seeing a single like of code, just going by boxes and arrows.

The other AIs I have shown the same diagram to, have all struggled to make sense of it.

ramesh31 · 2025-04-17T22:11:15 1744927875

>”Google is silently winning the AI race.”

It’s not surprising. What was surprising honestly was how they were caught off guard by OpenAI. It feels like in 2022 just about all the big players had a GPT-3 level system in the works internally, but SamA and co. knew they had a winning hand at the time, and just showed their cards first.

wkat4242 · 2025-04-17T22:50:47 1744930247

True and their first mover advantage still works pretty well. Despite "ChatGPT" being a really uncool name in terms of marketing. People remember it because they were the first to wow them.

kaoD · 2025-04-18T20:01:39 1745006499

How is ChatGPT bad in terms of marketing? It's recognizable and rolls off the tongue in many many many languages.

Gemini is what sucks from a marketing perspective. Generic-ass name.

simonw · 2025-04-18T20:12:57 1745007177

Generative Pre-trained Transformer is a horrible term to have an acronym for.

kaoD · 2025-04-18T20:25:14 1745007914

Do you think the mass market thinks GPT is an acronym? It's just a name. Currently synonymous with AI.

Ask anyone outside the tech bubble about "Gemini" though. You'll get astrology.

wkat4242 · 2025-04-18T20:32:22 1745008342

True I guess they treat it just like SMS.

I still think they'd have taken off more if they'd given it a catchy name from the start and made the interface a bit more consumer friendly.

golergka · 2025-04-17T23:11:15 1744931475

It feels more authentically engineer-coded.

redbell · 2025-04-17T21:35:08 1744925708

> Google is silently winning the AI race

Yep, I agree! This convinced me: https://news.ycombinator.com/item?id=43661235

Layvier · 2025-04-17T19:38:42 1744918722

Absolutely. So many use cases for it, and it's so cheap/fast/reliable

SparkyMcUnicorn · 2025-04-17T19:46:25 1744919185

And stellar OCR performance. Flash 2.0 is cheaper and more accurate than AWS Textract, Google Document AI, etc.

Not only in benchmarks[0], but in my own production usage.

[0] https://getomni.ai/ocr-benchmark

danielbln · 2025-04-17T19:45:02 1744919102

I want to use these almost too cheap to meter models like Flash more, what are some interesting use cases for those?

rvz · 2025-04-17T20:12:29 1744920749

Google always has been winning the AI race as soon as DeepMind was properly put to use to develop their AI models, instead of the ones that built Bard (Google AI team).

russellbeattie · 2025-04-17T22:12:28 1744927948

I have to say, I never doubted it would happen. They've been at the forefront of AI and ML for well over a decade. Their scientists were the authors of the "Attention is all you need" paper, among thousands of others. A Google Scholar search produces endless results. There just seemed to be a disconnect between the research and product areas of the company. I think they've got that worked out now.

They're getting their ass kicked in court though, which might be making them much less aggressive than they would be otherwise, or at least quieter about it.

no_wizard · 2025-04-17T22:35:33 1744929333

I remember everyone saying its a two horse race between Google and OpenAI, then DeepSeek happened.

Never count out the possibility of a dark horse competitor ripping the sod right out from under

nonethewiser · 2025-04-18T00:13:58 1744935238

How is deepseak doing though? It seemed like they probably just ingested ChatGPT. https://www.forbes.com/sites/torconstantino/2025/03/03/deeps...

Still impressive but would really put a cap on expectations for them.

FooBarWidget · 2025-04-18T05:10:23 1744953023

Everybody else also trains on ChatGPT data, have you never heard of public ChatGPT conversation data sets? Yes they trained on ChatGPT data. No it's not "just".

gs17 · 2025-04-18T01:31:10 1744939870

They supposedly have a new R2 model coming within a month.

42lux · 2025-04-17T19:44:00 1744919040

The API is free, and it's great for everyday tasks. So yes there is no better bang for the buck.

drusepth · 2025-04-17T19:47:06 1744919226

Wait, the API is free? I thought you had to use their web interface for it to be free. How do you use the API for free?

dcre · 2025-04-17T19:50:40 1744919440

You can get an API key and they don't bill you. Free tier rate limits for some models (even decent ones like Gemini 2.0 Flash) are quite high.

https://ai.google.dev/gemini-api/docs/pricing

https://ai.google.dev/gemini-api/docs/rate-limits#free-tier

NoahZuniga · 2025-04-17T21:01:05 1744923665

The rate limits I've encountered with free api keys has been way lower than the limits advertised.

jmacd · 2025-04-17T23:35:14 1744932914

I agree. I found it unusable for anything but casual usage due to the rate limiting. I wonder if I am just missing something?

tempthrow · 2025-04-18T08:08:46 1744963726

I think it's the small TPM limits. I'll be way under the 10-30 requests per minute while using Cline, but it appears that the input tokens count towards the rate limit so I'll find myself limited to one message a minute if I let the conversation go on for too long, ironically due to Gemini's long context window. AFAIK Cline doesn't currently offer an option to limit the context explosion to lower than model capacity.

nolok · 2025-04-18T06:48:36 1744958916

I'm pretty sure that's a google maps' level of free where once in control they will massively bill it

dcre · 2025-04-18T12:44:10 1744980250

There is no reason to expect the other entrants in the market to drop out and give them monopoly power. The paid tier is also among the cheapest. People say it’s because they built their own their inference hardware and are genuinely able to serve it cheaper.

spruce_tips · 2025-04-17T19:49:35 1744919375

create an api key and dont set up billing. pretty low rate limits and they use your data

midasz · 2025-04-17T20:03:42 1744920222

I use Gemini 2.5 pro experimental via openrouter in my openwebui for free. Was using sonnet 3.7 but I don't notice much difference so just default to the free thing now.

mlboss · 2025-04-17T19:49:34 1744919374

using aistudio.google.com

GaggiX · 2025-04-17T20:19:34 1744921174

Flash models are really good even for an end user because how fast and good performance they have.

xnx · 2025-04-17T19:52:42 1744919562

Shhhh. You're going to give away the secret weapon!

paulcole · 2025-04-17T23:17:06 1744931826

> Google is silently winning the AI race.

It’s not clear to me what either the “race” or “winning” is.

I use ChatGPT for 99% of my personal and professional use. I’ve just gotten used to the interface and quirks. It’s a good consumer product that I like to pay $20/month for and use. My work doesn’t require much in the way of monthly tokens but I just pay for the OpenAI API and use that.

Is that winning? Becoming the de facto “AI” tool for consumers?

Or is the race to become what’s used by developers inside of apps and software?

The race isn’t to have the best model (I don’t think) because it seems like the 3rd best model is very very good for many people’s uses.

belter · 2025-04-17T19:37:38 1744918658

> Google is silently winning the AI race.

That is what we keep hearing here...The last Gemini I cancelled the account, and can't help notice the new one they are offering for free...

arnaudsm · 2025-04-17T19:40:10 1744918810

Sorry I was talking of B2B APIs for my YC startup. Gemini is still far behind for consumers indeed.

JeremyNT · 2025-04-17T20:44:58 1744922698

I use Gemini almost exclusively as a normal user. What am I missing out on that they are far behind on?

It seems shockingly good and I've watched it get much better up to 2.5 Pro.

arnaudsm · 2025-04-17T21:04:00 1744923840

Mostly brand recognition and the earlier Geminis had more refusals.

As a consumer, I also really miss the Advanced voice mode of ChatGPT, which is the most transformative tech in my daily life. It's the only frontier model with true audio-to-audio.

jorvi · 2025-04-18T00:45:42 1744937142

> and the earlier Geminis had more refusals.

Its more so that almost every company is running a classifier on their web chat's output.

It isn't actually the model refusing, but rather if the classifier hits a threshold, it'll swap the model's out with "Sorry, let's talk about something else."

This is most apparent with DeepSeek. If you use their web chat with V3 and then jailbreak it, you'll get uncensored output but it is then swapped with "Let's talk about something else" halfway through the output. And if you ask the model, it has no idea its previous output got swapped and you can even ask it build on its previous answer. But if you use the API, you can push it pretty far with a simple jailbreak.

These classifiers are virtually always ran on a separate track, meaning you cannot jailbreak them.

If you use an API, you only have to deal with the inherent training data bias, neutering by tuning and neutering by pre-prompt. The last two are, depending on the model, fairly trivial to overcome.

I still think the first big AI company that has the guts to say "our LLM is like a pen and brush, what you write or draw with it is on you" and publishes a completely unneutered model will be the one to take a huge slice of marketshare. If I had to bet on anyone doing that, it would be xAI with Grok. And by not neutering it, the model will perform better in SFW tasks too.

Jensson · 2025-04-18T11:24:32 1744975472

> and the earlier Geminis had more refusals.

You can turn off those, Google lets you decide how much it censors you can completely turn it off.

It has separate sliders for sexually explicit, hate, dangerous and harassment. It is by far the best at this, since sometimes you want those refusals/filters.

whistle650 · 2025-04-18T02:11:26 1744942286

Have you tried the Gemini Live audio-to-audio in the free Gemini iOS app? I find it feels far more natural than ChatGPT Advanced Voice Mode.

wavewrangler · 2025-04-17T22:21:12 1744928472

What do you mean miss? You don’t have the budget to keep something you truly miss for $20? What am in missing here / I don’t mean to criticize I am just curious is all. I would reword but I have to go

what_ever · 2025-04-17T23:54:35 1744934075

What is true audio-to-audio in this case?

int_19h · 2025-04-18T07:24:19 1744961059

They used to be, but not anymore, not since Gemini Pro 2.5. Their "deep research" offering is the best available on the market right now, IMO - better than both ChatGPT and Claude.

Fairburn · 2025-04-17T19:42:34 1744918954

Sorry, but no. Gemini isn't the fastest horse, yet. And it's use within their ecosystem means it isn't geared to the masses outside of their bubble. They are not leading the race but they are a contender.

gambiting · 2025-04-17T20:04:57 1744920297

In my experience they are as dumb as a bag of bricks. The other day I asked "can you edit a picture if I upload one"

And it replied "sure, here is a picture of a photo editing prompt:"

https://g.co/gemini/share/5e298e7d7613

It's like "baby's first AI". The only good thing about it is that it's free.

ghurtado · 2025-04-17T20:45:47 1744922747

> in my experience they are as dumb as a bag of bricks

In my experience, anyone that describes LLMs using terms of actual human intelligence is bound to struggle using the tool.

Sometimes I wonder if these people enjoy feeling "smarter" when the LLM fails to give them what they want.

mdp2021 · 2025-04-17T21:14:24 1744924464

If those people are a subset of those who demand actual intelligence, they will very often feel frustrated.

JFingleton · 2025-04-17T20:24:00 1744921440

Prompt engineering is a thing.

Learning how to "speak llm" will give you great results. There's loads of online resources that will teach you. Think of it like learning a new API.

gambiting · 2025-04-18T05:44:42 1744955082

This was using Gemini on my phone - which both Samsung and Google advertise as "just talk to it".

abletonlive · 2025-04-18T00:34:00 1744936440

for now. one would hope that this is a transitory moment in llms and that we can just use intuition in the future.

asadotzler · 2025-04-18T00:36:05 1744936565

LLM's whole thing is language. They make great translators and perform all kinds of other language tasks well, but somehow they can't interpret my English language prompts unless I go to school to learn how to speak LLM-flavored English?

WTF?

th0ma5 · 2025-04-18T16:26:51 1744993611

You have the right perspective. All of these people hand waving away the core issue here don't realize their own biases. Some of the best these things tout as much as 97% accuracy on tasks but if a person was completely randomly wrong at 3% of what they say you'd call an ambulance and no doctor would be able to diagnose their condition (the kinds of errors that people make with brain injuries are a major diagnostic tool and the kinds of errors are known for major types of common injuries ... Conversely there is no way to tell within an LLM system if any specific token is actually correct or not and its incorrectness is not even categorizable.)

pplante · 2025-04-18T01:24:03 1744939443

I like to think of my interactions with an LLM like I'm explaining a request to a junior engineer or non engineering person. You have to be more verbose to someone who has zero context in order for them to execute a task correctly. The LLM only has the context you provided so they fail hard like a junior engineer would at a complicated task with no experience.

pplante · 2025-04-18T01:24:03 1744939443

I like to think of my interactions with an LLM like I'm explaining a request to a junior engineer or non engineering person. You have to be more verbose to someone who has zero context in order for them to execute a task correctly. The LLM only has the context you provided so they fail hard like a junior engineer would at a complicated task with no experience.

int_19h · 2025-04-18T07:25:28 1744961128

It's a natural language processor, yes. It's not AGI. It has numerous limitations that have to be recognized and worked around to make use of it. Doesn't mean that it's not useful, though.

JFingleton · 2025-04-18T07:03:17 1744959797

They are not humans - so yeah I can totally see having to "go to school" to learn how to interact with them.

nowittyusername · 2025-04-17T20:47:04 1744922824

Its because google hasn't realized the value of training the model on information about its own capabilities and metadata. My biggest pet peeve about google and the way they train these models.