There's a thru-line to commentary from experienced programmers on working with L...

darepublic · 2025-05-05T18:51:19 1746471079

So in my interactions with gpt, o3 and o4 mini, I am the organic middle man that copy and pastes code into the repl and reports on the output back to gpt if anything should be the problem. And for me, past a certain point, even if you continually report back problems it doesn't get any better in its new suggestions. It will just spin its wheels. So for that reason I'm a little skeptical about the value of automating this process. Maybe the llms you are using are better than the ones I tried this with?

Specifically I was researching a lesser known kafka-mqtt connector: https://docs.lenses.io/latest/connectors/kafka-connectors/si..., and o1 was hallucinating the configuration needed to support dynamic topics. The docs said one thing, and I even mentioned it to o1 that the docs contradicted with it. But it would stick to its guns. If I mentioned that the code wouldn't compile it would start suggesting very implausible scenarios -- did you spell this correctly? Responses like that indicate you've reached a dead end. I'm curious how/if the "structured LLM interactions" you mention overcome this.

diggan · 2025-05-05T19:52:06 1746474726

> And for me, past a certain point, even if you continually report back problems it doesn't get any better in its new suggestions. It will just spin its wheels. So for that reason I'm a little skeptical about the value of automating this process.

It sucks, but the trick is to always restart the conversations/chat with a new message. I never go beyond one reply, and also copy-paste a bunch. Got tired of copy-pasting, wrote something like a prompting manager (https://github.com/victorb/prompta) to make it easier, and not having to neatly format code blocks and so on.

Basically make one message, if they get the reply wrong, iterate on the prompt itself and start fresh, always. Don't try to correct by adding another message, but update initial prompt to make it clearer/steer more.

But I've noticed that every model degrades really quickly past the initial reply, no matter what length of each individual message. The companies seem to continue to increase the theoretical and practical context limits, but the quality degrades a lot faster even within the context limits, and they don't seem to try to address that (nor have a way of measuring it).

the_sleaze_ · 2025-05-06T03:50:26 1746503426

This is my experience as well, as has been for over a year now.

LLMs are so incredibly transformative when they're incredibly transformative. And when they aren't it's much better to fall back on the years of hard won experience I have - the sooner the better. For example I'll switch between projects and languages and even with explicit instruction to move to a strongly typed language they'll stick to dynamic answers. It's an odd experience to re-find my skills every once in a while. "Oh yeah, I'm pretty good at reading docs myself".

With all the incredible leaps in LLMs being reported (especially here on HN) I really haven't seen much of a difference in quite a while.

PantaloonFlames · 2025-05-06T14:13:57 1746540837

Interesting. This is another problem aider does not experience. It works on a git repo. If you switch repos, it changes context.

I’m not affiliated with aider. I just use it.

I have a bet that many of the pitfalls people experience at the moment are due to mismatched tools or immature tools.

emporas · 2025-05-06T05:57:08 1746511028

In other words don't use the context window. Treat it as a command line with input/output, in which the purpose of the command is to extract information signal or knowledge manipulation or data mining and so on.

Also special care has to be given to the number of tokens. Even with one-question/one-answer, 5 hundred to 1 thousand tokens can be focused at once by our artificial overlords. After that they start losing their marbles. There are exceptions to that rule with the reasoning models, but in essence they are not that different.

The difference of using the tool correctly versus not, might be that instead of getting 99.9% accuracy, the user gets just 98%. Probably that doesn't sound that big of a difference to some people. The difference is that it works 10 times better in the first case.

namaria · 2025-05-06T08:40:33 1746520833

People keep throwing these 95%+ accuracy rates for LLMs in these discussions, but that is nonsense. It's closer to 70%. It's quite terrible. I use LLMs but I never trust them beyond just doing some initial search if I am stumped, and when it unblocks me I immediately put it down again. It's not transformative, it's merely replacing google because search there has sucked for a while.

diggan · 2025-05-06T11:38:52 1746531532

95% accuracy VS 70% accuracy, both numbers are pulled out of someone's ass and serves little to the discussion at hand. How did you measure that, or rather since you didn't, what's the point of sharing this hypothetical 25% difference?

And how funny that you comment seems to land perfectly together with this about people having very different experiences with using LLMs:

> I am still trying to sort out why experiences are so divergent. I've had much more positive LLM experiences while coding than many other people seem to, even as someone who's deeply skeptical of what's being promised about them. I don't know how to reconcile the two.

https://news.ycombinator.com/item?id=43898532

emporas · 2025-05-06T15:50:39 1746546639

It works very well (99.9%), when the problem resides at a familiar territory of the user's knowledge. When i know enough about a problem, i know how to decompose it into smaller pieces, and all (most?) smaller pieces have been already solved countless of times.

When a problem is far outside of my understanding, A.I. leads me towards a wrong path more often than not. Accuracy is terrible, because i don't know how to decompose the problem.

Jargon plays a crucial role there. LLM's need to guided using as much correct jargon of the problem as possible.

I have done this for decades on people. I read a book at some point that the most sure way for people to like you, is to speak to them in words they usually use themselves. No matter the concepts they are hearing with their ears, if the words belong belong in their familiar vocabulary they are more than happy to discuss anything.

So when i meet someone, i always try to absorb as much of their vocabulary as possible, as quickly as possible, and then i use it to describe ideas i am interested in. People understand much better like that.

Anyway, the same holds true for LLM's, they need to hear the words of the problem, expressed in that particular jargon. So when a programmer wants to use a library, he needs to absorb the jargon used in that particular library. It is only then that accuracy rates hit many nines.

namaria · 2025-05-06T13:19:43 1746537583

I will walk around the gratuitous rudeness and state the obvious:

No, the pretend above 95% accuracy is not as good as the up to 50% rates of hallucinations reported by OpenAI for example.

The difference in experiences is easily explainable in my opinion. Much like some people swear by mediums and psychics and other easily see through it: it's easy to see what you want to see when a nearly random experience lands you a good outcome.

I don't appreciate your insinuation that I am making up numbers and I though it shouldn't go unanswered but do not mistake this for a conversation. I am not in the habit of engaging with such demeaning language.

diggan · 2025-05-06T14:39:41 1746542381

> gratuitous rudeness

It is "Gratuitous rudeness" to say these numbers without any sort of sourcing/backing are pulled from someone's ass? Then I guess so be it, but I'm also not a fan of people speaking about absolute numbers as some sort of truth, when there isn't any clear way of coming up with those numbers in the first place.

Just like there are "extremists" claiming LLMs will save us all, clearly others fall on the other extreme and it's impossible to have a somewhat balanced conversation with either of these two groups.

thewebguyd · 2025-05-06T15:55:49 1746546949

This has largely been my experience as well, at least with GH Copilot. I mostly use it as a better Google now, because even with context of my existing code, it can't adhere to style at all. Hell, it can't even get docker compose files right, using various versions and incorrect parameters all the time.

I also noticed language matters a lot as well. It's pretty good with Python, pandas, matplotlib, etc. But ask it to write some PowerShell and it regularly hallucinates modules that don't exist, more than any other language I've tried to use it with.

And good luck if you're working with a stack that's not flavor of the month with plenty of online information available. ERP systems with its documentation living behind a paywall, so it's not in the training data - you know, real world enterprise CRUD use cases where I'd want to use it the most, it's the least helpful.

namaria · 2025-05-07T09:21:29 1746609689

To be fair, I find ChatGPT useful for Elixir, which is pretty niche. The great error messages (if a bit verbose) and the atomic nature of functions in a functional language goes with the grain of LLMs I think.

Still, at most I get it to help me with snippets. I wouldn't want it to just generate lots of code, for one it's pretty easy to write Elixir...

tstrimple · 2025-05-06T06:59:16 1746514756

I think “don’t use the context window” might be too simple. It can be incredibly useful. But avoid getting in a context window loop. When iterations stop showing useful progress toward the goal, it’s time to abandon the context. LLMs tend to circle back to the same dead end solution path at some point. It also helps to jump between LLMs to get a feel for how they perform on different problem spaces.

emporas · 2025-05-06T07:18:50 1746515930

Depends on the use case. For programming where every small detail might have huge implications, 98% accuracy vs 99.99% is a ginormous difference.

Other tasks can be more forgiving, like writing which i do all the time, then i load 3000 tokens in the context window pretty frequently. Small details in the accuracy do not matter so much for most people, for everyday casual tasks like rewriting text, summarizing etc.

In general, be wary of how much context you load into the chat, performance degrades faster than you can imagine. Ok, the aphorism i started with, was a little simplistic.

tstrimple · 2025-05-06T07:38:49 1746517129

Oh sure. Context windows have been less useful for me on programming tasks than other things. When working iteratively against a CSV file for instance, it can be very useful. I’ve used something very similar to the following before:

“Okay, now add another column which is the price of the VM based on current Azure costs and the CPU and Memory requirements listed.”

“This seems to only use a few Azure VM SKU. Use the full list.”

“Can you remove the burstable SKU?”

Though I will say simple error fixes with context windows with programming issues are resolved fine. On more than one occasion when I copy and paste an incorrect solution. Providing the LLM with the error is enough to fix the problem. But if it goes beyond that, it’s best to abandon the context.

PantaloonFlames · 2025-05-06T14:09:26 1746540566

Aider, the tool, does exactly the opposite, in my experience.

It really works, for me. It iterates by itself and fixes the problem.

diggan · 2025-05-06T14:36:41 1746542201

I'm in the middle of test-driving Aider and I'm seeing exactly the same problem, the longer a conversation goes on, the worse the quality of replies... Currently, I'm doing something like this to prevent it from loading previous context:

    rm -r .aider.chat.history.md .aider.tags.cache.v4 || true && aider --architect --model deepseek/deepseek-reasoner --editor-model deepseek/deepseek-chat --no-restore-chat-history

Which clears the history, then I basically re-run that whenever the model gets something wrong (so I can update/improve the prompt and try again).

mikeqq2024 · 2025-05-09T03:55:49 1746762949

why not use /clear and /reset command?

Grimblewald · 2025-05-06T03:16:17 1746501377

I refused to stop being a middle man, because I can often catch really bad implementation early and can course correct. E.g. a function which solves a problem with a series of nested loops that can be done several orders of magnitude faster by using vectorised operations offered by common packages like numpy.

Even with all the coding agent magik people harp on about, I've never seen something that can write clean good quality code reliably. I'd prefer to tell an LLM what a functions purpose is, what kind of information and data structures it can expect and what it should output, see what it produces, provide feedback, and get a rather workable often perfect function in return.

If i get it to write the whole thing in one go, I cannot imagine the pain of having to find out where the fuckery is that slows everything down, without diving deep with profilers etc. all for a problem I could have solved by just playing middle man and keeping a close eye on how things are building up and being in charge of ensuring the overarching vision is achieved as required.

mr_toad · 2025-05-05T20:31:17 1746477077

> If I mentioned that the code wouldn't compile it would start suggesting very implausible scenarios

I have to chuckle at that because it reminds me of a typical response on technical forums long before LLMs were invented.

Maybe the LLM has actually learned from those responses and is imitating them.

-__---____-ZXyw · 2025-05-05T22:27:05 1746484025

It seems no discussion of LLMs on HN these days is complete without a commenter wryly observing how that one specific issue someone is pointing to with an LLM is also, funnily enough, an issue they've seen with humans. The implication always seems to be that this somehow bolsters the idea that LLMs are therefore in some sense and to some degree human-like.

Humans not being infallible superintelligences does not mean that the thing that LLMs are doing is the same thing we do when we think, create, reason, etc. I would like to imagine that most serious people who use LLMs know this, but sometimes it's hard to be sure.

Is there a name for the "humans stupid --> LLMs smart" fallacy?

raincole · 2025-05-05T23:22:52 1746487372

> Is there a name for the "humans stupid --> LLMs smart" fallacy?

No one is saying "humans stupid --> LLMs smart". That's absolutely not the commenter above you said. Your whole comment is strawman fallacy.

steveklabnik · 2025-05-06T00:25:59 1746491159

> The implication always seems to be that this somehow bolsters the idea that LLMs are therefore in some sense and to some degree human-like.

Nah, it's something else: it's that LLMs are being held to a higher standard than humans. Humans are fallible, and that's okay. The work they do is still useful. LLMs do not have to be perfect either to be useful.

The question of how good they are absolutely matters. But some error isn't immediately disqualifying.

-__---____-ZXyw · 2025-05-06T10:24:51 1746527091

I agree that LLMs are useful, in many ways, but think that people are in fact often making the stronger claim which I refer to in your quote from my original point. If the argument were put forward simply to highlight that LLMs, while fallible, are still useful, I would see no issue.

Yes, humans and LLMs are fallible, and both useful.

I'm not saying the comment I responded was an egregious case of the "fallacy" I'm wondering about, but I am saying that I feel like it's brewing. I imagine you've seen the argument that goes:

Anne: LLMs are human-like in some real, serious, scientific sense (they do some subset of reasoning, thinking, creating, and it's not just similar, it is intelligence)

Billy: No they aren't, look at XYZ (examples of "non-intelligence", according to the commenter).

Anne: Aha! Now we have you! I know humans who do XYZ! QED

I don't like Billy's argument and don't make it myself, but the rejoinder which I feel we're seeing often from Anne here seems absurd, no?

mwcampbell · 2025-05-06T03:17:42 1746501462

I think it's natural for programmers to hold LLMs to a higher standard, because we're used to software being deterministic, and we aim to make it reliable.

johnnyanmac · 2025-05-06T11:32:41 1746531161

well they try to copy humans and humans on the internet are very different creatures from humans in face to face interaction. So I see the angle.

It is sad that, inadvertendly or not, LLMs may have picked up on the traits of the loudest humans. abrasive, never admitting fault, always trying to bring something up that sounds plausible but falls under scrutiny. Only thing it holds back on is resorting to insults when cornered.

owebmaster · 2025-05-05T22:36:34 1746484594

> the idea that LLMs are therefore in some sense and to some degree human-like.

This is 100% true, isn't it? It is based on the corpus of humankind knowledge and interaction, it is only expected that it would "repeat" human patterns. It also makes sense that the way to evolve the results we get from it is to mimic human organization, politics, sociology in the a new layer on top of LLMs to surpass current bottlenecks, just like they were used to evolve human societies.

beanedinjkl · 2025-05-06T06:26:44 1746512804

>It is based on the corpus of humankind knowledge and interaction

Something being based on X or using it as source material doesen't guarantee any kind of similarity though. My program can also contains the entire text of wikipedia, and only ever outputs the number 5.

klank · 2025-05-06T19:02:42 1746558162

I'd love a further description of how you can have a program with the entire text of wikipedia that only ever outputs 5. It is not immediately obvious to me how that is possible.

Assuming the text of wikipedia is meaningfully used in the program, of course. A definition of "meaningful" I will propose is code which survives an optimization loop into the final resulting machine code and isn't hidden behind some arbitrary conditional. That seems reasonable as a definition of a program "containing" something.

nikita2206 · 2025-05-05T19:31:02 1746473462

You can have agent search the web for documentation and then provide it to the LLM. That is how Context7 is currently very popular in the AI user crowd.

entropie · 2025-05-05T19:38:54 1746473934

I used o4 to generate nixos config files from the pasted modules source files. At first it did outdated config stuff, but with context files it worked very good.

dingnuts · 2025-05-05T19:40:26 1746474026

Kagi Assistant can do this too but I find it's mostly useful because the traditional search function can find the pages the LLM loaded into its context before it started to output bullshit.

It's nice when the LLM outputs bullshit, which is frequent.

ericmcer · 2025-05-06T13:54:55 1746539695

Seriously Cursor (using Claude 3.5) does this all the time. It ends up with a pile of junk because it will introduce errors while fixing something, then go in a loop trying to fix the errors it created and slap more garbage on those.

Because it’s directly editing code in the IDE instead of me transferring sections of code from a chat window the large amount of bad code it writes it much more apparent.

jimbokun · 2025-05-05T19:42:01 1746474121

I wonder if LLMs have been seen claiming “THERE’S A BUG IN THE COMPILER!”

A stage every developer goes through early in their development.

bayarearefugee · 2025-05-06T04:08:57 1746504537

Gemini 2.5 got into as close to a heated argument with me as possible about the existence of a function in the kotlin coroutines library that was never part of the library (but does exist as a 5 year old PR still visible in github that was never merged in).

It initially suggested I use the function as part of a solution, suggesting it was part of the base library and could be imported as such. When I told it that function didn't exist within the library it got obstinate and argued back and forth with me to the point where it told me it couldn't help me with that issue anymore but would love to help me with other things. It was surprisingly insistent that I must be importing the wrong library version or doing something else wrong.

When I got rid of that chat's context and asked it about the existence of that function more directly without the LLM first suggesting its use to me, it replied correctly that the function doesn't exist in the library but the concept being easy to implement... the joys(?) of using an LLM and having it go in wildly different directions depending upon the starting point.

I'm used to the opposite situation where an LLM will slide into sycophantic agreeable hallucinations so it was in a way kind of refreshing for Gemini to not do this, but on the other hand for it to be so confidently and provably wrong (while also standing its ground on its wrongness) got me unreasonably pissed off at it in a way that I don't experience when an LLM is wrong in the other direction.

namaria · 2025-05-06T08:45:24 1746521124

That we're getting either sycophantic or intransigent hallucinations point to two fundamental limitations: there's no getting rid of hallucinations, and there's a trade-off in observed agreement "behavior".

Also, the recurring theme of "just wipe out context and re-start" places a hard ceiling on how complex an issue the LLM can be useful for.

cma · 2025-05-06T18:32:18 1746556338

> It will just spin its wheels. So for that reason I'm a little skeptical about the value of automating this process.

Question is would you rather find out it got stuck in a loop with 3 minutes with a coding agent or 40 minutes copy pasting. It can also get out of loops more often by being able to use tools to look up definitions with grep, ctags or language server tools, though you can copy paste commands for that too it will be much slower.

zoogeny · 2025-05-05T19:42:31 1746474151

For several moments in the article I had to struggle to continue. He is literally saying "as an experienced LLM user I have no experience with the latest tools". He gives a rationale as to why he hasn't used the latest tools which is basically that he doesn't believe they will help and doesn't want to pay the cost to find out.

I think if you are going to claim you have an opinion based on experience you should probably, at the least, experience the thing you are trying to state your opinion on. It's probably not enough to imagine the experience you would have and then go with that.

AlexCoventry · 2025-05-05T18:45:44 1746470744

He does partially address this elsewhere in the blog post. It seems that he's mostly concerned about surprise costs:

> On paper, coding agents should be able to address my complaints with LLM-generated code reliability since it inherently double-checks itself and it’s able to incorporate the context of an entire code project. However, I have also heard the horror stories of people spending hundreds of dollars by accident and not get anything that solves their coding problems. There’s a fine line between experimenting with code generation and gambling with code generation.

minimaxir · 2025-05-05T18:51:03 1746471063

Less surprise costs, more wasting money and not getting proportionate value out of it.

usrbinbash · 2025-05-06T08:19:53 1746519593

> But agenty LLM configurations aren't just the LLM; they're also code that structures the LLM interactions. When the LLM behind a coding agent hallucinates a function, the program doesn't compile, the agent notices it, and the LLM iterates.

This describes the simplest, and most benign case of code assistents messing up. This isn't the problem.

The problem is when the code does compile, but contains logical errors, security f_ckups, performance dragdowns, or missed functionality. Because none of those will be caught by something as obvious as a compiler error.

And no, "let the AI write tests" wont catch them either, because that's not a solution, that's just kicking the can dwn the road...because if we cannot trust the AI to write correct code, why would we assume that it can write correct tests for that code?

What will ultimately catch those, is the poor sod in the data center, who, at 03:00 AM has to ring the on-call engineer out of his bed, because the production server went SNAFU.

And when the oncall then has to rely on "AI" to fix the mess, because he didn't actually write the code himself, and really doesn't even know the codebase any more (or even worse: Doesn't even understand the libraries and language used at all, because he is completely reliant on the LLM doing that for him), companies, and their customers, will be in real trouble. It will be the IT equivalent of attorneys showing up in court with papers containing case references that were hallucinated by some LLM.

stefan_ · 2025-05-05T18:16:10 1746468970

Have you tried it? In my experience they just go off on a hallucination loop, or blow up the code base with terrible re-implementations.

Similarly Claude 3.5 was stuck on TensorRT 8, and not even pointing it at the documentation for the updated 10 APIs for RAG could ever get it to correctly use the new APIs (not that they were very complex; bind tensors, execute, retrieve results). The whole concept of the self-reinforcing Agent loop is more of a fantasy. I think someone else likened it to a lawnmower that will run rampage over your flower bed at the first hiccup.

tptacek · 2025-05-05T18:21:08 1746469268

Yes, they're part of my daily toolset. And yes, they can spin out. I just hit the "reject" button when they do, and revise my prompt. Or, sometimes, I just take over and fill in some of the structure of the problem I'm trying to solve myself.

I don't know about "self-reinforcing". I'm just saying: coding agents compile and lint the code they're running, and when they hallucinate interfaces, they notice. The same way any developer who has ever used ChatGPT knows that you can paste most errors into the web page and it will often (maybe even usually) come up with an apposite fix. I don't understand how anybody expects to convince LLM users this doesn't work; it obviously does work.

steveklabnik · 2025-05-05T19:26:17 1746473177

> I don't understand how anybody expects to convince LLM users this doesn't work; it obviously does work.

This is really one of the hugest divides I've seen in the discourse about this: anti-LLM people saying very obviously untrue things, which is uh, kind of hilarious in a meta way.

https://bsky.app/profile/caseynewton.bsky.social/post/3lo4td... is an instance of this from a few days ago.

I am still trying to sort out why experiences are so divergent. I've had much more positive LLM experiences while coding than many other people seem to, even as someone who's deeply skeptical of what's being promised about them. I don't know how to reconcile the two.

dml2135 · 2025-05-06T01:17:12 1746494232

> I am still trying to sort out why experiences are so divergent. I've had much more positive LLM experiences while coding than many other people seem to, even as someone who's deeply skeptical of what's being promised about them. I don't know how to reconcile the two.

I am also trying to sort this out, but I'm probably someone you'd consider to be on the other, "anti-LLM" side.

I wonder if part of this is simply level of patience, or, similarly, having a work environment that's chill enough to allow for experimentation?

From my admittedly short attempts to use agentic coding so far, I usually give up pretty quickly because I experience, as others in the thread described, the agent just spinning its wheels or going off and mangling the codebase like a lawnmower.

Now, I could totally see a scenario where if I spent time tweaking prompts, writing rule files, and experimenting with different models, I could improve that output significantly. But this is being sold to me as a productivity tool. I've got code to write, and I'm pretty sure I can write it fairly quickly myself, and I simply don't have time at my start up to muck around with babysitting an AI all day -- I have human junior engineers that need babysitting.

I feel like I need to be a lot more inspired that the current models can actually improve my productivity in order to spend the time required to get there. Maybe that's a chicken-or-egg problem, but that's how it is.

steveklabnik · 2025-05-06T16:38:48 1746549528

> I'm probably someone you'd consider to be on the other, "anti-LLM" side.

I think if you're trying stuff, you're not, otherwise, you wouldn't even use them. What I'd say is more that you're having a bad time, whereas I'm not.

> I wonder if part of this is simply level of patience, or, similarly, having a work environment that's chill enough to allow for experimentation?

Maybe? I don't feel like I've had to had a ton of patience. But maybe I'm just discounting that, or chiller or something, as you allude to.

namaria · 2025-05-06T08:54:59 1746521699

> Now, I could totally see a scenario where if I spent time tweaking prompts, writing rule files, and experimenting with different models, I could improve that output significantly.

I think this is it. Some people are willing to invest the time in writing natural language code for the LLM.

> I spent time tweaking prompts, writing rule files, and experimenting with different models, I could improve that output significantly. But this is being sold to me as a productivity tool. I've got code to write, and I'm pretty sure I can write it fairly quickly myself, and I simply don't have time at my start up to muck around with babysitting an AI all day -- I have human junior engineers that need babysitting.

I agree and this is the divide I think: skeptical people think this is a flimsy patch that will eventually collapse. I for one can't see how trying to maintain ever growing files in natural language won't lead to a huge cognitive load quite soon and I bet we're about to hear people discussing how to use LLMs to do that.

magicalist · 2025-05-05T21:04:34 1746479074

> This is really one of the hugest divides I've seen in the discourse about this: anti-LLM people saying very obviously untrue things, which is uh, kind of hilarious in a meta way.

> https://bsky.app/profile/caseynewton.bsky.social/post/3lo4td... is an instance of this from a few days ago.

Not sure why this is so surprising? ChatGPT search was only released in November last year, was a different mode, and it sucked. Search in o3 and o4-mini came out like three weeks ago. Otherwise you were using completely different products from Perplexity or Kagi, which aren't widespread yet.

Casey Newton even half acknowledges that timing ("But it has had integrated web search since last year"...even while in the next comment criticising criticisms using the things "you half-remember from when ChatGPT launched in 2022").

If you give the original poster the benefit of the doubt, you can sort of see what they're saying, too. An LLM, on its own, is not a search engine and can not scan the web for information. The information encoded in them might be ok, but is not complete, and does not encompass the full body of the published human thought it was trained on. Trusting an offline LLM with an informational search is sometimes a really bad idea ("who are all the presidents that did X").

The fact that they're incorrect when they say that LLM's can't trigger search doesn't seem that "hilarious" to me, at least. The OP post maybe should have been less strident, but it also seems like a really bad idea to gatekeep anybody wanting to weigh in on something if their knowledge of product roadmaps is more than six months out of date (which I guarantee is all of us for at least some subject we are invested in).

steveklabnik · 2025-05-05T21:31:56 1746480716

> ChatGPT search was only released in November last year

It is entirely possible that I simply got involved at a particular moment that was crazy lucky: it's only been a couple of weeks. I don't closely keep up with when things are released, I had just asked ChatGPT something where it did a web search, and then immediately read a "it cannot do search" claim right after.

> An LLM, on its own, is not a search engine and can not scan the web for information.

In a narrow sense, this is true, but that's not the claim: the claim is "You cannot use it as a search engine, or as a substitute for searching." That is pretty demonstrably incorrect, given that many people use it as such.

> Trusting an offline LLM with an informational search is sometimes a really bad idea ("who are all the presidents that did X").

I fully agree with this, but it's also the case with search engines. They also do not always "encompass the fully body of the published human thought" either, or always provide answers that are comprehensible.

I recently was looking for examples of accomplishing things with a certain software architecture. I did a bunch of searches, which led me to a bunch of StackOverflow and blog posts. Virtually all of those posts gave vague examples which did not really answer my question with anything other than platitudes. I decided to ask ChatGPT about it instead. It was able to not only answer my question in depth, but provide specific examples, tailored to my questions, which the previous hours of reading search results had not afforded me. I was further able to interrogate it about various tradeoffs. It was legitimately more useful than a search engine.

Of course, sometimes it is not that good, and a web search wins. That's fine too. But suggesting that it's never useful for a task is just contrary to my actual experience.

> The fact that they're incorrect when they say that LLM's can't trigger search doesn't seem that "hilarious" to me, at least.

It's not them, it's the overall state of the discourse. I find it ironic that the fallibility of LLMs is used to suggest they're worthless compared to a human, when humans are also fallible. OP did not directly say this, but others often do, and it's the combination that's amusing to me.

It's also frustrating to me, because it feels impossible to have reasonable discussions about this topic. It's full of enthusiastic cheerleaders that misrepresent what these things can do, and enthusiastic haters that misrepresent what these things can do. My own feelings are all over the map here, but it feels impossible to have reasonable discussions about it due to the polarization, and I find that frustrating.

AlexCoventry · 2025-05-05T22:20:07 1746483607

If you've only been using AI for a couple of weeks, that's quite likely a factor. AI services have been improving incredibly quickly, and many people have a bad impression of the whole field from a time when it was super promising, but basically unusable. I was pretty dismissive until a couple of months ago, myself.

I think the other reason people are hostile to the field is that they're scared it's going to make them economically redundant, because a tsunami of cheap, skilled labor is now towering over us. It's loss-aversion bias, basically. Many people are more focused on that risk than on the amazing things we're able to do with all that labor.

foobarqux · 2025-05-05T21:39:11 1746481151

These are mostly value judgments and people are using words that mean different things to different people but I would point that LLM boosters have been saying the same thing for each product release: "now it works, you are just using the last-gen model/technique which doesn't really work (even though I said the same thing for that model/technique and every one before that). Moreover there still hasn't been significant, objectively observable impact: no explosion in products, no massive acceleration of feature releases, no major layoffs attributed to AI (to which the response every time is that it was just released and you will see the effects in a few months).

Finally, if it really were really true that some people know the special sauce of how to use LLMs to make a massive difference in productivity but many people didn't know how to do that then you could make millions or tens of millions per year as a consultant training everyone at big companies. In other words if you really believed what you were saying you should pick up the money on the ground.

steveklabnik · 2025-05-05T21:42:59 1746481379

> using words that mean different things to different people

This might be a good explanation for the disconnect!

> I would point that LLM boosters have been saying the same thing

I certainly 100% agree that lots of LLM boosters are way over-selling what they can accomplish as well.

> In other words if you really believed what you were saying you should pick up the money on the ground.

I mean, I'm doing that in the sense that I am using them. I also am not saying that I "know the special sauce of how to use LLMs to make a massive difference in productivity," but what I will say is, my productivity is genuinely higher with LLM assistance than without. I don't necessarily believe that means it's replicable, one of the things I'm curious about it "is it something special about my setup or what I'm doing or the technologies I'm using or anything else that makes me have a good time with this stuff when other smart people seem to only have a bad time?" Because I don't think that the detractors are just lying. But there is a clear disconnect, and I don't know why.

foobarqux · 2025-05-05T21:58:43 1746482323

There is so much tacit understanding by both LLM-boosters and LLM-skeptics that only becomes apparent when you look at the explicit details of how they are trying to use the tools. That's why I've asked in the past for examples of recording of real-time development that would capture all the nuance explicitly. Cherry-picked chat logs are second best but even then I haven't been particularly impressed by the few examples I've seen.

> I mean, I'm doing that in the sense that I am using them.

My point is whatever you are doing is worth millions of dollars less than teaching the non-believers how to do it if you could figure out how (actually probably even if you couldn't but sold snake-oil).

diggan · 2025-05-05T20:07:46 1746475666

> I am still trying to sort out why experiences are so divergent. I've had much more positive LLM experiences while coding than many other people seem to, even as someone who's deeply skeptical of what's being promised about them. I don't know how to reconcile the two.

As with many topics, I feel like you can divide people in a couple of groups. You have people who try it, have their mind blown by it, so they over-hype it. Then the polar-opposite, people who are overly dismissive and cement themselves into a really defensive position. Both groups are relatively annoying, inaccurate, and too extremist. Then another group of people might try it out, find some value, integrate it somewhat and maybe got a little productivity-boost and moves on with their day. Then a bunch of other groupings in-between.

Problem is that the people in the middle tend to not make a lot of noise about it, and the extremists (on both ends) tend to be very vocal about their preference, in their ways. So you end up perceiving something as very polarizing. There are many accurate and true drawbacks with LLMs as well, but it also ends up poisoning the entire concept/conversation/ecosystem for some people, and they tend to be noisy as well.

Then the whole experience depends a lot on your setup, how you use it, what you expect, what you've learned and so many much more, and some folks are very quick to judge a whole ecosystem without giving parts of it an honest try. It took me a long time to try Aider, Cursor and others, and even now after I've tried them out, I feel like there are probably better ways to use this new category of tooling we have available.

In the end I think reality is a bit less black/white for most folks, common sentiment I see and hear is that LLMs are probably not hellfire ending humanity nor is it digital-Jesus coming to save us all.

steveklabnik · 2025-05-05T20:16:55 1746476215

> I feel like you can divide people in a couple of groups.

This is probably a big chunk of it. I was pretty anti-LLM until recently, when I joked that I wanted to become an informed hater, so I spent some more time with things. It's put me significantly more in the middle than either extremely pro or extremely anti. It's also hard to talk about anything that's not purely anti in the spaces I seemingly run in, so that also contributes to my relative quiet about it. I'm sure others are in a similar boat.

> for most folks, common sentiment I see and hear is that LLMs are probably not hellfire ending humanity nor is it digital-Jesus coming to save us all.

Especially around non-programmers, this is the vibe I get as well. They also tend to see the inaccuracies as much less significant than programmers seem to, that is, they assume they're checking the output already, or see it as a starting point, or that humans also make mistakes, and so don't get so immediately "this is useless" about it.

stefan_ · 2025-05-05T21:08:55 1746479335

> anti-LLM people saying very obviously untrue things, which is uh, kind of hilarious in a meta way.

tptacek shifted the goal posts from "correct a hallucination" to "solve a copy pasted error" (very different things!) and just a comment later theres someone assassinating me as an "anti-LLM person" saying "very obviously untrue things", "kind of hilarious". And you call yourself "charitable". It's a joke.

steveklabnik · 2025-05-05T21:21:55 1746480115

EDIT: wait, I think you're tptacek's parent. I was not talking about your post, I was talking about the post I linked to. I'm leaving my reply here but there's some serious confusion going on.

> theres someone assassinating me as an "anti-LLM person"

Is this not true? That's the vibe the comment gives off. I'm happy to not say that in the future if that's not correct, and if so, additionally, I apologize.

I myself was pretty anti-LLM until the last month or so. My opinions have shifted recently, and I've been trying to sort through my feelings about it. I'm not entirely enthusiastically pro, and have some pretty big reservations myself, but I'm more in the middle than where I was previously, which was firmly anti.

> "very obviously untrue things"

At the time I saw the post, I had just tabbed away from a ChatGPT session where it had relied on searching the web for some info, so the contrast was very stark.

> "kind of hilarious"

I do think that when people say that LLMs occasionally hallucinate things, and are therefore worthless, when others make false claims about them for the purpose of suggesting we shouldn't use them, that it is kind of funny. You didn't directly say this in your post, only handwaved towards it, but I'm talking about the discourse in general, not you specifically.

> And you call yourself "charitable"

I am trying to be charitable. A lot of people reached for some variant of "this person is stupid," and I do not think that's the case, or the good way to understand what people mean when they say things. A mistake is a mistake. I am actively not trying to simply dismiss arguments on either side of here, but take them seriously.

Bjartr · 2025-05-06T12:15:49 1746533749

> I am still trying to sort out why experiences are so divergent

I suspect part of it is that there still isn't much established social context for how to interact with an LLM, and best practices are still being actively discovered, at least compared to tools like search engines or word processors.

Search engines somewhat have this problem, but there's some social context around search engine skill, colloquially "google-fu", if it's even explicitly mentioned.

At some point, being able to get the results from a search engine stopped being entirely about the quality of the engine and instead became more about the skill of the user.

I imagine, as the UX for AI systems stabilizes, and as knowledge of the "right way", to use them diffuses through culture, experiences will be less divergent.

thewebguyd · 2025-05-06T16:28:15 1746548895

> I suspect part of it is that there still isn't much established social context for how to interact with an LLM, and best practices are still being actively discovered, at least compared to tools like search engines or word processors.

Likely, but I think another big reason for diverging experience is that natural language is ambiguous and human conversation leaves out a lot of explicit details because it can be inferred or assumed when using natural language.

I can't speak for others, but it's difficult for me to describe programming ideas and concepts using natural language - but, that's why we have programming languages for this. A language that is limited in vocabulary, and explicit in conveying your meaning.

Natural language is anything but, and it can be difficult to be exact. You can instinctively leave out all kinds of details using natural language, whereas leaving out those details in a programming language would cause a compiler error.

I've never really understood the push toward programming with natural language, even before LLMs. It's just not a good fit. And much like how you an pass specific parameters into Google, I think we'll end up getting to a place where LLMs have their own DSL for prompting to make it easier to get the result you want.

daxfohl · 2025-05-05T21:17:37 1746479857

So is the real engineering work in the agents rather than in the LLM itself then? Or do they have to be paired together correctly? How do you go about choosing an LLM/agent pair efficiently?

steveklabnik · 2025-05-05T21:34:20 1746480860

> How do you go about choosing an LLM/agent pair efficiently?

I googled "how do I use ai with VS: Code" and it pointed me at Cline. I've then swapped between their various backends, and just played around with it. I'm still far too new to this to have strong options about LLM/agent pairs, or even largely between which LLMs, other than "the free ChatGPT agent was far worse than the $20/month one at the task I threw it at." As in, choosing worse algorithms that are less idiomatic for the exact same task.

daxfohl · 2025-05-05T22:00:01 1746482401

I also wonder how hard it would be to create your own agent that remembers your preferences and other stuff that you can make sure stays in the LLM context.

...Maybe a good first LLM assisted project.

dcre · 2025-05-05T22:02:52 1746482572

No need to write your own whole thing (though it is a good exercise) — the existing tools all support ways of customizing the prompting with preferences and conventions, whether globally or per-project.

dcre · 2025-05-05T22:00:30 1746482430

The Aider leaderboards are quite helpful, performance there seems to match people's subjective experience pretty well.

https://aider.chat/docs/leaderboards/

Regarding choosing a tool, they're pretty lightweight to try and they're all converging in structure and capabilities anyway.

zoogeny · 2025-05-05T19:58:24 1746475104

I think it is pretty simple: people tried it a few times a few months ago in a limited setting, formed an opinion based on those limited experiences and cannot imagine a world where they are wrong.

That might sound snarky, but it probably works out for people in 99% of cases. AI and LLMs are advancing at a pace that is so different from any other technology that people aren't yet trained to re-evaluate their assumptions at the high rate necessary to form accurate new opinions. There are too many tools coming (and going, to be fair).

HN (and certain parts of other social media) is a bubble of early adopters. We're on the front lines seeing the war in realtime and shaking our heads at what's being reported in the papers back home.

steveklabnik · 2025-05-05T20:13:35 1746476015

Yeah, I try to stay away from reaching for these sorts of explanations, because it feels uncharitable. I saw a lot of very smart people repost the quoted post! They're not the kind who "cannot imagine a world where they are wrong."

But at the same time, the pace of advancement is very fast, and so not having recently re-evaluated things is significantly more likely while also being more charitable, I think.

zoogeny · 2025-05-05T20:25:51 1746476751

My language is inflammatory for certain, but I believe it is true. I don't think most minds are capable of having to reevaluate their opinions as quickly as AI is demanding. There is some evidence that stress is strongly correlated to uncertainty. AI is complicated, the tools are complicated, the trade-off are complicated. So that leaves a few options: live in uncertainty/stress, expend the energy to reevaluate or choose to believe in certainty based on past experience.

If someone is embracing uncertainty or expending the time/energy/money to reevaluate then they don't post such confidently wrong ideas on social media.

aerhardt · 2025-05-05T20:08:36 1746475716

> an instance of this from a few days ago.

Bro I've been using LLMs for search since before it even had search capabilities...

"LLMs not being for search" has been an argument from the naysayers for a while now, but very often when I use an LLM I am looking for the answer to something - if that isn't [information] search, then what is?

Whether they hallucinate or outright bullshit sometimes is immaterial. For many information retrieval tasks they are infinitely better than Google and have been since GPT3.

steveklabnik · 2025-05-05T20:11:48 1746475908

I think this is related, but I'm more interested in the factual aspects than the subjective ones. That is, I don't disagree there's also arguments over "are LLMs good for the same things search engines for" but it's more of the more objective "they do not search the web" part. We need to have agreement on the objective aspects before we can have meaningful discussion of the subjective, in my mind.

giovannibonetti · 2025-05-05T18:42:23 1746470543

> I think someone else likened it to a lawnmower that will run rampage over your flower bed at the first hiccup

This reminds me of a scene from the recent animation movie "Wallace and Gromit: Vengeance Most Fowl" where Wallace actually uses a robot (Norbot) to do gardening tasks, and rampages over Gromit's flower bed.

https://youtu.be/_Ha3fyDIXnc

zoogeny · 2025-05-05T20:06:21 1746475581

I mean, I have. I use them every day. You often see them literally saying "Oh there is a linter error, let me go fix it" and then a new code generation pass happens. In the worst case, it does exactly what you are saying, gets stuck in a loop. It eventually gets to the point where it says "let me try just once more" and then gives up.

And when that happens I review the code and if it is bad then I "git revert". And if it is 90% of the way there I fix it up and move on.

The question shouldn't be "are they infallible tools of perfection". It should be "do I get value equal to or greater than the time/money I spend". And if you use git appropriately you lose at most five minutes on a agent looping. And that happens a couple of times a week.

And be honest with yourself, is getting stuck in a loop fighting a compiler, type-checker or lint something you have ever experienced in your pre-LLM days?

mountainriver · 2025-05-05T18:44:03 1746470643

Have you tried it? More than once?

I’m getting massive productivity gains with Cursor and Gemini 2.5 or Claude 3.7.

One-shotting whole features into my rust codebase.

stefan_ · 2025-05-05T22:02:55 1746482575

I use it all the time, multiple times daily. But the discussion is not being very honest, particularly for all the things that are being bolted on (agent mode, MCP). Like just upstream people dunk on others for pointing out that maybe giving the model an API call to read webpages isn't quite turning LLM into search engines. Just like letting it run shell commands has not made it into a full blown agent engineer.

I tried it again just now with Claude 3.7 in Cursors Agent/Compose (they change this stuff weekly). Write a simple C++ TensorRT app that loads an engine and runs inference 100 times for a benchmark, use this file to source a toolchain. It generated code with the old API & a CMake file and (warning light turns on) a build script. The compile fails because of the old API, but this time it managed to fix it to use the new API.

But now the linking fails, because it overwrote the TRT/CUDA directories in the CMakeLists with some home cooked logic (there was nothing to do, the toolchain script sets up the environment fully and just find_package would work).

And this is where we go off the rails; it messes with the build script and CMakeLists more, but still it can not link. It thinks hey it looks like we are cross-compiling and creates a second build script "cross-compile.sh" that tries to use the compiler directly, but of course that misses things that the find_package in CMake would setup and so fails with include errors.

It pretends its a 1970 ./configure script and creates source files "test_nvinfer.cpp" and "test_cudart.cpp" that are supposed to test for the presence of those libraries, then tries to compile them directly; again its missing directories and obviously fails.

Next we create a mashup build script "cross-compile-direct.sh". Not sure anymore what this one tried to achieve, didn't work.

Finally, and this is my favorite agent action yet, it decides fuck it, if the library won't link, why don't we just mock out all the actual TensorRT/CUDA functionality and print fake benchmark numbers to demonstrate LLMs can average a number in C++. So it writes, builds ands runs a "benchmark_mock.cpp" that subs out all the useful functionality for random data from std::mt19937. This naturally works, so the agent declares success and happily updates the README.md with all the crap it added and stops.

This is what running the lawnmower over the flower bed means; you have 5 more useless source files and a bunch more shell scripts and a bunch of crap in a README that were all generated to try and fail to fix a problem it could not figure out, and this loop can keep going and generate more nonsense ad infinitum.

(Why could it not figure out the linking error? We come back to the shitty bolted on integrations; it doesn't actually query the environment, search for files or look at what link directories are being used, as one would investigating a linking error. It could of course, but the balance in these integrations is 99% LLM and 1% tool use, and even context from the tool use often doesn't help)

tptacek · 2025-05-05T23:40:23 1746488423

It's really weird for me to see people talking about using LLMs in coding situations in a frame where "agents" (we're not even at MCP yet!) are somehow an extra. People discussing the applicability of LLMs to programming, and drawing conclusions (even if only for themselves) about how well it works, should be experienced with a coding agent.

mountainriver · 2025-05-06T17:24:50 1746552290

LLMs aren't good at deep ML problems yet. We know this, its what MLE-bench is for. That doesn't mean they aren't good at other coding problems

dboreham · 2025-05-05T18:27:55 1746469675

Someone gave me the tip to add "all source files should build without error", which you'd think would be implicit, but it seems not.

tptacek · 2025-05-05T18:55:19 1746471319

There's definitely a skill to using them well (I am not yet expert); my only frustration is with people who (like me) haven't refined the skill but have also concluded that there's no benefit to the tool. No, really, in this case, you're mostly just not holding it right.

The tools will get better, but what I see happening with people who are good at using them (and from my own code, even in my degraded LLM usage), we have an existence proof of the value of the tools.

janalsncm · 2025-05-05T22:15:19 1746483319

There’s an argument that library authors should consider implementing those hallucinated functions, not because it’ll be easier for LLMs but because the hallucination is a statement about what an average user might expect to be there.

I really dislike libraries that have their own bespoke ways of doing things for no especially good reason. Don’t try to be cute. I don’t want to remember your specific API, I want an intuitive API so I spend less time looking up syntax and more time solving the actual problem.

dayvigo · 2025-05-05T22:22:15 1746483735

There's also an argument that developers of new software, including libraries, should consider making an earnest attempt to do The Right Thing instead of re-implementing old, flawed designs and APIs for familiarity's sake. We have enough regression to the mean already.

The more LLMs are entrenched and required, the less we're able to do The Right Thing in the future. Time will be frozen, and we'll be stuck with the current mean forever. LLMs are notoriously bad at understanding anything that isn't mappable in some way to pre-existing constructs.

> for no especially good reason

That's a major qualifier.

nicce · 2025-05-05T23:14:13 1746486853

Polars has their own LLM customised for the docs:

https://docs.pola.rs/api/python/stable/reference/

I would say that this is pretty good approach to combat the previous.

vunderba · 2025-05-05T18:04:37 1746468277

That sort of "REPL" system is why I really liked when they integrated a Python VM into ChatGPT - it wasn't perfect, but it could at least catch itself when the code didn't execute properly.

tptacek · 2025-05-05T18:07:43 1746468463

Sure. But it's 2025 and however you want to get this feature, be it as something integrated into VSCode (Cursor, Windsurf, Copilot), or a command line Python thing (aider), or a command line Node thing (OpenAI codex and Claude Code), with a specific frontier coding model or with an abstracted multi-model thingy, even as an Emacs library, it's available now.

I see people getting LLMs to generate code in isolation and like pasting it into a text editor and trying it, and then getting frustrated, and it's like, that's not how you're supposed to be doing it anymore. That's 2024 praxis.

gnatolf · 2025-05-05T19:58:54 1746475134

The churn of staying on top of this means to me that we'll also chew through experts of specific times much faster. Gone are the day of established, trusted top performers, as every other week somebody creates a newer, better way of doing things. Everybody is going to drop off the hot tech at some point. Very exhausting.

Philpax · 2025-05-05T20:52:32 1746478352

The answer is simple: have your AI stay on top of this for you.

Mostly joke, but also not joke: https://news.smol.ai/

arthurcolle · 2025-05-05T18:27:13 1746469633

I like using Jupyter Console as a primary interpreter, and then dropping into SQLite/duckdb to save data

Easy to to script/autogenerate code and build out pipelines this way

red_hare · 2025-05-05T19:27:16 1746473236

It is a little crazy how fast this has changed in the past year. I got VSCode's agent mode to write, run, and read the output of unit tests the other day and boy it's a game changer.

surgical_fire · 2025-05-05T19:13:10 1746472390

This has been my experience with any LLM I use as a code assistant. Currently I mostly use Claude 3.5, although I sometimes use Deepseek or Gemini.

The more prominent and widely used a language/library/framework, and the more "common" what you are attempting, the more accurate LLMs tends to be. The more you deviate from mainstream paths, the more you will hit such problems.

Which is why I find them them most useful to help me build things when I am very familiar with the subject matter, because at that point I can quickly spot misconceptions, errors, bugs, etc.

It's when it hits the sweet spot of being a productivity tool, really improving the speed with which I write code (and sometimes improving the quality of what I write, for sometimes incorporating good practices I was unaware of).

steveklabnik · 2025-05-05T19:21:35 1746472895

> The more prominent and widely used a language/library/framework, and the more "common" what you are attempting, the more accurate LLMs tends to be. The more you deviate from mainstream paths, the more you will hit such problems.

One very interesting variant of this: I've been experimenting with LLMs in a react-router based project. There's an interesting development history where there's another project called Remix, and later versions of react-router effectively ate it, that is, in December of last year, react-router 7 is effectively also Remix v3 https://remix.run/blog/merging-remix-and-react-router

Sometimes, the LLM will be like "oh, I didn't realize you were using remix" and start importing from it, when I in fact want the same imports, but from react-router.

All of this happened so recently, it doesn't surprise me that it's a bit wonky at this, but it's also kind of amusing.

joshuacc · 2025-05-06T00:35:28 1746491728

I ran into this as well, but now I have given standing instructions for the llm to pull the latest RR docs anytime it needs to work with RR. That has solved the entire issue.

zoogeny · 2025-05-05T19:50:38 1746474638

In addition to choosing languages, patterns and frameworks that the LLM is likely to be well trained in, I also just ask it how it wants to do things.

For example, I don't like ORMs. There are reasons which aren't super important but I tend to prefer SQL directly or a simple query builder pattern. But I did a chain of messages with LLMs asking which would be better for LLM based development. The LLM made a compelling case as to why an ORM with a schema that generated a typed client would be better if I expected LLM coding agents to write a significant amount of the business logic that accessed the DB.

My dislike of ORMs is something I hold lightly. If I was writing 100% of the code myself then I would have breezed past that decision. But with the agentic code assistants as my partners, I can make decisions that make their job easier from their point of view.

satvikpendem · 2025-05-05T21:22:53 1746480173

Cursor also can read and store documentation so it's always up to date [0]. Surprised that many people I talk to about Cursor don't know about this, it's one of its biggest strengths compared to other tools.

[0] https://docs.cursor.com/context/@-symbols/@-docs

protocolture · 2025-05-05T23:24:06 1746487446

>Although pandas is the standard for manipulating tabular data in Python and has been around since 2008, I’ve been using the relatively new polars library exclusively, and I’ve noticed that LLMs tend to hallucinate polars functions as if they were pandas functions which requires documentation deep dives to confirm which became annoying.

Funnily enough I was trying to deal with some lesser used parts of pandas via LLM and it kept sending me back through a deprecated function for everything. It was quite frustrating.

__mharrison__ · 2025-05-06T10:33:38 1746527618

This is because the training data for pandas code is not great. It is a lot of non programmers banging keys until it works or a bunch of newbie focused blog posts that endorse bad practices.

protocolture · 2025-05-07T00:16:41 1746577001

That tracks

namaria · 2025-05-06T08:22:02 1746519722

> the program doesn't compile

How does this even make sense when the "agent" is generating Python. There are several ways it can generate code that runs and even does the thing and still has severe issues.

datpuz · 2025-05-06T10:32:25 1746527545

Are you implying that you can actually let agents run loose to autonomously fix things without just creating a mess? Because that's not a thing that you can really do in real life, at least not for anything but the most trivial tasks.

__mharrison__ · 2025-05-06T10:28:51 1746527331

When there's is an AI that writes Polars code correctly, please let me know.

beepbooptheory · 2025-05-05T19:19:58 1746472798

How much money do you spend a day working like this?

steveklabnik · 2025-05-05T19:22:19 1746472939

I haven't spent many days or full days, but when I've toyed with this, it ends up at about $10/hour or maybe a bit less.

aerhardt · 2025-05-05T18:33:30 1746470010

> the program doesn't compile

The issue you are addressing refers specifically to Python, which is not compiled... Are you referring to this workflow in another language, or by "compile" do you mean something else, such as using static checkers or tests?

Also, what tooling do you use to implement this workflow? Cursor, aider, something else?

dragonwriter · 2025-05-05T18:42:07 1746470527

Python is, in fact, compiled (to bytecode, not native code); while this is mostly invisible, syntax errors will cause it to fail to compile, but the circumstances described (hallucinating a function) will not, because function calls are resolved by runtime lookup, not at compile time.

aerhardt · 2025-05-05T18:48:20 1746470900

I get that, and in that sense most languages are compiled, but generally speaking, I've always understood "compiled" as compiled-ahead-of-time - Python certainly doesn't do that and the official docs call it an interpreted language.

In the context we are talking about (hallucinating Polars methods), if I'm not mistaken the compilation step won't catch that, Python will actually throw the error at runtime post-compilation.

So my question still stands on what OP means by "won't compile".

dragonwriter · 2025-05-05T19:01:15 1746471675

> I get that, and in that sense most languages are compiled, but generally speaking, I've always understood "compiled" as compiled-ahead-of-time

Python is AOT compiled to bytecode, but if a compiled version of a module is not available when needed it will be compiled and the compiled version saved for next use. In the normal usage pattern, this is mostly invisible to the user except in first vs. subsequent run startup speed, unless you check the file system and see all the .pyc compilation artifacts.

You can do AOT compilation to bytecode outside of a compile-as-needed-then-execute cycle, but there is rarely a good reason to do so explicitly for the average user (the main use case is on package installation, but that's usually handled by package manager settings).

But, relevant to the specific issue here, (edit: calling) a hallucinated function would lead to a runtime failure not a compilation failure, since function calls aren't resolved at compile time, but by lookup by name at runtime.

(Edit: A sibling comment points out that importing a hallucinated function would cause a compilation failure, and that's a good point.)

Scarblac · 2025-05-05T19:11:35 1746472295

When a module is first loaded, it's compiled to bytecode and then executed. Importing a non existing function will throw an error right away.

It's not a compilation error but it does feel like one, somewhat. It happens at more or less the same time.

graemep · 2025-05-06T09:39:43 1746524383

Yes, that works if you do:

    from foo import nonexistent

but if you do:

    import foo

    foo.nonexistent()

You get a run time error.

A lot of the time with Python you will be importing classes so it will be mostly run time errors.

mountainriver · 2025-05-05T18:42:10 1746470530

Yes but it gets feedback from the IDE. Cursor is the best here