Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> And for me, past a certain point, even if you continually report back problems it doesn't get any better in its new suggestions. It will just spin its wheels. So for that reason I'm a little skeptical about the value of automating this process.

It sucks, but the trick is to always restart the conversations/chat with a new message. I never go beyond one reply, and also copy-paste a bunch. Got tired of copy-pasting, wrote something like a prompting manager (https://github.com/victorb/prompta) to make it easier, and not having to neatly format code blocks and so on.

Basically make one message, if they get the reply wrong, iterate on the prompt itself and start fresh, always. Don't try to correct by adding another message, but update initial prompt to make it clearer/steer more.

But I've noticed that every model degrades really quickly past the initial reply, no matter what length of each individual message. The companies seem to continue to increase the theoretical and practical context limits, but the quality degrades a lot faster even within the context limits, and they don't seem to try to address that (nor have a way of measuring it).



This is my experience as well, as has been for over a year now.

LLMs are so incredibly transformative when they're incredibly transformative. And when they aren't it's much better to fall back on the years of hard won experience I have - the sooner the better. For example I'll switch between projects and languages and even with explicit instruction to move to a strongly typed language they'll stick to dynamic answers. It's an odd experience to re-find my skills every once in a while. "Oh yeah, I'm pretty good at reading docs myself".

With all the incredible leaps in LLMs being reported (especially here on HN) I really haven't seen much of a difference in quite a while.


Interesting. This is another problem aider does not experience. It works on a git repo. If you switch repos, it changes context.

I’m not affiliated with aider. I just use it.

I have a bet that many of the pitfalls people experience at the moment are due to mismatched tools or immature tools.


In other words don't use the context window. Treat it as a command line with input/output, in which the purpose of the command is to extract information signal or knowledge manipulation or data mining and so on.

Also special care has to be given to the number of tokens. Even with one-question/one-answer, 5 hundred to 1 thousand tokens can be focused at once by our artificial overlords. After that they start losing their marbles. There are exceptions to that rule with the reasoning models, but in essence they are not that different.

The difference of using the tool correctly versus not, might be that instead of getting 99.9% accuracy, the user gets just 98%. Probably that doesn't sound that big of a difference to some people. The difference is that it works 10 times better in the first case.


People keep throwing these 95%+ accuracy rates for LLMs in these discussions, but that is nonsense. It's closer to 70%. It's quite terrible. I use LLMs but I never trust them beyond just doing some initial search if I am stumped, and when it unblocks me I immediately put it down again. It's not transformative, it's merely replacing google because search there has sucked for a while.


95% accuracy VS 70% accuracy, both numbers are pulled out of someone's ass and serves little to the discussion at hand. How did you measure that, or rather since you didn't, what's the point of sharing this hypothetical 25% difference?

And how funny that you comment seems to land perfectly together with this about people having very different experiences with using LLMs:

> I am still trying to sort out why experiences are so divergent. I've had much more positive LLM experiences while coding than many other people seem to, even as someone who's deeply skeptical of what's being promised about them. I don't know how to reconcile the two.

https://news.ycombinator.com/item?id=43898532


It works very well (99.9%), when the problem resides at a familiar territory of the user's knowledge. When i know enough about a problem, i know how to decompose it into smaller pieces, and all (most?) smaller pieces have been already solved countless of times.

When a problem is far outside of my understanding, A.I. leads me towards a wrong path more often than not. Accuracy is terrible, because i don't know how to decompose the problem.

Jargon plays a crucial role there. LLM's need to guided using as much correct jargon of the problem as possible.

I have done this for decades on people. I read a book at some point that the most sure way for people to like you, is to speak to them in words they usually use themselves. No matter the concepts they are hearing with their ears, if the words belong belong in their familiar vocabulary they are more than happy to discuss anything.

So when i meet someone, i always try to absorb as much of their vocabulary as possible, as quickly as possible, and then i use it to describe ideas i am interested in. People understand much better like that.

Anyway, the same holds true for LLM's, they need to hear the words of the problem, expressed in that particular jargon. So when a programmer wants to use a library, he needs to absorb the jargon used in that particular library. It is only then that accuracy rates hit many nines.


I will walk around the gratuitous rudeness and state the obvious:

No, the pretend above 95% accuracy is not as good as the up to 50% rates of hallucinations reported by OpenAI for example.

The difference in experiences is easily explainable in my opinion. Much like some people swear by mediums and psychics and other easily see through it: it's easy to see what you want to see when a nearly random experience lands you a good outcome.

I don't appreciate your insinuation that I am making up numbers and I though it shouldn't go unanswered but do not mistake this for a conversation. I am not in the habit of engaging with such demeaning language.


> gratuitous rudeness

It is "Gratuitous rudeness" to say these numbers without any sort of sourcing/backing are pulled from someone's ass? Then I guess so be it, but I'm also not a fan of people speaking about absolute numbers as some sort of truth, when there isn't any clear way of coming up with those numbers in the first place.

Just like there are "extremists" claiming LLMs will save us all, clearly others fall on the other extreme and it's impossible to have a somewhat balanced conversation with either of these two groups.


This has largely been my experience as well, at least with GH Copilot. I mostly use it as a better Google now, because even with context of my existing code, it can't adhere to style at all. Hell, it can't even get docker compose files right, using various versions and incorrect parameters all the time.

I also noticed language matters a lot as well. It's pretty good with Python, pandas, matplotlib, etc. But ask it to write some PowerShell and it regularly hallucinates modules that don't exist, more than any other language I've tried to use it with.

And good luck if you're working with a stack that's not flavor of the month with plenty of online information available. ERP systems with its documentation living behind a paywall, so it's not in the training data - you know, real world enterprise CRUD use cases where I'd want to use it the most, it's the least helpful.


To be fair, I find ChatGPT useful for Elixir, which is pretty niche. The great error messages (if a bit verbose) and the atomic nature of functions in a functional language goes with the grain of LLMs I think.

Still, at most I get it to help me with snippets. I wouldn't want it to just generate lots of code, for one it's pretty easy to write Elixir...


I think “don’t use the context window” might be too simple. It can be incredibly useful. But avoid getting in a context window loop. When iterations stop showing useful progress toward the goal, it’s time to abandon the context. LLMs tend to circle back to the same dead end solution path at some point. It also helps to jump between LLMs to get a feel for how they perform on different problem spaces.


Depends on the use case. For programming where every small detail might have huge implications, 98% accuracy vs 99.99% is a ginormous difference.

Other tasks can be more forgiving, like writing which i do all the time, then i load 3000 tokens in the context window pretty frequently. Small details in the accuracy do not matter so much for most people, for everyday casual tasks like rewriting text, summarizing etc.

In general, be wary of how much context you load into the chat, performance degrades faster than you can imagine. Ok, the aphorism i started with, was a little simplistic.


Oh sure. Context windows have been less useful for me on programming tasks than other things. When working iteratively against a CSV file for instance, it can be very useful. I’ve used something very similar to the following before:

“Okay, now add another column which is the price of the VM based on current Azure costs and the CPU and Memory requirements listed.”

“This seems to only use a few Azure VM SKU. Use the full list.”

“Can you remove the burstable SKU?”

Though I will say simple error fixes with context windows with programming issues are resolved fine. On more than one occasion when I copy and paste an incorrect solution. Providing the LLM with the error is enough to fix the problem. But if it goes beyond that, it’s best to abandon the context.


Aider, the tool, does exactly the opposite, in my experience.

It really works, for me. It iterates by itself and fixes the problem.


I'm in the middle of test-driving Aider and I'm seeing exactly the same problem, the longer a conversation goes on, the worse the quality of replies... Currently, I'm doing something like this to prevent it from loading previous context:

    rm -r .aider.chat.history.md .aider.tags.cache.v4 || true && aider --architect --model deepseek/deepseek-reasoner --editor-model deepseek/deepseek-chat --no-restore-chat-history
Which clears the history, then I basically re-run that whenever the model gets something wrong (so I can update/improve the prompt and try again).


why not use /clear and /reset command?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: