Three or four weeks ago I was posting how LLMs were useful for one-off questions but I wouldn't trust them on my codebase. Then I spent my week's holiday messing around on them for some personal projects. I am now a fairly committed Roo user. There are lots of problems, but there is incredible value here.
I spent a good part of yesterday attempting to use ChatGPT to help me choose an appropriate API gateway. Over and over it suggested things that literally do not exist, and the only reason I could tell was that I spent a good amount of time in the actual documentation. This has been my experience roughly 80% of the time when trying to use an LLM. I would like to know what is the magical prompt engineering technique that makes it stop confidently hallucinating about literally everything.
I mirror the GP's sentiment. My initial attempts using a chat like interface were poor. Then some months ago, due to many HN comments, I decided to give Aider a try. I had put my kid to bed and it was 10:45pm. My goal was "Let me just figure out how to install Aider and play with it for a few minutes - I'll do the real coding tomorrow." 15 minutes later, not only had I installed it, my script was done. There was one bug I had to fix myself. It was production quality code, too.
I was hooked. Even though I was done, I decided to add logging, command line arguments, etc. An hour later, it was a production grade script, with a very nice interface and excellent logging.
Oh, and this was a one-off script. I'll run it once and never again. Now all my one-off scripts have excellent logging, because it's almost free.
There was no going back. For small scripts that I've always wanted to write, AI is the way to go. That script had literally been in my head for years. It was not a challenging task - but it had always been low in my priority list. How many ideas do you have in your head that you'll never get around to because of lack of time. Well, now you can do 5x more of those than you would have without AI.
I was at the "script epiphany" stage a few months ago and I got cool Bash scripts (with far more bells and whistles I would normally implement) just by iterating with Claude via its web interface.
Right now I'm at the "Gemini (with Aider) is pretty good for knock-offs of the already existing functionality" stage (in a Go/HTMX codebase).
I'm yet to get to the "wow, that thing can add brand new functionality using code I'm happy with just by clever context management and prompting" stage; but I'm definitely looking forward to it.
I'm having a very good experience with ChatGPT at the moment. I'm mostly using it for little tasks where I don't remember the exact library functions. Examples:
"C++ question: how do I get the unqualified local system time and turn into an ISO time string?"
"Python question: how do I serialize a C struct over a TCP socket with asyncio?"
"JS question: how do I dynamically show/hide an HTML element?" (I obviously don't write a lot of JS :-D)
ChatGPT gave me the correct answers on the first try. I have been a sceptic, but I'm now totally sold on AI assisted coding, at least as a replacement for Google and StackOverflow. For me there is no point anymore in wading through all the blog spam and SEO crap just to find a piece of information. Stack Overflow is still occasionally useful, but the writing is on the wall...
EDIT: Important caveat: stay critical! I have been playing around asking ChatGPT more complex questions where I actually know the correct answer resp. where I can immediately spot mistakes. It sometimes gives me answers that would look correct to a non-expert, but are hilariously wrong.
The problem with this approach is that you might lose important context which is present in the documentation but doesn’t surface through the LLM. As an example, I just asked GPT-4o how to access Nth character in a string in Go. Predictably, it answered str[n]. This is a wildly dangerous suggestion because it works correctly for ASCII but not for other UTF8 characters. Sure, if you know about this and prompt it further it tells you about this limitation, but that’s not what 99% of people will do.
> The problem with this approach is that you might lose important context which is present in the documentation but doesn’t surface through the LLM.
Oh, I'm definitely aware of that! I mostly do this with things I have already done, but can't remember all the details. If the LLM shows me something new, I check the official documentation. I'm not into vibe coding :) I still want to understand every line of code I write.
Did you use search grounding? O3 or o4-mini-high with search grounding (which will usually come on by default with questions like this) are usually the best option.
Sure, this was exactly how I felt three weeks ago, and I could have written that comment myself. The agentic approach where it works out it made something up by looking at the errors the type-check generates is what makes the difference.
this is kind of a weird position to take. you're the captain, you're the person reviewing the code the LLM (agent or not) generates, you're the one asking for the code you want, you're in charge of deciding how much effort to put in to things, and especially which things are most worth your effort.
all this agent stuff sounded stupid to me until I tried it out in the last few weeks, and personally, it's been great - I give a not-that-detailed explanation for what I want, point it at the existing code and get back a patch to review once I'm done making my coffee. sometimes it's fine to just apply, sometimes I don't like a variable name or whatever, sometimes it doesn't fit in with the other stuff so I get it to try again, sometimes (<< 10% of the time) it's crap. the experience is pretty much like being a senior dev with a bunch of very eager juniors who read very fast.
anyway, obviously do whatever you want, but deriding something you've not looked in to isn't a hugely thoughtful process for adapting to a changing world.
If I have to review all code code it's writing, I'd rather write it myself (maybe with the help of an LLM).
> anyway, obviously do whatever you want, but deriding something you've not looked in to isn't a hugely thoughtful process for adapting to a changing world.
I have tried it. Not sure I want to be part of such world, unfortunately.
> the experience is pretty much like being a senior dev with a bunch of very eager juniors who read very fast.
I... don't want that. Juniors just slow me down because I have to check what they did and fix their mistakes.
(this is in the context of professional software development, not making scripts, tinkering etc)
> I... don't want that. Juniors just slow me down because I have to check what they did and fix their mistakes.
> (this is in the context of professional software development, not making scripts, tinkering etc)
I understand the sentiment. A few months ago they wanted us to move fast and dumped us (originally 2 developers) with 4 new people who have very little real world coding experience. Not fun, and very stressful.
However, keep in mind that in many workplaces, handling junior devs poorly means one of two things:
1. If you have some abstruse domain expertise, and it's OK that only 1-2 people work on it, you'll be relegated to doing that. Sadly, most workplaces don't have such tasks.
2. You'll be fantastic in your output. Your managers will like you. But they will not promote you. After some point, they expect you to be a leverage multiplier - if you can get others to code really well, the overall team productivity will exceed that of any superstar (and no, I don't believe 10x programmers exist in most workplaces).
I wonder if we'll start to see artisanal benchmarks. You -- and I -- have preferred models for certain tasks. There's a world in which we start to see how things score on the "simonw chattiness index", and come to rely on smaller more specific benchmarks I think
Yeah, I think personalized evals will definitely be a thing. Besides reviewing way too much Arena, WildChat and having now seen lots of live traces firsthand, there's a wide range of LLM usage (and preferences), which really don't match my own tastes or requirements, lol.
For the past year or two, I've had my own personal 25 question vibe-check I've used on new models to kick the tires, but I think the future is something both a little more rigorous and a little more automated (something like LLM Jury w/ an UltraFeedback criteria based off of your own real world exchanges and then BTL ranked)? A future project...
I think its more likely that we move away from benchmarks and towards more of a traditional reviewer model. People will find LLM influencers whose takes they agree with and follow them to keep up with new models.
I am starting to feel like hallucination is a fundamentally unsolvable problem with the current architecture, and is going to keep squeezing the benchmarks until something changes.
At this point I don't need smarter general models for my work, I need models that don't hallucinate, that are faster/cheaper, and that have better taste in specific domains. I think that's where we're going to see improvements moving forward.
If you could actually teach these models things, not just in the current context, but as temporal learning, then that would alleviate a lot of the issues of hallucination. I imagine being able to say "that method doesn't exist, don't recommend it again" and then give it the documentation and it would absorb that information permanently, that would fundamentally change how we interact with these models. But can that work for models hosted for everyone to use at once?
There are an almost infinite number of things that can be hallucinated, though. You can't maintain a list of scientific papers or legal cases that don't exist! Hallucinations (almost certainly) aren't specific falsehoods that need to be erased...
The level of hallucinations with o3 are no different than the level of hallucinations from most (all?) human sources in my experience. Yes, you definitely need to cross check, but yes, you need to do that for literally everything else, so it feels a bit redundant to keep preaching that as if it’s a failing of the model and not just an inherent property of all free sharing of information between two parties.
and sometimes before starting ... at two weeks' out, they'd had to change the salary offer (down) because they had screwed up the salary calculation, expressed surprise I'd said I planned to use the unlimited vacation policy to take a fixed four weeks a year (they felt it was a lot), changed the offer from employee to contractor, referred me to their accountant for what was really the simplest of accounting queries, sent me an equity calculator with an assumption of a $10bn sale price, and some other weird stuff. Really should have known better, and only lasted a few months -- my old company reached out to check I was happy in the new role, and had me back within a fortnight of checking in.
Providers are exceptionally easy to switch. There's no moat for enterprise-level usage. There's no "market share" to gobble up because I can change a line in my config, run the eval suite, and switch immediately to another provider.
This is marginally less true for embedding models and things you've fine-tuned, but only marginally.
I mean sure, but also, I feel like the ability to query an LLM for something is an invaluable resource I never had before and has made knowledge acquisition immeasurably easier for me. I definitely search the web much, much less when I'm trying to learn something.
Both statements are false, but even if they were true, if AI can do your job, and costs 1% of you, and works 100x faster, there's no reason to pay you.
At what point do you think this becomes ridiculous? Like are we angry they're not still supporting PowerPC? Would three more years have made a difference to you? 5 more? 10 more? What's the magical number would have made you happy here?
Maybe a given number of years isn't what the yardstick should be, but rather whether the hardware can still be reasonably used.
For example, I have a 3rd gen Intel Xeon that runs circles around regular newish processors in brute processing force (think compiling and such). Yet, MS doesn't officially support it anymore with win11. I know you can circumvent the TPM requirement, which I do, so I'm still using it, but this just shows how arbitrary this limit is.
In Apple's case, at least they can say it's a different architecture and whatnot.
Intel macs can run Windows on them (not that it would help) or Linux; a distro like Mint should have good support for most of the hardware, it actually runs better on older models. There is nothing Apple needs to do.
The moment they stop support they should release all documentation for the hardware and let enthusiasts reuse it. Planned obsolescence and electronic waste could be avoided.
That’s not how it works. The cost to maintaining support for old hardware isn’t merely money for more engineers. It’s the opportunity cost of slowing down forward progress for new things.
Intel laptops are sooooo slow. So extremely painfully slow. They’re quite bad. I’m largely a windows users, but my god old Intel laptops are bloody awful. Leaving behind old and bad things isn’t bad.
Besides, an older Intel MacBook will continue to work in its current form. It doesn’t need another 10 years of updates.
They really aren't slow, but the performance and battery life is greatly eclipsed by Apple ARM. I could live pretty comfortably on a Lenovo P51 (something like a 2017 MBP) if I had to under Linux and FreeBSD. Also a not negligible amount of performance was lost by the security gaffes and microcode and OS mitigations for them.
If that's the case, Apple could easily offer a fantastic trade-in deal on existing Intel machines (say, bought in the last 5 years), to get people moved to Apple Silicon. Do you think they can't afford it? I feel bad for the person buying an Intel Mac Mini in 2022.
Apple could afford to randomly select ten thousand people via lottery and give them a million dollars. Most American HN commenters could afford to donate 80% of their salary and still live a comfortable life. And a lot I bet could donate 90%.
A person or company being able to afford something is not a compelling argument.
I know, but Apple offers trade-in credit anyway when you "sell" your old laptop back to them. Offering an additional credit as good will to recent Intel owners would do nothing but help their reputation and get old systems off the street. It doesn't effect me... I migrated to Apple Silicon years ago.
It’s just a simple value prop. Is the cost of increasing the payout worth the brand reputation?
Well we can say with confidence what Apple determined the answer to be. Only a few dorks on HN will care that a 5 year old laptop won’t get the new macOS update.
5 years is admittedly a bit short. But the M1 was a quite frankly revolutionary upgrade. So it’s a one off.
Three or four weeks ago I was posting how LLMs were useful for one-off questions but I wouldn't trust them on my codebase. Then I spent my week's holiday messing around on them for some personal projects. I am now a fairly committed Roo user. There are lots of problems, but there is incredible value here.
Try it and see if you're still a hold-out.
reply