> Alternatively, LLM will use its tiny context window to build a true spaghetti that even itself cannot fix any more.
And this is (probably) what is happening to the Claude Code product itself. The harness itself has regressed and is increasingly unstable. I get lots of weird glitches:
- I scroll back in the conversation and keep seeing the the same sections repeated, I am not actually able to see the earlier parts of the conversation because of this.
- The whole CLI UI glitches out such that you can't even make sense of what you are seeing. This is usually fixed by resizing the terminal window
- The previous edit in the conversation history gets lost when I escape it to provide direction
- The CLI sometimes consumes huge amounts of memory (more than 10GB per window, multiplied by the number of windows I'm working in)
90-95% of all projects suffer this fate, and it didn't start with LLM. These projects include major commercial successes such as certain popular desktop Operating System and are essentially the standard state for many web services.
The projects that keep it simple and bare-necessity are either the ones that have scaled to enormous size (and complexity had to be removed for it to work), or ones that had strong philosophical and opinionated guardians, they are quite rare in practice.
This sounds like someone who have never had to write serious software.
> 1. You don't have to be an LLM expert to get good, consistent results with LLMs.
You don't get good consistent results with LLMs, expert or not
> 2. You don't have to write technical specs. The LLM does that for you. You just tell it "I want the next-tab button to wrap back to the first one" and it generates a technical plan. Natural language is fine.
Try this, have Claude write a section in your specs titled "Performance Optimizations" and see the gibberish it will come up with. Fluffy lists with no actually useful content specific to the project. This is a severe problem with LLM-driven speccing I have encountered uncountable times. I now rarely allow them to touch the specs document.
> 3. Software that seems to work only to fail down the line in production is already how software works today. With LLMs you can paste the stacktrace or user bug email and it will fix it.
And pretty soon you have a big ball of mud. But I guess if the rate of bugs accelerate, the LLMs can also "fix" them faster
> This is why vibe-coding works. Instead of simulating how an app will run in your head looking at its code, you run the app and tell the LLM what isn't working correctly. The app spec is derived iteratively through a UX feedback look.
I should tell you about the markdown viewer with specific features I want, that I have wanted to build only with LLM vibe-coding, and how none of them are able to do it.
> This sounds like someone who have never had to write serious software.
Why the insult? You never know who you're talking to on HN.
Your points have to do with process failure, not intractable LLM limitations. Most of which already apply to human-conceived software.
Your "Performance Optimizations" bit exemplifies this since you baked in the assumption that it will have no connection with your project. Well, why not? You need to figure out how to use your source code and relevant data as ground truth when working with LLMs.
A markdown viewer is on the simpler side of things I've built with LLMs, so this too suggests that you have a weak process. A common mistake is to expect LLMs to one-shot everything (the spec, the plan, or the actual impl). Instead you should use LLMs to review-revise-cycle one of those until it's refined, ideally the spec/plan since impl is derived from it. You will have much better and consistent results.
I recommend finding an engineer you respect/trust that has found a way to build good software with LLMs, and then tap them for their process.
Thanks for your response. I did not mean to insult; my mild jab was meant to draw attention to the idea that using LLMs for serious production software is a whole different game than using them for casual software.
You said
> Your "Performance Optimizations" bit exemplifies this since you baked in the assumption that it will have no connection with your project. Well, why not?
OK, I am talking from experience. Using LLMs for speccing is almost useless above certain complexity levels; what you get is an assemblage of the most average points you can imagine, the kinds of things almost every project in the category you are working on will address without any thought. Ask it to spec auth for a specific design, and all you'll get is: cookie-based login, input validation, password hashing, etc, etc. Which you don't need an LLM for. Nothing like an actual in-depth design. Even asking them to update specs based on discussions is hit or miss.
> A markdown viewer is on the simpler side of things I've built with LLMs, so this too suggests that you have a weak process. A common mistake is to expect LLMs to one-shot everything (the spec, the plan, or the actual impl). Instead you should use LLMs to review-revise-cycle one of those until it's refined, ideally the spec/plan since impl is derived from it. You will have much better and consistent results.
But what you are describing is NOT vibe-coding. I have no doubt I could build the viewer I want (which by the way is not your usual plain vanilla markdown viewer, but one with some very specific features) with LLM assistance. My point is: if you can't even vibe code your way to this specific viewer, how are you supposed to vibe code serious software?
Indeed, the declining quality of Claude Code is, I suspect, testament to the fact that vibe-coding any sufficienly complex piece of software does not work in the long run.
Oh, I see. I'll grant whatever you take vibe-code to mean since that seems to be the hang-up -- vibe-code prob suggests there's no process at all.
My point is that the planning phase and implementing phase are basically unsupervised, and all the work goes into the planning phase.
Yet I've noticed that over time, I'm not even needed in the planning phase because a simple revision loop on a plan file produces a really good plan. My role is mostly to decide what the agents should do next and driving the revision loop by hand (mostly because it's the best place for me to follow what's happening).
I've been getting really good results, though I've also developed a simple process that ensures that LLMs aren't relying on their model but rather external resources which is critical.
Again I've lost count of how many times I've had an in-depth architectural discussion with ChatGPT, with it giving me the final mark of approval ("This is excellent"), only for me to discover a flaw in my approach or a radically simpler and better approach, go back to it with it, and for it to proclaim "Yeah this is a much better approach".
These LLMs are in many cases sycophantic confirmation machines. Yes, they are useful to some extent in helping you refine your ideas and think of edge cases. But they are nowhere close to actually thinking better and faster. Faster in the wrong direction is not just slow, you are actually going backward.
Interesting article, but I'm perplexed by the original headline on the New York Times. The double past tense is grammatically incorrect, and yet it is repeated in the first paragraph.
I see this grammar a lot now, and it always bothers me. Is it accepted usage now?
I found this post late after the NYT changed the title and the HN title is "didn't use to", so was confused on what the problem was because I figured the post you replied to simply misspelled "didn't used to".
Anyway, was curious myself and found this [0] which looking through comments mentions what I thought the problem was, spelling. Others say it's intended. But then this one [1] has far more discussion on the topic and a comment under the question (from `a retired English grammarian`) "How you want to spell /'dɪdən yustə/ is what causes all the problems. "[1].
"didn't used to" simply doesn't look right to me (I can't recall ever having come across it), but the /'dɪdən yustə/ explanation is the simplest way to make sense of correctness.
Another interesting discussion[2] (this seems to be a popular topic) has a few ESL teachers that say they teach "didn't use to" and taught "didn't used to" as being wrong and were surprised to find that "didn't used to" is quite common with usage being correct depending on which rules a person chooses to acknowledge. There are apparently rules that make "didn't used to" correct and they are talked about in the others, but more so in this one.
> they have encyclopedic knowledge at a superficial level, the approximate judgement and maturity of a teenager, and the short-term memory of a parakeet. If I ask for something, I get the statistical average opinion of a bunch of goons, unconstrained by context or common sense or taste.
Love this paragraph; it's exactly how I feel about the LLMs. Unless you really know what you are doing, they will produce very sub-optimal code, architecturally speaking. I feel like a strong acumen for proper software architecture is one of the main things that defines the most competent engineers, along with naming things properly. LLMs are a long, long way from having architectural taste
I’ve tried that. I’ve experimented with a whole council of 13 personas including many famous developers. It’s definitely different. But it’s hasn’t performed significantly better in my tests.
If you do spot checks, that is woefully inadequate. I have lost count of the number of times when, poring over code a SOTA LLM has produced, I notice a lot of subtle but major issues (and many glaring ones as well), issues a cursory look is unlikely to pick up on. And if you are spending more time going over the code, how is that a massive speed improvement like you make it seem?
And, what do you even mean by 10x the amount of work? I keep saying anybody that starts to spout these sort of anecdotes absolutely does NOT understand real world production level serious software engineering.
Is the model doing 10x the amount of simplification, refactoring, and code pruning an effective senior level software engineer and architect would do? Is it doing 10x the detailed and agonizing architectural (re)work that a strong developer with honed architectural instincts would do?
And if you tell me it's all about accepting the LLM being in the driver's seat and embracing vibe coding, it absolutely does NOT work for anything exceeding a moderate level of complexity. I used to try that several times. Up to now no model is able to write a simple markdown viewer with certain specific features I have wanted for a long time. I really doubt the stories people tell about creating whole compilers with vide coding.
If all you see is and appreciate that it is pumping out 10x features, 10x more code, you are missing the whole point. In my experience you are actually producing a ton of sh*t, sorry.
Honestly, this more of a question about scope of the application and the potential threat vectors.
If the GP is creating software that will never leave their machine(s) and is for personal usage only, I'd argue the code quality likely doesn't matter. If it's some enterprise production software that hundreds to millions of users depend on, software that manages sensitive data, etc., then I would argue code quality should asymptotically approach perfection.
However, I have many moons of programming under my belt. I would honestly say that I am not sure what good code even is. Good to who? Good for what? Good how?
I truly believe that most competent developers (however one defines competent) would be utterly appalled at the quality of the human-written code on some of the services they frequently use.
I apply the Herbie Hancock philosophy when defining good code. When once asked what is Jazz music, Herbie responded with, "I can't describe it in words, but I know it when I hear it."
> I apply the Herbie Hancock philosophy when defining good code. When once asked what is Jazz music, Herbie responded with, "I can't describe it in words, but I know it when I hear it."
That’s the problem. If we had an objective measure of good code, we could just use that instead of code reviews, style guides, and all the other things we do to maintain code quality.
> I truly believe that most competent developers (however one defines competent) would be utterly appalled at the quality of the human-written code on some of the services they frequently use.
Not if you have more than a few years of experience.
But what your point is missing is the reason that software keeps working in the fist, or stays in a good enough state that development doesn’t grind to a halt.
There are people working on those code bases who are constantly at war with the crappy code. At every place I’ve worked over my career, there have been people quietly and not so quietly chipping away at the horrors. My concern is that with AI those people will be overwhelmed.
They can use AI too, but in my experience, the tactical tornadoes get more of a speed boost than the people who care about maintainability.
I had a long reply to your comment, then decide it was not truly worth reading. However, I do have one question remaining:
> the tactical tornadoes get more of a speed boost than the people who care about maintainability.
Why are these not the same people? In my job, I am handed a shovel. Whatever grave I dig, I must lay in. Is that not common? Seriously, I am not being factious. I've had the same job for almost a decade.
That’s because you’ve been there a decade. It’s very common for people to skip jobs every 2 years so that they never end up seeing the long term consequences of their actions.
The other common pattern I’ve seen goes something like this.
Product asks Tactical Tornado if they can building something TT says sure it will take 6 weeks. TT doesn’t push back or asks questions, he builds exactly what product asks for in an enormous feature branch.
At the end of 6 weeks he tries to merge it and he gets pushback from one or more of the maintainability people.
Then he tells management that he’s being blocked. The feature is already done and it works. Also the concerns other engineers have can’t be addressed because “those are product requirements”. He’ll revisit it later to improve on it. He never does because he’s onto the next feature.
Here’s the thing. A good engineer would have worked with product to tweak the feature up front so that it’s maintainable, performant etc…
This guy uses product requirements (many that aren’t actually requirements) and deadlines to shove his slop through.
At some companies management will catch on and he’ll get pushed out. At other companies he’ll be praised as a high performer for years.
Way better than the random India dev output. I seriously don't know what everyone around here is doing. All I see are complaints while I produce the output of ten devs. Clean code, solid design.
Spend a few hours writing context files. Spend the rest of the week sipping bourbon.
A better example might be why we build stairs with a standard riser height and tread run. If you've ever accidentally tripped on an unusual or non-standard stair, you already know this.
Users don't need to think about how to use them; they are ubiquitous and familiar, and therefore intuitive and automatic.
If every set of stairs (or, worse, if every stair in a set) was radically different, every time you approached some stairs you would have to think carefully about how to use them so you don't fall.
Your point is true, but the one I was replying to was focusing on the aesthetic aspect. For them, the sameness of UIs, while functional, make for a drab experience.
My point is that I don't find this to be case. Rather, consistent UIs, while functional, are also beautiful to me. The constituents of the UI can be designed with aesthetic taste, but the way it is all put together consistently and functionally has a beauty all its own.
It seems just fine to me. This is what Anthropic needs to do if they want to survive. I'm always looking out for someone to integrate an actually good harness to a good model. Once that happens, I'm jumping ship if Anthropic keeps playing these tricks.
It's almost unusable for me now. A simple prompt to merge 3 sub-100-line files with simple node code, on Sonnet 4.6, uses up 20% of my 5 hour quota, on a new/fresh session.
To be fair, my comment was a bit harsher before the update. The way they handle the development, communication and how they treat customers isn't fine. I've seen some angry people post and comment in manners which truly deserved the label hostile.
The whole product with the infrastructure and Claude Code's code appear to be vibe coded.
And this is (probably) what is happening to the Claude Code product itself. The harness itself has regressed and is increasingly unstable. I get lots of weird glitches:
- I scroll back in the conversation and keep seeing the the same sections repeated, I am not actually able to see the earlier parts of the conversation because of this.
- The whole CLI UI glitches out such that you can't even make sense of what you are seeing. This is usually fixed by resizing the terminal window
- The previous edit in the conversation history gets lost when I escape it to provide direction
- The CLI sometimes consumes huge amounts of memory (more than 10GB per window, multiplied by the number of windows I'm working in)
- Etc
reply