I think the argument here ignores a critical fact: a huge factor in Claude Code's popularity is the Claude Max plans. These plans give you potentially thousands of dollars of tokens for a capped $200.
Speaking for myself, I long for the day I can dump the comparatively garbage experience of Claude Code for something more enjoyable and OSS like OpenCode. But the fact is that it is simply not economically viable to do so.
So the PMF is not really for Claude Code alone -- it is for Claude Code + Claude Max.
I don't really see why evals are assumed to be exclusively in the domain of data scientists. In my experience SWEs-turned-AI Engineers are much better suited to building agents. Some struggle more than others, but "evals as automated tests" is, imo, so obvious a mental model, and can be so well adapted to by good SWEs, that data scientists have no real role on many "agent" projects.
I'm not saying this is good or bad, just that it's what I'm observing in practice.
For context, I'm a SWE-turned-AI Engineer, so I may be biased :)
I think there's a lot of methodological expertise that goes into collecting good eval data. For example, in many cases you need human labelers with the right expertise, well designed tasks, well defined constructs, and you need to hit interrater agreement targets and troubleshoot when you don't. Good label data is a prerequisite to the stuff that can probably be automated by the AI agent (improving the system to optimize a metric measured against ground truth labels). Data scientists and research scientists are more likely to have this skillset. And it takes time to pick up and learn the nuances.
As someone who works with real licensed engineers (electrical, civil), I wish we would use the term "agentic software engineering" to describe this. Omitting "software" here betrays a very SWE-centric mindset.
Agents are coming for the other engineering disciplines as well.
It is for this reason that I usually keep an "adr" folder in my repo to capture Architecture Decision Record documents in markdown. These allow the agent to get the "why" when it needs to. Useful for humans too.
The challenge is really crafting your main agent prompt such that the agent only reads the ADRs when absolutely necessary. Otherwise they muddy the context for simple inside-the-box tasks.
I don't doubt your sincerity. But this represents an absolutely bonkers disparity compared to the reality I'm experiencing.
I'm not sure what to say. It's like someone claiming that automobiles don't improve personal mobility. There are a lot of logical reasons to be against the mass adoption of automobiles, but "lack of effectiveness as a form of personal mobility" is not one of them.
Hearing things like this does give me a little hope though, as I think it means the total collapse of the software engineering industry is probably still a few years away, if so many companies are still so far behind the curve.
> It's like someone claiming that automobiles don't improve personal mobility.
I prefer walking or cycling and often walk about 8km a day around town, for both mobility and exercise. (Other people's) automobiles make my experience worse, not better.
I'm sure there's an analogy somewhere.
(Sure, automobiles improve the speed of mobility, if that's the only thing you care about...)
I don't think I'm asking for something unreasonable: I'll believe this actually speeds up software creation when one of my vendors starts getting me software faster. That's not some crazy ludditism on my part, I don't think?
Sorry to be so blunt, but it's not surprising that you aren't able to get much value from these tools, considering you don't use them much.
Getting value from LLMs / agents is a skill like any other. If you don't practice it deliberately, you will likely be bad at it. It would be a mistake to confuse lack of personal skill for lack of tool capability. But I see people make this mistake all the time.
I love Claude Code and use it all day, every day for work. I would self identify as an unofficial Claude Code evangelist amongst my coworkers and friends.
But Claude Code is buggy as hell. Flicker is still present. Plugin/skill configuration is an absolute shitshow. The docs are (very) outdated/incomplete. The docs are also poorly organized, embarrassingly so. I know Claude Code's feature set quite well, and I still have a hard time navigating their docs to find a particular thing sometimes. Did you know Claude Code supports "rules" (similar to the original Cursor Rules)? Find where they are documented, and tell me that's intuitive and discoverable. I'm sorry, but with an unlimited token (and I assume, by now, personnel) budget, there is no excuse for the literal inventors of Claude Code to have documentation this bad.
I seriously wish they would spend some more cycles on quality rather than continuing to push so many new features. I love new features, but when I can't even install a plugin properly (without manual file system manipulation) because the configuration system is so bugged, inscrutable, and incompletely documented, I think it's obvious that a rebalancing is needed. But then again, why bother if you're winning anyway?
Side note: comparing it to Gemini CLI is simply cruel. No one should ever have to use or think about Gemini CLI.
Are you me? Patriot is amazing and I will never stop recommending it no matter how many dumbfounded looks I get.
I have a framed "Structural Dynamics of Flow" poster on my wall in my home office, visible on Teams calls. Only 1 person has ever recognized the reference.
You need to give it the tools to check its own work, and remove yourself from that inner low-level error resolution loop.
If you're building a web app, give it a script that (re)starts the full stack, along with Playwright MCP or Chrome DevTools MCP or agent-browser CLI or something similar. Then add instructions to CLAUDE.md on how and when to use these tools. As in: "IMPORTANT: You must always validate your change end-to-end using Playwright MCP, with screenshot evidence, before reporting back to me that you are finished.".
You can take this further with hooks to more forcefully enforce this behavior, but it's usually not necessary ime.
Speaking for myself, I long for the day I can dump the comparatively garbage experience of Claude Code for something more enjoyable and OSS like OpenCode. But the fact is that it is simply not economically viable to do so.
So the PMF is not really for Claude Code alone -- it is for Claude Code + Claude Max.
reply