I think it depends - the actual thing to measure it to keep a developer in flow state. Many errors as well as latency break this. To be brief yes, accuracy comes first.
Quality is measured 2 main ways:
1) End-to-end: User query -> to task resolution. These are aider style benchmarks answering the question of actual task completion
2) Apply Quality: Syntax correctness, character diff, etc..
The error rate for large vs fast is around 2%. If you're doing code edits that are extremely complex or on obscure languages - large is the better option. There's also an auto option to route to the model we think is best for a task
I don't believe anyone can be in some kind of "flow state" while waiting on LLM responses. I think it's funny that we complained for years about C and others being slow to compile and now folks are fine waiting seconds++ everytime they want to change something.
This is gonna sound like some chad hype shit, but I've tried just working 2 different projects simultaneously and have had some incredible extended flow sessions. It felt like the old days of multitabling poker.
I had tried doing it with different features in different worktrees in the same codebase but found flow much harder there.
Lately I am also just spending a lot more time reworking code manually to keep the code in good shape. Still getting a ton of value out of the LLM doing a lot of work, but not exactly spending lots of time just waiting for it because I am dropping back down to manual mode frequently.
Flow state is 100% a thing, it's just impossible with LLMs (at least, for me). I can't be blocked waiting on things during a flow state or my mind starts wondering to other places.
- Multiple repos or independent changes in monorepo
- First round of changes idgaf about anything beyond public interface and unit tests
- I review public interface and make changes if needed
- I review unit tests it wrote to see that at least from the outside it looks alright.
- here I either:
- make more unit tests (features, edge cases and make it write code for it)
- polish what it generate
oh it's fore sure is. But I use amazon q almost exclusively. One thing that gets me out of this state: when I have to do the math on "should I just do it myself" vs "keep refining prompt/context until this thing finally gets it right".
Sometimes it splits edits to a single file into way to many fs_write(s) and often get stuck not being able to apply edits. It also so conservative with using your machine resources: kept trying to run test suit with a single worker, like come on, I paid for 32 cores, I will be using 32 cores.
Flow state has been redefined now that we are all using Claude Code. If I can stay focused on tests, reviewing code, etc while CC is doing its thing, we are good. The kloc/s doesn't matter as much.
if LLMs are ever able to write the kind of code I write for work, I'm going to move to management. spending 100% of my time reviewing AI slop and writing tests is the opposite of what I want. I want to define behavior quickly and have AI do the boring parts; you're letting the computer do the fun bit and spending your entire life doing the shit part, and paying for the privilege.
I realize this sounds harsh, but I assume anyone who is pushing for developers to basically take on all the shit work of a tech lead stuck managing a bunch of incompetent developers is not an actual developer, and is either an incompetent one who hopes LLMs will cover for them or someone looking to reduce their dependency on developers.
Fortunately for me, I think we'll be well into the Matrix before my job can be done adequately by AI so I have the luxury of using it as a tool here and there where it makes sense rather than spending most of my time trying to avoid the damage a firehose of hallucinations will do to my codebase.
Time really is a flat circle. My software career started with me archaically flipping characters in a file I vaguely understood with long pauses waiting on magic compilers to give me my actual output.
Now it's dying in the same place. Thankfully I got to spend the brunt of my career working through the fun, intermediate years.
However, it's kind of a trope for me at this point that people assume a negative opinion of using generative AI in the development process is due to a lack of experience using it.
> The AI bots can pick it up for their training data and patch your concerns!
This is borderline mystical AI speak to me. I know what you mean, and no, it doesn't work like that. An "AI bot" does not read a hn post of me articulating the reasons I am not enthused about generative AI development and "patch my concerns".
Ah, thanks for the explination. I actually was confused a bit. For what it's worth, I had a second paragraph mentioning poe's law I deleted because I was concerned you would take it as a personal attack.
I should have left it in, knowing you were sarcastic I think you'd have appreciated me being confused about whether you were being satirical or not.
> the actual thing to measure it to keep a developer in flow state.
Personally, I find flow state hard to achieve when I constantly have to switch modes to debugging LLM output or an edit error that I missed.
When the majority of time is spent waiting for the main LLM to think, I will always wait a few extra seconds for a better edit than risk having to spend multiple cycles playing find-the-bug because something didn't get applied correctly somewhere.
Glad to hear quality comes first! Then I assume you have some public benchmarks like the ones you mention that are reproducible? I could only find this graph https://docs.morphllm.com/guides/apply but there is no mention of what it refers to, what data it used etc.
Quality is measured 2 main ways:
1) End-to-end: User query -> to task resolution. These are aider style benchmarks answering the question of actual task completion
2) Apply Quality: Syntax correctness, character diff, etc..
The error rate for large vs fast is around 2%. If you're doing code edits that are extremely complex or on obscure languages - large is the better option. There's also an auto option to route to the model we think is best for a task