Today, I wrote a full YouTube subtitle downloader in Dart. 52 minutes from starting to google anything about it, to full implementation and tests, custom formatting any of the 5 obscure formats it could be in to my exact whims. Full coverage of any validation errors via mock network responses.
I then wrote a web AudioWorklet for playing PCM in 3 minutes, which complied to the same interface as my Mac/iOS/Android versions, ex. Setting sample rate, feedback callback, etc. I have no idea what an AudioWorklet is.
Two days ago, I stubbed out my implementation of OpenAI's web socket based realtime API, 1400 LOC over 2 days, mostly by hand while grokking and testing the API. In 32 minutes, I had a brand spanking new batch of code, clean, event-based architecture, 86% test coverage. 1.8 KLOC with tests.
In all of these cases, most I needed to do was drop in code files and say, nope wrong a couple times to sonnet, and say "why are you violating my service contract and only providing an example solution" to o1.
Not llama 3.1 405B specifically, I haven't gone to the trouble of running it, but things turned some sort of significant corner over the last 3 months, between o1 and Sonnet 3.5. Mistakes are rare. Believable 405B is on that scale, IIRC it went punch for punch with the original 3.5 Sonnet.
But I find it hard to believe a Google L3, and third of L4s, (read: new hires, or survived 3 years) are that productive and sending code out for review at a 1/5th of that volume, much less on demand.
So insane-sounding? Yes.
Out there? Probably, I work for myself now. I don't have to have a complex negotiation with my boss on what I can use and how. And I only saw this starting ~2 weeks ago, with full o1 release.
Most software is not one off little utilities/scripts, greenfield small projects, etc... That's where LLMs excel, when you don't have much context and can regurgitate solutions.
It's less to do with junior/senior/etc.. and more to do with the types of problems you are tackling.
This is a 30KLOC 6 platform flutter app that's, in this user story, doing VOIP audio, running 3 audio models on-device, including in your browser. A near-replica of the Google Assistant audio pipeline, except all on-device.
It's a real system, not kindergarten "look at the React Claude Artifacts did, the button makes a POST request!"
The 1500 loc websocket / session management code it refactored and tested touches on nearly every part of the system (i.e. persisting messages, placing search requests, placing network requests to run a chat flow)
Also, it's worth saying this bit a bit louder: the "just throwing files in" I mention is key.
With that, the quality you observed being in reverse is the distinction: with o1 thinking, and whatever Sonnet's magic is, there's a higher payoff from working a larger codebase.
For example, here, it knew exactly what to do for the web because it already saw the patterns iOS/Android/macOS shared.
The bend I saw in the curve came from being ultra lazy one night and seeing what would happen if it just had all the darn files.
> This is a 30KLOC 6 platform flutter app […] It's a real system, not kindergarten "look at the React Claude Artifacts did, the button makes a POST request!"
This is powerful and significant but I think we need to ground ourselves on what a skilled programmer means when he talks about solving problems.
That is, honestly ask: What is the level of skill a programmer requires to build what you've described? Mostly, imo, the difficulties in building it are in platform and API details not in any fundamental engineering problem that has to be solved. What makes web and Android programming so annoying is all the abstractions and frameworks and cruft that you end up having to navigate. Once you've navigated it, you haven't really solved anything, you've just dealt with obstacles other programmers have put in your way. The solutions are mostly boilerplate-like and the code I write is glue.
I think the definition of "junior engineer" or "simple app" will be defined by what LLMs can produce and so, in a way, unfortunately, the goal posts and skill ceiling will keep shifting.
On the other, hand, say we watch a presentation by the Lead Programmer at Naughty Dog, "Parallelizing the Naughty Dog Engine Using Fibers"[^0] and ask the same questions: what level of skill is required to solve the problems he's describing (solutions worth millions of dollars because his product has to sell that much to have good ROI):
"I have a million LOC game engine for which I need to make a scheduler with no memory management for multithreaded job synchronization for the PS4."
A lot of these guys, if you've talked to them, are often frustrated that LLMs simply can't help them make headway with, or debug, these hard problems where novel hardware-constrained solutions are needed.
It's been pretty hard, but if you reduce it to "Were you using a framework, or writing one that needs to push the absolute limits of performance?"...
...I guess the first?...
...But not really?
I'm not writing GPU kernels or operating system task schedulers, but I am going to some pretty significant lengths to be running ex. local LLM, embedding model, Whisper, model for voice activity detection, model for speaker counting, syncing state with 3 web sockets. Simultaneously. In this case, Android and iOS are no valhalla of vapid-stackoverflow-copy-pasta-with-no-hardware-constraints, as you might imagine.
And the novelty is, 6 years ago, I would have targeted iOS and prayed. Now I'm on every platform at top-tier speeds. All that boring tedious scribe-like stuff that 90% of us spend 80% of our time on, is gone.
I'm not sure there's very many people at all who get to solve novel hardware-constrained problems these days, I'm quite pleased to brush shoulders with someone who brushes shoulders with them :)
Thus, smacks more of no-true-scotsman than something I can chew on. Productivity gains are productivity gains, and these are no small productivity gain in a highly demanding situation.
> Thus, smacks more of no-true-scotsman than something I can chew on.
I wasn't making a judgement about you or your work, after all I don't know you. I was commenting within the context of an app that you described for which an LLM was useful, relative to the hard problems we'll need help with if we want to advance technology (that is, make computers do more powerful things and do them faster). I have no idea if you're a true Scotsman or not.
Regardless: over the coming years we'll find out who the true Scotsmen were, as they'll be hired to do the stuff LLMs can't.
The challenging projects I've worked on are challenging not because slamming out code to meet requirements is hard (or takes long).
It's challenging because working to get a stable set of requirements requires a lot of communication with end users, stakeholders, etc. Then, predicting what they actually mean when implementing said requirements. Then, demoing the software and translating their non-technical understanding and comments back into new requirements (rinse and repeat).
If a tool can help program some of those requirements faster, as long as it meets security and functional standards, and is maintainable, it's not a big deal whether a junior dev is working with Stack Exchange or Claude, IMO. But I do want that dev to understand the code being committed, because otherwise security bugs and future maintenance headaches creep in.
I've definitely noticed the opposite on larger codebases. It's able to do magical things on smaller ones but really starts to fall apart as I scale up.
I think most software outside of the few Silicon Valleys of the world is in fact a bunch of dirty hacks put together.
I fully believe recursive application of current SOTA LLMs plus some deployment framework can replace most software engineers who work in the cornfields.
I don't understand what you guys are doing. For me sonnet is great when I'm starting with a framework or project but as soon as I start doing complicated things it's just wrong all the time. Subtly wrong, which is much worse because it looks correct, but wrong.
YouTube dlp has a subtitle option.
To quote the documentation:
"--write-sub Write subtitle file
--write-auto-sub Write automatically generated subtitle
file (YouTube only)
--all-subs Download all the available subtitles of
the video
--list-subs List all available subtitles for the
video
--sub-format FORMAT Subtitle format, accepts formats
preference, for example: "srt" or
"ass/srt/best"
--sub-lang LANGS Languages of the subtitles to download
(optional) separated by commas, use
--list-subs for available language tags"
This is a key point and one of the reasons why I think LLMs will fall short of expectation. Take the saying "Code is a liability," and the fact that with LLMs, you are able to create so much more code than you normally would:
The logical conclusion is that projects will balloon with code pushing LLMs to their limit, and this massive amount is going to contain more bugs and be more costly to maintain.
Anecdotally, supposedly most devs are using some form of AI for writing code, and the software I use isn't magically getting better (I'm not seeing an increased rate of features or less buggy software).
My biggest challenges in building a new application and maintaining an existing one is lack of good unit tests, functional tests, and questionable code coverage; lack of documentation; excessively byzantine build and test environments.
Cranking out yet more code, though, is not difficult (and junior programmers are cheap). LLMs do truly produce code like a (very bad) junior programmer: when trying to make unit tests, it takes the easiest path and makes something that passes but won't catch serious regressions. Sometimes I've simply reprompted it with "Please write that code in a more proper, Pythonic way". When it comes to financial calculations around dates, date intervals, rounding, and so on, it often gets things just ever-so-slightly wrong, which makes it basically useless for financial or payroll type of applications.
It also doesn't help much with the main bulk of my (paid) work these days, which is migrating apps from some old platform like vintage C#-x86 or some vendor thing like a big pile of Google AppScript, Jotform, Xapier, and so on into a more maintainable and testable framework that doesn't depend on subscription cloud services. So far I can't find a way to make LLMs productive at all at that - perhaps that's a good thing, since clients still see fit to pay decently for this work.
I don't understand - why does the existence of a CLI tool mean we're risking a grey goo situation if an LLM helps produce Dart code for my production Flutter app?
My guess is you're thinking I'm writing duplicative code for the hell of it, instead of just using the CLI tool - no. I can't run arbitrary binaries, at all, on at least 4 of the 6 platforms.
Apologies if it looked like I was singling out your comment. It was more that those comments brought the idea to mind that sheer code generation without skilled thought directing it may lead to unintended negative outcomes.
Something I noticed is that as the threshold for doing something like "write software to do X" decreases, the tendency for people to go search for an existing product and download it tends to zero.
There is a point where in some sense it is less effort to just write the thing yourself. This is the argument against micro-libraries as seen in NPM as well, but my point is that the threshold of complexity for "write it yourself" instead of reaching for a premade thing changes over time.
As languages, compilers, tab-complete, refactoring, and AI assistance get better and better, eventually we'll reach a point where the human race as a whole will be spitting out code at an unimaginable rate.
I agree with you that they are improving not being a programer I can't tell if the code has improved but as user that uses chat gpt or Google gemini to build scripts or trading view indicators. I am seeing some big improvements and many times wording it better detail and restricting it from going of tangent results in working code.
I learned over 20 years ago to _never_ post your code online for programmers to critique. Never. Unless you are an absolute pro (at which point you wouldn't even be asking for review), never do it.
I'm happy to provide whatever you ask for. With utmost deference, I'm sure you didn't mean anything by it and were just rushed, but just in case...I'd just ask that you'd engage with charity[^3] and clarity[^2] :)
I'd also like to point out[^1] --- meaning, I gave it my original code, in toto. So of course you'll see ex. comments. Not sure what else contributed to your analysis, that's where some clarity could help me, help you.
[^1](https://news.ycombinator.com/item?id=42421900) "I stubbed out my implementation of OpenAI's web socket based realtime API, 1400 LOC over 2 days, mostly by hand while grokking and testing the API. In 32 minutes, I had a brand spanking new batch of code, clean, event-based architecture, 86% test coverage. 1.8 KLOC with tests."
[^3](https://news.ycombinator.com/newsguidelines.html) "Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith."
I then wrote a web AudioWorklet for playing PCM in 3 minutes, which complied to the same interface as my Mac/iOS/Android versions, ex. Setting sample rate, feedback callback, etc. I have no idea what an AudioWorklet is.
Two days ago, I stubbed out my implementation of OpenAI's web socket based realtime API, 1400 LOC over 2 days, mostly by hand while grokking and testing the API. In 32 minutes, I had a brand spanking new batch of code, clean, event-based architecture, 86% test coverage. 1.8 KLOC with tests.
In all of these cases, most I needed to do was drop in code files and say, nope wrong a couple times to sonnet, and say "why are you violating my service contract and only providing an example solution" to o1.
Not llama 3.1 405B specifically, I haven't gone to the trouble of running it, but things turned some sort of significant corner over the last 3 months, between o1 and Sonnet 3.5. Mistakes are rare. Believable 405B is on that scale, IIRC it went punch for punch with the original 3.5 Sonnet.
But I find it hard to believe a Google L3, and third of L4s, (read: new hires, or survived 3 years) are that productive and sending code out for review at a 1/5th of that volume, much less on demand.
So insane-sounding? Yes.
Out there? Probably, I work for myself now. I don't have to have a complex negotiation with my boss on what I can use and how. And I only saw this starting ~2 weeks ago, with full o1 release.
Wrong? Shill? Dilletante?
No.
I'm still digesting it myself. But it's real.