I've been working on this problem coming from the program synthesis school of thought over at https://promptless.ai (which you would have no clue just from looking at the website because its targeted at tech writers).
I'm quite fond of the idea of incremental mutation of agent trajectories to move/embody some of the reasoning steps from LLM tokens into a program. Imagine you have a long agent transcript/trajectory and you have a magic want to replace a run of messages with "and now I'll call this script which gives me exactly the information I need," then seeing if the rewritten trajectory is stable.
To give credit where it's due, it's an overly complicated restatement of what Manny Silva has been saying with docs-as-tests https://www.docsastests.com/. Once you describe some user flow to humans (your "docs"), you can "compile" or translate part or all of those steps into deterministic test programs that perform and validate state transitions. Ideally you compile an agent trajectory all the way.
So: working with coding agents, you've cranked up the defect rate in exchange for speed, lets try testing all important flows. The first thing you try is: ok, I've got these user guides, I guess I'll have the agent follow along and try do it. And that works! But it's a little expensive and slow.
So I go, ok I'll have the agent do it once, and if it finds a trajectory through a product that works, we can reflect on that transcript and make some helper scripts to automate some or all of those state transitions, then store these next to our docs.
And then you say, ok if I ship a product change, can I have my coding agent update those testing scripts to save the expense and time of re-running the original follow-along. Also an obvious thing to do, and you can totally build it yourself with Claude Code in a github action. But I think there is a lot of complexity in how you go about doing this, what kind of incremental computation you can do to keep the LLM costs of all this under a couple hundred bucks a month for teams shipping 20 changes a day with 200 pages of docs.
The most polished open source "compiler/translator" I've seen exploring these ideas so far is Doc Detective (https://doc-detective.com) by Manny.
This is the best articulation of this viewpoint I have seen so far so I tip my hat to your writing skills.
I've been sitting with this post for a bit and something about the framing keeps bugging me.
I keep coming back to one question: are you planning to add Co-authored-by trailers to these commits?
The pitch is: don't send me code, send me the prompt you used to produce it. And on the surface that sounds like a lighter ask — a prompt is a small thing, one sentence maybe, versus a whole diff. But that's the sleight of hand. The prompt that produced a working patch is almost never one prompt. My PR is the compressed output of all of that work.
So "just send me the prompt" is doing two things at once. It's reframing the contributor's work as a small artifact (one prompt, how hard could that be), and it's reassigning the result of that work to whoever re-runs it. If I send you the actual thing that produced the patch — the whole transcript, the rejected attempts, the corrections — that's not a smaller contribution than a PR, it's a more complete one. And you're going to land the commit under your name.
This is the same shape of argument the model labs made about training data. Individual contribution is "too insignificant" to be worth crediting, but the aggregate is valuable and belongs to whoever assembled it. I don't love it there and I don't love it here.
The other thing I keep thinking about: to anyone who hasn't used these tools seriously, an engineer landing a high volume of clean commits under their own name looks like a very productive author. That gap between how it looks and what's actually happening is going to close eventually, but in the meantime it's a real incentive to set things up this way.
None of this is an argument against the workflow. Review is expensive, untrusted code is risky, I get it. I just want to make an argument for Co-authored-by being part of the deal.
I'm curious if evals of the DAEMONs and replays for debugging are on the roadmap?
I looked but did not see any facility for collecting/managing evals in the Charlie docs.
Docs drift might sound easy for agents but after working on it at https://promptless.ai for about two years, it's been tricker than just "make some skills". We've got an agent that watches PRs and suggests docs changes. Getting the suggestions good enough that doc owners would actually accept them took a fair bit of evals. Non-ai voice matching existing content, and even "simple" act of deciding whether a given PR warrants a docs change at all.
I have benefited greatly from evals catching things (especially as models change) to the point where I'm loath to go back.
Promptless (YC) | Founding Docs Practice Lead | San Francisco (Onsite) | Full-time | $140k–$200k + equity
Promptless builds AI agents that automatically update customer-facing documentation. Startups, CNCF projects, and Fortune 500 companies use us. YC-backed with a seed round from top VCs and angels.
This is a one-of-one role. You'll own the documentation practice at Promptless: onboarding customers onto the platform, building the methodology for AI-assisted documentation, and growing our reputation as the company that makes docs teams more effective. Think practice lead at a top consulting firm, except the domain is docs and the leverage is AI.
You should have deep experience in technical documentation (developer docs, API references, support content), be comfortable reading code, and be excited about pushing the boundaries of LLM-assisted writing. You'll be building this function from scratch, so an entrepreneurial mindset is key.
Ha, this is funny (also sad for me because I failed to explain on website clearly) because you have described exactly what it does as an example of what it can't do.
The core loop is more like a truffle-hunting pig than a ghostwriter. Promptless watches for signal that your product is behaving differently from the live documentation. It watches PRs opened/merging, Slack threads, support tickets. Then like a pig alerting on a truffle it shows up like "hey, this section over here doesn't match what the code/product does anymore."
Now of course we'll also generate a first draft of a suggested fix, but I want to say 40% of tech writers just like knowing when things changed.
Its a proper union find algorithm, where every suggestion links back to the source that triggered it, but multiple source do get linked up to just a single canonical suggestion. So you don't get duplicate alerts if people keep talking for weeks about a fix going out in the next release.
Obviously I've got some more work to do on the website again but c'est la vie.
It's already happened to me. I've started to have dreams where instead of some sort of interpersonal struggle the entire dream is just a chatbot UI viewport and I'm arguing with an LLM streaming the responses in. Which is super trippy when I become aware its a dream. In the old days I'd dream about playing chess against myself and lose which was quite bizzare feeling because my brain was running both players. But thats totally normal compared to having my brain pretend to be an LLM inside a dream.
The literal writing of the code was hard. This revisionism about how we were all secretly shakespeare typing monkey scammers pulling the wool over the eyes of the economy drives me nuts. Choosing which words to put in the editor, how to express all these ideas in a limited syntax. That was the big skill.
Sure, writing a program that makes a machine kind of do something was easy. Lots of people can do that. But then you ship a mobile app to a billion users and discover that people are genuinely wired differently.
different cultures, different mental models, different expectations
Now you have to accommodate and express all of that complexity in a language whose only reader is a machine that tolerates zero ambiguity. And you have to do it in a way that other engineers can read, reason about, and build on top of without the whole thing exploding. That's not requirements gathering. Its literally writing it down
You're doing the thing where you read code like a fish breathes water and conclude it was easy to write. You can read a Nobel Prize novel in a weekend too. The readability is the achievement, not evidence it was trivial.
I program for 20 years. More than half of my life. Some people forget how they felt as beginners, I did not. Which I know, since I teach programming as well.
As a beginner syntax is the hard thing, remembering how to write a thing. As a beginner you don't even think about structure, how to write maintainable or testable code — you're just happy it eventually works for you. Depending on the beginners character they might fall into the trap of thinking more complicated code using more advanced language features is a sign of a genius programmer.
When you're getting better you realize that writing the code is indeed the easy part and that you should avoid writing code that is too clever unless it is well localized and neatly tucked away. Writing clever code is something that does not impress you at all — quite the opposite as it is usually just unnecessary bragging. The hard part after all isn't writing clever code, it is finding good abstractions, staying consistent, writing maintainable and testable code without being too smart. It is understanding and then solving the real world problem in an elegant way. It isn't writing what the customer think they want, it is writing whst they truly need.
That does not mean actually writing the code isn't specialized work that requires skill. But it just isn't the hardest part of the job. Just like knowing how to use the tools isn't the hardest part for a car mechanic. Making sure that the car drives reliably, you chose the right parts, you did it fast and efficiently is.
I think we’re agreeing and likely have different understandings of what’s meant by “code wasn’t the hard part”. What you’re describing is what I’m calling the hardest part: building a system. To me that’s different from “coding”. This isn’t engaging in revisionist history. It’s why I’m referencing a book written 51 years ago, almost two decades before I was born,[0] and referring to a joke about systems design from ‘99[1].
Edit: This also isn’t to say that _millions_ around the world aren’t employed just to write code. This isn’t to say LLMs aren’t hugely disruptive. This isn’t even to say they aren’t also good at the hard parts. It’s just to say there’s a difference between coding and systems design and one is harder than the other in most cases in most jobs.
>That’s not evidence the task was easy. That’s evidence it was so hard...
Are humans starting to adopt LLM patterns or was this was ironically written with an LLM?
That said, I'm surprised you didn't bring up Marx in your essay in the later sections. I vaguely remember he had some thoughts about derivation of value from labor vs "ideas/capital". Whether or not you agree, this debate is reminiscent of that just moved up one level to white-collar workers.
You've described PMs running circles around you and you still can't see it. They didn't need to praise you or pressure you. They seem to have all caught on that your button is let you feel smarter than them. You did their job, did a bunch of physical typing they would otherwise have to do themselves, and walked away thinking you won.
Meanwhile they're pulling the same or greater comp, working half the hours, and "drinking beers with important people" is an accepted part of their job. The status hierarchy you're describing where they suck isn't real. It's a useful fiction that keeps you grinding while they harvested your output.
Everyone becoming a PM is a good thing precisely because PMs don't work as hard. Wouldn't a job be more pleasant if you could meet expectations by lunch? Imagine how psychologically freeing that would be. Dreadful future my ass.
Considering every time they left not a single thing changed, as though they were never there, because I was the one actually organizing the projects, I doubt they were running circles around me. Likely dicking around with Jira for 5 hours to siphon money from our company instead of actually organizing the project.
> Meanwhile they're pulling the same or greater comp, working half the hours, and "drinking beers with important people" is an accepted part of their job
You took the words right out of my mouth. Almost like it's a made up job and not the real work that needs to get done.
Thats what we call a Staff level engineer. Proven ability to learn, implement and validate is basically the "it factor" businesses are looking for.
If you are thinking about this from an academic angle then sure its sounds weird to say "Two Staff jobs in a row from the University of LinkedIn" as a degree. But I submit this as basically the certificate you desire.
No, this is not at all being a staff engineer. One is about delivering high-impact projects toward a business's needs, with all the soft/political things that involves, and the other is about implementing and validating cutting-edge research, with all the deep academic and technical knowledge and work that that involves. They're incredibly different skillsets, and many people doing one would easily fail in the other.
Counter argument: people invest in bonds. Quite a lot of bonds in fact.
Picking up pennies in front of a steam roller and counterparty risk seem to be perennial favorites of youth, but I hazard to guess only a minority in the market have flesh yet untouched by fire.
I'm quite fond of the idea of incremental mutation of agent trajectories to move/embody some of the reasoning steps from LLM tokens into a program. Imagine you have a long agent transcript/trajectory and you have a magic want to replace a run of messages with "and now I'll call this script which gives me exactly the information I need," then seeing if the rewritten trajectory is stable.
To give credit where it's due, it's an overly complicated restatement of what Manny Silva has been saying with docs-as-tests https://www.docsastests.com/. Once you describe some user flow to humans (your "docs"), you can "compile" or translate part or all of those steps into deterministic test programs that perform and validate state transitions. Ideally you compile an agent trajectory all the way.
So: working with coding agents, you've cranked up the defect rate in exchange for speed, lets try testing all important flows. The first thing you try is: ok, I've got these user guides, I guess I'll have the agent follow along and try do it. And that works! But it's a little expensive and slow.
So I go, ok I'll have the agent do it once, and if it finds a trajectory through a product that works, we can reflect on that transcript and make some helper scripts to automate some or all of those state transitions, then store these next to our docs.
And then you say, ok if I ship a product change, can I have my coding agent update those testing scripts to save the expense and time of re-running the original follow-along. Also an obvious thing to do, and you can totally build it yourself with Claude Code in a github action. But I think there is a lot of complexity in how you go about doing this, what kind of incremental computation you can do to keep the LLM costs of all this under a couple hundred bucks a month for teams shipping 20 changes a day with 200 pages of docs.
The most polished open source "compiler/translator" I've seen exploring these ideas so far is Doc Detective (https://doc-detective.com) by Manny.