Reacting to this and the previous comment, I've pursued the idea of iterative human-computer interaction via GPT as perhaps not the ultimate solution, but an extant solution to the problem posed by Knuth in literate programming before we had the magic to do it in HCI where the spectrum extends from humans to machines and assumes that the humans are as tolerant of the machines as the machines are of the human (Postel's law), in a frame that Peter Pirolli described here:
Which is to say that with an iterative, human-computer interaction (HCI), that is back-ended by a GPT (API) algorithm which can learn from the conversation and perhaps be enhanced by RAG (retrieval augmented generation) of code AND documentation (AKA prompt engineering), results beyond the average intern-engineer pair are not easily achievable, but increasingly probable given how both humans and computers are learning iteratively as we interact with emergent technology.
The key is that realizing that the computer can generate code, but that code is going to be frequently bad, if not hallucinatory in its compilability and perhaps computability and therefore, the human MUST play a DevOps or SRE or tech writer role pairing with the computer to produce better code, faster and cheaper.
Subtract either the computer or the human and you wind up with the same old, same old. I think what we want is GPT-backed metaprogramming produce white box tests precisely because it can see into the design and prove the code works before the code is shared with the human.
I don't know about you, but I'd trust AI a lot further if anything it generated was provable BEFORE it reached my cursor, not after.
The same is true here today.
Why doesn't every GPT interaction on the planet, when it generates code, simply generate white box tests proving that the code "works" and produces "expected results" to reach consensus with the human in its "pairing"?
I'm still guessing. I've posed this question to every team I've interacted with since this emerged, which includes many names you'd recognize.
Not trivial, but increasingly straightforward given the tools and the talent.
> Why doesn't every GPT interaction on the planet, when it generates code, simply generate white box tests proving that the code "works" and produces "expected results" to reach consensus with the human in its "pairing"? I'm still guessing. I've posed this question to every team I've interacted with since this emerged, which includes many names you'd recognize.
My guess would be that it's simply rare in the training data to have white box tests right there next to the new snippet of code, rather than a lack of capability. Even when code does have tests, it's usually in other modules or source code files, written in separate passes, and not the next logical thing to write at any given point in an interactive chatbot assistant session. (Although Claude-3.5-sonnet seems to be getting there with its mania for refactoring & improvement...)
When I ask GPT-4 or Claude-3 to write down a bunch of examples and unit-test them and think of edge-cases, they are usually happy to oblige. For example, my latex2unicode.py mega-prompt is composed almost 100% of edge cases that GPT-4 came up with when I asked it to think of any confusing or uncertain LaTeX constructs: https://github.com/gwern/gwern.net/blob/f5a215157504008ddbc8... There's no reason they couldn't do this themselves and come up with autonomous test cases, run it in an environment, change the test cases and/or code, come up with a finalized test suite, and add that to existing code to enrich the sample. They just haven't yet.
https://www.efsa.europa.eu/sites/default/files/event/180918-...
Which is to say that with an iterative, human-computer interaction (HCI), that is back-ended by a GPT (API) algorithm which can learn from the conversation and perhaps be enhanced by RAG (retrieval augmented generation) of code AND documentation (AKA prompt engineering), results beyond the average intern-engineer pair are not easily achievable, but increasingly probable given how both humans and computers are learning iteratively as we interact with emergent technology.
The key is that realizing that the computer can generate code, but that code is going to be frequently bad, if not hallucinatory in its compilability and perhaps computability and therefore, the human MUST play a DevOps or SRE or tech writer role pairing with the computer to produce better code, faster and cheaper.
Subtract either the computer or the human and you wind up with the same old, same old. I think what we want is GPT-backed metaprogramming produce white box tests precisely because it can see into the design and prove the code works before the code is shared with the human.
I don't know about you, but I'd trust AI a lot further if anything it generated was provable BEFORE it reached my cursor, not after.
The same is true here today.
Why doesn't every GPT interaction on the planet, when it generates code, simply generate white box tests proving that the code "works" and produces "expected results" to reach consensus with the human in its "pairing"?
I'm still guessing. I've posed this question to every team I've interacted with since this emerged, which includes many names you'd recognize.
Not trivial, but increasingly straightforward given the tools and the talent.