Reacting to this and the previous comment, I've pursued the idea of iterative hu...

gwern · on July 9, 2024

> Why doesn't every GPT interaction on the planet, when it generates code, simply generate white box tests proving that the code "works" and produces "expected results" to reach consensus with the human in its "pairing"? I'm still guessing. I've posed this question to every team I've interacted with since this emerged, which includes many names you'd recognize.

My guess would be that it's simply rare in the training data to have white box tests right there next to the new snippet of code, rather than a lack of capability. Even when code does have tests, it's usually in other modules or source code files, written in separate passes, and not the next logical thing to write at any given point in an interactive chatbot assistant session. (Although Claude-3.5-sonnet seems to be getting there with its mania for refactoring & improvement...)

When I ask GPT-4 or Claude-3 to write down a bunch of examples and unit-test them and think of edge-cases, they are usually happy to oblige. For example, my latex2unicode.py mega-prompt is composed almost 100% of edge cases that GPT-4 came up with when I asked it to think of any confusing or uncertain LaTeX constructs: https://github.com/gwern/gwern.net/blob/f5a215157504008ddbc8... There's no reason they couldn't do this themselves and come up with autonomous test cases, run it in an environment, change the test cases and/or code, come up with a finalized test suite, and add that to existing code to enrich the sample. They just haven't yet.