Isn't asking an LLM "to write a Python validator" suffers from the 99.9% (or wha...

bonzini · 2024-08-25T09:25:28 1724577928

The difference is that you're asking it to perform one intellectual task (write a program) instead of 100 menial tasks (parse a file). To the LLM the two are the same level of complexity, so performing less work means less possibility of error.

Also, the LLM is more likely to fail spectacularly by hallucinating APIs when writing a script, and more likely to fail subtly on parsing tasks.

dbaupp · 2024-08-25T09:57:13 1724579833

In addition to what you say, it can also be easier for a (appropriately-skilled) human to verify a small program than to verify voluminous parsing output, plus, as you say, there's the semi-automated "verification" of a very-wrong program failing to execute.

jessekv · 2024-08-25T10:08:31 1724580511

All tests have this problem. We still write them for the same reasons we do double-entry bookkeeping.