If a model often produces information that is blatantly wrong, then you need to check ALL of its outputs. If you're going to have to double-check all information that it provides, you might as well skip using it entirely and search for the information directly.
You're missing the part where searching for the information directly might take hours or even days.
You're looking for black-and-white truth, while the real world is actually more interested in efficiency.
I can spend 2 days scouring documentation and forum posts and experimenting to get ffmpeg or Matplotlib to produce exactly the results I want. Or I can just ask ChatGPT and check if its code works, and if it doesn't maybe spend 10 minutes correcting it so it does, or refining the prompt so it does.
And so you're also missing the part where verifying the correctness of output is very often many orders of magnitude faster than coming up with the output in the first place.