One approach to fixing factual errors is to use two rounds of LLM interaction. I forgot the name of the paper.
Say you ask "What is the height of Everest?"
1. generate an answer in closed-book model, with the LLM: "The height of Everest is 8723m" = candidate_answer
2. search your references with candidate_answer, find: "At 8,849 meters (29,032 feet), Everest is considered the tallest point on Earth" = search_snippet
3. do a second pass to rewrite the answer with the LLM using search_snippet in the prompt
Basically, the incorrect phrase candidate_answer is very good at matching the correct answer in search engines. It is like a template tuned to extract the desired facts. A search engine check could also verify when the searched fact has no references.
How would you apply this approach to the article's example, the response to the question "Who is Daragh O Brien from Castlebridge", where I count at least 15 separate statements of fact?
Should we research all of them and try again with a big table of hits and misses from the first attempt? Seems like a lot of work.
Also: Is the generated response really "very good at matching the correct answer"? I suppose it would work because the search engine's language processing cancels out the useless parts that were generated by the AI (sort of a "human ABI", analogous to the C ABI?) but a more direct query (e.g. "height of everest") would likely be just as effective.
Yes, this is the essential issue. Any system competent to fully ground and check factual statements in a stream of arbitrary text will be phenomenally more complex than the original LLM, and usually will be able to answer queries directly, at which point one wonders what the LLM is adding. I think at best, if we can somehow identify all factual statements that need to be cross-referenced and then offload them to a knowledge base (dubious), we are left this kind of mad-libs connective flow that the LLM has created which approximates the essay style of a human writer. I'm not certain that has much practical value besides allowing for a form of undetectable plagiarism to be published as though it were free-form writing.
Say you ask "What is the height of Everest?"
1. generate an answer in closed-book model, with the LLM: "The height of Everest is 8723m" = candidate_answer
2. search your references with candidate_answer, find: "At 8,849 meters (29,032 feet), Everest is considered the tallest point on Earth" = search_snippet
3. do a second pass to rewrite the answer with the LLM using search_snippet in the prompt
Basically, the incorrect phrase candidate_answer is very good at matching the correct answer in search engines. It is like a template tuned to extract the desired facts. A search engine check could also verify when the searched fact has no references.