The researchers found that certain artifacts associated with LLM model generations could potentially indicate whether or not a model is hallucinating. Their results showed that the distributions of these artifacts were different between hallucinated and non-hallucinated generations. Using these artifacts, they trained binary classifiers to classify model generations into hallucinations and non-hallucinations. They also discovered that tokens preceding a hallucination can predict the subsequent hallucination before it occurs.
I didn't read the paper, but it seems they're trying to fix a ML model by a ML model. I am not sure whether that's a good idea, but I digress. Besides, how do they know what is a hallucination and what is a non-hallucination (cf. a similar debate on disinformation)?