This article is silly. It's taking LLMs that were trained on statistical token -> token probabilities without any concept of truth or accuracy, and implying that LLMs are hyped bullshit that is incapable of truth or accuracy.
LLMs can easily be trained on (token -> .. -> token, factual_accuracy) annotated data, we just need to build a "fact check" data set for common token sequences. That data set will be expensive to build, but we could always build a fact check app that gamifies the process to help build it instead of just paying turks.
I'm not sure if this would work, as you cannot always define the factual accuracy on a token-by-token basis.
However I wonder if having the LLM answer questions based on the knowledge baked in the model is the way forward at all: You quickly run into issues with up-to-date information, as you'd have to do a possibly expensive retraining every few weeks at least - and you also have no way of controlling which information exactly is stored exactly and how the models combines it.
I think an interesting alternative approach could be to split the bot up into two "agents" and one backend database:
- One model which receives user input and translates it into a machine-readable well-defined representation e.g. a series of SQL or SPARQL queries, a series of API calls or whatever.
- Then have a wholly non-AI backend that validates and executes those queries and delivers a result, in machine-readable format as well.
- Finally have a second model (or the first model with different finetuning/prompt) which can translate the query result back into a freetext sentence, which is then delivered to the user.
This way, you can keep tighter control about what exactly the model is answering and you can also update the underlying knowledgebase independently of the model.
That sounds like the blog post by Steven Wolfram recently [1] where he proposes having two models cooperating - one that "knows" facts (has computational knowledge in that post), and the other that controls the language interpretation and decides which facts are relevant. It's interesting to go further and speculate that our "mind" is made up of cooperating sub-systems like this.
It wouldn't be on a token by token basis, but rather "common" token sequence tuples. If you generate a histogram of common 5-8 grams, and filter out all the ones that are just linguistic filler and keep the ones that are "factual," then feed those "factual" n-grams to reviewers (via captcha/turk/game) to get a distribution of accuracy scores that you then condition the model on, so it can produce "accuracy" estimates for token sequences, and further condition on high accuracy scores for generation.
for example (butchered for clarity), (["The", "2022", "election", "was", "stolen"], accuracy=normal(0.0001, 0.1)).
Frankly, I think you're massively underestimating the difficulty in what you're describing (assuming what you're describing would actually work, and I'm not sure it would). To do what you're suggesting, you'd need a human to evaluate and feed into one of these LLMs every known fact. Good luck with that.
And that's ignoring the issue of maintaining that model as those facts change.
That's not true at all. Right now there's a tension in ML between hoovering up as much data as possible (since more data, even if it's bad, tends to improve the model in general even if it poisons certain parts of it) and curating/annotating data sets to improve answer quality and consistency.
Data set annotation is expensive and time consuming, and it's a hard sell to shareholders to spend a billion dollars on data set annotation when there's still more data that can be cheaply obtained, and the liability or cost for producing incorrect answers is intangible or low. After Google's massive share hit from Bard's exoplanet flub though, I don't think the costs are quite as intangible as they were, and I expect accuracy will become a new push in models over the next few years.
I actually made three statements in my comment and you only disputed the first one.
Yours is a good counter-argument to my first statement, and it is a valid rebuttal, I concede the point!
But I maintain the scale of the problem, and the difficulty of maintenance, will make manual curation impossible, while automated "fact generation" is a technology that doesn't exist today.
Eh, when we have trillion dollar companies whose core product is AI commonly used for question/answer tasks, and query accuracy directly impacts the bottom line, I don't think spending 10 billion a year on annotation is a big ask. It might not be perfect, but as long as your query responses are significantly more accurate than the competition you'll end up as Google to their Altavista.
Additionally, you don't have to manually curate the entire data set, only a large enough chunk to predict accuracy annotations for the stuff you haven't curated yet. Facts that change over time shouldn't be hard to identify if you periodically do cross validation on existing annotations to pick out ones where the predicted accuracy differs significantly from the annotation.
There are a lot of process and engineering questions still to be answered in terms of building high quality AI products, but I don't think there are hidden dragons that are going to lead to another AI winter. There will be hiccups and snafus, but AI is the real deal.
LLMs can easily be trained on (token -> .. -> token, factual_accuracy) annotated data, we just need to build a "fact check" data set for common token sequences. That data set will be expensive to build, but we could always build a fact check app that gamifies the process to help build it instead of just paying turks.