> I'm pretty sure Google, OpenAI and Facebook all have such snapshots stashed away that they train their AIs on
They probably just use publicly-available resources like The Pile. If newer training material becomes unusable for whatever reason, the old stuff still exists.
Paradoxically, I think a lot of research is showing that synthetic training information can be just as good as the real stuff. We may stumble upon an even stranger scenario where AI-generated content is more conducive to training than human content is.
> Paradoxically, I think a lot of research is showing that synthetic training information can be just as good as the real stuff.
Which studies show this? https://arxiv.org/abs/2305.17493 shows the exact opposite and my (layman's) understanding of statistics and epistemology lines up entirely with this finding.
Like, how could this even theoretically work? In the best case scenario wouldn't training on synthetic training data make LLMs overconfident / overfit the data once faced with new (human) input to respond to?
I don't have any exact references, but multiple finetuning datasets have used curated GPT-3/4 conversations as training data. It's less that they're overtly superior to human data, and more that they're less-bad and more abundantly available.
> Like, how could this even theoretically work?
I'm not really an expert on it either, but my understanding is that it works the same way curating human data works. You sift through the garbage, nonsense, impolite and incoherent AI responses and only include the exemplary conversations in your training set.
It feels kinda like the "monkeys on typewriters writing shakespeare" parable. If you have enough well-trained AIs generate enough conversations, eventually enough of them will be indistinguishable enough from human data to be usable for training.
> They probably just use publicly-available resources like The Pile
I’d be very surprised if the big orgs don’t have in house efforts that far exceed the pile. Hell we know Google paid Reddit a pile of money for data and other orgs are also willing to pay
GPT-Neo and Llama were both trained on The Pile, and both of those were fairly influential releases. That's not to say they don't also use other resources, but I see no reason not to use The Pile; it's enormous.
It's also not everything there is, but for public preservation purposes I think the current archives are fine. If Google or Meta turn out to have been secretly stockpiling old training data without our knowledge, I'm not exactly sure what "we" would lose.
Yeah, I hadn't thought about their abandoned effort to scan every book and archived newspaper in the world in a while, but I bet they're regretting now that they didn't finish. A non-trivial amount of that physical media has been tossed or degraded by underfunded libraries since then. And it's more valuable to them now that it ever was.
The problem with AI-generated content is not necessarily that it's bad, rather, it's not novel information. To learn something, you must not already know it. If it's AI-generated, the AI already knows it.
We might also say the same thing about spelling and grammar checkers. The difference will be in the quality of oversight of the tool. The "AI generated drivel" has minimum oversight.
Example: I have a huge number of perplexity.ai search/research threads, but the ones I share with my colleagues are a product of selection bias. Some of my threads are quite useless, much like a web search that was a dud. Those do not get shared.
Likewise, if I use LLM to draft passages or even act as something like an overgrown thesaurus, I do find I have to make large changes. But some of the material stays intact. Is it AI, or not AI? It's bit of both. Sometimes my editing is heavyhanded, other times, less so, but in all cases, I checked the output.
You are assuming that you and AI are the same sort of thing.
I do not think we are at that point yet. In the meantime, the idea that we might get to intelligence by feeding in more data might get choked out by poisoned data.
I have a suspicion that there's a bit more to it than just more data though.
They probably just use publicly-available resources like The Pile. If newer training material becomes unusable for whatever reason, the old stuff still exists.
Paradoxically, I think a lot of research is showing that synthetic training information can be just as good as the real stuff. We may stumble upon an even stranger scenario where AI-generated content is more conducive to training than human content is.