> I'm pretty sure Google, OpenAI and Facebook all have such snapshots stashed aw...

dangerwill · on April 16, 2024

> Paradoxically, I think a lot of research is showing that synthetic training information can be just as good as the real stuff.

Which studies show this? https://arxiv.org/abs/2305.17493 shows the exact opposite and my (layman's) understanding of statistics and epistemology lines up entirely with this finding.

Like, how could this even theoretically work? In the best case scenario wouldn't training on synthetic training data make LLMs overconfident / overfit the data once faced with new (human) input to respond to?

talldayo · on April 17, 2024

I don't have any exact references, but multiple finetuning datasets have used curated GPT-3/4 conversations as training data. It's less that they're overtly superior to human data, and more that they're less-bad and more abundantly available.

> Like, how could this even theoretically work?

I'm not really an expert on it either, but my understanding is that it works the same way curating human data works. You sift through the garbage, nonsense, impolite and incoherent AI responses and only include the exemplary conversations in your training set.

It feels kinda like the "monkeys on typewriters writing shakespeare" parable. If you have enough well-trained AIs generate enough conversations, eventually enough of them will be indistinguishable enough from human data to be usable for training.

Havoc · on April 16, 2024

> They probably just use publicly-available resources like The Pile

I’d be very surprised if the big orgs don’t have in house efforts that far exceed the pile. Hell we know Google paid Reddit a pile of money for data and other orgs are also willing to pay

vagabund · on April 16, 2024

Yeah they absolutely do not use the pile.

talldayo · on April 16, 2024

GPT-Neo and Llama were both trained on The Pile, and both of those were fairly influential releases. That's not to say they don't also use other resources, but I see no reason not to use The Pile; it's enormous.

It's also not everything there is, but for public preservation purposes I think the current archives are fine. If Google or Meta turn out to have been secretly stockpiling old training data without our knowledge, I'm not exactly sure what "we" would lose.

skybrian · on April 16, 2024

To index the web, you generally do make a copy of it.

Google has a huge number of books scanned, too.

TowerTall · on April 17, 2024

“Somewhere at Google there is a database containing 25 million books and nobody is allowed to read them.”

https://www.theatlantic.com/technology/archive/2017/04/the-t...

compootr · on April 17, 2024

https://archive.ph/rQ7Zb

pinko · on April 17, 2024

Yeah, I hadn't thought about their abandoned effort to scan every book and archived newspaper in the world in a while, but I bet they're regretting now that they didn't finish. A non-trivial amount of that physical media has been tossed or degraded by underfunded libraries since then. And it's more valuable to them now that it ever was.

bigyikes · on April 16, 2024

I learned Rust, with great help from ChatGPT-4.

If I can learn from AI-generated content, then I totally believe that AI can too.

coder-3 · on April 16, 2024

The problem with AI-generated content is not necessarily that it's bad, rather, it's not novel information. To learn something, you must not already know it. If it's AI-generated, the AI already knows it.

HeatrayEnjoyer · on April 17, 2024

How much work do individual humans do that could be considered genuinely truly novel? I measure the answer to be "almost none."

skybrian · on April 17, 2024

That's true to some extent, but training on synthetic content is big these days:

https://importai.substack.com/p/import-ai-369-conscious-mach...

fdr · on April 17, 2024

We might also say the same thing about spelling and grammar checkers. The difference will be in the quality of oversight of the tool. The "AI generated drivel" has minimum oversight.

Example: I have a huge number of perplexity.ai search/research threads, but the ones I share with my colleagues are a product of selection bias. Some of my threads are quite useless, much like a web search that was a dud. Those do not get shared.

Likewise, if I use LLM to draft passages or even act as something like an overgrown thesaurus, I do find I have to make large changes. But some of the material stays intact. Is it AI, or not AI? It's bit of both. Sometimes my editing is heavyhanded, other times, less so, but in all cases, I checked the output.

gorjusborg · on April 17, 2024

You are assuming that you and AI are the same sort of thing.

I do not think we are at that point yet. In the meantime, the idea that we might get to intelligence by feeding in more data might get choked out by poisoned data.

I have a suspicion that there's a bit more to it than just more data though.

janice1999 · on April 16, 2024

AI does not 'learn' like a human.

mr90210 · on April 17, 2024

I learned.. If I can… then I totally…