> I am yet to see any reasoned argument for why it is far more difficult and wil...

slidehero · 2024-08-21T00:14:44 1724199284

>We are basically out of new, non-synthetic text to train models

this is not even remotely true.

There is an astronomical amount of data siloed by publishers, professional journals etc. that is yet to be tapped.

OpenAI is making inroads by making deals with these content owners for access to all that juicy data.

dartos · 2024-08-21T19:11:10 1724267470

Even assuming there is a ton of data companies are just now getting access to, the logarithmic curve of LLM improvements is clearly visible (granted that our LLM evaluation frameworks are not very good)

staticman2 · 2024-08-21T12:38:38 1724243918

>>>There is an astronomical amount of data siloed by publishers, professional journals etc. that is yet to be tapped.

You seem to think these models haven't already been trained on pirated versions of this content, for some reason.

dartos · 2024-08-21T19:08:43 1724267323

Yep, books3 is what llama was famously trained on before it was taken down.

That’s not even considering AI crawlers or all the copyright text on archive.org