Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> I am yet to see any reasoned argument for why it is far more difficult and will take far longer.

For language models specifically, they are trained on data and have historically been improved by increasing the size of the model (by number of parameters) and by the amount and/or quality of training data.

We are basically out of new, non-synthetic text to train models on and it’s extremely hard work to come up with novel architecture that performs well against transformers.

Those are some simple reasons why it will be far more difficult to improve general language models.

There are also papers showing that training models on synthetic data causes “model collapse” and greatly reduces output quality by magnifying errors already present in the model, so it’s not a problem we can easily sidestep.

It’s an easy mistake to see something like chatgpt not exist, then suddenly exist and assume a major breakthrough happened, but behind the scenes there has been like 50 years of R&D that led to it, it’s not like suddenly there was a breakthrough and now the gates are open.

A general intelligence for CS is like the elixir of life for medicine.



>We are basically out of new, non-synthetic text to train models

this is not even remotely true.

There is an astronomical amount of data siloed by publishers, professional journals etc. that is yet to be tapped.

OpenAI is making inroads by making deals with these content owners for access to all that juicy data.


Even assuming there is a ton of data companies are just now getting access to, the logarithmic curve of LLM improvements is clearly visible (granted that our LLM evaluation frameworks are not very good)


>>>There is an astronomical amount of data siloed by publishers, professional journals etc. that is yet to be tapped.

You seem to think these models haven't already been trained on pirated versions of this content, for some reason.


Yep, books3 is what llama was famously trained on before it was taken down.

That’s not even considering AI crawlers or all the copyright text on archive.org




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: