Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

No training data, no open source. Don't fall for the company PR.


As long as it works who cares about training data. Obviously they can't open it for many reasons. License is one of them.


I don't care if it's open source or not. I care if people call something open source which it isn't.

Do you care about binary blobs in the kernel? No. Are binary blobs in the kernel open source? No.

But it is tedious to go through the same discussion every 10 years, with a relentless industry that wants to dupe people.

If there wasn't a benefit in for them, they would not call it open source.


For some reason people think of models as software and open source should have similar meaning. There are fundamental differences: 1) models aren't reproducible given everything, data, hardware, methodology. 2) they aren't even verifiable. i.e. given model and dataset it's impossible to say if model was trained on that data. 3) except for toys models are trained on copyrighted data. Some of it is private, like users' chats. 4) besides data there is a lot of human input after pretraining.

This means given everything you have two options: 1) train similar model yourself 2) trust model provider. In software you can get script and run, or get code and compile it in exactly the same binaries.

Naturally 'open source' has different meaning. Some are trying to monopolize it, like they know the 'truth'. Others simply ignore it. Eventually we'll settle on something.


A decent training pipeline will be able to reproduce models with equivalent aggregate performance (as measured by the evaluation metrics). And a high degree of similarity in behavior on specific inputs - but not identical. It will not be the same exact weights. But that is not a critical bar to reach. And may software builds also fails that bar.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: