Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The math underpinning an AI model exists independent of the hardware it's realized on. I can train a model on one GPU and someone else can replicate my results with a different GPU running different drivers, down to small numerical differences that should hopefully not have major effects.

Data isn't fungible in the same way: I can't just replace one dataset with another for research where the data generation and curation is the primary novel contribution and expect to replicate the results.

There's also a larger accountability picture: just like scientific papers that don't publish data are inherently harder to check for statistical errors or outright fraud, there's a lot of uncomfortable trust required for open-weight closed-data models. How much contamination is there for the major AI benchmarks? How much copyrighted data was used? How can we be sure that the training process was conducted as the authors say, whether from malfeasance or simple mistakes?



Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: