How could they publish the terabytes of training data? A million RAR files?
Honestly would that part even be useful? Like I want to know how they did the training so I can repro it with my own set of training data, right?
I mean, isn't that the future? Somebody figures out how to do P2P distributed training and groups can crawl the web training their own open source models?