> *As an academic mathematician who spent their entire life collaborating openly...

ludwik · on Dec 23, 2024

> But if all models were truly open, then we could simply verify what they had been trained on

How do you verify what a particular open model was trained on if you haven’t trained it yourself? Typically, for open models, you only get the architecture and the trained weights. How can you reliably verify what the model was trained on from this?

Even if they provide the training set (which is not typically the case), you still have to take their word for it—that’s not really "verification."

lmm · on Dec 24, 2024

> Even if they provide the training set (which is not typically the case), you still have to take their word for it—that’s not really "verification."

If they've done it right, you can re-run the training and get the same weights. And maybe you could spot-check parts of it without running the full training (e.g. if there are glitch tokens in the weights, you'd look for where they came from in the training data, and if they weren't there at all that would be a red flag). Is it possible to release the wrong training set (or the wrong instructions) and hope you don't get caught? Sure, but demanding that it be published and available to check raises the bar and makes it much more risky to cheat.

bambax · on Dec 23, 2024

If they provide the training set it's reproducible and therefore verifiable.

If not, it's not really "open", it's bs-open.

asadotzler · on Dec 23, 2024

The OP said "truly open" not "open model" or any of the other BS out there. If you are truly open you share the training corpora as well or at least a comprehensive description of what it is and where to get it.

ludwik · on Dec 23, 2024

It seems like you skipped the second paragraph of my comment?

SpaceManNabs · on Dec 24, 2024

Because it is mostly hogwash.

Lots of ai researchers have shown that you can both give credit and discredit "open models" when you are given a dataset and training steps.

Many lauded papers fell into reddit Ml or twitter ire when people couldnt reproduce the model or results.

If you are given the training set, the weights, the steps required, and enough compute, you can do it.

Having enough compute and people releasing the steps is the main impediment.

For my research I always release all of my code, and the order of execution steps, and of course the training set. I also give confidence intervals based on my runs so people can reproduce and see if we get similar intervals.