> As an academic mathematician who spent their entire life collaborating openly on research problems and sharing my ideas with other people, it frustrates me [that] I am not even to give you a coherent description of some basic facts about this dataset, for example, its size. However there is a good reason for the secrecy. Language models train on large databases of knowledge, so you moment you make a database of maths questions public, the language models will train on it.
Well, yes and no. This is only true because we are talking about closed models from closed companies like so-called "OpenAI".
But if all models were truly open, then we could simply verify what they had been trained on, and make experiments with models that we could be sure had never seen the dataset.
Decades ago Microsoft (in the words of Ballmer and Gates) famously accused open source of being a "cancer" because of the cascading nature of the GPL.
But it's the opposite. In software, and in knowledge in general, the true disease is secrecy.
> But if all models were truly open, then we could simply verify what they had been trained on
How do you verify what a particular open model was trained on if you haven’t trained it yourself? Typically, for open models, you only get the architecture and the trained weights. How can you reliably verify what the model was trained on from this?
Even if they provide the training set (which is not typically the case), you still have to take their word for it—that’s not really "verification."
> Even if they provide the training set (which is not typically the case), you still have to take their word for it—that’s not really "verification."
If they've done it right, you can re-run the training and get the same weights. And maybe you could spot-check parts of it without running the full training (e.g. if there are glitch tokens in the weights, you'd look for where they came from in the training data, and if they weren't there at all that would be a red flag). Is it possible to release the wrong training set (or the wrong instructions) and hope you don't get caught? Sure, but demanding that it be published and available to check raises the bar and makes it much more risky to cheat.
The OP said "truly open" not "open model" or any of the other BS out there. If you are truly open you share the training corpora as well or at least a comprehensive description of what it is and where to get it.
Lots of ai researchers have shown that you can both give credit and discredit "open models" when you are given a dataset and training steps.
Many lauded papers fell into reddit Ml or twitter ire when people couldnt reproduce the model or results.
If you are given the training set, the weights, the steps required, and enough compute, you can do it.
Having enough compute and people releasing the steps is the main impediment.
For my research I always release all of my code, and the order of execution steps, and of course the training set. I also give confidence intervals based on my runs so people can reproduce and see if we get similar intervals.
Well, yes and no. This is only true because we are talking about closed models from closed companies like so-called "OpenAI".
But if all models were truly open, then we could simply verify what they had been trained on, and make experiments with models that we could be sure had never seen the dataset.
Decades ago Microsoft (in the words of Ballmer and Gates) famously accused open source of being a "cancer" because of the cascading nature of the GPL.
But it's the opposite. In software, and in knowledge in general, the true disease is secrecy.