I’ve been wondering how does Creative Commons apply in ‘big data’-ish use cases. Can a dataset distributed under CC BY-SA be analyzed, possibly used as part of training input for an ML model? What if a product is built on top of a model that learned from a CC-licensed dataset? Products are rarely distributed under CC; bow far do ShareAlike & Attribution reach, by letter and by spirit?
Should there be (or does there exist) a type of license for data—different from the ones typically used for software source code (MIT, GPL) and ones typically used for creative work (CC), encouraging innovation but giving something back to dataset creator or maintainer?
Those are reasonable questions. At work, we release lots of data under OGL (Open Government Licence) which is CC compatible.
For my personal stuff, if you'd like a different license, I'm happy for you to pay me for a more restrictive one. But if you build an ML using my open data, I expect that model to be released under a similarly licence.
Didn’t know about OGL, it does look suitable for this purpose.
To (partially) answer myself, contrary to what I implied CC-BY does cover this base if (for example) the creator of the dataset accepts a note in product’s “About” documentation as sufficient attribution.
Should there be (or does there exist) a type of license for data—different from the ones typically used for software source code (MIT, GPL) and ones typically used for creative work (CC), encouraging innovation but giving something back to dataset creator or maintainer?