> Microsoft is selling AI services based on training data they don't own and didn't acquire rights to, nobody writing the licenses of the code it's using [...]
I agree, but it still uses resources and those don't come for free (hardware, electricity, cooling, maintenance staff, housing, etc.)
It's really difficult to assign monetary value to all these aspects and weighing them against each other in a fair manner.
The consent issue is a difficult legal aspect as well. Github's ToS Section D.4 clearly states they retain the rights to process your content and
parse it into a search index or otherwise analyze it on our servers
It can be argued that using the content to train an AI model falls under "analysing it on our servers". Also
It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service
If CoPilot is part of their service, it's in their right to distribute the content, e.g. by means of CoPilot as a processed part of the model.
GPL and other licences don't place restriction on the usage as training data. It's currently a very murky legal grey area. Licences need to adapt to this new form of usage pattern.
I think copilot is pretty clearly copyright violation and in violation of licenses of "public" code. People uploading code to github are bound to the licenses just the same as anyone, unless you're the legitimate owner of all of the copyright in a codebase, you can't give change the license provisions by accepting a ToS.
I don't think it's really that murky, these models contain and have been shown to reproduce copyrighted code with the right prompting, it's not a grey area it's just obfuscated theft.
what's the difference between allowing you to search github and find a code snippet, and having a fancy autocomplete system search github and find a code snippet for you?
seems to me anyone agreeing to the ToS should expect their code to show up on other peoples screens as search results
really the question is a matter of degree, is copying your nested for-loop iterating through a row oriented matrix really a unique piece of code protected by copyright? Or does the copyright apply to the file you've written as a whole, leaving room for me to accidentally use words in the same order? clearly there is a tipping point between writing code that looks like yours and using the code you've written outside the terms of your license, we will have to wait for courts to decide where that line is for all ML, not just co-pilot
When I look at github code, it's only stored in my brain and personal notes, not packaged into a product as a trained ML model.
When I reproduce code based on something I looked up, I do indeed have to be careful not to explicitly copy sizable chunks, somethings are obvious and the only way to do things, but not everything.
What users and copyright holders expect from humans does not automatically apply to marginally similar situations with computers and ML applications. For example: if I'm walking down the street I don't mind at all if someone recognizes me or a stranger remembers seeing me later, I'm actually rather bothered if someone (or the state) is running facial recognition software and recording every time it see me or anyone else walking down the street.
I agree, but it still uses resources and those don't come for free (hardware, electricity, cooling, maintenance staff, housing, etc.)
It's really difficult to assign monetary value to all these aspects and weighing them against each other in a fair manner.
The consent issue is a difficult legal aspect as well. Github's ToS Section D.4 clearly states they retain the rights to process your content and
It can be argued that using the content to train an AI model falls under "analysing it on our servers". Also If CoPilot is part of their service, it's in their right to distribute the content, e.g. by means of CoPilot as a processed part of the model.GPL and other licences don't place restriction on the usage as training data. It's currently a very murky legal grey area. Licences need to adapt to this new form of usage pattern.