This doesn’t at all address the primary issue, which is one of licensing.
Is it a valid defense against copyright infringement to say “we don’t know where we got it, maybe someone else copied it from you first?”
If someone violated the copyright of a song by sampling too much of it and released it in the public domain (or failed to claim it at all), and you take the entire sample from them, would that hold up in a legal setting? I doubt it.
It does address it, although not that clearly. This happens all the time with news media. They will post a picture and say they got permission from X person, but X person actually didn't even own the copyright in the first place. That doesn't make any of it okay, but it does mean that the organization has legal cover in this case and the worst that will happen is that they'll have to take the content down. In GitHub's case if that same code snippet is found in other repo's that have different licensing then it's difficult to really prove who owns the copyright, it's a legal issue between the original copyright owner and the person that re-distributed the work. They can submit a DCMA takedown notice for the other repo's. But it's pretty unlikely Github gets into any legal trouble as long as they can prove that they got the snippet from someone else.
That code seems to appear in thousands of repositories on GitHub, I’m sure some of them haven’t copied the license.
The vast majority of people who would use a matrix transform function they got from code completion (or from a GitHub or stack overflow search) probably don’t care what the license is. They’ll just paste in the code. To many developers publicly viewable code is in the public domain. Code pilot just shortens the search by a few seconds.
Microsoft should try todo better (I’m not sure how), but the sad fact is that trying to enforce a license on a code fragment is like dropping dollar bills on the sidewalk with a note pinned to them saying “do not buy candy with this dollar”
What’s the most github could reasonably be expected to do? Identify if multiple licenses are found for the same code then maybe it should be flagged for review or the most restrictive license applied.
Do we want that though? I personally believe copyright as implemented today is harmful. The fact that code largely is able to dodge this could be seen as arguing we should be laxer with copyright, rather than arguing for strict enforcement of copyright on code.
That would only work if the original was uploaded to GitHub before the copies. Like, somebody could copy from GitLab or BitBucket. And git histories don’t always help if they’re not copied over.
But copyright law doesn't really care about how you prevent infringement, just that it doesn't happen. Isn't it up to Github to come up with a way to do it, or otherwise not do it at all?
GitHub just needs to show they have taken reasonable precautions, and if a conflict is identified, that they remediate it without undue delay.
It’s not a binary all perfectly or nothing at all. The law looks at intent and so doesn’t punish mistakes or errors so long as you aren’t being malicious or reckless or negligent.
> No provider or user of an interactive computer service shall be treated as the publisher or speaker of any information provided by another information content provider
So the act of hosting copyrighted content is not actually a copyright violation for Github. They're not obligated to preemptively determine who the original copyright owner of some piece of code is, as they're not the judge of that in the first place. Even if you complain that someone stole your code, how is Github supposed to know who's lying? Copyright is a legal issue between the copyright holder and the copyright infringer. So the only thing Github is required to do is to respond to DMCA takedown notices.
Yes. GitHub can get away with "oh well, we're all learning" because if the code is violating copyright, it's the user who is infringing directly by publishing it, not GitHub via Copilot. Either the user would have to bring a case against GitHub demonstrating liability (good luck) or the copyright holder would have to bring a case against GitHub demonstrating copyright violation (again, good luck). Otherwise this is entirely between the copyright holder and the Copilot user, legally speaking.
Of course if someone does manage to set a precedent that including copyrighted works in AI training data without an explicit license to do so, GitHub Copilot would be screwed and at best have to start over with a blank slate if they can't be grandfathered. But this would affect almost all products based on the recent advancements in AI and they're backed by fairly large companies (after all, GitHub is owned by Microsoft and a lot of the other AI stuff traces back to Alphabet and there are a lot of startups funded by huge and influential VC companies). Given the US's history of business-friendly legislation, I doubt we'll see copyright laws being enforced against training data unless someone upsets Disney.
Do you think that as part of this Github discovered that essentially everyone was in violation of copyright? That copyright of material without public knowledge or review (which exists in music, but not most code), is basically unenforceable?
Then they decided to wade in and build a house of cards where the cards are everyone else’s code, just waiting for the grenade pin puller and we’ve potentially witnessed the moment?
That’s the only thing that makes sense to me here. They don’t care because opening the issue will bring down everyone else with them.
Yeah, so if a news agency publishes a picture without knowing where it came from, the originator can sue them for violating copyright.
There is no “I don’t know who owns the IP” defense: the image has a copyright, a person owns that copyright, publishing the image without licensing or purchasing the copyright, is a violation. The fine is something like $100k per offense for a business.
FWIW this in consequence means you can't legally use Copilot without becoming liable to copyright violations because it's essentially a black box and you have no insight into where the code it generates originated and even if it isn't a 1-to-1 copy it might be a "derivative work".
This is why I'm gnashing my teeth whenever I hear companies being fine with their employees using Copilot for public-facing code. In terms of liability, this is like going back from package managers to copying code snippets of blogs and forum posts.
Why this restriction on public-facing code? Are you OK with Copilot being used for "private"/closed source code? I get that it would be less likely to be noticed if the code is not published, but (if I understand right) is even worse for license reasons.
I don't advocate people use Copilot for anything but hobby toy projects.
I have lower expectations of the rigor with which companies police their internal codebases, though. Seeing Copilot banned for internal use too is a pleasant surprise. Companies tend to be a lot more "liberal" in what kind of legal liabilities they accept for their internal tooling in my experience.
Turn the parties in this argument around and see if you think it still holds.
J. Random Hacker acquires and uses a copy of some of GitHub's, or Microsoft's source. When sued, the defense says that the code was not taken directly from GH/MS, just copied from a newsgroup where it had been posted. Does this get J. off the hook?
Was J using automated methods based on false claims of ownership by the newsgroup posters, with no direct knowledge of the violation? If so J should not be punished.
I may be misinformed but my understanding of copyright is that it protects the 'expression' of something (like an algorithm or recipe) so someone can rewrite a copyrighted chunk of code into another language and be free of the original copyright, while also able to assert their own copyright on their new expression.
If that is true then one way to get around copyright restrictions on existing code is to create a new language.
fascinating idea, copilot could do the translations internally and also work torwards widening the pool of suggestions to all languages instead of the individual lamguage a user is using (bit then again, they might be writing in the "new" language already
> Is it a valid defense against copyright infringement to say “we don’t know where we got it, maybe someone else copied it from you first?”
If you do something, it's ultimately you who has to make sure that it is not against the law. "I didn't know" is never a good defense. If you pay with counterfeit cash, it is you who will be arrested, even if you didn't know it was counterfeit. If you use code from somewhere else (no matter if it's by copy/pasting or by using Copilot), it is you who has to make certain that it doesn't infringe on any copyright.
Just because a tool can (accidentally) make you break the law, doesn't mean the tool is to blame (cf. BitTorrent, Tor, KaliLinux, ...)
BitTorrent (and, to a larger degree, EDonkey) did and still do that. Who tells you that what you're downloading is indeed what you think it is. You can click on a magnet link that claims to download a Debian ISO just to find out later that it's something else entirely. To make matters worse, BitTorrent even uploads to potentially hundreds of other clients while you're still downloading, so while downloading something might not be illegal in your jurisdiction, uploading/distributing most certainly is, and you can get into lots of trouble for uploading (parts of a) copyrighted wortk to hundreds or thousands of other users
> You can click on a magnet link that claims to download a Debian ISO just to find out later that it's something else entirely
This is just fear mongering, the same exact thing can happen with a web browser, I click a link to view an image of a cat but... oops, it was actually a Getty copyrighted picture of a dog! Oh nooooo.
On the web that sort of thing is actually common, but bit torrent? I have never downloaded a torrent to find it was something other than what I expected. Never have I seen a movie masquerading as a Debian ISO. That's nothing more than a joke people use to make light of their (deliberate) copyright infringement.
Furthermore, is there even any bit torrent client that will recommend copyrighted content to you, rather than merely download what you tell it to? I've not seen one. Search engines, in my browser, do that sort of recommendation but bit torrent clients do what I tell them to. Including seeding to others, which is optional but recommended for obvious reasons.
If you actually care, then simply configure your client to leech. Every client I've ever used or heard of supports this.
But more to the point, getting tricked into seeding a copyrighted movie by a torrent masquerading as a Debian ISO isn't something that actually happens. That's absurd FUD.
> "This is just fear mongering, the same exact thing can happen with a web browser, I click a link to view an image of a cat but... oops, it was actually a Getty copyrighted picture of a dog! Oh nooooo."
No-one cares whether you download an open-sourced photo of a cat or a copyrighted photo of a dog.
BitTorrent is certainly not a good example to follow, but I do think that copilot is more wrong.
They should definitely include disclaimers and make seeding opt-in (though I don't know how safe you are legally when you download a Lion King copy labeled Debian.iso). That said, they don't have the information necessary to tell whether what you're doing is legal or not.
Copilot _has_ that information. The model spits out code that it read. They could disallow publishing or commercially using code generated by it while they're sorting it out, but they made the decision not to.
AI is hard, but the model is clearly handing out literal copies of GPL code. Github knows this and they still don't tell you about it when you click install.
It doesn't matter if the information is there or not, since an algorithm cannot commit a copyright violation. There is at least one human involved, and the human is the one who is responsible.
A car has all the information that it's going faster than the speed limit, or that it just ran a red light. But in the end it's the driver who is responsible. It's not the tool (car, Copilot) that commits the illegal act, it's the user using that tool
So your point is that removing the speedometer from your car and then claiming "I didn't know I was driving too fast!" will make it somehow not your responsibility?
It is still your responsibility to know and obey the traffic laws, the same as it is your responsibility to obey the copyright laws....
Indeed, and people always (rightfully) complain loudly against the outlawing of these tools, and in many cases they have been successful. Yet here it's the opposite for some weird reason.
I don't know whether the "Numerical Recipes" publisher actively defends their copyright of the code in the books but it would be an interesting test case.
> Is it a valid defense against copyright infringement to say “we don’t know where we got it, maybe someone else copied it from you first?”
I mean, in humans it's just referred to as 'experience', 'training', or 'creativity'. Unless your experience is job-only, all the code you write is based on some source you can't attribute combined with your own mental routine of "i've been given this problem and need to emit code to solve it". In fact, you might regularly violate copyright every time you write the same 3 lines of code that solve some common language workaround or problem. Maybe the solution is CoPilot accompanying each generation with a URL containing all of the run's weights and traces so that a court can unlock the URL upon court order to investigate copyright infringement.
> If someone violated the copyright of a song by sampling too much of it and released it in the public domain (or failed to claim it at all), and you take the entire sample from them, would that hold up in a legal setting? I doubt it.
In general you're not liable for this. While you still will likely have to go to court with the original copyright holder's work, all the damages you pay can be attributed to whoever defrauded or misrepresented ownership over that work. (I am not your lawyer)
> > Is it a valid defense against copyright infringement to say “we don’t know where we got it, maybe someone else copied it from you first?”
> I mean, in humans it's just referred to as 'experience', 'training', or 'creativity'. Unless your experience is job-only, all the code you write is based on some source you can't attribute combined with your own mental routine of "i've been given this problem and need to emit code to solve it". In fact, you might regularly violate copyright every time you write the same 3 lines of code that solve some common language workaround or problem.
Aren't you moving the goal posts? This is not 3 lines, but instead is 1 to 1 reproducing a complex function that definitely has enough invention height to be copyright able.
With high probability, what's happened here is this code is an important piece of code-infrastucture in that it's copied into a fair number of places. Which means humans are copying it without attribution or downstream of someone who did while relevant license is not propagated anywhere near as reliably.
It doesn't change licensing issue but it does mean people are already copying and using copyrighted code without respecting original license and no AI involved.
There should be a way to reverse engineer code LLMs to see which core bits of memorized code they build on. Another complex option is a combination of provenance tracking and semantic hashing on all functions in code used for training. Another option (non-technical) is a rethinking of IP.
>With high probability, what's happened here is this code is an important piece of code-infrastucture in that it's copied into a fair number of places. Which means humans are copying it without attribution or downstream of someone who did while relevant license is not propagated anywhere near as reliably.
The original poster said it was in a private repository.
>It doesn't change licensing issue but it does mean people are already copying and using copyrighted code without respecting original license and no AI involved.
I don't get the argument. Many people are copying/pirating MS windows/MS office. What do you think MS would say to a company they caught with unlicensed copies and they used the excuse "the PCs came preinstalled with Windows and we didn't check if there was a valid license"?
The training set for C was algol and a bunch of other languages.
AI could be used to create languages based on design criteria and constraints like C was, but it does bring up the question of why one of the constraints should be character encodings from human languages if the final generated language would never be used by humans...
I mainly think it's funny watching all of these Rand'ian objectivists reusing ever excuse used by every craftsman that was excised from working life...machines need a machinist, they don't have souls or creativity, etc.
Industry always saw open source as a way to cut cost. ML trained from open source has the capability to eliminate a giant sink of labor cost. They will use it to do so. Then they will use all of the arguments that people have parroted on this site for years to excuse it.
I'm a pessimist about the outcomes of this and other trends along with any potential responses to them.
The problem here is that copilot explots a loophole that allows it to produce derivative works without license. Copilot is not sophisticated enough to structure source code generally- it is overtrained. What is an overtrained neural network but memcpy?
the problem isn't even that this technology will eventually replace programmers: the problem is that it produces parts of the training set VERBATIM, sans copyright.
No, I am pretty optimistic that we will quickly come to a solution when we start using this to void all microsoft/github copyright.
Is it a valid defense against copyright infringement to say “we don’t know where we got it, maybe someone else copied it from you first?”
If someone violated the copyright of a song by sampling too much of it and released it in the public domain (or failed to claim it at all), and you take the entire sample from them, would that hold up in a legal setting? I doubt it.