I am not a lawyer, but I am capable of summarizing the thoughts of lawyers, so my take is that in general, fair use allows AI to be trained on copyrighted material, and humans who use this AI are not responsible for minor copyright infringement that happens accidentally as a result. However, this has not been tested in court in detail, so the consensus could change, and if you were extremely risk-averse you might want to avoid Copilot.
A key quote from the second link:
Copyright has concluded that reading by robots doesn’t count. Infringement is for humans only; when computers do it, it’s fair use.
Personally, I think law should allow Copilot. As a human, I am allowed to read copyrighted code and learn from it. An AI should be allowed to do the same thing. And nobody cares if my ten-line "how to invert a binary tree" snippet is the same as someone else's. Nobody is really being hurt when a new tool makes it easier to copy little bits of code from the internet.
> As a human, I am allowed to read copyrighted code and learn from it.
Of course not. Reading some copyrighted code can have you entirely excluded from some jobs - you can't become a wine contributor if it can be shown you ever read Windows source code and most likely conversely.
Likewise, you can't ever write GPL VST 2 audio plug-ins if you ever had access to the official Steinberg VST2 SDK. Etc etc...
Did people forget why black box reverse engineering of software ever came to be ?
> Of course not. Reading some copyrighted code can have you entirely excluded from some jobs
That's not a law. That's a cautionary decision made by those companies or projects to make it more difficult for competitors to argue that code was copied.
Those projects could hire people familiar with competitor code and assign them to competing projects if they wanted. The contributors could, in theory, write new code without using proprietary knowledge from their other companies. In practice, that's actually really difficult to do and even more difficult to prove in court, so companies choose the safe option and avoid hiring anyone with that knowledge altogether.
Now the question is whether or not GitHub's AI can be argued to have proprietary knowledge contained within. If your goal is to avoid any possibility that any court could argue that GitHub copilot funneled proprietary code (accessible to GitHub copilot) into your project, then you'd want to forbid contributors from using CoPilot.
In this case though we have machine learning model that is trained with some code and is not merely learning abstract concepts to be applied generally in different domains, but instead can use that knowledge to produce code that looks pretty much the same as the learning material, given the context that fits the learning material.
If humans did that, it would be hard to argue they didn't outright copy the source.
When a machine does it, does it matter if the machine literally copied it from sources, or first transformed it into an isomorphic model in its "head" before regurgitating it back?
If yes, why doesn't parsing the source into an AST and then rendering it back also insulate you from abiding a copyright?
>When a machine does it, does it matter if the machine literally copied it from sources, or first transformed it into an isomorphic model in its "head" before regurgitating it back?
You've hit the nail on the head here. If this is okay, then neural nets are simply machines for laundering IP. We don't worry about people memorizing proprietary source code and "accidentally" using it because it's virtually impossible for a human to do that without realizing it. But it's trivial for a neural net to do it, so comparisons to humans applying their knowledge are flawed.
That's a really good observation. Perhaps it highlights an essential difference between two modes of thought - a fuzzy, intuitive, statistical mode based on previously seen examples, and a reasoned, analytical calculating mode which depends on a precise model of the system. Plausibly, the landscape of valid musical compositions is more continuous than the landscape of valid source code, and therefore more amenable to fuzzy, example-based generation; it's entirely possible to blend two songs and make a third song. Such an activity is nonsensical with source code, and so humans don't even try. We probably do apply that sort of learning to short snippets (idioms), but source code diverges too rapidly for it to be useful beyond that horizon.
This is not such a big problem in reality because the output of Copilot can be filtered to exclude snippets too similar to the training data, or any corpus of code you want to avoid. It's much easier to guarantee clean code than train the model in the first place.
> then you'd want to forbid contributors from using CoPilot
I mean, if you used CoPilot on one computer, stared at it intensely for 1 hour, closed that computer, and then typed out code in the other computer that you were contributing from, you technically didn't use it for the contribution, you just used CoPilot for your education only.
Intellectual property is itself a flawed concept in many ways. It's like asking someone to do physics research but forbidding them from using anything that Einstein wrote.
Intellectual property itself is silly. How can a thought be the property of someone ?
Secrecy is the solution if you don't want others to learn from you (like Coca-Cola does).
It's not a natural right, we supposedly do it too stimulate innovation by offering a reward and in order to get things into the public domain -- obviously Disney (and the politicians that kowtowed to them) ruined that for the World.
Patents should have reduced with product lifecycles, copyright should be a similar period; maybe 10-14 years.
It's not silly, it's an evolved and pragmatic solution to the question of how society can incentivize creative work. More or less every society has developed some notion of IP and there's little appetite in wider society to debate it - the idea of abolishing IP laws is deeply fringe and only really surfaces in forums like this one.
Does it have flaws and can it be improved upon? Sure. I think society underweights what improvements to the patent system in particular could do. But such ideas are so niche they are hardly even written down, let alone debated at large. Society has bigger issues on its mind.
Like any evolved system IP law encounters new challenges over time and will be expected to evolve again, which it will surely do. A simple fix for Copilot is surely to just exclude all non-Apache2/BSD/MIT licensed code. Although there might technically still be advertising clause related issues, in practice hardly anyone cares enough to go to court over that.
If you read the video with a view to reproducing it then you created a derivative work, ie copyright infringement.
If you just used it for inspiration, that's fine; if the way it was coded is a result of technical constraints, that's fine too; if the code is generic it's not distinctive enough to acquire copyright in the first place.
>That's not a law. That's a cautionary decision made by those companies or projects to make it more difficult for competitors to argue that code was copied.
and they made those decisions based on the need to be able to argue in court that code was not copied.
>then you'd want to forbid contributors from using CoPilot
Right, the whole thing about arguing if copilot spits out a ten line function verbatim is not really what will be the problem, the problem is a human programmer still needs to run copilot and they will be the ones shown in the codebase as the author of the code (they could of course put a comment 'I got this bit from copilot' but might be cumbersome and anyway would hardly work as proof), although I suppose it would be not just proprietary code but code with an incompatible license.
> >That's not a law. That's a cautionary decision made by those companies or projects to make it more difficult for competitors to argue that code was copied.
> and they made those decisions based on the need to be able to argue in court that code was not copied.
Yeah, but only to make it easier for them to argue it; the letter of the law doesn't require it. You could argue that "Sure, I read Windows source code once -- but that was years ago and I can't remember shit of it, so anything I wrote now is my own invention." That might be harder to get the court to accept as a fact, but it's not a prima facie legal impossibility.
>That's not a law. That's a cautionary decision made by those companies or projects to make it more difficult for competitors to argue that code was copied.
Okay, so it's not law, it's just a policy compelled by preceding legal judgements. Case law, perhaps.
In general, you're absolutely allowed to learn programming techniques from anywhere. You can contribute software almost anywhere even if you've read Windows source code. Re-using everything you've learned, in your own creative creation, is part of fair use.
Your example is the very specific scenario where you're attempting to replicate an entire program, API, etc., to identical specifications. That's obviously not fair use. You're not dealing with little bits and pieces, you're dealing with an entire finished product.
> Your example is the very specific scenario where you're attempting to replicate an entire program, API, etc., to identical specifications. That's obviously not fair use. You're not dealing with little bits and pieces, you're dealing with an entire finished product.
No - google's 9 lines of sorting algorithm (iirc) copied from Oracle's implementation were not considered fair use in the Google / Oracle debacle.
Likewise SCO claimed that 80 copied lines (in the entirety of the Linux source code) were a copyright violation, even if we never had a legal answer to this.
nope, those lines were specifically excluded from the prior judgment - and SC did not cast another judgment on them:
> With respect to Oracle’s claim for relief for copyright infringement, judgment is entered in favor of Google and against Oracle except as follows: the rangeCheck code in TimSort.java and ComparableTimSort.java, and the eight decompiled files (seven “Impl.java” files and one“ACL” file), as to which judgment for Oracle and against Google is entered in the amount of zero dollars (as per the parties’ stipulation).
The fair use was about Googled API reimplementation.
It becomes a whole different case with a 1:1 copy of code.
And don't forget fair use works in the US, not necessarily in the rest of the world.
But I'm happy about all the new GPL programs created by Copilot
That Supreme Court ruling doesn't appear to address the claims of actual copied code (the rangeCheck function), only the more nebulous API copyright claims.
This is true, but there's also a murkier middle option. I used to work for a company that made a lot of money from its software patents but I was in a division that worked heavily in open-source code. We were forbidden to contribute to the high-value patented code because it was impossible to know whether we were "tainted" by knowledge of GPL code.
Same here. I worked at a NAS storage (NFS) vendor and this was a common practice. Could not look at server implementation in Linux kernel and open source NFS client team could not look at proprietary server code.
No you are not, guaranteed (I think, not a lawyer).
At least from a copyright point of few.
TL;DR: Having right, and having a easy defense in a law suite are not the same.
BUT separating it makes defending any law-suite against them because of copyright and patent law much easier. It also prevents any employee from "copying GPL(or similar) code verbatim from memory"(1) (or even worse the clipboard) sure the employee "should" not do it but by separating them you can be more sure they don't, and in turn makes it easier to defent in curt especially wrt. "independent creation".
There is also patent law shenanigans.
(1): Which is what GitHub Copilot is sometimes doing IMHO.
This model doesn't learn and abstract: it just pattern matches and replicates; that's why it was shown exactly replicating regions of code--long enough to not be "de minimis" and recognizable enough to include the comments--that happen to be popular... which would be fine, as long as the license on said code were also being replicated. It just isn't reasonable to try to pretend Copilot--or GPT-3 in general--is some kind of general purpose AI worthy of being compared with the fair use rights of a human learning techniques: this is a machine learning model that likes to copy/paste not just tiny bits of code but entire functions out of other peoples' projects, and most of what makes it fancy is that it is good at adapting what it copies to the surrounding conditions.
Have you used Copilot? I have not, but I have trained a GPT2 model on open source projects (https://doesnotexist.codes/). It does not just pattern match and replicate. It can be cajoled into reproducing some memorized snippets, but this is not the norm; in my experience the vast majority of what it generates is novel. The exceptions are extremely popular snippets that are repeated many many times in the training data, like license boilerplate.
Perhaps Copilot behaves very differently from my own model, but I strongly suspect that the examples that have been going around twitter are outliers. Github's study agrees: https://docs.github.com/en/github/copilot/research-recitatio... (though of course this should be replicated independently).
So, to verify, your claim is that GPT-3, when trained on a corpus of human text, isn't merely managing to string together a bunch of high-probability sequences of symbol constructs--which is how every article I have ever read on how it functions describes the technology--but is instead managing to build a model of the human world and the mechanism of narration required to describe it, with which it uses to write new prose... a claim you must make in order to then argue that GPT-3 works like a human engineer learning a model of computers, libraries, and engineering principals from which it can then write code, instead of merely using pattern recognition as I stated? As someone who spent years studying graduate linguistics and cognitive science (though admittedly 15-20 years ago, so I certainly haven't studied this model: I have only read about it occasionally in passing) I frankly think you are just trying to conflate levels of understanding, in order to make GPT-3 sound more magical than it is :/.
What? I don't think I made any claim of the sort. I'm claiming that it does more than mere regurgitation and has done some amount of abstraction, not that it has human-level understanding. As an example, GPT-3 learned some arithmetic and can solve basic math problems not in its training set. This is beyond pattern matching and replication, IMO.
I'm not really sure why we should consider Copilot legally different from a fancy pen – if you use it to write infringing code then that's infringement by the user, not the pen. This leaves the practical question of how often it will do so, and my impression is that it's not often.
It's not really comparable to a pen. Because a pen by itself doesn't copy someone else's code/written words. It's more like copying code from Github or if you wrote a script that did that automatically. You have to be actively cautious that the material that you are copying is not violating any copyrights. The problem is Copilot has enough sophistication to for example change variable names and make it very hard to do content matching. What I can guarantee it won't be able to do is to be able to generate novel code from scratch that does a particular function (source: I have a PhD in ML). This brute-force way of modeling computer programs (using a language model) is just not sophisticated enough to be able to reason and generate high level concepts at least today.
The argument I was responding to--made by the user crazygringo--was that GPT-3 trained on a model of the Windows source code is fine to use nigh unto indiscriminately, as supposedly Copilot is abstracting knowledge like a human engineer. I argued that it doesn't do that: that GPT-3 is a pattern recognize that not only theoretically just likes to memorize and regurgitate things, it has been shown to in practice. You then responded to my argument claiming that GPT-3 in fact... what? Are you actually defending crazygringo's argument or not? Note carefully that crazygringo explicitly even stated that copying little bits and pieces of a project is supposedly fair use, continuing the--as far as I understand, incorrect--assertion by lacker (the person who started this thread) that if you copied someone's binary tree implementation that would be fair use, as the two of them seem to believe that you have to copy essentially an entire combined work (whatever that means to them) for something to be infringing. Honestly, it now just seems like you decided to skip into the middle of a complex argument in an attempt to made some pedantic point: either you agree that GPT-3 is a human that is allowed to, as crazygringo insists, read and learn from anything and the use that knowledge in any way they see fit, or you agree with me that GPT-3 is a fancy pattern recognizer and it can and will just generate copyright infringements if used to solve certain problems. Given your new statements about Copilot being a "fancy pen" that can in fact be used incorrectly--something crazygringo seems to claim isn't possible--you frankly sound like you agree with my arguments!!
I think a crucial distinction to be made here, and with most 'AI' technologies (and I suspect this isn't news to many people here) is that – yes – they are building abstractions. They are not simply regurgitating. But – no – those abstractions are not identical (and very often not remotely similar) to human abstractions.
That's the very reason why AI technologies can be useful in augmenting human intelligence; they see problems in a different light, can find alternate solutions, and generally just don't think like we do. There are many paths to a correct result and they needn't be isomorphic. Think of how a mathematical theorem may be proved in multiple ways, but the core logical implication of the proof within the larger context is still the same.
Statistical modelling doesn't imply that GPT-3 is merely regurgitating. There are regularities among different examples, i.e. abstractions, that can be learned to improve its ability to predict novel inputs. There is certainly a question of how much Copilot is just reproducing input it has seen, but simply noting that its a statistical model doesn't prove the case that all it can do is regurgitate.
One way to look at these models is to say that they take raw input, convert it into a feature space, manipulate it, then output back as raw text. A nice example of this is neural style transfer, where the learnt features can distinguish content from style, so that the content can be remixed with a different style in feature space. I could certainly imagine evaluating the quality of those features on a scale spanning from rote-copying all the way up to human understanding, depending on the quality of the model.
Imagine for a second a model of the human brain that consists of three parts. 1) a vector of trillion inputs, 2) a black box, and 3) a vector of trillion outputs. At this level of abstraction, the human brain "pattern matches and replicates" just the same, except it is better at it.
Human brains are at least minimally recurrent, and are trained on data sets that are much wider and more complex than what we are handing GPT-3. I have done all of these standard though experiments and even developed and trained my own neural networks back before there were libraries that have allowed people to "dabble" in machine learning: if you consider the implications of humans being able to execute turing complete thoughts it should be come obvious that the human brain isn't merely doing pattern-anything... it sometimes does, but you can't just conflate them and then call it a day.
The human brain isn't Turing-complete as that would require infinite memory. I'm not saying that GPT-3 is even close, but it is in the same category. I tried playing chess against it. According to chess.com, move 10 was its first mistake, move 16 was its first blunder, and past move 20 it tried to make illegal moves. Try playing chess without a chessboard and not making an illegal move. It is difficult. Clearly it does understand chess enough not to make illegal moves as long as its working memory allows it to remember the game state.
Hmm... but a finite state machine with an infinite tape is Turing complete too. If you're allowed to write symbols out and read them back in, you've invalidated the "proof" that humans aren't just doing pattern matching.
How so? The page you link offers three definitions[1], and all of them require an infinite tape.
You could argue that a stack is missing in my simplified model of the human brain, which would be correct. I used the simple model in allusion to the Chinese room thought experiment which doesn't require anything more than a dictionary.
Turing completeness applies to models of computation, not hardware. Otherwise, nothing would be Turing-complete because infinite memory doesn't exist in the real world. Just read the first sentence of what you linked to:
In computability theory, several closely related terms are used to describe the computational power of a computational system (such as an abstract machine or programming language)
Human thought isn't anything like GPT thought - humans can spend a variable amount of time thinking about what to learn from "training data" and can use explicit logic to reason about it. GPT is more like a form of lossy compression than that.
This is called prompt engineering. If you find a popular, frequently repeated code snippet and then fashion a prompt that is tailored to that snippet then yes the NN will recite it verbatim like a poem.
But that doesn't mean it's the only thing it does or even that it does it frequently. It's like calling a human a parrot because he completed a line from a famous poem when the previous speaker left it unfinished.
The same argument was brought up with GPT too and has been long debunked. The authors (and others) checked samples against the training corpus and it only rarely copies unless you prod it to.
I don't know if I agree with your argument about GPT-3, but I think our disagreement seems to be besides the point: if your human parrot did that, they would--not just in theory but in actual fact! see all the cases of this in the music industry--get sued for it, even if they claim they didn't mean to and it was merely a really entrenched memory.
The point is that many of the examples you see are intentional, through prompt engineering. The pilot asked the copilot to violate copyright, the copilot complied. Don't blame the copilot.
There also are cases where this happens unintentionally, but those are not the norm.
Transformers do learn and abstract. Not as well as humans, but for whatever definitive of innovation or creativity you wanna run with, these gpt models have it. It's not magic, it's math, but these programs are approximating the human function of media synthesis across narrowly limited domains.
These aren't your crazy uncle's Markov chain chatbots. They're sophisticated bayesian models trained to approximate the functions that produced the content used in training.
The model and attention mechanism produces Bayesian properties, but transformers as a whole contain non-Bayesian aspects, depending on how rigorous you want to be in defining Bayesian.
In my experience open source has now become so prevalent that I think some young developers could be completely caught out if the pendulum swings the other way.
Semi-related, the GNU/Linux copypasta is now more familiar to some than the GNU project in general - this is a shame to me as I view the copypasta to be mocking people who worked very hard to achieve what GNU has achieved asking for some credit.
Yeah... but they didn't say it was the law that got you excluded from working on some projects from reading copyright code. It's corporate policy that does that - it's not a law but they do it based on who owns the copyright. Not everything that impacts you is a law.
They said
> Reading some copyrighted code can have you entirely excluded from some jobs
And they're right. It's because of corporate policies. They never said it was because of a law - you imagined that out of nothing.
No that’s not true. I did not edit my posts after reading their reply, and the false accusation was that I changed my comment after it was replied to.
I didn’t challenge whether the question was in good faith, but I’ll just note that the relevant discussion of copyright got dropped in favor of an ad-hominem attack.
My question of which “it” was being referred to is a legitimate question that I believe clarified the intent of my comment, and I added it to make clear I was talking about what @lacker said, not what @jcelerier wrote.
> Edit - I’m adding another point as an edit to show another way to communicate. Would any of your points been lost had you done something similar?
This doesn’t answer my question of why an edit should not be made before I see any replies, nor of why any edit is “poor form” and according to whom. I made my edit immediately. I’m well aware of the practice of calling out edits with a note, I’ve done it many times. I don’t feel the need to call out every typo or clarification with an explicit note, especially when edited very soon after the original comment.
Thanks? Edits exist before you finish replying too, right? Maybe point that out to @chrisseaton, whose incorrect assumption was that I edited in response to what he wrote.
It's dependent on jurisdiction. Black box reverse engineering is only required in certain countries. If I remember correctly, most of Europe doesn't require it.
> > As a human, I am allowed to read copyrighted code and learn from it.
> Of course not. Reading some copyrighted code can make you entirely excluded from some jobs - you can't become a wine contributor if it can be shown you ever read Windows source code and most likely conversely.
You can of course read the code. The consequences are thus increased limitations, like you say.
What you mention is not an absolute restriction from reading copyrighted material. You perhaps have to cease other activities as a result.
I've you've ever read a book or interacted with any product, you've learned from copyrighted material.
You've extrapolated "some organizations don't allow you to contribute if you've learned from the code of their direct competitor" to "You're not allowed to learn from copyrighted code", which is absurd.
> Reading some copyrighted code can have you entirely excluded from some jobs - you can't become a wine contributor if it can be shown you ever read Windows source code and most likely conversely.
If that's the case, it should be easy to kill a project like wine - just send every core contributor an email containing some Windows code.
Nobody could grant if that thing is really windows code or a fake. Not without the sender self-identifying as a well known top MS employee having access to it. In that case the sender would be doing something illegal and against MS interests.
The result would be WINE having an advantage to redo the snippet of code in a totally new and different way and MS being forced to show part of its private code, that would expose them also to patent trolls.
Would be a win-win situation for Wine and a lose-lose situation for MS.
It's very clearly visible on the Wine wiki that people who have ever seen Microsoft Windows source code cannot contribute to Wine due to copyright restrictions:
> Nobody is really being hurt when a new tool makes it easier to copy little bits of code from the internet.
That's the first time I've heard copilot get described as copying little bits of code from the Internet. Copilot aggregates all github source code, removes licences from the code, and regurgitates the code without licenses.
Furthermore, both github and the programmers using copilot know this. Look at any one of these threads written by programmers about copilot. Using copilot is knowingly stealing the source code of others without attribution. Using copilot is literally humans stealing source code from others. Copilot was written for the purpose of taking other's code.
It's not "literally" stealing, because it doesn't deprive anyone of the use the source code. Those two points were somehow extremely obvious to everyone here as long as it was music and movies we were talking about.
And Github themselves have stated that only 0.1% of the Copilot output contains chunks taken verbatim from the learning set. Of those, the vast majority are likely to be boilerplate so generic it's silly to claim ownership, and maybe sometimes impossible to avoid.
It is actually true, in the UK at least the legal definition of theft includes the deprivation of the owner of the property in question.
The copyright lobby hedge the term as "copyright theft" (i.e. not actual theft) in order to shift the societal understanding. Whish appears to have worked.
This is not a value judgement on copyright infringement. Just that technically it doesn't meet the legal definition of theft.
cf. The rather amusing satire of the "you wouldn't steal a handbag" campaign in the UK, which ran "you wouldn't download a bear!"
Actually it's not theft in the US, it's intellectual property rights infringement. The way you're defining it memes are theft. There is also a thing called fair use when you don't use a significant portion of a copyrighted work, which is why memes and using small bits of code aren't infringement when you use them in different context.
Oh, then today I learned! I didn't realise they were different. Just looked it up in a "plain English dictionary of law" and the distinction seems subtle but important. Rather than "with the intention of depriving the owner", the US one says "with the intention of converting it to their use", which seems broad enough to cover exploiting a copy, rather than the original (or only, in the physical realm...)
Oh Idunno, it "depends on what the meaning of 'is' is"...
> Rather than "with the intention of depriving the owner", the US one says "with the intention of converting it to their use", which seems broad enough to cover exploiting a copy
...or rather, on the meaning of "converting". I've always theought of that as "changing", i.e. "it used to be one thing, and now it's something else". But copying IP only adds a use of it, it doesn't fundamentally change it in this sense: it is still available for the original proprietor's use. Is that really "converted"?
At least for the ordinary-English uuage of the word, I think it could be argued that it isn't. But then maybe this isn't just English; maybe the word "converting" also has some term-of-trade definition in that dictionary?
The US definition seems more robust, as otherwise, I could somehow steal something you built (e.g. a farm) and then generously allow you to continue using it, perhaps for a fee. You would therefore not be deprived of it but I would still be the new owner or user.
It seems unlikely this distinction would ever matter in a real court though.
The issue isn't an AI reading copyrighted code, the issue is an AI regurgitating the lines of copyrighted code verbatim. To be clear, humans aren't allowed to do this either.
And sure, nobody cares about your stupid binary tree, but do they care about GNU and the Linux kernel? Imagine someone trained an AI to specifically output Linux code, and used it to reproduce a working OS. Is that fair?
> Copyright has concluded that reading by robots doesn’t count. Infringement is for humans only; when computers do it, it’s fair use.
This is silly. Co pilot is not reading by itself, someone pushed buttons telling it to read and write. If I clone the entire github without the licenses I am telling a robot to do it, doesn't make it right.
> As a human, I am allowed to read copyrighted code and learn from it. An AI should be allowed to do the same thing.
You're taking the "learning" metaphor too literally. Machine learning models do not learn. They can and do encode their training material into their weights and biases, too. That's what Copilot was doing, regurgitating parts of its training data line for line.
To me, that is not much different from transforming a copyrighted piece of work with, say, compression, a lossy codec or cropping. There are plenty of people who can learn to play Metallica songs really well, but if they copied specifics aspects of their work it would be copyright infringement, as well.
A human being can literally learn. We can understand abstract principles from one copyrighted work and apply them to another without actually infringing its copyright. A ML model does not understand, it is a statistical model. It is inherently a derivative work, and it often encodes the copyrighted work into was trained on into the model itself.
That function appears in hundreds, if not thousands of GitHub repos. It's plausible that it's the most famous block of code ever. Are all those repos guilty of copyright infringement?
The only ways this argument could be less alarming to me is if people were bothered that it was writing the same Hello World as somebody else, or that it was naming variables "foo" and "bar".
Let's wait until we have a bulletproof, egregious, and inexcusable case of it lifting code until we panic.
I don't think it should be infringement. You can't copyright an algorithm, and Carmack's function is like six lines of code. If you just read Carmack's source code, then rewrote the same algorithm with different variable names, it would clearly not be infringement. Is it really so bad if you keep his precise variable names, comments, and indentation? How does it hurt Carmack to reproduce this tiny snippet of code exactly, rather than with a small rewrite?
Sorry but it is not a robot publishing the "lifted" code but a human. So the copyright will very much apply. That's an argument like saying CTRL+C/CTRL+V is OK because it is a "computer doing it".
Plus it is not "minor infringement" but code is being lifted verbatim - e.g. as has been demonstrated by the Quake square root code.
Perhaps the final judgment would say "AI cannot infringe on copyright provided that only other AIs consume the result of the first AIs work".
And suddenly there is a world of robots composing, writing and painting for other robots. With us humans left out.
There should be a /s at the end, but legal world sometimes produces such convolutions. See, for example, the interpretation of the Commerce Clause in Gonzales v. Reich.
As far as IP protections go, they're similar, but the incentives are so different that you get songwriters going to court over bits of melodies that might be worth millions. Outside of quantitative trading, it's hard to find an example of 10 lines of code that are worth millions and couldn't easily be replaced with another implementation.
This is a stupid argument that the Twitter author made. Saving music digitally is reading by robot, so recording music that wasn't digital into a digital format is fair use.
It should be obvious that if the robot is simply scraping web sites and reproducing their text verbatim (without permission and without giving credit) that would be an infringement.
There are a lot of shades of gray between that and the other extreme, which is where it is scraping millions of sites, learning from them, and producing something that isn't all that similar to any of them. Both ends of the spectrum, and everywhere in between, are things that humans can do, but as machines get more capable this is getting trickier and trickier to sort out.
In this case, it sounds like it might be closer to the first example, since significant parts of the code will be verbatim.
Ultimately, I am hoping that such things cause us to completely rethink copyright law. The blurriness of it all is becoming too much to make laws around. We just need better mechanisms to reward people for creating valuable IP that they allow people to freely use as they please.
Copilot is lifting entire functions from GPL code. Legal technicality aside , I know I'd be upset if I gpl'ed some code and someone stole large parts of it.
Why would you GPL the code in the first place if you didn't want other people using it? It's perfectly within the license for someone to do basically whatever they want with GPL code as long as they're not redistributing it. That includes using it for the internal operations of a Fortune 500 company, using it to run a dictatorship, or building a SaaS business on top of it. If you don't want people to "steal" your code, the GPL isn't the right license.
> Copyright has concluded that reading by robots doesn’t count. Infringement is for humans only; when computers do it, it’s fair use.
So wait, if I write my own AI, lets call it cp, and train it on gnu-gcc.tar.gz with the goal of creating a commercial-compiler.tar.gz then I can license the result any way I want? After all most of the work was done by the computer.
Just when I thought tweetstorms couldn't get any worse, here's one where every tweet is a quote-tweet of the author. I don't even understand how I'm supposed to read this.
> Copyright has concluded that reading by robots doesn’t count. Infringement is for humans only; when computers do it, it’s fair use.
Surely there's a limit to this. If I use a machine to produce something that just happens to exactly match a copyrighted work, now it's not infringement because of the method I used to produce it? That seems nonsensical, but maybe there's precedent for this too? (I have no idea what I'm talking about.)
That quote is basically entirely nonsensical. 'copyright' hasn't decided anything (nor has any legislative body nor the courts). All that's happened is that OpenAI has put forward an argument that using large quantities of media scraped from the internet as training data is fair use. This argument for the most part does not rely on the human vs machine distinction (in fact it leans on the idea that the process is not so different from a human learning). The main place this comes up is the final test of damage to the original in terms of lost market share where it's argued that because it's a machine consuming the content there's no loss of audience to the creator (which is probably better phrased as the people training the neural net weren't going to pay for it anyway). A lot does ride on the idea that the neural net, if 'well designed', does not generally regurgitate its training data verbatim, which is in fairly hot dispute at the moment. OpenAI somewhat punts on this situation and basically says the output may infringe copyright in this case, but the copyright holder should sue whoever's generating and using the output from the net, not the person who trained and distributed the net.
Surely it could be argued that there is a loss of audience to the author. At the moment some people will read the author's code directly in order to find out how to solve a problem. In the future at least some of those people will just ask copilot to solve the problem for them.
It all comes down to this: this has not been tested in the court. The above opinion, or for that matter any opinion from any lawyer or not-a-lawyer, is just that, an opinion.
As a business it is your responsibility to determine if this code-copying is worth a risk to your business.
Based on my experience, I'm pretty sure all corporate lawyers will disallow such code copying, till it has been tested in the court. It's just a matter of who will be the guinea pig.
Fair use for training and "independent creation" are one think a AI "remembering and mostly verbatim copying code over" an another.
Many of the current Machine Learning application try to teach AI to understand the concepts behind their training data and use that to do whatever they are trained to do.
But most (all?) fail to properly reach the goal in any more complicated cases, at least the kinds of models which are used for things like Copilot (GPT-3?).
Instead what this models learn can be described as a combination of some abstract understanding and verbatim snippets of input data of varying size.
As such while they sometimes generate "new" things based on "understanding" they also sometimes just copy things they have seen before!! (Like in the Quake code example where it even copied over some of the not-so "proper" comments expressing programmers frustration).
It's like a human not understanding programming or english or Latin letters but has a photographic memory and tries to somehow create something which seems to make sense by recombining existing verbatim snippets, sometimes while tweaking them.
I.e. if the snippets are small enough and tweaked enough it's covered by fair use and similar, BUT the person doing it doesn't know about this, so if a large remembered snippet matches verbatim it will put it in effectively copying code of a size which likely doesn't fall under fair use.
Also this is a well known problem, at least it was when I covered topics including ML ~5 years ago. I.e. good examples included extracting whole sequences of paragraphs of a book out of such a network or (more brilliantly) extracting thinks like peoples contact data based on their names or credit card information (in case of systems trained on mails).
So that Copilot is basically guaranteed to sometimes copy non super smalls snippets of code and potential comments in a way not-really appropriate wrt. copyright should have been a well know fact for the ML specialist in charge of this project.
> Copyright has concluded that reading by robots doesn’t count. Infringement is for humans only; when computers do it, it’s fair use.
This is a ridiculous conclusion. The ultimate destination of the robot actions' product is its user, i.e. a human. It is a clear corollary of the transitive law. Therefore, all human-focused legal concepts, including infringement, are applicable in such cases.
The absurdness of the conclusion cited above can be easily illustrated, as follows. Suppose that a person owns or rents an advanced robot (say, like Boston Dynamics' Spot, but better). He/she then programs it to break into someone's house and steal something valuable. All goes by the plan, the robot delivers the stolen goods to the rendezvous point and, if rented, gets returned. Now, according to the conclusion's logic, since a robot has done the actual "work", "it's fair use". Nonsense, right?
Just to clarify: I like the concept of GitHub Copilot (even though I have not yet tried this particular product). It offers various benefits, from pedagogical to adopting software engineering best practices to improving engineering productivity. However, I think that IP and legal aspects of this approach and specific product should be carefully studied and resolved in a consistent manner (e.g., prevent the model or system to output exact source code snippets).
An AI isnt learning from it. Its effectively copying prior work when it solves a problem. There is no novel out of bounds data generation by modern ai approaches
> Copyright has concluded that reading by robots doesn’t count. Infringement is for humans only; when computers do it, it’s fair use.
Reading by a robot doesn't count. But injecting a robot between copyright material and a product doesn't magically strip the copyright from whatever it produces.
> I am capable of summarizing the thoughts of lawyers
Then, I'm sorry, but you seem to have done a pretty bad job of summarizing that paper ?
My own take on summarizing it's conclusion would be :
"If the program can pass the Turing Test, then it should be legally liable, just like a human would."
(Yes, emphasis on the should here, but the way that you're presenting that quote might make readers think that that paper's conclusion is the opposite one !)
----
> Nobody is really being hurt when a new tool makes it easier to copy little bits of code from the internet.
Some of the examples from that paper are A&M Records, Inc. v. Napster (no comment) and White Smith Music Publ’g Co. v. Apollo Co. (piano rolls), and in that latter case we pretty much (?) got the whole Copyright Act of 1909 the very next year, where these “mechanical reproductions” were subjected to a statutory compulsory license.
So at the very least there should be concern about how Copilot might be eventually considered by the courts to facilitate copyright infringement and at the very least have to provide the source of its "insights" ?
(Attribution being the bare minimum that most of the software licenses require.)
In germany, there is no fair use exception to copyright. Also, there is no IP most software principles: e.g. writing a specific loop, that even an (weak) AI could suggest, would probably be too simple to be protected.
What could be valid is a right to not mimic collections, but that would mean you cannot clone the Copilot, as input is mapped to a non-trivial collection of outputs.
Hit the nail on the head, at least as far as my concerns go.
Security is a constant issue with humans writing code. Do we really want an "AI" that understands neither code nor security spitting out snippets of code to pasted into network services?
If Copilot ever becomes truly popular it's going to be an absolute security nightmare both from the code it suggests (just bad code, GIGO as you say), but also because adversaries will be gaming it by posting bad code for it to pick up and "learn" from.
Well, maybe the interpretation will change if the right people are pissed off.
At this point, how hard would it be to produce a structurally similar "content-aware continuation/fill" for audio producers, film makers, etc, which suggests audio snippets or film snippets, trained from copyrighted source material?
If prompted by a black screen with some white dots, the video tool could suggest a sequence of frames beginning with text streaming into the distance "A long time ago in a galaxy far far away ..." and continue from there.
Normally we don't try to train models to regurgitate their inputs, but if we actually tried, I'm sure one could be made to reproduce the White Album or Thriller or whatever else.
> nobody cares if my ten-line "how to invert a binary tree" snippet is the same as someone else's.
Maybe nobody cares about that, but the problem is that Github's automated tool is not telling you what code it shows you is actually an exact copy of existing code, or how much of that existing code is being copied, or whether the existing code is licensed, or, if it is licensed, whether your copying is in accordance with the license or not. And without that information you can't possibly know whether what you are doing is legal or ethical. Sure, you could try to guess, but that sort of thing is not supposed to rely on guessing.
> Nobody is really being hurt when a new tool makes it easier to copy little bits of code from the internet.
Of course people are hurt, namely the original creators who spent years of work and whose work is potentially laundered, depending on how good this IP grabbing AI will get.
If it gets really good, some smug and well connected loser (e.g. the type who posts pictures of himself with a microphone on GitHub) will click a button, steal other people's hard work and start a "new" project that supersedes the old one.
So what happens when someone makes a transformer network that can read fanfics and animate them live trained from the whole collection of MPAA movies? I mean its inevitable. Given the history of the MPAA, I don't think they're gonna lie down and just take it. I feel like we're in a slippery slope to provoke the "IP lords" into brutally draconian measures that will make the Disney copyright extensions look like a tax deferral.
You sure about that? You aren't allowed to read it outside of the license attached to it. Downloading pirated source code, reading it, and then typing it out from memory doesn't magically give you a right to use it in any way. I would argue the licenses attached to most copyrighted code are being violated the moment the code is scraped and replicated without permission.
There might be a slippery slope here: suppose there's a GPL version of product X in the training set. I'm building a proprietary competitor. Then let's say copilot makes it a little bit easier and cheaper for me to build my product.
Now suppose it's 10 years from now and it's trivial to build a proprietary competitor.
>Copyright has concluded that reading by robots doesn’t count. Infringement is for humans only; when computers do it, it’s fair use.
Glad to hear this. My warez group from here on out will only release binaries written collaboratively with a neural net trained on the best proprietary software available
> As a human, I am allowed to read copyrighted code and learn from it. An AI should be allowed to do the same thing.
This is a very false equivalency. AI and humans are different. First, AI is at best a slave, and likely a slave of a capital. Second - scale makes difference.
I have a question. How to define the AI? In an extreme case, could I say copy-paste is the AI if I implement it in hundreds of thousands of NN layers and it output piece of the origin work (e.g. 30s of music from 3mins with small modification).
> As a human, I am allowed to read copyrighted code and learn from it.
Plenty of examples show that it didn't learn that much and copy literal parts of code. At my school that would have been ground for plagiarism which weren't treated lightly.
> Infringement is for humans only; when computers do it, it’s fair use.
But ultimately the human is OK-ing the code and committing it, basically as his own work most of the time. I'm reasonably sure that this may matter to courts.
There's a lot of sibling commenters disagreeing with this take but I think they miss that ultimately this comes down to how legal experts interpret tech, rather than what tech experts think law should apply.
This is, imo, unfortunate, as often the legal interpretation is based on a gross misunderstanding of how the tech works, but this is the way.
I don't think copilot should be legal according to my own interpretation but in this (rare) case I feel the "IANAL" tag applies not because I lack (legal) knowledge, but rather because I have (tech) knowledge that is likely absent from actual decision making on legal outcomes (therefore leading to different legal outcomes than how I would see things working).
Autonomous programming will be explored. Potentially, Copilot is a proof of concept, an early step in that direction. If it is, the corrections made by Copilot users will be applied to the development of the future of unattended programming. Whether it is or not, it's close enough that any legal outcomes experienced by Copilot users will contribute to the definition of liability boundaries relevant to the future of autonomous programming. Copilot users are numerous enough that the incidence of risk is low of ending up under the foot of a copyright owner with the means and will to crush a user, but no one should take such a risk to use a novelty like Copilot in production code.
> As a human, I am allowed to read copyrighted code and learn from it. An AI should be allowed to do the same thing.
This is a non-sequitur. Why should it?
> And nobody cares if my ten-line "how to invert a binary tree" snippet is the same as someone else's.
Are you going to make up a rule for every length and type of code? What about twenty line? If ten lines are fine then surely twenty would be? How about pictures? If some code is then surely a picture or two wouldn't hurt? Let's just tweak the AI slightly so it regurgitates more code verbatim -- or do courts have to examine any change made to the AI and okay them?
> Nobody is really being hurt when a new tool makes it easier to copy little bits of code from the internet.
The Windows source code can be found on the internet. As a human you're allowed to read that if you have it. Try making an AI that copies bits of that into your code and release that on the internet.
> Nobody is really being hurt when a new tool makes it easier to copy little bits of code from the internet.
Quite the opposite. We all get a tiny bit better with good information like this. This is what the internet should be for, evolving, learning from past mistakes, information availability.
If the discussion was “I clicked this button and got someone’s entire chat platform” that would be different. Words and sentences aren’t copy written, books are, so when exactly are a collection of words a book?
There is nuance, and the linked page has none. But that’s fine, that guy is free to pull his content off GitHub. This seems like a useful feature for other people who want to make things first and foremost.
> Words and sentences aren’t copy written, books are, so when exactly are a collection of words a book?
If that were true, then 20 people could each steal a single chapter from a book, and one of the people could combine those 20 chapters into a new copyright-free book. That's clearly false.
https://twitter.com/luis_in_brief/status/1410242882523459585...
And this is a longer article about how IP and AI interact:
https://ilr.law.uiowa.edu/print/volume-101-issue-2/copyright...
I am not a lawyer, but I am capable of summarizing the thoughts of lawyers, so my take is that in general, fair use allows AI to be trained on copyrighted material, and humans who use this AI are not responsible for minor copyright infringement that happens accidentally as a result. However, this has not been tested in court in detail, so the consensus could change, and if you were extremely risk-averse you might want to avoid Copilot.
A key quote from the second link:
Copyright has concluded that reading by robots doesn’t count. Infringement is for humans only; when computers do it, it’s fair use.
Personally, I think law should allow Copilot. As a human, I am allowed to read copyrighted code and learn from it. An AI should be allowed to do the same thing. And nobody cares if my ten-line "how to invert a binary tree" snippet is the same as someone else's. Nobody is really being hurt when a new tool makes it easier to copy little bits of code from the internet.