Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Lots of questions:

  - the generated code by AI belongs to me or GitHub?
  - under what license the generated code falls under?
  - if generated code becomes the reason for infringment, who gets the blame or legal action?
  - how can anyone prove the code was actually generated by Copilot and not the project owner?
  - if a project member does not agree with the usage of Copilot, what should we do as a team?
  - can Copilot copy code from other projects and use that excerpt code?
    - if yes, *WHY* ?!
    - who is going to deal with legalese for something he or she was not responsible in the first place?
    - what about conflicts of interest?
  - can GitHub guarantee that Copilot won't use proprietary code excerpts in FOSS-ed projects that could lead to new "Google vs Oracle" API cases?


In general: (1) training ML systems on public data is fair use (2) the output belongs to the operator, just like with a compiler.

On the training question specifically, you can find OpenAI's position, as submitted to the USPTO here: https://www.uspto.gov/sites/default/files/documents/OpenAI_R...

We expect that IP and AI will be an interesting policy discussion around the world in the coming years, and we're eager to participate!


You should look into:

https://breckyunits.com/the-intellectual-freedom-amendment.h...

Great achievements like this only hammer home the point more about how illogical copyright and patent laws are.

Ideas are always shared creations, by definition. If you have an “original idea”, all you really have is noise! If your idea means anything to anyone, then by definition it is built on other ideas, it is a shared creation.

We need to ditch the term “IP”, it’s a lie.

Hopefully we can do that before it’s too late.


> Ideas are always shared creations, by definition. If you have an “original idea”, all you really have is noise! If your idea means anything to anyone, then by definition it is built on other ideas, it is a shared creation.

Copyright doesn't protect "ideas" it protects "works". If an artist spends a decade of his life painting a masterpiece, and then some asshole sells it on printed T-shirts, then copyright law protects the artist.

Likewise, an engineer who writes code should not have to worry about some asshole (or some for-profit AI) copy and pasting it into other peoples' projects. No copyright protections for code will just disincentivize open source.

Software patents are completely bullshit though, because they monopolize ideas which 99.999% of the time are derived from the ideas other people freely contributed to society (aka "standing on the shoulders of giants"). Those have to go, and I do not feel bad at all about labeling all patent-holders greedy assholes.

But copyright is fine and very important. Nothing is perfect, but it does work very well.


Copyrights are complete bullshit too though. In your 2 examples. First, the artist I assume is using paints and mediums developed arguably over thousands of years, at great cost. So just because she is assembling the leaf nodes of the tree, the far majority of the “work” was created by others. Shared creation.

Same goes for an engineer. Binary notation is at the root of all code, and in the intermediate nodes you have Boolean logic and microcode and ISAa and assembly and compilers and high level Lang’s and character sets. The engineer who assembles some leaf nodes that are copy and pasteable is by definition building a shared creation of which they’ve contributed the least.


The basis of copyright isn’t that the sum product is 100% original. That insane since nothing we do is ever original. It’ll always be a component ultimately of nature. The point is that your creation is protected for a set amount of time and then it too eventually becomes a component for future works.


> the artist I assume is using paints and mediums developed arguably over thousands of years, at great cost.

And they went to the store and paid money for those things.


And they handed the cashier money and then got to do whatever they want with those things. Now they want to sell their painting to the cashier AND control what the cashier does with it for the rest of the cashier's life. They want to make a slave out of the cashier by a million masters.


Remind me when GitHub handed anyone any money for the code they used?


I'm sure natfriedman will be thrilled to abolish IP and also apply this to the Windows source code. We can expect it on GitHub any minute!


I used to work at Microsoft and occasionally would email Satya the same idealistic pitch. I know they have to be more conservative , but some of us have to envision where the math can take us and shout out loud about it, and hope they steer well. When I started at MS, my first week was heckled for installing Ubuntu on my windows machine. When I left, windows was shipping with Ubuntu. What may seem impossible today can become real if enough people push the ball forward together. I even hold out hope that someday BG will see the truth and help reduce the ovarian lottery by legalizing intellectual freedom.


Talking about the ovarian lottery seems strange in a thread about an AI tool that will turn into a paid service.

No one will see the light at Microsoft. The "open" source babble is marketing and recruiting oriented, and some OSS projects infiltrated by Microsoft suffer and stagnate.


All I know is that if a lawsuit comes around for a company who tried to use this, Github et al won't accept an ounce of liability.


You can't abolish IP without completely restructuring the economic system (which I'm all for, BTW). But just abolishing IP and keeping everything the same is kind of myopic. Not saying that's what you're advocating for, but I've run into this sentiment before.


Sure, but usually I tend to think "abolish X" means "lets agree on an end goal of abolishing X and then work rapidly to transition to that world." So in that sense I tend to think the person is not advocating for the simple case of changing one law, but on the broader case of examining the legal system and making the appropriate changes to realize a functioning world where we can "abolish X".


I agree that it would be a huge upheaval. Absolutely massive change to society. But I have every confidence we have a world filled with good bright people who can steer us through the transition. Step one now is just educating people that these laws are marketed dishonestly, are inequitable, and counter productive to the progress of ideas. As long as the market starts to get wind that the truth is spreading, I believe it will start to prepare for the transition.


In practical terms, IP could be referred to as unique advantages. What is the purpose of an organization that has no unique qualities?

In general, what is IP and how it's enforced are two separate things. Just because we've used copyright and patents to "protect" an organization's unique advantages, doesn't mean we need to keep using them in the same way. Or maybe it's the best we can do for now. That's why BSD style licences are so great.


> training ML systems on public data is fair use

Uh, I very much doubt that. Is there any actual precedent on this?

> We expect that IP and AI will be an interesting policy discussion around the world in the coming years, and we're eager to participate!

But apparently not eager enough to have this discussion with the community before deciding to train your proprietary for-profit system on billions of lines of code that undoubtedly are not all under CC0 or similar no-attribution-required licenses.

I don't see attribution anywhere. To me, this just looks like yet another case of appropriating the public commons.


@Nat, these questions (all of them, not just the 2 you answered) are critical for anyone who is considering using this system. Please answer them?

I for one wouldn't touch this with a 10000' pole until I know the answers to these (very reasonable) questions.


How do you guarantee it doesn't copy a GPL-ed function line-by-line?


Yup, this isn't a theoretical concern, but a major practical one. GPT models are known for memorizing their training data: https://towardsdatascience.com/openai-gpt-leaking-your-data-...

Edit: Github mentions the issue here: https://docs.github.com/en/github/copilot/research-recitatio... and here: https://copilot.github.com/#faq-does-github-copilot-recite-c... though they neatly ignore the issue of licensing :)


That second link says the following:

> We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set

That's kind of a useless stat when you consider that the code it generates makes use of your existing variable/class/function names when adapting the code it finds.

I'm not a lawyer, but I'm pretty sure I can't just bypass GPL by renaming some variables.


It's not just about regurgitating training data during a beam search, it's also about being a derivative work, which it clearly is in my opinion.


> GPT models are known for memorizing their training data

Hash each function, store the hashes as a blacklist. Then you can ask the model to regenerate the function until it is copyright safe.


What if it copies only a few lines, but not an entire function? Or the function name is different, but the code inside is the same?


If we could answer those questions definitively, we could also put lawyers out of a job. There’s always going to be a legal gray area around situations like this.


Matching on the abstract syntax tree might be sufficient, but might be complex to implement.


You can probably tokenize the names so they become irrelevant. You can ignore non-functional whitespace, so that code C remains. Maybe one can hash all the training data D such that hash(C) is in hash(D). Some sort of Bloom filter...


Surprised not to see more mention of this. It would make sense for an AI to "copy" existing solutions. In the real world, we use clean room to avoid this.

In the AI world, unless all GPL (etc.) code is excluded from the training data, it's inevitable that some will be "copied" into other code.

Where lawyers decide what "copy" means.


It's not just about copying verbatim. They clearly use GPL code during training to create a derivative work.

Then you have the issue of attribution with more permissive licenses.


How do you know that when you write a simplish function for example, it is not identical to some GPL code somewhere? "Line by line" code does not exist anywhere in the neural network. It doesn't store or reference data in that way. Every character of code is in some sense "synthesized". If anything, this exposes the fragility of our concept of "copyright" in the realm of computer programs and source code. It has always been ridiculous. GPL is just another license that leverages the copyright framework (the enforcement of GPL cannot exist outside such a copyright framework after all) so in such weird "edge cases" GPL is bound to look stupid just like any other scheme. Remember that GPL also forbids "derivative" works to be relicensed (with a less "permissive" one). It is safe to say that you are writing code that is close enough to be considered "derivative" to some GPL code somewhere pretty much every day, and you can't possibly prove that you didn't cheat. So the whole framework collapses in the end anyways.


> How do you know that when you write a simplish function for example, it is not identical to some GPL code somewhere?

I don't, but then I didn't go first look at the GPL code, memorize it completely, do some brain math, and then write it out character by character.


I truly don't think they can guarantee that. Which is a massive concern.


(1) That may be so, but you are not training the models on public data like sports results. You are training it on copyright protected creations of humans that often took years to write.

So your point (1) is a distraction, and quite an offensive one to thousands of open source developers, who trusted GitHub with their creations.


   (1) training ML systems on public data is fair use 

This one is tricky considering that kNN is also a ML system.


kNN needs to hold on to a complete copy of the dataset itself unlike a neural net where it's all mangled.


What about privacy. Does the AI send code to GitHub? This reminds me of Kite


Yes, under "How does GitHub Copilot work?":

> [...] The GitHub Copilot editor extension sends your comments and code to the GitHub Copilot service, which then uses OpenAI Codex to synthesize and suggest individual lines and whole functions.


Fair use doesn't exist in every country, so it's US only?


Yes, my partner likes to remind me we don't have it here in Australia. You could never write a search engine here. You can't write code that scrapes websites.


It exists in EU also (and it much mire powerful here).


The EU doesn't have a copyright related fair use. Quite the opposite, that why we are getting upload filters.


False. In Spain you have it as under "uso legitimo".


Spain is only part of the EU not the EU.


One exception makes the whole "EU doesn't have" incorrect.

EU doesn't enforce it on the states, yes. But some (maybe all) countries that are in EU do have it.


> We expect that IP and AI will be an interesting policy discussion around the world in the coming years, and we're eager to participate!

Another question is this: let's hypothesize I work solo on a project; I have decided to enable Copilot and have reached a 50%-50% development with it after a period of time. One day the "hit by a bus" factor takes place; who owns the project after this incident?


Your estate? The compiler comparison upthread seems to be perfectly valid. If you work on a solo project in c# and die, Microsoft doesn’t automatically own your project because you used visual studio to produce it


> the output belongs to the operator, just like with a compiler.

No it really is not that easy, as with compilers it depends on who owned the source and which license(s) they applied on it.

Or would you say I can compile the Linux kernel and the output belongs to me, as compiler operator, and I can do whatever I want with it without worrying about the GPL at all?


> training ML systems on public data is fair use

So, to be clear, I am allowed to take leaked Windows source code and train an ML model on it?


Or, take leaked Windows source code, run it through a compiler, and own it!


What does "public" mean? Do you mean "public domain", or something else?


Unfortunately, in ML "public data" typically means available to the public. Even if it's pirated, like much of the data available in the Books3 dataset, which is a big part of some other very prominent datasets.


So basically youtube all over again? I.e bootstrap and become popular by using widely available whatever media (pirated by crowdsourced piracy) and then many years later, when it gets popular, dominant, it has to turn around and "do things right" and guard copyrights.


Fair Use is an affirmative defense (i.e. you must be sued and go to court to use it; once you're there, the judge/jury will determine if it applies). But taking in code with any sort of restrictive license (even if it's just attribution) and creating a model using it is definitely creating a derivative work. You should remember, this is why nobody at Ximian was able to look at the (openly viewable, but restrictively licensed) .NET code.

Looking at the four factors for fair use looks like Copilot will have these issues: - The model developed will be for a proprietary, commercial product - Even if it's a small part of the model, the all training data for that model are fully incorporated into the model - There is a substantial likelihood of money loss ("I can just use Copilot to recreate what a top tier programmer could generate; why should I pay them?")

I have no doubt that Microsoft has enough lawyers to keep any litigation tied up for years, if not decades. But your contention that this is "okay because it's fair use" based on a position paper by an organization supported by your employer... I find that reasoning dubious at best.


It is the end of copyright then. NNs are great at memorizing text. So I just train a large NN to memorize a repository and the code it outputs during "inferencing" is fair use ?

You can get past GPL, LGPL and other licenses this way. Microsoft can finally copy the linux kernel and get around GPL :-).


> - under what license the generated code falls under?

Is it even copyrighted? Generally my understand is that to be copyrightable it has to be the output of a human creative process, this doesn't seem to qualify (I am not a lawyer).

See also, monkeys can't hold copyright: https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...


> Is it even copyrighted?

Isn't it subject to the licenses the model was created from, as the learning is basically just an automated transformation of the code, which would be still the original license - as else I could just run some minifier, or some other, more elaborate, code transformation, on some FOSS project, for example the Linux kernel, and relicense it under whatever?

Does not sound right to me, but IANAL and I also did not really look at how this specific model/s is/are generated.

If I did some AI on existing code I'd be quite cautious and group by compatible licences classes, asking the user what their projects licence is and then only use the compatible parts of the models.-Anything else seems not really ethical and rather uncharted territory in law to me, which may not mean much as IANAL and just some random voice on the internet, but FWIW at least I tried to understand quite a few FOSS licences to decide what I can use in projects and what not.

Anybody knows of some relevant cases of AI and their input data the model was from, ideally in jurisdictions being the US or any European Country ones?


This is a great point. If I recall correctly, prior to Microsoft's acquisition of Xamarin, Mono had to go out of its way to avoid accepting contributions from anyone who'd looked at the (public but non-FOSS) source code of .NET, for fear that they might reproduce some of what they'd seen rather than genuinely reverse engineering.

Is this not subject to the same concern, but at a much greater scale? What happens when a large entity with a legal department discovers an instance of Copilot-generated copyright infringement? Is the project owner liable, is GitHub/Microsoft liable, or would a court ultimately tell the infringee to deal with it and eat whatever losses occur as a result?

In any case, I hope that GitHub is at least limiting any training data to a sensible whitelist of licenses (MIT, BSD, Apache, and similar). Otherwise, I think it would probably be too much risk to use this for anything important/revenue-generating.


> In any case, I hope that GitHub is at least limiting any training data to a sensible whitelist of licenses (MIT, BSD, Apache, and similar). Otherwise, I think it would probably be too much risk to use this for anything important/revenue-generating.

I'm going to assume that there is no sensible whitelist of licenses until someone at GitHub is willing to go on the record that this is the case.


> I hope that GitHub is at least limiting any training data to a sensible whitelist of licenses (MIT, BSD, Apache, and similar)

Yes, and even those licences require preservation of the original copyright attribution and licence. MIT gives some wiggle room with the phrase "substantial portions", so it might just be MIT and WTFPL


Interesting to see since Nat was a founder of Xamarin


(Not a lawyer, and only at all familiar with US law, definitely uncharted territory)

No, I don't believe it is, at least to the extent that the model isn't just copy and pasting code directly.

Creating the model implicates copyright law, that's creating a derivative work. It's probably fair use (transformative, not competing in the market place, etc), but whether or not it is fair use is github's problem and liability, and only if they didn't have a valid license (which they should have for any open source inputs, since they're not distributing the model).

I think the output of the model is just straight up not copyrighted though. A license is a grant of rights, you don't need to be granted rights to use code that is not copyrighted. Remember you don't sue for a license violation (that's not illegal), you sue for copyright infringement. You can't violate a copyright that doesn't exist in the first place.

Sometimes a "license" is interpreted as a contract rather than a license, in which you agreed to terms and conditions to use the code. But that didn't happen here, you didn't agree to terms and conditions, you weren't even told them, there was no meeting of minds, so that can't be held against you. The "worst case" here (which I doubt is the case - since I doubt this AI implicates any contract-like licenses), is that github violated a contract they agreed to, but I don't think that implicates you, you aren't a party to the contract, there was no meeting of minds, you have a code snippet free of copyright received from github...


So if I make AI that takes copyrighted material in one side, jumbles it about, and spits out the same copyrighted material on the other side, I have successfully laundered someone else's work as my own?

Wouldn't GitHub potentially be responsible for the infringement by distributing the copyrighted material knowing that it would be published?


I exempted copied segments at the start of my previous post for a reason, that reason is I don't really know, I doubt it works because judges tend to frown on absurd outcomes.


Where does copying end though? If an AI "retypes" it, not only with some variable renaming but some transformations that are not just describable by a few code transformations (neural nets are really not transparent and can do weird stuff), it wouldn't seem like a copy when just comparing parts of it, but it effectively would be one, as it was an automated translation.


Probably, copying ends when the original creative elements are unrecognizable. Renaming variables actually goes a long way to that, also having different or standardized (and therefore not creative) whitespace conventions, not copying high level structure of files, etc.

The functional parts of code are not copyrightable, only the non functional creative elements.

(Not a lawyer...)


> The functional parts of code are not copyrightable, only the non functional creative elements.

1. Depends heavily on the jurisdiction (e.g., Software patents are a thing in America but not really in basically all European ones)

2. A change to a copyrightable work, creative or not, would still mean that you created a derived work where you'd hold some additional rights, depending on the original license, but not that it would now be only in your creative possession. E.g., check §5 of https://www.gnu.org/licenses/gpl-3.0.en.html

3. What do you think of when saying "functional parts"? Some basic code structure like an `if () {} else {}` -> sure, but anything algorithmic like can be seen as copyrightable, and whatever (creative or not) transformation you apply, in its basics it is a derived work, that's just a fact and the definition of derived work.

Now, would that matter in courts? That depends not only on 1., but additionally to that also very much on the specific case, and for most trivial like it probably would be ruled out, but if an org would invest enough lawyer power, or suing in a for its case favourable court (OLG Hamburg anyone). Most small stuff would be thrown out as not substantial enough, or die even before reaching any court.

But, that actually scares me a bit when thinking about that in this context, as for me, it seems like when assuming you'd be right, this all would significantly erodes the power of copyleft licenses like (A)GPL.

Especially if a non-transparent (e.g., AI), lets call it, code laundry would be deemed as a lawful way to strip out copyright. As it is non-transparent it wouldn't be immediately clear if creative change or not, to use the criteria for copyright you used. This would break basically the whole FOSS community, and with its all major projects (Linux, coreutils, ansible, git, word press, just to name a few) basically 80% of core infrastructure.


If the model is a derivative work, why wouldn’t works generated using the model also be derivative works?


Because a derivative work must "as a whole, represent an original work of authorship".

https://www.law.cornell.edu/uscode/text/17/101

(Not a lawyer...)


In the US, yes. Elsewhere, not necessarily.


It is output of humans creative processes, just not yours. Like an automated stackoverflow snippet engine.


>Generally my understand is that to be copyrightable it has to be the output of a human creative process

https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...


You should read the FAQ at the bottom of the page; I think it answers all of your questions: https://copilot.github.com/#faqs


> You should read the FAQ at the bottom of the page; I think it answers all of your questions: https://copilot.github.com/#faqs

Read it all, and the questions still stand. Could you, or any on your team, point me on where the questions are answered?

In particular, the FAQ doesn't assure that the "training set from publicly available data" doesn't contain license or patent violations, nor if that code is considered tainted for a particular use.


From the faq:

> GitHub Copilot is a code synthesizer, not a search engine: the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set.

I'm guessing this covers it. I'm not sure if someone posting their code online, but explicitly saying you're not allowed to look at it, getting ingested into this system with billions of other inputs could somehow make you liable in court for some kind of infringement.


That doesn't cover it, since that is a technical answer for a non-technical question. The same questions remain.


that doesn't include patent violations nor license violations or compatibility between licenses. Which would be the most numerous and non-trivial cases.


How is it possible to determine if you've violated a random patent from somewhere on the internet via a small snippet of customized auto-generated code?

Does everyone in this thread contact their lawyers after cutting and pasting a mergesort example from Stackoverflow that they've modified to fit their needs? Seems folks are reaching a bit.


For that very reason, many companies have policies that forbid copying code from online (especially from StackOverflow).


That mitigates copyright concerns, but patent infringement can occur even if the idea was independently rediscovered.


I was answering a specific question, "How is it possible to determine if you've violated a random patent from somewhere on the internet via a small snippet of customized auto-generated code?" The answer is that many companies have forbidden that specific action in order to remove the risk from that action.

You are expanding the discussion, which is great, but that doesn't apply in answer to that specific question.

There are answers in response to your question, however. For example, many companies use software for scanning and composition analysis that determines the provenance and licensing requirements of software. Then, remediation steps are taken.


Not sure what you're getting at. Are you suggesting that independent discovery is a defense against patents? Or are you clear that it isn't a defense, but just arguing that something from the internet is more likely to be patented than something independently invented in-house? Maybe that's true, but it doesn't really answer the question of

> How is it possible to determine if you've violated a random patent from somewhere on the internet via a small snippet of customized auto-generated code?

The only real answer is a patent search.


There are different ways to handle risk, such as avoidance, reduction, transfersal, acceptance. I was answering a specific question as to how people manage risk in a given situation. In answer I related how companies will reduce the risk. I was not talking about general cases of how to defend against the risk of patents but a specific case as to reducing the risk of adding externally found code into a product.

My answer described literally what many companies do today. It was not a theoretical pie in the sky answer or a discussion about patent IP.

To restate, the real-world answer I gave for, "How is it possible to determine if you've violated a random patent from somewhere on the internet via a small snippet of customized auto-generated code?" is often "Do not take code from the Internet."


I think a patent violation with CoPilot is exactly the same scenario as if you violated a patent yourself without knowing it.


Sounds like using CodePilot can introduct GPLd code into your project and make your project bound by GPL as a result...

0.1% is a lot when you use 100 suggestions a day.


The most important question, whether you own the code, is sort of maybe vaguely answered under “How will GitHub Copilot get better over time?”

> You can use the code anywhere, but you do so at your own risk.

Something more explicit than this would be nice. Is there a specific license?

EDIT: also, there’s multiple sections to a FAQ, notice the drop down... under “Do I need to credit GitHub Copilot for helping me write code?”, the answer is also no.

Until a specific license (or explicit lack there-of) is provided, I can’t use this except to mess around.


None of the questions and answers in this section hold information about how the generated code affects licensing. None of the links in this section contain information about licensing, either.


I dont see the answer to a single one of their questions on that page - did you link to where you intended?

Edit: you have to click the things on the left, I didn't realize they were tabs.


Sorry Nat, but I don't think it really answers anything. I would argue that using GPL code during training falls under Copilot being a derivative work of said code. I mean if you look at how a language model works, than it's pretty straightforward. The word "code synthesizer" alone insinuates as much. I think this will probably ultimately tested in court.


This page has a looping back button hijack for me


Does Copilot phone home?


When you sign up for the waitlist it asks permission for additional telemetry, so yes. Also the "how it works" image seems to show the actual model is on github's servers.


Yes, and with the code you're writing/generating.


This obviously sucks.

Can't companies write code that runs on customer's premises these days? Are they too afraid somebody will extract their deep learning model? I have no other explanation.

And the irony is that these companies are effectively transferring their own fears to their customers.


It's a large and gpu-hungry model.


Some of your questions aren't easy to answer. Maybe the first two were OK to ask. Others would probably require lawyers and maybe even courts to decide. This is a pretty cool new product just being shared on an online discussion forum. If you are serious about using it for a company, talk to your lawyers, get in touch with Github's people, and maybe hash out these very specific details on the side. Your comment came off as super negative to me.


> This is a pretty cool new product just being shared on an online discussion forum.

This is not one lone developer with a passion promoting their cool side-project. It's GitHub, which is an established brand and therefore already has a leg up, promoting their new project for active use.

I think in this case, it's very relevant to post these kinds of questions here, since other people will very probably have similar questions.


I think these are very important questions.

The commenter isn't interrogating some indy programmer. This is a product of a subsidiary of Microsoft, who I guarantee has already had a lawyer, or several, consider these questions.


No, they are all entirely reasonable questions. Yeah, they might require lawyers to answer - tough shit. Understanding the legal landscape that ones' product lives in is part of a company's responsibility.


Regardless of tone, I thought it was chock full of great questions that raised all kinds of important issues, and I’m really curious to hear the answers.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: