Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Generative AI profits off your code. Make them pay for it (paytotrain.ai)
40 points by evashang on Nov 25, 2022 | hide | past | favorite | 54 comments


If you put your code on Github, it's bound by the TOS, which states (https://docs.github.com/en/site-policy/github-terms/github-t...):

> We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

Doesn't that contradict the purpose of this website? Is this performance art?

Also, why is this website so secretive? Why not publish the license on the website?

> Who is PayToTrain created by and why?

> PayToTrain is created by a small group of developers and attorneys who are passionate about open source software and ensuring that developers are properly compensated for their work. The website and service are provided completely free of charge.

Edit: PayToTrain looks like a non-disclosed ad and/or project from legalist . com.


I agree Humans Only Clause does not prevent Microsoft from training Copilot from codes on GitHub due to GitHub's terms of service, but I think it does prevent, say, Salesforce from training its CodeGen model.

So if the clause is widely adopted, it may be good for Microsoft and bad for Salesforce. If you want to reward Microsoft and punish Salesforce, it may be a good idea.


It shouldn't even hinge on a TOS.

If Microsoft loses this case, it actually means Microsoft wins and we all lose.

Who has a large enough corpora of training data? Only institutional copyright holders.

This is probably going to play out like Oracle vs. Google when Google suddenly realized that they should lose and intentionally threw the case.

I'm so worried about this case. Treating copyrighted training data as fair use, and letting models learn as a child might learn from a book or movie, is the best way to proceed. It widens the playing field for both development and disruption.


I would happily contribute all of my public source code under whatever license to a dataset that required models to also be open sourced. I am not OK with Microsoft creating a derivative work (Copilot's model) off of GPL code and not releasing the weights under the GPL.


I tend to agree, but to play devil's advocate, if we were talking about a biological neural network (person) training themselves by looking at GPL code, the GPL would of course not apply to code they release later in general.


> if we were talking about a biological neural network (person) training themselves by looking at GPL code...

The thing is, said person both reads way less code than a non-biological neural network, and emits its derivations based on many inputs regardless of the code it ingested via its high resolution multi focal adaptive light sensors. Including but not limited to experiences, communication with other biological neural networks, human-machine code translators (compilers), daily unpredictable hormone fluctuations and infinite other inputs it processed which affected all aspects of its cranial muscle and daily living circumstances and choices.

IOW, A neural network is neither a person, nor learns the same way or derives and emits the same way.

This is equal with claiming that a Furby is a person, just because it can babble and blink.


I'm not saying all neural networks are people, im saying people are a subset of neural networks by definition. We don't have any idea how consciousness works, and our brains are essentially still black boxes.

In a similar way, noone really understands intuitively how these ML models are actually working (we treat them as mostly black boxes in practice), in contrast to looking at an equation for instance. I have played with some of these text generation models, and frankly we are already at the point where deciding whether or not they pass the turing test depends on the details rather than the spirit of the rules for the test. It may not be a coincidence that NNs designed to replicate our own brain structure also replicate important aspects of our cognition.


These are not living beings though. They are programs. No one is arguing against humans learning.

>Neural networks approximate the function represented by your data.


The amount of data (not just code) that we would need to get sign off on is prohibitively large. If you account for all the stakeholders, then this won't be easy at all.

Meanwhile, the institutions will leap ahead of us. Models and annotated data sets will forever be out of reach. Open source equivalents will be severely behind the status quo.


Strongly disagree, institutions have an advantage in having the data, but they do not have the capability of creating more.

An intentional open source dataset could target new domains that there was no institutional will to pursue. I strongly believe that the open source community's capabilities far exceed that of any single large corporation.


> This is probably going to play out like Oracle vs. Google when Google suddenly realized that they should lose and intentionally threw the case.

What are you talking about here?


> I'm so worried about this case. Treating copyrighted training data as fair use, and letting models learn as a child might learn from a book or movie, is the best way to proceed. It widens the playing field for both development and disruption.

Which also opens the floodgates for bleeding GPL and other copylefted code to proprietary realm. Very convenient, yes.

I’m never OK with someone making a derivation engine which offers my GPL code to a closed source base.


> Oracle vs. Google when Google suddenly realized that they should lose and intentionally threw the case.

But Google won?


You casting doubt on this with a frontal assault. Read your post and wanted to check out the 'show' that was implied but all immediately my eyes fell upon this:

Add our “Humans Only Clause” to your MIT license. Your code is still open source — for human developers only.

Sore disappointed that there is no entertainment involved. That's actually a pretty cool idea.

So github doesn't have (could be wrong) a default license grant or a over-riding licensing agreement. Your project, your license. If you change the license of your project, that is entirely your choice.

As to the Q of should we be generous to our corporate masters or take this opportunity to stick to the man and get rewarded for our mind products and compensation for being geeks! Society does owe something, does it not? /G

It's worth having a discussion about it, imo.


Yeah, my content, but I don't necessarily have the ability to give them this, because people uploading code to Github don't always have the ability to grant them any license on the code uploaded.


The only way to add or read the license is to give them read and write access to all public repos in your Github account. Strange.


I can touch on this, the read and write access is for future iterations to be able to automatically add / append the license text to the file. Adding the licenses one by one is tedious.


A lot of people learned to code from completely free materials on the internet. In fact, it's actually amazing how much free learning resources there are on the internet. That's how we create a good internet ecosystem, not by making everyone pay for every little thing.

Teaching AI how to code is a continuation of building this ecosystem because people will use these generative AI coding tools, lowering the bar to code.


But if the AI has been trained on open source licensed code and occasionally can produce verbatim copies of said code, shouldn't the weights of the AI then also become open source under the same license?

You could argue that training is a form of compilation, and the weights are a derivative work.


Regardless of any legal issues surrounding this, an important consideration is that modifying any well known license will effectively mean your work will never get used by any large organization. No one wants to spend n hours of legal services vetting a custom license. They want to say "GPLv2, MIT, BSD is cool, everything else is banned".


It certainly raises some interesting questions. After a model has been trained is there really any surefire way to prove that these models are profiting from your individual code? How is this different, from, say, search indexing in Google? Imagine Wikipedia wants to sue Google for stealing their content. Google essential keeps a mirror of Wikipedia and uses that data to serve up better search results (sometimes). But is there legal group to stand on in such a situation? It seems hard to prove that Company X made Y dollars off your individual code or text, therefore Company X owes you money.

In any case it raises some other questions about intellectual property as a whole. If you can sue an AI model for profiting off your intellectual property, why can't you sue a human for the same? Say you read a book one day, and are so inspired that you go ahead and write a new book. Imagine you publish that new book and sell millions of copies. Are you entitled to pay royalties to the author who gave you inspiration? It seems to me that unless you're plagiarizing large chunks of the original work verbatim, you probably shouldn't be forced to owe the original author much of anything. LLMs do plagiarize, but they do so somewhat inconsistently due to their non-deterministic output (just like humans!).


In some cases the model will spit out verbatim copies of code. E.g. https://twitter.com/docsparse/status/1581461734665367554?lan...

So in this case the weights are not even a derivative work but just a compressed copy of the original code.


I see that your lawyers have reviewed the licensing terms. Great!

However, have they reviewed Microsoft's claims that their use of code for Copilot is open source? And if they have, is there somewhere I can read that analysis?

Now, Microsoft could be wrong on that claim, but until someone convinces me otherwise I'm going to assume their lawyers did their due diligence and they're correct. If Microsoft is correct about that it doesn't matter what you put in your license, and thus this is useless.


I don't know how I missed this but I can't edit my comment now, I meant to say

"their use of code for Copilot is open source" -> "their use of code for Copilot is __fair use__"


Yeah this is about as effective as making a FB post saying "I hereby revoke Meta's permission to use my photos". It's a surprisingly common theme on /r/oldpeoplefacebook


A lot of web pages already have a copyright notice, why doesn't that stop AI's from training on the contents of web pages? Can you train AI's on patents? Or on binary executables? Or from cameras that observe the public? Or from public legal document repositories? Or from Lexus-Nexus data? Or your personal health tracking wearable data? What exactly is ok or not ok? AI's can train on just about anything. I have long had a suspicion that most sites with large user populations use their collected data to train a stock market algorithm (in addition to advertising algorithms). What rules apply to government use of AI trained on all the data governments collect?


> Who is PayToTrain created by and why?

> PayToTrain is created by a small group of developers and attorneys who are passionate about open source software and ensuring that developers are properly compensated for their work. The website and service are provided completely free of charge.

That doesn't answer the obvious implied questions of how much anyone should trust this effort -- such as not to be selling them out to the infingers, either individually, or as as a "self-regulation" model that the infringers can point to in upcoming legal battles. And I'd be surprised if the attorneys didn't realize that.


This is a great idea but I don't understand why I need to provide this service full write access to my public repos. Shouldn't full read access suffice?


We wanted to be able in the future to be able to automatically add / append the license text to the file. Adding the licenses one by one would be tedious so coming up with a way to bulk add the licenses seems essential.


Fair enough but I have some heavily used open source projects on GitHub so I can't take the risk. It would be great if there was a read-only option.

Also, some of my open source projects have an explicit clause which states that 'The source of this code shall not be misrepresented' so if chunks of my code showed up in some AI-generated code without attribution, it would be a pretty clear-cut case.


This is a very clear cut case of a copyright violation by github copilot:

https://twitter.com/docsparse/status/1581461734665367554?lan...

In this instance the weights of the AI system just contain an obfuscated copy of the original source code.


As others have said before, it is fairly likely that the copyright violation probably happened before copilot learned from it, namely by people copying this code and publishing it under a different license.


Thanks for taking the initiative. there is very little awareness and implication of this.

I am not a developer, but I understand how generative AI leveraging your work to make it easier for someone else.

A similar thing needs to be done for images too.


>I am not a developer, but I understand how generative AI leveraging your work to make it easier for someone else.

To me this sounds like it's antithetical to open source software because the point of making software open source is so that other people can leverage your work. It shouldn't matter if it's done through generative AI or through a human's brain.


> the point of making software open source is so that other people can leverage your work

The point is that other people can leverage your work under the terms you distribute them under. For the vast majority of open source licenses, that means giving attribution and including the copyright notice and license when distributing the source code or its derivatives. For others, it means all of that and releasing derivatives under the same license.

If developers wanted to distribute their code under licenses with different terms, they would have, but they didn't.


But the generative models don't spit out existing code, it generates new code that (sometimes) happens to be the same as existing code. Which is the same as a human being does, just that an AI is much better at seeing a larger amount of existing work. There's no part of the model that has a specific piece of code, it just happens to reproduce the same thing.

People often write code that looks like existing code that they've seen even if they're not aware of it, it's a blurry line. I see it as just banning AI from doing the same thing as humans just because it's better at it.

An argument could be made that it's fair for an AI to not attribute the code it outputs too. The human-human reason for attribution is "I wrote this code by doing X amount of work, since you're using it and it'll save you time, I should fairly be given attribution". But then the AI is also writing out the code that it's prompted for, it's just faster at doing it.

Why not create a tool that instead runs the AI generated output through a check that provides proper attribution? Then you'll also get human written code that doesn't attribute the original author as well.


Why shouldn't it matter? Many open source licenses require attribution, so it is reasonable to think one point of making software open source is to get attribution. Generative AI prevents getting attribution, so it matters whether it is through generative AI or human brain.


I think the issues is making the work usable by people for free and for a fee paid to microsoft because (mostly) due to open source licenses that help keep the software free and not a boon only to huge corporations. This seems like a sneaky way of getting around the licensing. Maybe GPL4 that covers usage by AI models


Interesting.

I would be fine with an AI being trained on my code, provided that the weights of said AI would then be published under the same license as my code.

Is there a license for that?


Anyone have the text of the clause? I clicked the get started link but it wanted to log in on my Github. I did not see it in the FAQ.


"Use of the Software by any person to train, teach, prompt, populate, or otherwise further or facilitate any so-called generative artificial intelligence, generative algorithm, generative adversarial network, generative model, or similar or related activity (or to attempt to perform any of the foregoing acts or activity), whether in connection with any so-called machine learning, deep learning, neural network, or similar or related framework, system, or model or otherwise, is strictly prohibited and beyond the limited scope of this license, absent prior payment to licensor of the licensing fee of the amount of ____"


TL;DR our lawyers wrote a clause to protect open source code from being used by generative AI companies for profit. You can find it here: paytotrain.ai

A legal grey area exists as to whether publicly available creations (code or art) can be used to train datasets for generative AI projects without infringing their creators' underlying copyrights. Other types of claims, such as violation of license agreements and DMCA violations, require proof of damages to substantiate.

The legal solution we’ve identified is to add a specific damages amount to the license itself — a licensing fee. The failure to pay such a fee would cause the creator to suffer damages in the amount of the fee. By imbedding a licensing fee into a traditional open-source license, a creator can solve the proof-of-damages issue that could otherwise limit a claim under the DMCA or for breach of contract, and limit the fee to generative AI companies.

That’s why we built the Humans Only Clause. If you don’t want your code used by Copilot in this way, the Humans Only Clause can help strengthen your protections from use for training purposes. It’s a simple addition to your existing open source license to keep it free use and open source for other developers, but to prevent use without attribution by generative AI companies.

You can access the Humans Only Clause and insert it into your GitHub repo by going to PayToTrain.ai — we also built a payments form where you can set your own licensing fee depending on how valuable you believe your repo to be. If we get enough people using this clause, there’s a good chance we can assemble a separate class for a future class action, where each user gets significantly higher damages than what’s available statutorily under existing DMCA lawsuits.

On a philosophical level, we believe that the open source community is based on principles of taking and giving back to the collective. AI-based programming assistants strip away any attribution while drawing from the underlying contributions of the community. We want the open source community to continue to be open source, but we don’t want big companies to profit on our code.

If you’re interested, check it out: paytotrain.ai. We’d love to hear what you think


> AI-based programming assistants strip away any attribution while drawing from the underlying contributions of the community ... we believe that the open source community is based on principles of taking and giving back to the collective

How is this behavior different than 90% of human coders? Most SW devs scream if you ask them to pay for something, whether its apps, code, TV, movies, etc. But they will happily try to build startups on top of a vast mountain of free code. I really don't care if my code gets sucked up by the AI vacuum, humans have been doing that quite well for awhile now.


Thank you. Could you paste the Humans Only Clause here so we can read it?

Here was my attempt to write a clause prohibiting language model training/inference:

https://bugfix-66.com/7a82559a13b39c7fa404320c14f47ce0c304fa...

  3. Use in source or binary form for the construction or operation
     of predictive software generation systems is prohibited.
How does the Humans Only Clause fix the flaws in my attempt?

The Humans Only Clause adds an explicit licensing fee, and what else?

How is the clause worded?


"Use of the Software by any person to train, teach, prompt, populate, or otherwise further or facilitate any so-called generative artificial intelligence, generative algorithm, generative adversarial network, generative model, or similar or related activity (or to attempt to perform any of the foregoing acts or activity), whether in connection with any so-called machine learning, deep learning, neural network, or similar or related framework, system, or model or otherwise, is strictly prohibited and beyond the limited scope of this license, absent prior payment to licensor of the licensing fee of the amount of ____"


This seems to prohibit benign activities like importing the code into an IDE that contains auto-complete.


A necessary evil. You point out a cost. The benefits outweigh that cost.


That clause appears to violate Item #6 of the OSI's open source definition:

> The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research.

It also seems to violate freedom 0 of the FSF's four essential freedoms that define free software:

> The freedom to run the program as you wish, for any purpose (freedom 0).

I'm not sure this can be used by open source projects if they want to remain open source projects.


Thank you. I see the Humans Only Clause is much more explicit about what is prohibited than the No-AI 3-Clause License, and furthermore directly states a licensing fee.


That wouldn't license code. That's just license text that happens to be in a string. Not sure why this bugfix site is being used instead of just pastebin with text.


I'm presenting the license text in a creative and unusual way that real hackers might enjoy.

If that confused you, or you consider it "obnoxious", then you are not the target audience.

That's ok. Hacker News is not 100% hackers!


I think it is more likely you are just baldly shilling your website. Which is a shame, because I think the site isn't a bad idea - I like the minimal interface and the thought you've put into hints for many not-quite-right solutions. I've seen quality links you've posted in the past with more interesting, subtle, and relevant bugs - but this isn't one of them.


I have a different Hacker News account for every one of my projects. It happens that the No-AI 3-Clause License became part of the BUGFIX-66 project. I'm sorry that upsets you, but I'm sure you'll get over it. Happy Thanksgiving.


It's more obnoxious than creative and unusual




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: