Courts (at least in the US) have already ruled that use of ingested data for training is transformative. There’s lots of details to figure, but the genie is out of the bottle.
Sure it’s a big hill to climb in rethinking IP laws to align with a societal desire that generating IP continue to be a viable economic work product, but that is what’s necessary.
> Courts (at least in the US) have already ruled that use of ingested data for training is transformative
Yes, the training of the model itself is (or should be) a transformative act so you can train a model on whatever you have legal access to view.
However, that doesn't mean that the output of the model is automatically not infringing. If the model is prompted to create a copy of some copyrighted work, that is (or should be) still a violation.
Just like memorizing a book isn't infringment but reproducing a book from memory is.
The fact that GitHub’s Copilot has an enterprise feature that matches model output against code having certain licenses - in order to prevent you from using it, with a notification - suggests the model outputs are at least potentially infringing.
If MS were compelled to reveal how these completions are generated, there’s at least a possibility that they directly use public repositories to source text chunks that their “model” suggested were relevant (quoted as it could be more than just a model, like vector or search databases or some other orchestration across multiple workloads).
> suggests the model outputs are at least potentially infringing.
The only thing it suggests is that they recognize that a subset of users worry about it. Whether or not GitHub worries about it any further isn’t suggested.
Don’t think about it from an actual “rights” perspective. Think about the entire copyright issue as a “too big to fail” issue.
Not really, no. If you’re specifically referring to, say, GPL or BSD or other Open Source licenses, it’s a bit more unsettled, but software licensing as a whole has several decades of case law at this point.
No more so than regurgitating an entire book. While it could technically be possible in the case of certain repos that are ubiquitous on the internet (and therefore overrepresented in training data to the point that they are "regurgitated" verbatim, in whole), it is extremely unlikely and would only occur after deliberate prompting. The NYT suit against Open AI shows (in discovery) that the NYT was only able to get partial results after deliberately prompting the model with portions of the text they were trying to force it to regurgitate.
So. Yes, technically possible. But impossible by accident. Furthermore when you make this argument you reveal that you don't understand how these models work. They do not simply compress all the data they were trained on into a tiny storable version. They are effectively multiplication matrices that allow math to be done to predict the most likely next token (read: 2-3 Unicode characters) given some input.
So the model does not "contain" code. It "contains" a way of doing calculations for predicting what text comes next.
Finally, let's say that it is possible that the model does spit out not entire works, but a handful of lines of code that appear in some codebase.
This does not constitute copyright infringement, as the lines in question a) represent a tiny portion of the whole work (and copyright only protecst against the reduplication of whole works or siginficant portions of the work), and B) there are a limited number of ways to accomplish a certain function and it is not only possible but inevitable that two devs working independently could arrive at the same implementation. Therefore using an identical implementation (which is what this case would be) of a part of a work is no more illegal than the use of a certain chord progression or melodic phrasing or drum rhythm. Courts have ruled about this thoroughly.
Yes, that is one of those works that is over-represented in the training data, as I explained in the part of the comment you clearly did not comprehend.
I don't see why verbatim or not should matter at all.
How complex does a mechanical transformation have to be to not be considered plagiarism, copyright infringement or parasitism?
If somebody writes a GPL-licensed program, is it enough to change all variable and function names to get rid of those pesky users' rights? Do you have to change the order of functions? Do you have to convert it to a different language? Surely nobody would claim c2rust is transformative even though the resulting code can be wildly different if you apply enough mechanical transformations.
All LLMs do is make the mechanical transformations 1) probabilistic 2) opaque 3) all at once 4) using multiple projects as a source.
> How complex does a mechanical transformation have to be to not be considered plagiarism, copyright infringement or parasitism?
Legally speaking, this depends from domain to domain. But consider for example extracting facts from several biology textbooks, and then delivering those facts to the user in the characteristic ChatGPT tone that is distinguishable from the style of each source textbook. You can then be quite assured that courts will not find that you have infringed on copyright.
> Sure it’s a big hill to climb in rethinking IP laws to align with a societal desire that generating IP continue to be a viable economic work product, but that is what’s necessary.
Well, AI can perhaps solve the problem it created here: generated IP with AI is much cheaper than with humans, so it will be viable even at lower payoffs.
Less cynical: you can use trade secrets to protect your IP. You can host your software and only let customers interact with it remotely, like what Google (mostly) does.
Of course, this is a very software-centric view. You can't 'protect' eg books or music in this way.
Eh, they figured out how to copyright photographs, where the human only provides a few bits (setting up the scene, deciding when to pull the trigger etc); so stretching a few bits of human input to cover the whole output of an AI should also be doable with sufficiently well paid lawyers.
Tell that to Reddit. They’re AI translating user posts and serving it up as separate Google search results. I don’t remember if Reddit claims copyright on user-submitted content, or on its AI translations, but I don’t think Reddit is paying ad share like X is, either, so it kind of doesn’t matter to the user, as they’re (still) not getting paid, even as Reddit collects money for every ad shown/clicked. Even if OP did write it, an AI translated the version shown.
reddit is a user hostile company, have been forever, always will be. they take rights over your content, farm things about you, sell data, do invasive things in the mobile apps, use creepware cookies, etc.
Excerpt from the user agreement:
When Your Content is created with or submitted to the Services, you grant us a worldwide, royalty-free, perpetual, irrevocable, non-exclusive, transferable, and sublicensable license to use, copy, modify, adapt, prepare derivative works of, distribute, store, perform, and display Your Content and any name, username, voice, or likeness provided in connection with Your Content in all media formats and channels now known or later developed anywhere in the world. This license includes the right for us to make Your Content available for syndication, broadcast, distribution, or publication by other companies, organizations, or individuals who partner with Reddit. For example, this license includes the right to use Your Content to train AI and machine learning models, as further described in our Public Content Policy. You also agree that we may remove metadata associated with Your Content, and you irrevocably waive any claims and assertions of moral rights or attribution with respect to Your Content.
People put their heads in the sand over reddit for some reason, but it's worse than FAANG.
With respect to the content or other materials you upload through the Site or share with other users or recipients (collectively, “User Content”), you represent and warrant that you own all right, title and interest in and to such User Content, including, without limitation, all copyrights and rights of publicity contained therein. With respect to the content or other materials you upload through the Site or share with other users or recipients (collectively, “User Content”), you represent and warrant that you own all right, title and interest in and to such User Content, including, without limitation, all copyrights and rights of publicity contained therein. By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies a nonexclusive, worldwide, royalty free, fully paid up, transferable, sublicensable, perpetual, irrevocable license to copy, display, upload, perform, distribute, store, modify and otherwise use your User Content for any Y Combinator-related purpose in any form, medium or technology now known or later developed.
To a certain reading, this is user-centric: it’s increasing the size of the audience pool beyond that of shared language speakers and readers to the entire literate human race. This is an important point to acknowledge, because every silver lining has its cloud.
It's not about being mandatory. It's about having a privileged position and using it to extract value from people.
The nature of network effects is such that once a site gets as big as reddit (or facebook or tiktok or whichever), it's nearly impossible for competition to take over in the same design space.
Many communities (both small and large) are only present on specific platforms (sometimes only one) and if you want to participate you have to accept their terms or exclude yourself socially.
If they have a privileged position, it is earned and freely given. No one is obligated to use the site. The issue is more one of the commons being enclosed and encircled by corporate interests, and then branded and commodified. Once the deed is done, there is no reason for folks to leave, because everyone they know is there.
Most communities on Reddit that I’d care to be a part of have additional places to gather, but I do take your point that there are few good alternatives to r/jailbreak, for example.
The host always sets its own rules. How else could anything actually get done? The coordination problem is hard enough as it is. It’s a wonder that social media exists at all.
Gatekeepers will always exist adjacent to the point of entry, otherwise every site turns extremist and becomes overrun with scammers and spammers.
> Courts (at least in the US) have already ruled that use of ingested data for training is transformative.
If you have code that happens to be identical to some else's code or implements someone's proprietary algorithm, you're going to lose in court even if you claim an "AI" gave it to you.
AI is training on private Github repos and coughing them up. I've had it regurgitate a very well written piece of code to do a particular computational geometry algorithm. It presented perfect, idiomatic Python with perfect tests that caught all the degenerate cases. That was obviously proprietary code--no amount of searching came up with anything even remotely close (it's why I asked the AI, after all).
>If you have code that happens to be identical to some else's code or implements someone's proprietary algorithm, you're going to lose in court even if you claim an "AI" gave it to you.
Not for a dozen lines here or there, even if it could be found and identified in a massive code base. That’s like quoting a paragraph of a book in another book, non infringing.
For the second half of your comment it sounds like you’re saying you got results that were too good to be AI- that’s a bit “no true Scotsman”, at least without more detail. But implementing an algorithm, even a complex one, is very much something an LLM can do. Algorithms are much better defined and scoped natural language, and LLMs do a reasonable job of translating to languages. An algorithm is a narrow subset of that task type with better defined context and syntax.
> Not for a dozen lines here or there, even if it could be found and identified in a massive code base. That’s like quoting a paragraph of a book in another book, non infringing.
It's potentially non-infringing in a book if you quote it in a plausible way, and properly.
If you copy&paste a paragraph from another book into yours, it's infringing, and a career-ending scandal. There's plenty of precedent on that.
Just like if you manually copied a function out of some GPL code and pasted it into your own.
What will happen when company A implements algorithm X based on AI output, company B does the same and company A claims that it is proprietary code and takes company B to court?
It cannot do anything on its own, it's just a (very complex, probabilistic) mechanical transformation (including interpolation) of training data and a prompt.
Advertising autocomplete as AI was a genius move because people start humanizing it and look for human-centric patterns.
Thinking A"I" can do anything on its own is like seeing faces in rocks on Mars.
The idea that something that can't handle simple algorithms (e.g. counting the number of times a letter occurs in a word) could magically churn out far more advanced algorithms complete with tests is… well it's a bit of a stretch.
LLMs aren't good at rote memorization. They can't even get quotations of humans right.
It's easier for the LLM to rewrite an idiomatic computational geometry algorithm from scratch in a language it understands well like Python. Entire computational geometry textbooks and research papers are in its knowledge base. It doesn't have to copy some proprietary implementation.
That seems a real stretch. GPT 5 just invented new math for reference. What you are saying would be equivalent to saying that this math was obviously in some paper that mathematician did not know about. Maybe true, but it's a far reach.
It invented "new math" as much as I invented "new food" when I was cooking yesterday. It did a series of quite complicated calculations that would take a well trained human several hours or even days to do - still impressive, but no it's not new maths.
Obviously not ChatGPT. But ChatGPT isn't the sharpest stick on the block by a significant margin. It is a mistake to judge what AIs can do based on what ChatGPT does.
This would be the first time ever that an LLM has discovered new knowledge, but the far reach is that the information does appear in the training data?
They've been doing it for a while. Gemini has also discovered new math and new algorithms.
There is an entire research field of scientific discovery using LLMs together with sub-disciplines for the various specialization. LLMs routinely discover new things.
I hadn't heard of that, so I did some searching and the single source for the claim I can find is a Google white paper. That doesn't automatically mean it's false, of course, but it is curious that the only people ostensibly showing LLMs discover new things are the companies offering the LLMs.
Citation needed, and I call bullshit. Unless you mean that they hallucinate useless algorithms that do not work, which they do.
LLMs do not have an internal model for manipulating mathematical objects. They cannot, by design, come up with new algorithms unless they are very nearly the same as some other algorithm. I'm a computer science researcher and have not heard of a single algorithm created by LLM.
This article is about the same thing I mentioned in a sibling comment. I personally don't find an unreplicated Google white paper to be compelling evidence.
The AI coming up with it? When Google claimed their Wizard of Oz show at the Las Vegas Sphere was AI-generated, a ton of VFX artists spoke up to say they'd spent months of human labor working on it. Forgive me for not giving the benefit of the doubt to a company that has a vested interest in making their AI seem more powerful, and a track record of lying to do so.
Maybe I should clarify: Society, in general, supports the idea that writers, artists, film makers, coders, etc— everyone who creates IP- should have a place in the economy. Basically just that it should be possible to make a living and have a career at it. It can be spun different ways, and those differences are important, but this is the basic thing.
This doesn’t seem like a disputable statement to me. For anyone who thinks actors’ likenesses, authors’ words, all of it- that all and everything should be up for grabs once written or put anywhere in public, that is not a widely held opinion.
Once that’s established, it all comes down to implementation details.
If you mean the ruling has absolutely no applicability when it comes to using the model then, no, that is incorrect:
Judge Alsup, in his ruling, specifically likened the process to reading text and then using the knowledge to write something else. That’s training and use.
A compression algorithm doesn't transform the data it stores it in a different format. Storing a story in a txt file vs word file doesn't transform the data.
An llm is looking at the shape of words and ideas over scale and using that to provide answers.
No a compression algorithm does transform the data, particularly lossy ones. The pixels stored in the output are not in the input, they're new pixels. That's why you can't uncompress a jpeg. Its a new image that just happens to look like the original. But it even might not - some jpegs are so deep fried they become their own form of art. This is very popular in meme culture.
The only difference, really, is we know how a JPEG algorithm works. If I wanted to, I could painstakingly make a jpeg by hand. We don't know how LLMs work.
Sure it’s a big hill to climb in rethinking IP laws to align with a societal desire that generating IP continue to be a viable economic work product, but that is what’s necessary.