Courts (at least in the US) have already ruled that use of ingested data for tra...

shkkmo · 2025-08-22T00:45:24 1755823524

> Courts (at least in the US) have already ruled that use of ingested data for training is transformative

Yes, the training of the model itself is (or should be) a transformative act so you can train a model on whatever you have legal access to view.

However, that doesn't mean that the output of the model is automatically not infringing. If the model is prompted to create a copy of some copyrighted work, that is (or should be) still a violation.

Just like memorizing a book isn't infringment but reproducing a book from memory is.

threecheese · 2025-08-22T01:51:00 1755827460

The fact that GitHub’s Copilot has an enterprise feature that matches model output against code having certain licenses - in order to prevent you from using it, with a notification - suggests the model outputs are at least potentially infringing.

If MS were compelled to reveal how these completions are generated, there’s at least a possibility that they directly use public repositories to source text chunks that their “model” suggested were relevant (quoted as it could be more than just a model, like vector or search databases or some other orchestration across multiple workloads).

ineedasername · 2025-08-22T12:34:25 1755866065

> suggests the model outputs are at least potentially infringing.

The only thing it suggests is that they recognize that a subset of users worry about it. Whether or not GitHub worries about it any further isn’t suggested.

Don’t think about it from an actual “rights” perspective. Think about the entire copyright issue as a “too big to fail” issue.

martin-t · 2025-08-22T11:20:35 1755861635

> directly use public repositories

I don't see why a company which has been waging a multi decade war against GPL and users' rights would stop at _public_ repositories.

anp · 2025-08-22T01:06:09 1755824769

This also matches my (not a lawyer) intuition, but have there been any legal precedents set in this direction yet?

eschaton · 2025-08-21T21:58:03 1755813483

Some courts at some levels. It’s by no means settled law.

bigyabai · 2025-08-22T00:03:28 1755821008

Most licensed software is unsettled law, if we're being that pedantic.

eschaton · 2025-08-22T05:06:46 1755839206

Not really, no. If you’re specifically referring to, say, GPL or BSD or other Open Source licenses, it’s a bit more unsettled, but software licensing as a whole has several decades of case law at this point.

alfalfasprout · 2025-08-21T20:17:14 1755807434

> Courts (at least in the US) have already ruled that use of ingested data for training is transformative

This is far from settled law. Let's not mischaracterize it.

Even so, an AI regurgitating proprietary code that's licensed in some other way is a very real risk.

popalchemist · 2025-08-21T20:30:40 1755808240

No more so than regurgitating an entire book. While it could technically be possible in the case of certain repos that are ubiquitous on the internet (and therefore overrepresented in training data to the point that they are "regurgitated" verbatim, in whole), it is extremely unlikely and would only occur after deliberate prompting. The NYT suit against Open AI shows (in discovery) that the NYT was only able to get partial results after deliberately prompting the model with portions of the text they were trying to force it to regurgitate.

So. Yes, technically possible. But impossible by accident. Furthermore when you make this argument you reveal that you don't understand how these models work. They do not simply compress all the data they were trained on into a tiny storable version. They are effectively multiplication matrices that allow math to be done to predict the most likely next token (read: 2-3 Unicode characters) given some input.

So the model does not "contain" code. It "contains" a way of doing calculations for predicting what text comes next.

Finally, let's say that it is possible that the model does spit out not entire works, but a handful of lines of code that appear in some codebase.

This does not constitute copyright infringement, as the lines in question a) represent a tiny portion of the whole work (and copyright only protecst against the reduplication of whole works or siginficant portions of the work), and B) there are a limited number of ways to accomplish a certain function and it is not only possible but inevitable that two devs working independently could arrive at the same implementation. Therefore using an identical implementation (which is what this case would be) of a part of a work is no more illegal than the use of a certain chord progression or melodic phrasing or drum rhythm. Courts have ruled about this thoroughly.

aspenmayer · 2025-08-22T04:38:42 1755837522

> No more so than regurgitating an entire book.

Like this?

Meta's Llama 3.1 can recall 42 percent of the first Harry Potter book - https://news.ycombinator.com/context?id=44972296 - 67 days ago (313 comments)

popalchemist · 2025-08-23T03:53:18 1755921198

Yes, that is one of those works that is over-represented in the training data, as I explained in the part of the comment you clearly did not comprehend.

aspenmayer · 2025-08-23T03:55:39 1755921339

> you clearly did not comprehend

I comprehend it just fine, I was adding context for those who may not comprehend.

typpilol · 2025-08-22T01:11:38 1755825098

It's also why some companies do clean room design.

jhanschoo · 2025-08-22T00:46:10 1755823570

An AI model's output can be transformative, but you can be unlucky enough that the LLM memorized the data that it gave you.

martin-t · 2025-08-22T11:15:47 1755861347

I don't see why verbatim or not should matter at all.

How complex does a mechanical transformation have to be to not be considered plagiarism, copyright infringement or parasitism?

If somebody writes a GPL-licensed program, is it enough to change all variable and function names to get rid of those pesky users' rights? Do you have to change the order of functions? Do you have to convert it to a different language? Surely nobody would claim c2rust is transformative even though the resulting code can be wildly different if you apply enough mechanical transformations.

All LLMs do is make the mechanical transformations 1) probabilistic 2) opaque 3) all at once 4) using multiple projects as a source.

jhanschoo · 2025-08-22T16:40:37 1755880837

> How complex does a mechanical transformation have to be to not be considered plagiarism, copyright infringement or parasitism?

Legally speaking, this depends from domain to domain. But consider for example extracting facts from several biology textbooks, and then delivering those facts to the user in the characteristic ChatGPT tone that is distinguishable from the style of each source textbook. You can then be quite assured that courts will not find that you have infringed on copyright.

eru · 2025-08-22T00:18:55 1755821935

> Sure it’s a big hill to climb in rethinking IP laws to align with a societal desire that generating IP continue to be a viable economic work product, but that is what’s necessary.

Well, AI can perhaps solve the problem it created here: generated IP with AI is much cheaper than with humans, so it will be viable even at lower payoffs.

Less cynical: you can use trade secrets to protect your IP. You can host your software and only let customers interact with it remotely, like what Google (mostly) does.

Of course, this is a very software-centric view. You can't 'protect' eg books or music in this way.

raggi · 2025-08-22T02:12:14 1755828734

In the US you can not generate copyrightable IP without substantial human contribution to the process.

https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...

eru · 2025-08-22T09:56:28 1755856588

Eh, they figured out how to copyright photographs, where the human only provides a few bits (setting up the scene, deciding when to pull the trigger etc); so stretching a few bits of human input to cover the whole output of an AI should also be doable with sufficiently well paid lawyers.

aeon_ai · 2025-08-22T12:33:23 1755866003

It's already been done.

https://en.wikipedia.org/wiki/Generative_artificial_intellig...

eru · 2025-08-25T01:15:18 1756084518

Nice, thanks!

aspenmayer · 2025-08-22T04:20:30 1755836430

Tell that to Reddit. They’re AI translating user posts and serving it up as separate Google search results. I don’t remember if Reddit claims copyright on user-submitted content, or on its AI translations, but I don’t think Reddit is paying ad share like X is, either, so it kind of doesn’t matter to the user, as they’re (still) not getting paid, even as Reddit collects money for every ad shown/clicked. Even if OP did write it, an AI translated the version shown.

https://news.ycombinator.com/context?id=44972296

raggi · 2025-08-22T05:36:18 1755840978

reddit is a user hostile company, have been forever, always will be. they take rights over your content, farm things about you, sell data, do invasive things in the mobile apps, use creepware cookies, etc.

Excerpt from the user agreement:

    When Your Content is created with or submitted to the Services, you grant us a worldwide, royalty-free, perpetual, irrevocable, non-exclusive, transferable, and sublicensable license to use, copy, modify, adapt, prepare derivative works of, distribute, store, perform, and display Your Content and any name, username, voice, or likeness provided in connection with Your Content in all media formats and channels now known or later developed anywhere in the world. This license includes the right for us to make Your Content available for syndication, broadcast, distribution, or publication by other companies, organizations, or individuals who partner with Reddit. For example, this license includes the right to use Your Content to train AI and machine learning models, as further described in our Public Content Policy. You also agree that we may remove metadata associated with Your Content, and you irrevocably waive any claims and assertions of moral rights or attribution with respect to Your Content.

People put their heads in the sand over reddit for some reason, but it's worse than FAANG.

LegionMammal978 · 2025-08-22T05:42:52 1755841372

  With respect to the content or other materials you upload through the Site or share with other users or recipients (collectively, “User Content”), you represent and warrant that you own all right, title and interest in and to such User Content, including, without limitation, all copyrights and rights of publicity contained therein. With respect to the content or other materials you upload through the Site or share with other users or recipients (collectively, “User Content”), you represent and warrant that you own all right, title and interest in and to such User Content, including, without limitation, all copyrights and rights of publicity contained therein. By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies a nonexclusive, worldwide, royalty free, fully paid up, transferable, sublicensable, perpetual, irrevocable license to copy, display, upload, perform, distribute, store, modify and otherwise use your User Content for any Y Combinator-related purpose in any form, medium or technology now known or later developed.

cutemonster · 2025-08-22T13:08:03 1755868083

Interesting. Apparently you don't waive any moral rights, unlike at Reddit. That is, you should still get credited for your work (in theory).

aspenmayer · 2025-08-22T05:40:57 1755841257

To a certain reading, this is user-centric: it’s increasing the size of the audience pool beyond that of shared language speakers and readers to the entire literate human race. This is an important point to acknowledge, because every silver lining has its cloud.

martin-t · 2025-08-22T11:18:11 1755861491

User-centric would be giving users choice.

It really is that simple.

Forcing something on people from a position of power is never in their favor.

aspenmayer · 2025-08-22T19:18:57 1755890337

I don’t think having a Reddit account is mandatory.

As a user of Reddit, I think it’s cool, and also raises some concerns.

I think most sites that handle user data are going to have rough edges. Making money off of user content is never without issues.

martin-t · 2025-08-23T00:17:39 1755908259

It's not about being mandatory. It's about having a privileged position and using it to extract value from people.

The nature of network effects is such that once a site gets as big as reddit (or facebook or tiktok or whichever), it's nearly impossible for competition to take over in the same design space.

Many communities (both small and large) are only present on specific platforms (sometimes only one) and if you want to participate you have to accept their terms or exclude yourself socially.

aspenmayer · 2025-08-23T03:09:05 1755918545

If they have a privileged position, it is earned and freely given. No one is obligated to use the site. The issue is more one of the commons being enclosed and encircled by corporate interests, and then branded and commodified. Once the deed is done, there is no reason for folks to leave, because everyone they know is there.

Most communities on Reddit that I’d care to be a part of have additional places to gather, but I do take your point that there are few good alternatives to r/jailbreak, for example.

The host always sets its own rules. How else could anything actually get done? The coordination problem is hard enough as it is. It’s a wonder that social media exists at all.

Gatekeepers will always exist adjacent to the point of entry, otherwise every site turns extremist and becomes overrun with scammers and spammers.

bsder · 2025-08-21T21:39:29 1755812369

> Courts (at least in the US) have already ruled that use of ingested data for training is transformative.

If you have code that happens to be identical to some else's code or implements someone's proprietary algorithm, you're going to lose in court even if you claim an "AI" gave it to you.

AI is training on private Github repos and coughing them up. I've had it regurgitate a very well written piece of code to do a particular computational geometry algorithm. It presented perfect, idiomatic Python with perfect tests that caught all the degenerate cases. That was obviously proprietary code--no amount of searching came up with anything even remotely close (it's why I asked the AI, after all).

ineedasername · 2025-08-21T22:43:17 1755816197

>If you have code that happens to be identical to some else's code or implements someone's proprietary algorithm, you're going to lose in court even if you claim an "AI" gave it to you.

Not for a dozen lines here or there, even if it could be found and identified in a massive code base. That’s like quoting a paragraph of a book in another book, non infringing.

For the second half of your comment it sounds like you’re saying you got results that were too good to be AI- that’s a bit “no true Scotsman”, at least without more detail. But implementing an algorithm, even a complex one, is very much something an LLM can do. Algorithms are much better defined and scoped natural language, and LLMs do a reasonable job of translating to languages. An algorithm is a narrow subset of that task type with better defined context and syntax.

neilv · 2025-08-22T01:18:29 1755825509

> Not for a dozen lines here or there, even if it could be found and identified in a massive code base. That’s like quoting a paragraph of a book in another book, non infringing.

It's potentially non-infringing in a book if you quote it in a plausible way, and properly.

If you copy&paste a paragraph from another book into yours, it's infringing, and a career-ending scandal. There's plenty of precedent on that.

Just like if you manually copied a function out of some GPL code and pasted it into your own.

Or if you had an LLM do it for you.

ozfive · 2025-08-22T00:24:56 1755822296

What will happen when company A implements algorithm X based on AI output, company B does the same and company A claims that it is proprietary code and takes company B to court?

andreasmetsala · 2025-08-22T00:52:32 1755823952

What has happened when the same thing happens without AI involved?

ozfive · 2025-08-22T04:32:00 1755837120

Yep, it’s not a brand-new problem. I just wonder if AI is going to turbocharge the odds of these disputes popping up.

Filligree · 2025-08-21T22:43:41 1755816221

How is that obviously proprietary? Aren't you implicitly assuming that the AI couldn't have written it on its own?

martin-t · 2025-08-22T11:23:20 1755861800

It cannot do anything on its own, it's just a (very complex, probabilistic) mechanical transformation (including interpolation) of training data and a prompt.

Advertising autocomplete as AI was a genius move because people start humanizing it and look for human-centric patterns.

Thinking A"I" can do anything on its own is like seeing faces in rocks on Mars.

inferiorhuman · 2025-08-22T03:15:36 1755832536

The idea that something that can't handle simple algorithms (e.g. counting the number of times a letter occurs in a word) could magically churn out far more advanced algorithms complete with tests is… well it's a bit of a stretch.

Filligree · 2025-08-22T14:28:41 1755872921

It's terrible at executing algorithms. This, it turns out, is completely disjoint from writing algorithms.

fzzzy · 2025-08-22T00:28:55 1755822535

This makes no sense. Computational geometry algorithms are computable.

ants_everywhere · 2025-08-22T00:41:54 1755823314

LLMs aren't good at rote memorization. They can't even get quotations of humans right.

It's easier for the LLM to rewrite an idiomatic computational geometry algorithm from scratch in a language it understands well like Python. Entire computational geometry textbooks and research papers are in its knowledge base. It doesn't have to copy some proprietary implementation.

gugagore · 2025-08-22T00:53:34 1755824014

A search for "LLM Harry Potter" would suggest that LLMs are widely understood to be proficient at rote memorization.

(I find the example of the computational geometry algorithm being a clear case of direct memorization not very compelling, in any case.)

rowanG077 · 2025-08-21T22:56:06 1755816966

That seems a real stretch. GPT 5 just invented new math for reference. What you are saying would be equivalent to saying that this math was obviously in some paper that mathematician did not know about. Maybe true, but it's a far reach.

postexitus · 2025-08-22T12:04:03 1755864243

It invented "new math" as much as I invented "new food" when I was cooking yesterday. It did a series of quite complicated calculations that would take a well trained human several hours or even days to do - still impressive, but no it's not new maths.

rerdavies · 2025-08-22T14:41:22 1755873682

An example: https://medium.com/@deshmukhpratik931/the-matrix-multiplicat...

Obviously not ChatGPT. But ChatGPT isn't the sharpest stick on the block by a significant margin. It is a mistake to judge what AIs can do based on what ChatGPT does.

jakelazaroff · 2025-08-22T00:15:04 1755821704

This would be the first time ever that an LLM has discovered new knowledge, but the far reach is that the information does appear in the training data?

ants_everywhere · 2025-08-22T00:43:34 1755823414

They've been doing it for a while. Gemini has also discovered new math and new algorithms.

There is an entire research field of scientific discovery using LLMs together with sub-disciplines for the various specialization. LLMs routinely discover new things.

jakelazaroff · 2025-08-22T00:59:37 1755824377

I hadn't heard of that, so I did some searching and the single source for the claim I can find is a Google white paper. That doesn't automatically mean it's false, of course, but it is curious that the only people ostensibly showing LLMs discover new things are the companies offering the LLMs.

tovej · 2025-08-22T05:47:13 1755841633

Citation needed, and I call bullshit. Unless you mean that they hallucinate useless algorithms that do not work, which they do.

LLMs do not have an internal model for manipulating mathematical objects. They cannot, by design, come up with new algorithms unless they are very nearly the same as some other algorithm. I'm a computer science researcher and have not heard of a single algorithm created by LLM.

rerdavies · 2025-08-22T14:44:01 1755873841

https://medium.com/@deshmukhpratik931/the-matrix-multiplicat...

And it's not an accident that significant percentage (40%?) of all papers being published in top journals involve application of AIs.

jakelazaroff · 2025-08-22T16:49:29 1755881369

This article is about the same thing I mentioned in a sibling comment. I personally don't find an unreplicated Google white paper to be compelling evidence.

rerdavies · 2025-08-22T19:54:31 1755892471

It's a fast matrix multiply! (A decades-old human problem). What exactly do you need to replicate??! Just count the multiplies, fer goodness sake.

jakelazaroff · 2025-08-22T23:39:04 1755905944

> What exactly do you need to replicate??!

The AI coming up with it? When Google claimed their Wizard of Oz show at the Las Vegas Sphere was AI-generated, a ton of VFX artists spoke up to say they'd spent months of human labor working on it. Forgive me for not giving the benefit of the doubt to a company that has a vested interest in making their AI seem more powerful, and a track record of lying to do so.

const_cast · 2025-08-22T14:16:27 1755872187

New math? As in it just fucking Isaac Newton'd invented calculus? Or do you just mean it solved a math problem?

slg · 2025-08-22T01:29:27 1755826167

>societal desire that generating IP continue to be a viable economic work product

It is strange that you think the law is settled when I don't think even this "societal desire" is completely settled just yet.

ineedasername · 2025-08-22T01:46:50 1755827210

Maybe I should clarify: Society, in general, supports the idea that writers, artists, film makers, coders, etc— everyone who creates IP- should have a place in the economy. Basically just that it should be possible to make a living and have a career at it. It can be spun different ways, and those differences are important, but this is the basic thing.

This doesn’t seem like a disputable statement to me. For anyone who thinks actors’ likenesses, authors’ words, all of it- that all and everything should be up for grabs once written or put anywhere in public, that is not a widely held opinion.

Once that’s established, it all comes down to implementation details.

ekianjo · 2025-08-22T04:38:15 1755837495

> Society, in general, supports the idea that writers, artists, film makers, coders, etc

Coders don't get paid every single time their code runs. Why bundle different rights together?

aspenmayer · 2025-08-22T04:56:19 1755838579

> Coders don't get paid every single time their code runs.

They do if they code the API correctly.

> Why bundle different rights together?

Why are mineral rights sold separately to most land deeds?

aleph_minus_one · 2025-08-22T12:54:17 1755867257

> Why are mineral rights sold separately to most land deeds?

Because the population does not rebel against the politicians that made these laws.

ineedasername · 2025-08-22T12:25:42 1755865542

That’s a matter of contract and licensing, not the limits of copyright law.

Hamuko · 2025-08-21T21:01:23 1755810083

Training an AI model is not the same as using an AI model.

ineedasername · 2025-08-21T23:05:02 1755817502

If you mean the ruling has absolutely no applicability when it comes to using the model then, no, that is incorrect:

Judge Alsup, in his ruling, specifically likened the process to reading text and then using the knowledge to write something else. That’s training and use.

Hamuko · 2025-08-23T10:29:18 1755944958

How are you verifying that the AI is writing "something else" and not something that's verbatum in its training material?

ryukoposting · 2025-08-21T22:30:14 1755815414

Publishing pirated copies of books on libgen isn't the same as downloading pirated copies of books from libgen. Neither is legal.

eru · 2025-08-22T00:22:17 1755822137

In many places that's a fairly recent development: publishing pirated IP used to be much more of a legal problem than consuming it.

Also publishing pirated IP without any monetary gain to yourself also used to be treated more leniently.

Of course, all the rules were changed (both in law and in interpretation in practice) as file sharing became a huge deal about two decades ago.

Details depend on jurisdiction.

MacsHeadroom · 2025-08-25T17:21:37 1756142497

Many jurisdictions outlaw sharing but not copying. Downloading is legal in those places but publishing is not.

BobbyTables2 · 2025-08-22T01:44:50 1755827090

I’m curious … So “transformative” is not necessarily “derivative”?

Seems to me the training of AI is not radically different than compression algorithms building up a dictionary and compressing data.

Yet nobody calls JPEG compression “transformative”.

Could one do lossy compression over billions of copyrighted images to “train” a dictionary?

zahlman · 2025-08-22T02:52:17 1755831137

> I’m curious … So “transformative” is not necessarily “derivative”?

(not legal advice)

Transformative works are necessarily derivative, but that transformation allows for a legal claim to "fair use" regardless of making a derived work.

https://en.wikipedia.org/wiki/Transformative_use

ipaddr · 2025-08-22T06:11:57 1755843117

A compression algorithm doesn't transform the data it stores it in a different format. Storing a story in a txt file vs word file doesn't transform the data.

An llm is looking at the shape of words and ideas over scale and using that to provide answers.

const_cast · 2025-08-22T14:07:51 1755871671

No a compression algorithm does transform the data, particularly lossy ones. The pixels stored in the output are not in the input, they're new pixels. That's why you can't uncompress a jpeg. Its a new image that just happens to look like the original. But it even might not - some jpegs are so deep fried they become their own form of art. This is very popular in meme culture.

The only difference, really, is we know how a JPEG algorithm works. If I wanted to, I could painstakingly make a jpeg by hand. We don't know how LLMs work.