Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So, the law has this concept of 'de minimus' infringement, where if you take a very small amount - like, way smaller than even a fair use - the courts don't care. If you're taking a handful of word probabilities from every book ever written, then the portion taken from each work is very, very low, so courts aren't likely to care.

If you're only training on a handful of works then you're taking more from them, meaning it's not de minimus.

For the record, I got this legal theory from Cory Doctorow[0], but I'm skeptical. It's very plausible, but at the same time, we also thought sampling in music was de minimus until the Second Circuit said otherwise. Copyright law is extremely malleable in the presence of moneyed interests, sometimes without Congressional intervention even!

[0] who is NOT pro-AI, he just thinks labor law is a better bulwark against it than copyright



You don't even need to go this far.

The word-probabilities are transformative use, a form of fair use and aren't an issue.

The specific output at each point in time is what would be judged to be fair use or copyright infringing.

I'd argue the user would be responsible for ensuring they're not infringing by using the output in a copyright infringing manner i.e. for profit, as they've fed certain inputs into the model which led to the output. In the same way you can't sue Microsoft for someone typing up copyrighted works into Microsoft Word and then distributing for profit.

De minimus is still helpful here, not all infringments are noteworthy.


MS Word does not actively collect and process all texts for all available sources and does not offer them in recombined form. MS Word is passive whereas the whole point of an LLM is to produce output using a model trained on ingested data. It is actively processing vast amounts of texts with intent to make them available for others to use and the T&C state that the user owns the copyright to the outputs based on works of other copyright owners. LLMs give the user a CCL (Collateralised Copyright Liability, a bit like a CDO) without a way of tracing the sources used to train the model.


Legally, copyright is only concerned with the specific end work. A unique or not so unique standalone object that is being scrutinized, if this analogy helps.

The process involved in obtaining that end work is completely irrelevant to any copyright case. It can be a claim against the models weights (not possible as it's fair use), or it's against the specific once off output end work (less clear), but it can't be looked at as a whole.


I don't think that's accurate. The us copyright office last year issued guidance that basically said anything generated with ai can't be copyrighted, as human authorship/creation is required for copyright. Works can incorporate ai generated content but then those parts aren't covered by copyright.

https://www.federalregister.gov/documents/2023/03/16/2023-05...

So I think the law, at least as currently interpreted, does care about the process.

Though maybe you meant as to whether a new work infringes existing copyright? As this guidance is clearly about new copyright.


These are two sides of the same coin, and what I'm saying still stands. This is talking about who you attribute authorship to when copyrighting a specific work. Basically on the application form, the author must be a human. The reason it's worth them clarifying is because they've received applications that attributed AI's, and legal persons do exist that aren't human (such as companies), they're just making it clear it has to be human.

Who created the work, it's the user who instructed the AI (it's a tool), you can't attribute it to the AI. It would be the equivalent of Photoshop being attributed as co-author on your work.


Couldn't you just generate it with AI then say you wrote it? How could anyone prove you wrong?


That's what you're supposed to do. No need to hide it either :).


First, I agree with nearly everything that you wrote. Very thoughtful post! However, I have some issues with the last sentence.

    > Collateralised Copyright Liability
Is this a real legal / finance term or did you make it up?

Also, I do not follow you leap to compare LLMs to CDOs (collateralised debt obligations). And, do you specifically mean CDO or any kind of mortgage / commercial loan structured finance deal?


My analogy is based on the fact that nobody could see what was inside CDOs nor did they want to see, all they wanted to do was pass them on to the next sucker. It was all fun until it all blew up. LLM operators behave in the same way with copyrighted material. For context, read https://nymag.com/news/business/55687/


    > nobody could see what was inside CDOs
Absolutely not true. Where did you get that idea? When pricing the bonds from a CDO you get to see the initial collateral. As a bond owner, you receive monthly updates about any portfolio updates. Weirdly, CDOs frequently have more collateral transparency compared to commercial or residential mortgage deals.


OpenAI is outputting the partially copyright-infringing works of their LLM for profit. How does that square?


You, the user, is inputting variables into their probability algorithm that's resulting in the copyright work. It's just a tool.


Let's say a torrent website asks the user through an LLM interface what kind of copyrighted content they want to download and then offers me links based on that, and makes money off of it.

The user is "inputting variables into their probability algorithm that's resulting in the copyright work".


Theoretically a torrent website that does not distribute the copyright files themselves in anyway should be legal, unless there's a specific law for this (I'm unaware of any, but I may be wrong).

They tend to try argue for conspiracy to commit copyright infringement, it's a tenuous case to make unless they can prove that was actually their intention. I think in most cases it's ISP/hosting terms and conditions and legal costs that lead to their demise.

Your example of the model asking specifically "what copyrighted content would you like to download", kinda implies conspiracy to commit copyright infringement would be a valid charge.


How is it any different than training a model on content protected under an NDA and allowing access to users via a web-portal?

What is the difference OpenAI has that lets them get away with, but not our hypothetical Mr. Smartass doing the same process trying to get around an NDA?


Well if OpenAI signed an NDA beforehand to not disclose certain training data it used, and then users actually do access this data, then yes it would be problematic for OpenAI, under the terms of their signed NDA.


Yes, a tool that they charge me money to use.


Just like any other tool that can be used to plagiarize, Photoshop, Word etc.


You raise an interesting point. If more professional lawyers agreed with you, then why have we not seen a lawsuit from publishers against OpenAI?



There are some lawsuits, especially in the very reflexively copyright-pilled industries. However, a good chunk of publishers aren't suing for self-interested reasons. There's a lot of people in the creative industry who see a machine that can cut artists out of the copyright bargain completely and are shouting "omg piracy is based now" because LLMs can spit out content faster and for free.


Is converting an audio signal into the frequency domain, pruning all inaudible frequencies, and then Huffman encoding it tranformative?


Well if the end result is something completely different such as an algorithm for determining which music is popular or determining which song is playing then yes it's transformative.

It's not merely a compressed version of a song intended to be used in the same way as the original copyright work, this would be copyright infringement.


If your training process ingests the entire text of the book, and trains with a large context size, you're getting more than just "a handful of word probabilities" from that book.


If you've trained a 16-bit ten billion parameter model on ten trillion tokens, then the mean training token changes 2/125 of a bit, and a 60k word novel (~75k tokens) contributes 1200 bits.

It's up to you if that counts as "a handful" or not.


I think it’s questionable whether you can actually use this bit count to represent the amount of information from the book. Those 1200 bits represent the way in which this particular book is different from everything else the model has ingested. Similarly, if you read an entire book yourself, your brain will just store the salient bits, not the entire text, unless you have a photographic memory.

If we take math or computer science for example: some very important algorithms can be compressed to a few bits of information if you (or a model) have a thorough understanding of the surrounding theory to go with it. Would it not amount to IP infringement if a model regurgitates the relevant information from a patent application, even if it is represented by under a kilobyte of information?


I agree with what I think you're saying, so I'm not sure I've understood you.

I think this is all still compatible with saying that ingesting an entire book is still:

> If you're taking a handful of word probabilities from every book ever written, then the portion taken from each work is very, very low

(Though I wouldn't want to make a bet either way on "so courts aren't likely to care" that follows on from that quote: my not-legally-trained interpretation of the rules leads to me being confused about how traditional search engines aren't a copyright violation).


If I invent an amazing lossless compression algorithm such that adding an entire 60k word novel to my blob only increases the size by 1.2kb, does that mean I'm not copyright infringing if I release that model?


How is that relevant? If some LLM were able to regurgitate a 60k word novel verbatim on demand, sure, the copyright situation would be different. But last I checked they can’t, not 60k, 6k, or even 600 words. Perhaps they can do 60 words of some well-known passages from the Bible or other similar ubiquitous copyright-free works.


So the fact that it's a lossy compression algorithm makes it ok?


"It's lossy" is in isolation much too vague to say if it's OK or not.

A compression algorithm which loses 1 bit of real data is obviously not going to protect you from copyright infringement claims, something that reduces all inputs to a single bit is obviously fine.

So, for example, what the NYT is suing over is that it (or so it is claimed) allows the model to regenerate entire articles, which is not OK.

But to claim that it is a copyright infringement to "compress" a Harry Potter novel to 1200 bits, is to say that this:

> Harry Potter discovers he is a wizard and attends Hogwarts, where he battles dark forces, including the evil Voldemort, to save the wizarding world.

… which is just under 1200 bits, is an unlawful thing to post (and for the purpose of the hypothetical, imagine that quotation in the form of a zero-context tweet rather than the actual fact of this being a case of fair-use because of its appearance in a discussion about copyright infringement of novels).

I think anyone who suggests suing over this to a lawyer, would discover that lawyers can in fact laugh.

Now, there's also the question of if it's legal or not to train a model on all of the Harry Potter fan wikis, which almost certainly have a huge overlap with the contents of the novels and thus strengthens these same probabilities; some people accuse OpenAI et al of "copyright laundering", and I think ingesting derivative works such as fan sites would be a better description of "copyright laundering" than the specific things they're formally accused of in the lawsuits.


To be fair, OP raises an important question that I hope smart legal minds are pondering. In my view, they aren't looking for a "programmer answers about legal issue" response. Probably the right court might agree with their premise. What the damages or restrictions might be, I cannot speculate. Any IP lawyers here who want to share some thoughts?


Yup, that's fair.

As my not-legally-trained interpretation of the rules leads to me being confused about how traditional search engines aren't a copyright violation, I don't trust my own beliefs about the law.


xz can compress the text of Harry Potter by a factor of 30:1. Does that mean I can also distribute compressed copies of copyrighted works and that's okay?


Can you get that book out of an LLM?

Because that's the distinction being argued here: it's "a handful"[0] of probabilities, not the complete work.

[0] I'm not sold on the phrasing "a handful", but I don't care enough to argue terminology; the term "handful" feels like it's being used in a sorites paradox kind of way: https://en.wikipedia.org/wiki/Sorites_paradox


Incredibly poor analogy. If an LLM were able to regurgitate Harry Potter on demand like xz can, the copyright situation would be much more black and white. But they can’t, and it’s not even close.


You can't get Harry Potter out of the LLM, that's the difference


I think with some AI you could reproduce artworks of obscure indie artists who are working right now.

If you were a director at a game company and needed art in that style, it would be cheaper to have the AI do it instead of buying from the artist.

I think this is currently an open question.


I recently read an article that I annoyingly can't find again about an art director at a company that decided to hire some prompters. They got some art, told them to completely change it, got other art, told them to make smaller changes... And then got nothing useful as the prompters couldn't tell the ai "like that but make this change". Ai art may get there in a few years or maybe a decade or two, but it's not there yet. (End of that article: they fired the prompters after a few days)

An ai-enhanced Photoshop, however, could do wonders though as the base capabilities seem to be mostly there. Haven't used any of the newer ai stuff myself but https://www.shruggingface.com/blog/how-i-used-stable-diffusi... makes it pretty clear the building blocks seem largely there. So my guess is the main disconnect is in making the machines understand natural language instructions for how to change the art.


>we also thought sampling in music was de minimus

I would think if I can recognize exactly what song it comes from - not de minimus.


When I was younger, I was told that the album from Beastie Boys called Paul's Boutique was the straw that broke the camel's back! I have no idea if this true, but that album has a batshit crazy amount of recognizable samples. I doubt very much that Beastie paid anything for the rights to sample.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: