Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I still find it very (depressingly) hilarious how everybody sees this as a lawsuit about if training on copyrighted context is legal or not.

Literally, the NYT claimed that OpenAI maintained a database of NYT's works and would just verbatim surface the content. This is not an AI issue, it's settled copyright law.



NYT's statement of claims is here:

https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

And OpenAI's defense is here:

https://fingfx.thomsonreuters.com/gfx/legaldocs/byvrkxbmgpe/...

OpenAI's defense, which I am not in a position to verify, is that ChatGPT can not be used to reproduce NYT's articles verbatim without a great deal of prompt engineering. That the submissions NYT provided of verbatim reproductions are the result of a third party expert who tried more than 10000 prompts that included feeding into ChatGPT portions of NYT articles in order to get ChatGPT to respond with verbatim reproductions of NYT articles and that no one uses ChatGPT in this manner. Furthermore the verbatim reproductions that ChatGPT did end up producing after these 10000 prompts are available on numerous public websites unaffiliated with NYT.


This feels like an extremely weak defense to me, because if it works it is essentially a copyright backdoor.

1. There is little doubt that OpenAI has NYT articles in it's dataset.

2. They claim it's OK because all of the IP is mixed together and effectively laundered.

3. But in fact if you put in a secret code you can get the IP back nearly verbatim.

4. So if OpenAI wins this case we could just trade prompts that regurgitate the articles back without ever visiting NYT.


> So if OpenAI wins this case we could just trade prompts that regurgitate the articles back without ever visiting NYT.

This seems like the inverse of the old "book cipher" scheme to "avoid" copyright infringement.

If you want to distribute something you're not allowed to, first you find some public data (e.g. a public domain book), then you xor it against the thing you want to distribute. The result is gibberish. Then you distribute the gibberish and the name of the book to use as a key and anyone can use them to recover the original. The "theory" is that neither the gibberish nor the public domain book can be used to recover the original work alone, so neither is infringing by itself, and any given party is only distributing one of them. Obviously this doesn't work and the person distributing the gibberish rather than the public domain book is going to end up in court.

So then which side of the fence is ChatGPT and which side is the text you have to feed it to get it to emit the article? Well, it's the latter that you need access to both the existing ChatGPT and the original article in order to produce.

Notice also that this fails in the same way. The people distributing the text that can be combined with the LLM to reproduce the article are the ones with the clear intention to infringe the copyright. Moreover, you can't produce the prompt that would get ChatGPT to do that unless you already have access to the article, so people without a subscription can't use ChatGPT that way. And, rather importantly, the scheme is completely vacuous. If you already have access to the article needed to generate the relevant prompt and you want to distribute it to someone else, you don't have to give them some prompt they can feed to ChatGPT, you can just give them the text of the article.


I agree. If you gzip a NYT article and print it out, very few people would be able to read the article. But it can still be decoded ("prompt engineering" as OpenAI calls it).


Copyright maximalism in the 21st century can be summed up as: When an individual makes a single copy of a song and gives it to a friend, that's piracy. When a corporation makes subtlety different copies of thousands of works and sells them to customers, that's just fair use


NYT is a corporation.

Corporation vs individual is a distraction. It’s some people (wrongly, in my view) prioritising production over consumption. If this were Altman personally producing an AI, the same people would rally to him.

The corporate/individual framing needlessly inflames the debate when it’s really one about power and money.


I don't think it's "production over consumption". At least I don't like that framing. For me it's about supporting production. The humans that write news articles every day can't produce that valuable work if they don't get fairly compensated for it. It's not that the AI produces more, it's that the AI destabilizes production. It makes it impossible to produce.


> It's not that the AI produces more

We're not debating whether they do. "Humans that write news articles" are producing. That contrasts with "an individual mak[ing] a single copy of a song and giv[ing] it to a friend." We don't put journalists in jail for plagiarism.


> We don't put journalists in jail for plagiarism.

I'm guessing you're imagining a scenario here were a journalist has copied an entire article verbatim and republished it in their newspaper. That would actually be both copyright infringement AND plagiarism. Newspapers just rarely enforce that right.

These two things aren't on a scale. They are independent infractions.


No, they wouldn't, because Altman would still be stealing other people's actual work.


> they wouldn't, because Altman would still be stealing other people's actual work

OpenAI is "stealing other people's [sic] actual work." The people rallying to it clearly don't care that much about it now. They wouldn't care whether it's a corporation or Sam Altman per se doing it.


I'd say there's some merit to that defense. Imagine for example if a website generated itself based on a sequence in Pi - technically all of the NYT is in that 'dataset' and if you tell it to start at the right digit it will spit back any NYT article. In a more realistic sense though you can make it spit back anything you want and the NYT article is just a consequence of that behavior - finding the right 'secret code' to get a verbatim article is not something you can easily just do.

ChatGPT is somewhere in-between - You can't just ask it for a specific NYT article and have it spit it back at you verbatim (NYT acknowledges as such, it took them ~10k prompts to do it), but with enough hints and guesses you can coax it into producing one (along with pretty much anything else you want). The question then becomes whether that's closer to the Pi example (ChatGPT is basically just spitting the prompt back at you), or if it's easy enough to do that it similar to ChatGPT just hosting the article.

Edit: I suppose I'd add, this is also a separate question from the training, training on copyrighted material may or may not be legal regardless of whether the model can verbatim spit the training material back out.


You're getting lost in the technology here. Copyright is not about producing the exact sequence of bytes, nor is it about "hosting an article". Copyright is an intellectual property right to the creative work, not the exact reproduction that is seen on some website, but the creative work itself.

The law doesn't not care about your weird edge cases. What matters is what should be and how we can make it so.


You're ignoring the point of what I'm saying though, which is that the required prompt is relevant to determining if ChatGPT itself is the thing violating the copyright. I can probably get ChatGPT to produce any sequence of tokens I want given enough time, that doesn't mean ChatGPT is violating every copyright in existence, somewhere you have to draw the line.


I'm not ignoring it. I'm saying that the axis you're contemplating this problem on isn't correct. It's not about if you can get "any sequence of tokens" or the edit distance between those tokens and the actual tokens of the copyrighted work. The law is not (and should not be) an algorithm with a definite mathematical answer at some fixed point in a continuum.

PI is not copyrighted, because that would be silly, but if you were to find the exact bytes in there to reproduce the next Marvel movie and you started sharing that offset, that would probably be copyright infringement. The fact that neither of those numbers were part of the original work, or copyrightable in isolation, or that "technically everything is present in pi", is immaterial. It's obvious to any non-pedantic human being that you're infringing on the creative work.


You're still missing the key point. Say that website exists, if someone were to find the exact point in Pi that is the next marvel movie and started sharing the location, is the copyright violation committed by the creator of the Pi website or by the person who found and is sharing the location?

If I give you a prompt that's just the contents of a NYT article and me telling ChatGPT to say it back to me, is ChatGPT committing the copyright violation by producing the article or am I by creating and sharing the prompt?


I will say it again. I am not missing the point, I am refusing the point. The point you are bringing across is not a useful point in matters of law.

There is no reasonable way for us to deliberate on your made up scenarios, because in matters of law the details matter. The website hosting pi could very well be taking part in the copyright infringement, it could also very well not. Our way of weighing those details is the process of the law.

You place the question of PI in a vacuum, asking me if it should be illegal "in principle", but that's not law. The intent, appearance, skill of council, even the judge and jury, will matter if a case had to come up. You cannot separate the idealized question from the messy details of the fleshy humans.


Yes, it's almost like it's a complicated legal question and the content of the required prompt to produce a copyright-infringing response would be something that would interest the judge and jury.

You're saying "it's complicated and lots of factors would come into play", which is the same thing I'm saying. The fact that it spits out copyright-violating text does not necessarily mean ChatGPT is the one at fault, it's messy.


>Yes, it's almost like it's a complicated legal question and the content of the required prompt to produce a copyright-infringing response would be something that would interest the judge and jury.

In what way? You don't seem to know what is decided by a jury or what is decided by a judge. Specifically, what do you think the prompt evidences that it is relevant?

> The fact that it spits out copyright-violating text does not necessarily mean ChatGPT is the one at fault, it's messy.

Actually, that's exactly what it means. There is no defense to copyright infringement of the nature you are discussing. OpenAi is responsible for what it ingests, and the fact that use of its tool can result in these outcomes is solely the responsibility of OpenAI and your misunderstandings otherwise are dense and apparently impenetrable.


How is that person missing the point? You are making a legal argument and apparently without any consideration for the actual law...


They're missing my point because I'm not saying it is or isn't, I'm saying that it's messy and things like the required prompt may sway the judge and/or jury one way or the other. If you provide ChatGPT an entire copyrighted text in the prompt and then go "ah-ha, the response violated my copyright", a judge and/or jury probably won't be very impressed with you. If instead you just ask ChatGPT "please produce chapter 1 of my latest book" and it does, then ChatGPT is not looking so great.


Judge or jury one way or the other on what? You literally have no idea what you are talking about, have any idea how a lawsuit works, and apparently what is decided by a judge vs what is decided by a jury, and you are constantly espousing on legal issues as if it is contributing to anything but furthering other people's ignorance when they don't know better to dismiss your posts.

Your hypothetical is asinine and completely removed from what is at issue in this lawsuit.

And of course now, reading other posters responding to you in this thread, I'm not the only one pointing out how you are only contributing your own misunderstandings.


The difference between "I can probably generate everything" and "I can definitely produce this copyrighted work" is substantial and in fact the core argument in the case.


Can you really say it can "definitely produce this copyrighted work" if NYT had to try thousands of prompts, some of which included parts of the articles they wanted it to produce? That's my point. I really don't know the answer, but it's not as simple as "they asked it to to produce the article and it did", they tested thousands of combinations.


Did it? Then yes. You can say it "definitely produce this copyrighted work"

I'm not sure how that could even be controversial. Either it does or doesn't. In this case, it does.


So if I go on ChatGPT, copy in a chapter from a book and then ask it to repeat the chapter back to me, is ChatGPT violating the copyright of the book I just fed it?


That's not an issue in this lawsuit


If it is outputting verbatim copies of works it has ingested, it is doing copyright infringement. It's really not that difficult.


I think the difference here is that a human intentionally built a dataset containing that information, whereas Pi is an irrational number which is a consequence of our mathematics and number system and wasn't intentionally crafted to give you NYT articles.


Well that depends on what you're trying to prove. If you think it's a copyright violation to include the articles in the dataset _at all_ then it doesn't even matter if ChatGPT can produce NYT articles, it's a violation either way. If including the articles in the dataset is not in-and-of-itself a copyright violation then things get complicated when talking about what prompt is required to produce a copyright-violating result.


1. Anyone can get all of NYT's articles for free along with CNN and every other major news site, this isn't in dispute, it's available here in a single 93 terabyte compressed file:

https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-05/inde...

2. I did not see any defense of this nature.

3. Yes and this is the big deal. If the secret code needed to reproduce copyrighted material involves large portions of that copyrighted material already then that's quite a bit different than just verbatim reproductions out of thin air.

4. Yes, if OpenAI wins this case then you could feed into ChatGPT large portions of NYT articles and OpenAI could possibly respond by regurgitating similar such portions of NYT articles in response.


> OpenAI's defense, which I am not in a position to verify, is that ChatGPT can not be used to reproduce NYT's articles verbatim without a great deal of prompt engineering.

That's really stupid. It's akin to claiming that I can serve pirated copyrighted content from my server, just as long as it's served from a really convoluted path. If you can get to it through any path, it's infringing. The path literally doesn't matter; it's a total red herring.

> Furthermore the verbatim reproductions that ChatGPT did end up producing after these 10000 prompts are available on numerous public websites unaffiliated with NYT.

Also stupid. So it's only piracy if you download it from the original source, and somehow not-piracy if you download it from another pirate? Or every single commercially released movie is fair game to distribute, because they're already being served up on numerous pirate BitTorrent sites?


Everybody seems to be focused on whether or not the OpenAI copied the data in training, but my understanding of copyright is that if a person when into a clean room and wrote an new article from scratch, without having read any NYT, that just so happened to be exactly the same as an existing NYT article, it would still be a copyright violation.

As soon as OpenAI repeats a set of words verbatim, it violates copyright.

The courts should examine how much damage an occasional verbatim regurgitation would damage NYTs business. (I would guess not much)


No this is untrue. Independent creation is an affirmative defense against copyright infringement. You'd never convince a jury that you independently wrote the exact same article as a New York Times article, but in principle you can argue that you independently wrote say... a song, or even reimplemented the WIN32 API without ever having read or familiarized yourself with the original source code:

https://github.com/wine-mirror/wine

https://harvardlawreview.org/print/vol-128/creating-around-c...


Thanks for the clarification!


> but my understanding of copyright is that if a person when into a clean room and wrote an new article from scratch, without having read any NYT, that just so happened to be exactly the same as an existing NYT article, it would still be a copyright violation.

It would not be. Independent creation is a complete defense against copyright infringement.

Patents, however, do work this way.


> is that ChatGPT can not be used to reproduce NYT's articles verbatim without a great deal of prompt engineering

You aren't allowed to infringe copyrights just because you make it difficult to do so. OpenAI's system should not be making verbatim copies at all.


It's probably worth considering how the thing actually works.

LLMs are sort of like a fancy compression dictionary that can be used to compress text, except that we kind of use them in reverse. Instead of compressing likely text into smaller bitstrings, they generate likely text. But you could also use them for compression of text because if you take some text, there is highly likely a much shorter prompt + seed that would generate the same text, provided that it's ordinary text with a common probability distribution.

Which is basically what the lawyers are doing. Keep trying combinations until it generates the text you want.

But the ability to do that isn't really that surprising. If you feed a copyrighted article to gzip, it will give you a much shorter string that you can then feed back to gunzip to get back the article. That doesn't mean gunzip has some flaw or ill intent. It also doesn't imply that the article is even stored inside of the compression library, rather than there just being a shorter string that can be used to represent it because it contains predictable patterns.

It's not implausible that an LLM could generate a verbatim article it was never even trained on if you pushed on it hard enough, especially if it was trained on writing in a similar style and other coverage of the same event.


> It's not implausible that an LLM could generate a verbatim article it was never even trained on if you pushed on it hard enough, especially if it was trained on writing in a similar style and other coverage of the same event.

That'd be a coincidence, not a verbatim copy. Copyright law doesn't prohibit independent creation. This defense isn't available to OpenAI because there is no dispute OpenAI ingested the NYTimes articles in the first place. There is no plausible way OpenAI could say they never had access to the articles they are producing verbatim copies of.

Rather than sneeringly explain away how LLMs work without any eye towards the laws at issue, maybe you should do yourself the favor of learning about them so you can spare us this incessent "no let me explain how they work, it's fine I swear!" shtick.


> That'd be a coincidence, not a verbatim copy.

It would be both. Or to put it a different way, how would you distinguish one from the other?

> This defense isn't available to OpenAI because there is no dispute OpenAI ingested the NYTimes articles in the first place.

The question remains whether ingesting the article is the reason it gets output in response to a given prompt, when it could have happened either way.

And in cases where you don't know, emitting some text is not conclusive evidence that it was in the training data. Most of the text emitted by LLMs isn't verbatim from the training data.

> Rather than sneeringly explain away how LLMs work without any eye towards the laws at issue, maybe you should do yourself the favor of learning about them so you can spare us this incessent "no let me explain how they work, it's fine I swear!" shtick.

This is a case of first impression. We don't really know what they're going to do yet. But "there exists some input that causes it to output the article" isn't any kind of offensive novelty; lots of boring existing stuff does that when the input itself is based on the article.


>It would be both. Or to put it a different way, how would you distinguish one from the other?

No, it's not both. Have you engaged in any effort to understand the law here? Copyright doesn't prohibit independent creation. I'm not sure how much more simple I can make that for you. In one scenario there is copying, in the other there isn't. The facts make it clear, when something is copied it is illegal.

>The question remains whether ingesting the article is the reason it gets output in response to a given prompt, when it could have happened either way.

This can't actually be serious? This isn't credible. You are saying there is no difference between ingesting it and outputting the results vs not ingesting it and outputting the results. Anything to back this up at all?

>This is a case of first impression. We don't really know what they're going to do yet. But "there exists some input that causes it to output the article" isn't any kind of offensive novelty; lots of boring existing stuff does that when the input itself is based on the article.

"First impression" (something you claim) doesn't mean ignore existing copyright law. One side is arguing this isn't first impression at all, it's just rote copying.

> But "there exists some input that causes it to output the article" isn't any kind of offensive novelty

You said its novel, I called it plain copying.

>lots of boring existing stuff does that when the input itself is based on the article.

You are saying its first impression... not me.


> and that no one uses ChatGPT in this manner

Someone did though and was able to get verbatim reproductions of NYT articles out of it.

> Furthermore the verbatim reproductions that ChatGPT did end up producing after these 10000 prompts are available on numerous public websites unaffiliated with NYT.

So what? NYT as a copyright holder might have no issue with those unaffiliated sites but have an issue with OpenAI.


This is a problem with copyright law. There is no way for an end user to determine the copyright status of anything on the Internet, you can only make an educated guess.


It's pretty simple in the US. A work has a copyright regardless of whether it's registered or a notice placed on the work. Registration provides easier means of asserting your copyright but you have a copyright as soon as you create the work. If I wrote a handwritten note about OpenAI on a cocktail napkin then I have copyright over that work barring some challenge to whether it's a "creative work" or not. It doesn't matter what the medium is, or how it's shared. The internet makes this challenging in that it's essentially a shared technical means of disseminating the work, but the work remains copyrighted no matter how publicly available it might or might not be. It's just a matter of the rights-holder asserting their right. Which is something NYT does with their paywall all the time.


As read you are asserting that public domain doesn't exist.

Not only that, if something is available on the Internet, and still under copyright, you have no way of knowing whether the website is authorized to distribute it or not.


> ChatGPT can not be used to reproduce NYT's articles verbatim without a great deal of prompt engineering

If i have a copy of a movie I am infridging copyright. Why is Open Ai special ?

Because it's Microsoft /s


First sentence of second paragraph of the lawsuit: “Defendants’ unlawful use of The Times’s work to create artificial intelligence products that compete with it threatens The Times’s ability to provide that service.” First sentence of p7: “The Times objected after it discovered that Defendants were using Times content without permission to develop their models and tools.”

I think it’s ultimately about whether training on copyrighted content is legal or not.

Here are some other quotes from the lawsuit that approach it from a different angle: “These tools also wrongly attribute false information to The Times.” “By providing Times content without The Times’s permission or authorization, Defendants’ tools undermine and damage The Times’s relationship with its readers and deprive The Times of subscription, licensing, advertising, and affiliate revenue.”

Even if the first argument fails, if the second argument wins, it still boils down to not being able to train on copyrighted content unless it is possible to train on copyrighted data without ultimately quoting that content or attributing anything to the author of that content. My (uneducated) guess is that’s not possible.


> I think it’s ultimately about whether training on copyrighted content is legal or not.

It is.

The bulk of the complaint is a narrative; it's meant to be a persuasive story that seeks to put OpenAI in a bad light. You don't really get to the specific causes of action until page 60 (paragraphs 158-180). A sample of the specific allegations that comprise the elements of each cause of action are:

160. By building training datasets containing millions of copies of Times Works, including by scraping copyrighted Times Works from The Times’s websites and reproducing such works from third-party datasets, the OpenAI Defendants have directly infringed The Times’s exclusive rights in its copyrighted works.

161. By storing, processing, and reproducing the training datasets containing millions of copies of Times Works to train the GPT models on Microsoft’s supercomputing platform, Microsoft and the OpenAI Defendants have jointly directly infringed The Times’s exclusive rights in its copyrighted works.

162. On information and belief, by storing, processing, and reproducing the GPT models trained on Times Works, which GPT models themselves have memorized, on Microsoft’s supercomputing platform, Microsoft and the OpenAI Defendants have jointly directly infringed The Times’s exclusive rights in its copyrighted works.

163. By disseminating generative output containing copies and derivatives of Times Works through the ChatGPT offerings, the OpenAI Defendants have directly infringed The Times’s exclusive rights in its copyrighted works.


IMO the first argument is invalid, however, the second one is a completely valid argument.


> "Defendants’ tools undermine and damage The Times’s relationship with its readers and deprive The Times of subscription, licensing, advertising, and affiliate revenue."

News flash: you can read newspaper articles at the library.


Yes, and libraries pay for that access. They also don't obfuscate the origin or remove the advertising. Don't equate libraries with what OpenAI does.


> News flash: you can read newspaper articles at the library.

Reading an article != selling a product that redistributes the article.


And its no coincidence that the NYTimes isn't suing OpenAI for reading newspaper articles at the library...


I haven’t checked in on this case for a while, but aren’t there also many organizations that want OpenAI to win this case so that the concept of fair use is upheld?


If OpenAI's use of publishers' content required no permission then why did it consummate deals with all the publishers mentioned in the article, as well as seeking one with the NYT.

"Ask forgiveness, not permission" is supposed to be the Silicon Valley motto. But that's not what happening here. OpenAI is asking for permission. As with all the other publishers, OpenAI will have to pay. NYT reserves the right to set the price as high as it wishes. No doubt the price will be enough cover NYT's costs from this litigation. OpenAI will pay it.

How much has OpenAI spent on this litigation.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: