The New York Times Has Spent $10.8M in Its Legal Battle with OpenAI So Far

lesuorac · 2025-02-05T18:25:12 1738779912

I still find it very (depressingly) hilarious how everybody sees this as a lawsuit about if training on copyrighted context is legal or not.

Literally, the NYT claimed that OpenAI maintained a database of NYT's works and would just verbatim surface the content. This is not an AI issue, it's settled copyright law.

Maxatar · 2025-02-05T18:46:27 1738781187

NYT's statement of claims is here:

https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

And OpenAI's defense is here:

https://fingfx.thomsonreuters.com/gfx/legaldocs/byvrkxbmgpe/...

OpenAI's defense, which I am not in a position to verify, is that ChatGPT can not be used to reproduce NYT's articles verbatim without a great deal of prompt engineering. That the submissions NYT provided of verbatim reproductions are the result of a third party expert who tried more than 10000 prompts that included feeding into ChatGPT portions of NYT articles in order to get ChatGPT to respond with verbatim reproductions of NYT articles and that no one uses ChatGPT in this manner. Furthermore the verbatim reproductions that ChatGPT did end up producing after these 10000 prompts are available on numerous public websites unaffiliated with NYT.

TrainedMonkey · 2025-02-05T19:06:45 1738782405

This feels like an extremely weak defense to me, because if it works it is essentially a copyright backdoor.

1. There is little doubt that OpenAI has NYT articles in it's dataset.

2. They claim it's OK because all of the IP is mixed together and effectively laundered.

3. But in fact if you put in a secret code you can get the IP back nearly verbatim.

4. So if OpenAI wins this case we could just trade prompts that regurgitate the articles back without ever visiting NYT.

AnthonyMouse · 2025-02-05T19:34:45 1738784085

> So if OpenAI wins this case we could just trade prompts that regurgitate the articles back without ever visiting NYT.

This seems like the inverse of the old "book cipher" scheme to "avoid" copyright infringement.

If you want to distribute something you're not allowed to, first you find some public data (e.g. a public domain book), then you xor it against the thing you want to distribute. The result is gibberish. Then you distribute the gibberish and the name of the book to use as a key and anyone can use them to recover the original. The "theory" is that neither the gibberish nor the public domain book can be used to recover the original work alone, so neither is infringing by itself, and any given party is only distributing one of them. Obviously this doesn't work and the person distributing the gibberish rather than the public domain book is going to end up in court.

So then which side of the fence is ChatGPT and which side is the text you have to feed it to get it to emit the article? Well, it's the latter that you need access to both the existing ChatGPT and the original article in order to produce.

Notice also that this fails in the same way. The people distributing the text that can be combined with the LLM to reproduce the article are the ones with the clear intention to infringe the copyright. Moreover, you can't produce the prompt that would get ChatGPT to do that unless you already have access to the article, so people without a subscription can't use ChatGPT that way. And, rather importantly, the scheme is completely vacuous. If you already have access to the article needed to generate the relevant prompt and you want to distribute it to someone else, you don't have to give them some prompt they can feed to ChatGPT, you can just give them the text of the article.

jrockway · 2025-02-05T19:12:22 1738782742

I agree. If you gzip a NYT article and print it out, very few people would be able to read the article. But it can still be decoded ("prompt engineering" as OpenAI calls it).

sidewndr46 · 2025-02-05T19:13:21 1738782801

Copyright maximalism in the 21st century can be summed up as: When an individual makes a single copy of a song and gives it to a friend, that's piracy. When a corporation makes subtlety different copies of thousands of works and sells them to customers, that's just fair use

JumpCrisscross · 2025-02-05T19:16:07 1738782967

NYT is a corporation.

Corporation vs individual is a distraction. It’s some people (wrongly, in my view) prioritising production over consumption. If this were Altman personally producing an AI, the same people would rally to him.

The corporate/individual framing needlessly inflames the debate when it’s really one about power and money.

delusional · 2025-02-05T19:40:22 1738784422

I don't think it's "production over consumption". At least I don't like that framing. For me it's about supporting production. The humans that write news articles every day can't produce that valuable work if they don't get fairly compensated for it. It's not that the AI produces more, it's that the AI destabilizes production. It makes it impossible to produce.

JumpCrisscross · 2025-02-05T23:16:51 1738797411

> It's not that the AI produces more

We're not debating whether they do. "Humans that write news articles" are producing. That contrasts with "an individual mak[ing] a single copy of a song and giv[ing] it to a friend." We don't put journalists in jail for plagiarism.

delusional · 2025-02-06T17:12:14 1738861934

> We don't put journalists in jail for plagiarism.

I'm guessing you're imagining a scenario here were a journalist has copied an entire article verbatim and republished it in their newspaper. That would actually be both copyright infringement AND plagiarism. Newspapers just rarely enforce that right.

These two things aren't on a scale. They are independent infractions.

freejazz · 2025-02-05T20:39:58 1738787998

No, they wouldn't, because Altman would still be stealing other people's actual work.

JumpCrisscross · 2025-02-05T23:19:07 1738797547

> they wouldn't, because Altman would still be stealing other people's actual work

OpenAI is "stealing other people's [sic] actual work." The people rallying to it clearly don't care that much about it now. They wouldn't care whether it's a corporation or Sam Altman per se doing it.

DSMan195276 · 2025-02-05T19:26:38 1738783598

I'd say there's some merit to that defense. Imagine for example if a website generated itself based on a sequence in Pi - technically all of the NYT is in that 'dataset' and if you tell it to start at the right digit it will spit back any NYT article. In a more realistic sense though you can make it spit back anything you want and the NYT article is just a consequence of that behavior - finding the right 'secret code' to get a verbatim article is not something you can easily just do.

ChatGPT is somewhere in-between - You can't just ask it for a specific NYT article and have it spit it back at you verbatim (NYT acknowledges as such, it took them ~10k prompts to do it), but with enough hints and guesses you can coax it into producing one (along with pretty much anything else you want). The question then becomes whether that's closer to the Pi example (ChatGPT is basically just spitting the prompt back at you), or if it's easy enough to do that it similar to ChatGPT just hosting the article.

Edit: I suppose I'd add, this is also a separate question from the training, training on copyrighted material may or may not be legal regardless of whether the model can verbatim spit the training material back out.

delusional · 2025-02-05T19:34:47 1738784087

You're getting lost in the technology here. Copyright is not about producing the exact sequence of bytes, nor is it about "hosting an article". Copyright is an intellectual property right to the creative work, not the exact reproduction that is seen on some website, but the creative work itself.

The law doesn't not care about your weird edge cases. What matters is what should be and how we can make it so.

DSMan195276 · 2025-02-05T19:53:17 1738785197

You're ignoring the point of what I'm saying though, which is that the required prompt is relevant to determining if ChatGPT itself is the thing violating the copyright. I can probably get ChatGPT to produce any sequence of tokens I want given enough time, that doesn't mean ChatGPT is violating every copyright in existence, somewhere you have to draw the line.

delusional · 2025-02-05T21:11:01 1738789861

I'm not ignoring it. I'm saying that the axis you're contemplating this problem on isn't correct. It's not about if you can get "any sequence of tokens" or the edit distance between those tokens and the actual tokens of the copyrighted work. The law is not (and should not be) an algorithm with a definite mathematical answer at some fixed point in a continuum.

PI is not copyrighted, because that would be silly, but if you were to find the exact bytes in there to reproduce the next Marvel movie and you started sharing that offset, that would probably be copyright infringement. The fact that neither of those numbers were part of the original work, or copyrightable in isolation, or that "technically everything is present in pi", is immaterial. It's obvious to any non-pedantic human being that you're infringing on the creative work.

DSMan195276 · 2025-02-06T00:00:17 1738800017

You're still missing the key point. Say that website exists, if someone were to find the exact point in Pi that is the next marvel movie and started sharing the location, is the copyright violation committed by the creator of the Pi website or by the person who found and is sharing the location?

If I give you a prompt that's just the contents of a NYT article and me telling ChatGPT to say it back to me, is ChatGPT committing the copyright violation by producing the article or am I by creating and sharing the prompt?

delusional · 2025-02-06T17:01:25 1738861285

I will say it again. I am not missing the point, I am refusing the point. The point you are bringing across is not a useful point in matters of law.

There is no reasonable way for us to deliberate on your made up scenarios, because in matters of law the details matter. The website hosting pi could very well be taking part in the copyright infringement, it could also very well not. Our way of weighing those details is the process of the law.

You place the question of PI in a vacuum, asking me if it should be illegal "in principle", but that's not law. The intent, appearance, skill of council, even the judge and jury, will matter if a case had to come up. You cannot separate the idealized question from the messy details of the fleshy humans.

DSMan195276 · 2025-02-07T12:15:36 1738930536

Yes, it's almost like it's a complicated legal question and the content of the required prompt to produce a copyright-infringing response would be something that would interest the judge and jury.

You're saying "it's complicated and lots of factors would come into play", which is the same thing I'm saying. The fact that it spits out copyright-violating text does not necessarily mean ChatGPT is the one at fault, it's messy.

freejazz · 2025-02-07T17:00:19 1738947619

>Yes, it's almost like it's a complicated legal question and the content of the required prompt to produce a copyright-infringing response would be something that would interest the judge and jury.

In what way? You don't seem to know what is decided by a jury or what is decided by a judge. Specifically, what do you think the prompt evidences that it is relevant?

> The fact that it spits out copyright-violating text does not necessarily mean ChatGPT is the one at fault, it's messy.

Actually, that's exactly what it means. There is no defense to copyright infringement of the nature you are discussing. OpenAi is responsible for what it ingests, and the fact that use of its tool can result in these outcomes is solely the responsibility of OpenAI and your misunderstandings otherwise are dense and apparently impenetrable.

freejazz · 2025-02-06T15:33:54 1738856034

How is that person missing the point? You are making a legal argument and apparently without any consideration for the actual law...

DSMan195276 · 2025-02-07T13:35:21 1738935321

They're missing my point because I'm not saying it is or isn't, I'm saying that it's messy and things like the required prompt may sway the judge and/or jury one way or the other. If you provide ChatGPT an entire copyrighted text in the prompt and then go "ah-ha, the response violated my copyright", a judge and/or jury probably won't be very impressed with you. If instead you just ask ChatGPT "please produce chapter 1 of my latest book" and it does, then ChatGPT is not looking so great.

freejazz · 2025-02-07T16:56:23 1738947383

Judge or jury one way or the other on what? You literally have no idea what you are talking about, have any idea how a lawsuit works, and apparently what is decided by a judge vs what is decided by a jury, and you are constantly espousing on legal issues as if it is contributing to anything but furthering other people's ignorance when they don't know better to dismiss your posts.

Your hypothetical is asinine and completely removed from what is at issue in this lawsuit.

And of course now, reading other posters responding to you in this thread, I'm not the only one pointing out how you are only contributing your own misunderstandings.

stonogo · 2025-02-05T20:03:20 1738785800

The difference between "I can probably generate everything" and "I can definitely produce this copyrighted work" is substantial and in fact the core argument in the case.

DSMan195276 · 2025-02-05T20:25:23 1738787123

Can you really say it can "definitely produce this copyrighted work" if NYT had to try thousands of prompts, some of which included parts of the articles they wanted it to produce? That's my point. I really don't know the answer, but it's not as simple as "they asked it to to produce the article and it did", they tested thousands of combinations.

freejazz · 2025-02-06T15:34:40 1738856080

Did it? Then yes. You can say it "definitely produce this copyrighted work"

I'm not sure how that could even be controversial. Either it does or doesn't. In this case, it does.

DSMan195276 · 2025-02-07T12:17:50 1738930670

So if I go on ChatGPT, copy in a chapter from a book and then ask it to repeat the chapter back to me, is ChatGPT violating the copyright of the book I just fed it?

freejazz · 2025-02-07T16:56:42 1738947402

That's not an issue in this lawsuit

freejazz · 2025-02-06T15:33:22 1738856002

If it is outputting verbatim copies of works it has ingested, it is doing copyright infringement. It's really not that difficult.

throwway120385 · 2025-02-05T19:34:26 1738784066

I think the difference here is that a human intentionally built a dataset containing that information, whereas Pi is an irrational number which is a consequence of our mathematics and number system and wasn't intentionally crafted to give you NYT articles.

DSMan195276 · 2025-02-05T20:28:39 1738787319

Well that depends on what you're trying to prove. If you think it's a copyright violation to include the articles in the dataset _at all_ then it doesn't even matter if ChatGPT can produce NYT articles, it's a violation either way. If including the articles in the dataset is not in-and-of-itself a copyright violation then things get complicated when talking about what prompt is required to produce a copyright-violating result.

Maxatar · 2025-02-05T19:18:31 1738783111

1. Anyone can get all of NYT's articles for free along with CNN and every other major news site, this isn't in dispute, it's available here in a single 93 terabyte compressed file:

https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-05/inde...

2. I did not see any defense of this nature.

3. Yes and this is the big deal. If the secret code needed to reproduce copyrighted material involves large portions of that copyrighted material already then that's quite a bit different than just verbatim reproductions out of thin air.

4. Yes, if OpenAI wins this case then you could feed into ChatGPT large portions of NYT articles and OpenAI could possibly respond by regurgitating similar such portions of NYT articles in response.

palmotea · 2025-02-05T22:10:27 1738793427

> OpenAI's defense, which I am not in a position to verify, is that ChatGPT can not be used to reproduce NYT's articles verbatim without a great deal of prompt engineering.

That's really stupid. It's akin to claiming that I can serve pirated copyrighted content from my server, just as long as it's served from a really convoluted path. If you can get to it through any path, it's infringing. The path literally doesn't matter; it's a total red herring.

> Furthermore the verbatim reproductions that ChatGPT did end up producing after these 10000 prompts are available on numerous public websites unaffiliated with NYT.

Also stupid. So it's only piracy if you download it from the original source, and somehow not-piracy if you download it from another pirate? Or every single commercially released movie is fair game to distribute, because they're already being served up on numerous pirate BitTorrent sites?

jay_kyburz · 2025-02-05T20:33:04 1738787584

Everybody seems to be focused on whether or not the OpenAI copied the data in training, but my understanding of copyright is that if a person when into a clean room and wrote an new article from scratch, without having read any NYT, that just so happened to be exactly the same as an existing NYT article, it would still be a copyright violation.

As soon as OpenAI repeats a set of words verbatim, it violates copyright.

The courts should examine how much damage an occasional verbatim regurgitation would damage NYTs business. (I would guess not much)

Maxatar · 2025-02-05T21:06:19 1738789579

No this is untrue. Independent creation is an affirmative defense against copyright infringement. You'd never convince a jury that you independently wrote the exact same article as a New York Times article, but in principle you can argue that you independently wrote say... a song, or even reimplemented the WIN32 API without ever having read or familiarized yourself with the original source code:

https://github.com/wine-mirror/wine

https://harvardlawreview.org/print/vol-128/creating-around-c...

jay_kyburz · 2025-02-06T01:59:25 1738807165

Thanks for the clarification!

nyssos · 2025-02-05T20:59:25 1738789165

> but my understanding of copyright is that if a person when into a clean room and wrote an new article from scratch, without having read any NYT, that just so happened to be exactly the same as an existing NYT article, it would still be a copyright violation.

It would not be. Independent creation is a complete defense against copyright infringement.

Patents, however, do work this way.

freejazz · 2025-02-05T19:56:44 1738785404

> is that ChatGPT can not be used to reproduce NYT's articles verbatim without a great deal of prompt engineering

You aren't allowed to infringe copyrights just because you make it difficult to do so. OpenAI's system should not be making verbatim copies at all.

AnthonyMouse · 2025-02-05T20:16:26 1738786586

It's probably worth considering how the thing actually works.

LLMs are sort of like a fancy compression dictionary that can be used to compress text, except that we kind of use them in reverse. Instead of compressing likely text into smaller bitstrings, they generate likely text. But you could also use them for compression of text because if you take some text, there is highly likely a much shorter prompt + seed that would generate the same text, provided that it's ordinary text with a common probability distribution.

Which is basically what the lawyers are doing. Keep trying combinations until it generates the text you want.

But the ability to do that isn't really that surprising. If you feed a copyrighted article to gzip, it will give you a much shorter string that you can then feed back to gunzip to get back the article. That doesn't mean gunzip has some flaw or ill intent. It also doesn't imply that the article is even stored inside of the compression library, rather than there just being a shorter string that can be used to represent it because it contains predictable patterns.

It's not implausible that an LLM could generate a verbatim article it was never even trained on if you pushed on it hard enough, especially if it was trained on writing in a similar style and other coverage of the same event.

freejazz · 2025-02-05T20:42:54 1738788174

> It's not implausible that an LLM could generate a verbatim article it was never even trained on if you pushed on it hard enough, especially if it was trained on writing in a similar style and other coverage of the same event.

That'd be a coincidence, not a verbatim copy. Copyright law doesn't prohibit independent creation. This defense isn't available to OpenAI because there is no dispute OpenAI ingested the NYTimes articles in the first place. There is no plausible way OpenAI could say they never had access to the articles they are producing verbatim copies of.

Rather than sneeringly explain away how LLMs work without any eye towards the laws at issue, maybe you should do yourself the favor of learning about them so you can spare us this incessent "no let me explain how they work, it's fine I swear!" shtick.

AnthonyMouse · 2025-02-05T21:08:15 1738789695

> That'd be a coincidence, not a verbatim copy.

It would be both. Or to put it a different way, how would you distinguish one from the other?

> This defense isn't available to OpenAI because there is no dispute OpenAI ingested the NYTimes articles in the first place.

The question remains whether ingesting the article is the reason it gets output in response to a given prompt, when it could have happened either way.

And in cases where you don't know, emitting some text is not conclusive evidence that it was in the training data. Most of the text emitted by LLMs isn't verbatim from the training data.

> Rather than sneeringly explain away how LLMs work without any eye towards the laws at issue, maybe you should do yourself the favor of learning about them so you can spare us this incessent "no let me explain how they work, it's fine I swear!" shtick.

This is a case of first impression. We don't really know what they're going to do yet. But "there exists some input that causes it to output the article" isn't any kind of offensive novelty; lots of boring existing stuff does that when the input itself is based on the article.

freejazz · 2025-02-06T15:31:33 1738855893

>It would be both. Or to put it a different way, how would you distinguish one from the other?

No, it's not both. Have you engaged in any effort to understand the law here? Copyright doesn't prohibit independent creation. I'm not sure how much more simple I can make that for you. In one scenario there is copying, in the other there isn't. The facts make it clear, when something is copied it is illegal.

>The question remains whether ingesting the article is the reason it gets output in response to a given prompt, when it could have happened either way.

This can't actually be serious? This isn't credible. You are saying there is no difference between ingesting it and outputting the results vs not ingesting it and outputting the results. Anything to back this up at all?

>This is a case of first impression. We don't really know what they're going to do yet. But "there exists some input that causes it to output the article" isn't any kind of offensive novelty; lots of boring existing stuff does that when the input itself is based on the article.

"First impression" (something you claim) doesn't mean ignore existing copyright law. One side is arguing this isn't first impression at all, it's just rote copying.

> But "there exists some input that causes it to output the article" isn't any kind of offensive novelty

You said its novel, I called it plain copying.

>lots of boring existing stuff does that when the input itself is based on the article.

You are saying its first impression... not me.

hiatus · 2025-02-05T19:02:43 1738782163

> and that no one uses ChatGPT in this manner

Someone did though and was able to get verbatim reproductions of NYT articles out of it.

> Furthermore the verbatim reproductions that ChatGPT did end up producing after these 10000 prompts are available on numerous public websites unaffiliated with NYT.

So what? NYT as a copyright holder might have no issue with those unaffiliated sites but have an issue with OpenAI.

amanaplanacanal · 2025-02-05T19:18:27 1738783107

This is a problem with copyright law. There is no way for an end user to determine the copyright status of anything on the Internet, you can only make an educated guess.

throwway120385 · 2025-02-05T19:37:44 1738784264

It's pretty simple in the US. A work has a copyright regardless of whether it's registered or a notice placed on the work. Registration provides easier means of asserting your copyright but you have a copyright as soon as you create the work. If I wrote a handwritten note about OpenAI on a cocktail napkin then I have copyright over that work barring some challenge to whether it's a "creative work" or not. It doesn't matter what the medium is, or how it's shared. The internet makes this challenging in that it's essentially a shared technical means of disseminating the work, but the work remains copyrighted no matter how publicly available it might or might not be. It's just a matter of the rights-holder asserting their right. Which is something NYT does with their paywall all the time.

amanaplanacanal · 2025-02-05T21:53:32 1738792412

As read you are asserting that public domain doesn't exist.

Not only that, if something is available on the Internet, and still under copyright, you have no way of knowing whether the website is authorized to distribute it or not.

hulitu · 2025-02-06T21:25:39 1738877139

> ChatGPT can not be used to reproduce NYT's articles verbatim without a great deal of prompt engineering

If i have a copy of a movie I am infridging copyright. Why is Open Ai special ?

Because it's Microsoft /s

jtbayly · 2025-02-05T19:07:48 1738782468

First sentence of second paragraph of the lawsuit: “Defendants’ unlawful use of The Times’s work to create artificial intelligence products that compete with it threatens The Times’s ability to provide that service.” First sentence of p7: “The Times objected after it discovered that Defendants were using Times content without permission to develop their models and tools.”

I think it’s ultimately about whether training on copyrighted content is legal or not.

Here are some other quotes from the lawsuit that approach it from a different angle: “These tools also wrongly attribute false information to The Times.” “By providing Times content without The Times’s permission or authorization, Defendants’ tools undermine and damage The Times’s relationship with its readers and deprive The Times of subscription, licensing, advertising, and affiliate revenue.”

Even if the first argument fails, if the second argument wins, it still boils down to not being able to train on copyrighted content unless it is possible to train on copyrighted data without ultimately quoting that content or attributing anything to the author of that content. My (uneducated) guess is that’s not possible.

otterley · 2025-02-05T20:00:28 1738785628

> I think it’s ultimately about whether training on copyrighted content is legal or not.

It is.

The bulk of the complaint is a narrative; it's meant to be a persuasive story that seeks to put OpenAI in a bad light. You don't really get to the specific causes of action until page 60 (paragraphs 158-180). A sample of the specific allegations that comprise the elements of each cause of action are:

160. By building training datasets containing millions of copies of Times Works, including by scraping copyrighted Times Works from The Times’s websites and reproducing such works from third-party datasets, the OpenAI Defendants have directly infringed The Times’s exclusive rights in its copyrighted works.

161. By storing, processing, and reproducing the training datasets containing millions of copies of Times Works to train the GPT models on Microsoft’s supercomputing platform, Microsoft and the OpenAI Defendants have jointly directly infringed The Times’s exclusive rights in its copyrighted works.

162. On information and belief, by storing, processing, and reproducing the GPT models trained on Times Works, which GPT models themselves have memorized, on Microsoft’s supercomputing platform, Microsoft and the OpenAI Defendants have jointly directly infringed The Times’s exclusive rights in its copyrighted works.

163. By disseminating generative output containing copies and derivatives of Times Works through the ChatGPT offerings, the OpenAI Defendants have directly infringed The Times’s exclusive rights in its copyrighted works.

avbanks · 2025-02-05T19:11:18 1738782678

IMO the first argument is invalid, however, the second one is a completely valid argument.

pro14 · 2025-02-05T19:21:06 1738783266

> "Defendants’ tools undermine and damage The Times’s relationship with its readers and deprive The Times of subscription, licensing, advertising, and affiliate revenue."

News flash: you can read newspaper articles at the library.

rrobukef · 2025-02-05T19:39:14 1738784354

Yes, and libraries pay for that access. They also don't obfuscate the origin or remove the advertising. Don't equate libraries with what OpenAI does.

pton_xd · 2025-02-05T19:44:18 1738784658

> News flash: you can read newspaper articles at the library.

Reading an article != selling a product that redistributes the article.

freejazz · 2025-02-05T20:44:32 1738788272

And its no coincidence that the NYTimes isn't suing OpenAI for reading newspaper articles at the library...

blackeyeblitzar · 2025-02-05T19:19:26 1738783166

I haven’t checked in on this case for a while, but aren’t there also many organizations that want OpenAI to win this case so that the concept of fair use is upheld?

1vuio0pswjnm7 · 2025-02-05T22:18:07 1738793887

If OpenAI's use of publishers' content required no permission then why did it consummate deals with all the publishers mentioned in the article, as well as seeking one with the NYT.

"Ask forgiveness, not permission" is supposed to be the Silicon Valley motto. But that's not what happening here. OpenAI is asking for permission. As with all the other publishers, OpenAI will have to pay. NYT reserves the right to set the price as high as it wishes. No doubt the price will be enough cover NYT's costs from this litigation. OpenAI will pay it.

How much has OpenAI spent on this litigation.

rustc · 2025-02-05T18:15:08 1738779308

I hope they don't settle early and we finally get an answer to whether training AI on copyrighted content is fair use or not.

seydor · 2025-02-05T19:53:33 1738785213

at this point we won't need to know because most AIs will be trained on generated data

n0rdy · 2025-02-05T19:02:49 1738782169

I like following the OpenAI vs. NYT case, as it's a great example of the controversial situation:

- OpenAI created their models by parsing the internet by disregarding the copyrights, licenses, etc., or looking for a law loopholes

- by doing that, OpenAI (alongside others) developed a new progressive tool that is shaping the world, and seems to be the next “internet”-like (impact-wise) thing

- NYT is not happy about that, as their content is their main asset

- less democratic countries, can apply even less ethical practices for data mining, as the copyright laws don't work there, so one might claim that it's a question of national defense, considering the fact that AI is actively used in the miltech these days

- while the ethical part is less controversial (imho, as I'm with NYT there), the legal one is more complicated: the laws might simply say nothing about this use case (think GPL vs. AGPL license), so the world might need new ones.

And so on...

screye · 2025-02-05T18:20:02 1738779602

I can't imagine a scenario where pre-training on someone else's works is fair-use, but distilling from a proprietary LLM isn't.

pkamb · 2025-02-05T18:35:41 1738780541

Is anyone building a public domain repository / AI training ground for old newspapers? Anything before 1930 has no restrictions. Newspapers.com has pretty good content but the interface and search is extremely lacking. Google News was abandoned a decade ago. This seems like something where AI could really help, for once. Not in training chatbots or whatever but actually just providing great search for articles in books, newspapers, and magazines.

bikeshaving · 2025-02-05T19:12:11 1738782731

There’s also a fascinating proposal I read somewhere where you create a training set with a knowledge cutoff of 1900 or 1930 and see if the resulting AI could predict the future or independently discover later scientific breakthroughs.

themarbz · 2025-02-05T18:42:44 1738780964

I'm imagining a model trained on pre-1930 data that only speaks "old-timey English"...

adonovan · 2025-02-05T19:10:18 1738782618

> How do I make an HTML view of a SQL database?

Well old chap, you'll need a shoeshine box full of vacuum tubes and some brass flanges...

TomatoCo · 2025-02-05T22:45:35 1738795535

I'm imagining a text-to-speech model that only speaks in the Transatlantic Accent

tombert · 2025-02-05T18:50:33 1738781433

I think that idea is capital. I'm really hoping that chatbots starting using 23-skidoo in conversations.

mrweasel · 2025-02-05T19:48:06 1738784886

The interface and search could probably be solved without the use of AI. Seems like mostly an OCR problem. Both ElasticSearch and Sphinx are already really good, and I'm sure that there are other open source or commercial search engines available, or hire ex-Google engineers, Google doesn't seem interested in search anymore.

pkamb · 2025-02-05T20:03:19 1738785799

Newspapers have nearly identical newswire columns printed in 100+ newspapers, but with slightly different headlines and content. Or OCR breaking due to words being physically next to each other but in separate stories. The Newspapers.com search has fine OCR but is difficult and time consuming to use because of those issues. Seems like something "AI" could solve easily.

ViktorRay · 2025-02-05T18:22:56 1738779776

Would anyone here be able to explain to me where this money is going? Are the lawyers working for the New York Times really this expensive? If so these lawyers must be getting massive amounts of money...

echoangle · 2025-02-05T18:35:23 1738780523

Is 10 million a year a lot for lawyers? I thought a partner at a large law firm might get $ 500k or more per year, so paying a few lawyers and the assistants for all of them can get expensive quickly.

nyssos · 2025-02-05T21:03:28 1738789408

> I thought a partner at a large law firm might get $ 500k or more per year

Easily.

light_triad · 2025-02-05T18:37:54 1738780674

$1000-$2000+/h is not uncommon for top lawyers

nimish · 2025-02-05T19:50:23 1738785023

NYT will lose:

Copyright only protects the actual text. LLMs have weights, not exact copies. In any case, saying "if I put in some input and get copyrighted output" is tantamount to copyright violations; if I use a generative tool and generate copyrighted info is it the tools fault?

An LLM is a dump of effectively arbitrary numbers that, when hooked up to a command line, uses one of the world's most awful programming languages to evaluate and execute.

OpenAI at most broke an EULA or some technicality on copyright w.r.t. local ephemeral copies. What's the damage to the NYT though?

kopecs · 2025-02-05T20:11:38 1738786298

> Copyright only protects the actual text. LLMs have weights, not exact copies.

Following this logic a lossily compressed image is completely unprotected by copyright.

> In any case, saying "if I put in some input and get copyrighted output" is tantamount to copyright violations; if I use a generative tool and generate copyrighted info is it the tools fault?

Do you not think this is obviously fact-specific? If I gzip a bunch of (copyrighted) files, then obviously that doesn't somehow make distributing them not infringement. If I now replace the tool = ungzip + input = files combination with tool = (ungzip and files) and input = (selection mechanism over files) do you think that in the second case distributing the tool is not infringement? I don't mean to say that any of these is precisely the same as the LLM case, but I think your argument is clearly overbroad.

> OpenAI at most broke an EULA or some technicality on copyright w.r.t. local ephemeral copies. What's the damage to the NYT though?

One obvious damage claim (if you are skeptical of market harm wrt newspaper/oneline sub sales) is that they were entitled to the FMV of licensing costs of the articles, which is not so hard to value: OpenAI has entered such agreements with AP and others. [0]

[0]: https://apnews.com/article/openai-chatgpt-associated-press-a...

dkjaudyeqooe · 2025-02-05T20:11:42 1738786302

Wrong. I can sample a sound off a record, convert it to any format, manipulate it until it's unrecognizable and I'll still have to pay royalties to the original copyright holder.

Even a translation of original text into another language is copyright infringement.

The real question is if LLMs are fair use, and on the basis of the standard tests for fair use, it seems quite doubtful.

dragonwriter · 2025-02-05T20:21:24 1738786884

> Copyright only protects the actual text.

Copyright protects against both derived works and copies in any form, including lossy or inaccurate copies that do not reach the originality level to be derived works, not just “exact copies”.

But that doesn't really matter, here, because OpenAI isn't being sued for producing and distributing an LLM (against a mere LLM distributor, NYT would have a much weaker case), they are being sued for providing a service which takes in copyrighted works and spits out copies, both exact and not, that are well within the established coverage of what is a copyright violation that does not fall within exceptions like fair use. and when they control the whole path in between original and copy, then the path in between is largely immaterial.

Its not an “is training AI on copyright protected works fair use” case, its an “is producing copies well within the established parameters of commercial copright violation rendered fair use by sticking an LLM in the middle of the process as part of the mechanism of copying” case.

otterley · 2025-02-05T20:02:35 1738785755

To train the model, OpenAI had to make a copy of NYT's works in order to do it. (Running a scraper to dump websites onto your local storage is making a copy.) NYT's first theory is that the act of copying is a prima facie copyright violation.

gotoeleven · 2025-02-05T19:21:54 1738783314

Are they paying the lawyers with government money? I'm seriously asking. Why is the government paying 10s of millions of dollars/year to the New York Times? How can they still claim to be a news organization without having disclosed this? If the government is paying the NYT, then don't their productions belong in the public domain?

https://x.com/stillgray/status/1887191056074350690

delecti · 2025-02-05T20:02:13 1738785733

That suggestion seems rather conspiratorial. Do you have any reason to think that's the case, or are you just throwing it out as a wild possibility?

Also, has anything changed WRT Ian Miles Cheong's credibility? He's been a far-right grifter for years, I wouldn't trust any data he puts out without a corroborating source.

dkjaudyeqooe · 2025-02-05T20:20:46 1738786846

(1) Apparently (2) Why don't you ask the NYT or the departments in question (eg through a FOI request)? (3) The NYT sells things, also to the government, so what? Why should they disclose this? (4) No, not by default. Depends on the circumstances.

You make insinuations without a shred of relevant evidence. Your tinfoil hat is on too tight.

dkjaudyeqooe · 2025-02-06T19:16:50 1738869410

If you're wondering what this nonsense is about, it's a conspiracy theory straight from Trump's mouth: false claims that Democrats paid for positive coverage. The truth? Subscriptions. What a surprise.

https://www.nytimes.com/2025/02/06/business/trump-politico-u...

SebFender · 2025-02-06T01:33:37 1738805617

"OpenAI asserts that training AI models using publicly accessible content, including material from The New York Times, is protected under longstanding fair use principles."

Incredible.

The foundation of fair use is a transformative and non-consumptive use of copyrighted material.

tester756 · 2025-02-05T18:19:34 1738779574

Why is it THAT expensive?

tintor · 2025-02-05T18:23:06 1738779786

Lawyer time

$10.8 ~ 135 days * 8 hours * $1000/h x10

tester756 · 2025-02-05T18:29:28 1738780168

1000/h?

why they're this expensive?

echoangle · 2025-02-05T18:36:26 1738780586

Because the times is trying to get millions or billions in compensation if they win, why would you lower your odds of winning by getting cheaper lawyers just to save a few 100 k?

songshu · 2025-02-05T18:42:35 1738780955

They have, let’s not call it a union to not upset people, but let’s say a collective agreement that they won’t work for capital. They will only work for other lawyers. So there’s no Walmart Law or other enterprise selling legal services for cheap.

samvher · 2025-02-05T19:13:12 1738782792

I'm trying to parse the idea of "a collective agreement" but can't fully wrap my head around how that would work.

It seems to me more like the lack of a "Walmart Law" is a result of e.g. lack of economies of scale and other economic structure, rather than some collective agreement. (If it was profitable to break out of that agreement and start a "Walmart Law", it seems we'd see that happen pretty quickly?)

But if you know more about this and I'm off the mark I'd love to learn

songshu · 2025-02-05T19:45:39 1738784739

I looked it up. Rule 5.4 of the American Bar Association.

echoangle · 2025-02-05T21:13:13 1738789993

https://www.americanbar.org/groups/professional_responsibili...

HDThoreaun · 2025-02-05T19:59:39 1738785579

Every fortune 500 company has a team of lawyers working for it.

jcranmer · 2025-02-05T19:51:06 1738785066

They're not.

Median lawyer rates are more like $200-300/h, with variations depending on locality--a lawyer in NYC is going to be much more expensive than a lawyer in middle-of-nowhere, Kentucky.

As for why they're expensive, part of the answer is because legal training (i.e., law school) is expensive, and lawyers have to pay their student debt.

tptacek · 2025-02-05T23:55:49 1738799749

My understanding of the price/scarcity dynamics here are that people want lawyers with significant experience, there are a relatively small number of openings every year for BigLaw associates that provide access to that kind of experience, and they hire almost exclusively from the T14 schools.

If you want a random lawyer, you can get them very cheap. But if you're doing M&A or serious corp litigation, you're looking at a much smaller and more expensive pool of candidates. I was shocked by how much our M&A lawyers cost.

All this is to say: the median might not be telling most of the story here.

nyssos · 2025-02-05T21:05:50 1738789550

Lawyer rates are extremely high variance, and the NYT is not hiring anywhere near the median. White shoe firms break $1000/hr routinely.

hlynurd · 2025-02-05T18:31:47 1738780307

Because people are willing to pay that amount

jfkrrorj · 2025-02-05T18:42:00 1738780920

That is pretty cheap for NYC lawyer.

seydor · 2025-02-05T19:54:30 1738785270

an AI lawyer could work 24/365

user3939382 · 2025-02-05T19:15:05 1738782905

My ideal solution would be to public domain anything NYT has written in the past, turn it over to archive.org, and dismantle NYT so it’s no longer an issue in the future.