Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't understand why it's even a question that Meta trained their LLM on copyrighted material. They say so in their paper! Quoting from their LLaMMa paper [Touvron et al., 2023]:

> We include two book corpora in our training dataset: the Gutenberg Project, [...], and the Books3 section of ThePile (Gao et al., 2020), a publicly available dataset for training large language models.

Following that reference:

> Books3 is a dataset of books derived from a copy of the contents of the Bibliotik private tracker made available by Shawn Presser (Presser, 2020).

(Presser, 2020) refers to https://twitter.com/theshawwn/status/1320282149329784833. (Which funnily refers to this DMCA policy: https://the-eye.eu/dmca.mp4)

Furthermore, they state they trained on GitHub, web pages, and ArXiv, which are all contain copyrighted content.

Surely the question is: is it legal to train and/or use and/or distribute an AI model (or its weights, or its outputs) that is trained using copyrighted material. That it was trained on copyrighted material is certain.

[Touvron et al., 2023] https://arxiv.org/pdf/2302.13971

[Gao et al., 2020] https://arxiv.org/pdf/2101.00027



Critically, by torrenting they also directly distributed the copywritten material itself. That is a standalone infringement separate from any argument about trained LLMs.


They could have only leached and refrained from sharing any part of copyrighted data. If i were to commit something as risky as this, that is what i would do.


Then it would need to be determined, whether that is the case or not. Did every single machine they used have the configuration for only leeching and no seeding? The company is liable for what its employees on the job. If only one employee was also seeding ... that could be a very interesting case.


> Did every single machine they used have the configuration for only leeching and no seeding?

I would certainly assume so. It's incredibly obvious that's what you would want to do from a legal standpoint.

> If only one employee was also seeding ... that could be a very interesting case.

The torrenting wouldn't be done casually by employees acting on their own. And it's not like multiple employees are doing it simultaneously, unsupervised, on their personal computers.

This is part of an official project. They'd spin up a machine just to download the torrent, being careful to disable seeding.

This is Meta. They have lawyers involved and advising. This isn't a teenager who doesn't fully understand how torrenting works.


Did you not read the article? There are quotes from Meta employees doing exactly what you claim they wouldn't do.

> This is part of an official project. They'd spin up a machine just to download the torrent, being careful to disable seeding.

From the article:

> "Torrenting from a corporate laptop doesn’t feel right," Nikolay Bashlykov, a Meta research engineer, wrote in an April 2023 message, adding a smiley emoji. In the same message, he expressed "concern about using Meta IP addresses 'to load through torrents pirate content.'"

You also claim they would be "careful to disable seeding" but we know they did in fact seed (and anyone who uses private trackers knows they couldn't get away with leeching for very long before being kicked off):

> Meta also allegedly modified settings "so that the smallest amount of seeding possible could occur," a Meta executive in charge of project management, Michael Clark, said in a deposition.


Seeding can be trivially faked to trackers.

https://github.com/slundi/RatioUp

https://github.com/anthonyraymond/joal

http://ratiomaster.net/

The smallest amount of seeding possible would be metadata, presumably not subject to copyright.


And punishing them in the normal manner will be an incredibly small slap on the wrist, and do absolutely nothing to help us find out what will play out in court regarding a fair-use defense on training AI with copyrighted material.


Isn't there a "fruit of the poisoned tree" kind of thing? Sounds to me quite similar to the situation where you would murder your parent and get to keep the inheritance, even if you are convicted of murder. Inheriting stuff isn't illegal, yet, I think most jurisdictions would not allow you to keep it in this case.

There should be a problem with stuff obtained through illegal means, even if having that stuff is in principle legal. In this case, copyrighted material.

Obviously they would argue that having the data is only a consequence of the download part, and that part is legal. What I see is that these situations are always complicated, and if you're rich enough, you get to litigate the complications and come out with a slap on the wrist or maybe even clean hands, while if you are an ordinary citizen, you can't afford to delve into the complexities and get punished.

These days I'm starting to give up on the whole concept of the legal system being fair. They're not even pretending anymore.


There are two different things when it comes to discussing training LLM's on "copyright" protected data, and I almost never see people differentiate.

1.) Training on copyright that is publicly available. You write a poem and publish it online for the world to read. That is your IP, no one else can take it an sell it, but they are free to read and be inspired by it. The legalitly of training on this is in the courts, but so far seems to be going in favor of LLMs.

2.) Training on copyright that is not publicly available. These are pretty much pirated works or works obtained by backdoor to avoid paying for them. Your poem is behind a paywall and you never got paid, yet the poem is known by the LLM. This is just straight illegal, as you legally must pay to view the work. However there might be conditions here too like paying for access to an archive and then training on everything in it.


I never gave my poem to Facebook. My site is for humans. And there was absolutely no problem with that website being public, until Facebook et al wanted to move the goalpost.. again. Remember when companies started to claim that their abuse is on you, because you failed to publish the correct headers/robots.txt and their bot needs to be told the rules in specific language? And now we get the same attempt at making such distinction again, just this time its our fault for .. having a public website in the first place (should have operated a paywall, duh!)


3.) The company making an unauthorized copy of your work and storing it permanently in a giant corporate library of their own making which they refer to over and over.

This is distinct from (1) where the content is streamed or only ephemeral/incidental copies are made.


The very idea that LLMs are "inspired" by copyright material is so far beyond absurd I just don't know what reality you people live in. They are ingesting copyright material in order to re-use it. Yeah they remix it to add their own (incredibly annoying) tone but that's what they're doing.


good distinction

IMO there's a hack about this,

authors can claim that they allow for public use unless it's used for training LLMs. And all of training work would fall under 2 because they would be used against the copyright.


I think they would need to have some explicit contract every time they want to sell the book then, though. I don’t think I am bound by some random terms someone writes into a book I’m buying. Those probably are only binding if a reasonable person would notice them before sale.


If you arrive at the point of being able to buy that book, it means it has passed the publisher's hands and I would think, that the publisher was OK with those terms then, and limiting the usage of the text may in fact be effective. If it was self-published, then even more so.


But the license restriction would have to apply both to the publisher and the customer.

If I go to the bookstore, buy the book, make a scan, and train an LLM with it, how would you enforce your license as an author? The customer never knew that he shouldn’t have been allowed to train LLMs.

Edit: I think I misunderstood the original comment, I thought the idea was to sell books and restrict use for LLM training. If we’re only talking about stuff that’s publicly released, the restriction should be possible.


Whether you make a scan of it or not, the license applies to the IP, I guess (IANAL).

Whether the shop makes a scan should not affect you as the buyer of the actual book. What does the scan have to do with you?

Whether the author learns about that scan and perhaps training of some LLM using the scan or not, does not change the legality of it.


But the license doesn’t apply to me as a customer if I can’t be expected to even notice it. If I buy a book in a bookstore, no one would assume that training LLMs on it would be explicitly forbidden. And adding a note to the book would probably not be binding because no one is expected to read the legal notice in a book.


Ah, I assumed, that the clauses regarding the use in training of an LLM are printed inside the book somewhere.


It would still be unenforceable because there's no consideration.

There is nothing of value that the license gives me that I wouldn't already have if the contract didn't exist. I can already read the book, merely by having it in front of me.


How does that give you the right to train an LLM on it?

Or are we talking about training an LLM on it and never releasing that LLM to anyone ever? Then I guess it wouldn't matter. But if that LLM is released to anyone, shouldn't the author of the book have a say on it?


> How does that give you the right to train an LLM on it?

Fair use gives me that right, not a contract or license.


Whether that falls under fair use is highly debatable.


It's going through the courts right now. We'll probably have an answer in a year or two.


I felt for a long time that it should be fair use. If an LLM can abstract what it learns from the copyrighted work, then that seems "fair" because that's what humans do.

But ... as I've thought about it more, it doesn't really feel just to me. The kind of value reaped from the works seems to suggest that the creator is due some portion of that value. Also, in practice - there's just an absolutely enormous amount of knowledge that can be consumed from the public domain. Even if Meta, OpenAI and friends decided to license a ~small handful of the long-term archives of some globally-read newspapers, they could get very broad and deep knowledge about the events, trends, terms of the last century to fill in a lot of gaps.


I'm not sure there's any legal distinction though.

Is a book publicly available? No, you have to purchase it. But once you do, you're legally allowed to let your friends and family and so forth read it too. As long as you don't sell copies of it (the "copy" part of "copyright"), or meaningfully take away the ability for the publisher to make money from sales (so you can't post it for the whole world to see on the internet).

And sure, there are lots of ToS for digital works, but are they actually enforceable? ToS can say you're not allowed to let anyone else read the book you purchased. But no court is going to say you can't lend your Kindle to your friend for them to read it too. Many ToS clauses are flat-out illegal.

Meta will argue that training on books is no different from reading all the books at a friend's house. That as long as Meta isn't reselling or making publicly available the original text, they're in the clear.


I don't know what the deal is in the relevant jurisdictions, but in Swedish copyright law, the provenance of the original matters ("lovlig förlaga").

This means that it's not legal to download a rip of e.g. a CD that was uploaded without consent, even if you own a copy.

(This exception to the general right to make copies for private use was added in 2005 to make downloading illegal -- previously, only uploading was infringing.)

I would assume just the act of downloading this content was illegal in the relevant US jurisdictions as well.


I believe the most famous cases in the US have only gone after the people sharing or seeding or uploading content. My ISP could care less what I download from use net but they will definitely care when I start seating.


But they are making unauthorized copies: Their training data set is analogous to private collection of duplicates.

What do you think copyright law(suits) would do if a regular person made copies of every book and movie and song they saw, placing the duplicate media in a room of their house?


Trained on doesn't mean significant inclusion in the final state.

Is it truly a violation of copyright when a user hacks out bits and pieces of easily restyled raw data points from a model to look samey? what about if it takes two models? Might be time to accept humans are just cooked in their ability to discern attempts at direct plagiarism - just as it is hard to discern Sky voice from Her voice.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: