The mere notion that you can copyright or trademark combinations of words to procure legal ownership of those specific combinations, ownership that grants you the ability to forcefully take the property of other people who repeat those same words via a formally codified and enforced system of wage theft (court-awarded "damages" to the author) is patently absurd, and should be an affront to the sensibilities of any person who's ever had to work for a wage.
People have written books. Those books were copied into something, called "Books3 section of ThePile". Now those people trying to sell their books, but nobody needs to buy them, because one can download thePile.
This alone is awful and wrong, and we all know this.
Now comes Facebook, and builds an LLM. The LLM can now write somewhat nicely sounding books in no time. People who wrote those books and put all the work, time etc. into it, are angry. And they are right.
It is very telling how many words and arguments are used here to make Facebook's behavior sound like a good thing, or a necessity.
There are at least 3 distinct things in your post.
1. Authors' works were (are?) illegally distributed in 'Books3': This seems to clearly be a copyright infringement. That being said, I'd be shocked if someone could prove this copyright infringement made an appreciable impact on an author's income. Something tangible, not a hand-wavey "but they would have bought my book". I know that I don't go out downloading books from places like ThePile if I want to read something. I'd wager most people don't.
2. Facebook (et al.) acquired illegal collections of books vs legally acquiring the books: If that is true (seems likely) then they should suffer punishment for the acquisition. That being said, it's the same cudgel that'd be used to sue individuals into oblivion, so the end outcomes might be less constructive for the rest of humanity. I do feel like corporations involved in mass copyright infringement should be held accountable though.
3. Facebook (et al.) trained foundational models on collections of books they acquired (disconnect #2 from #3 here): I'd argue strongly that the foundational model training is not a copyright violation. It's not storing copyrighted works and distributing them. It is using them to model language patterns and token frequencies that could be used to create an approximation of a copyrighted work if the training was poor and you prompt it properly. There are plenty of experts in this matter that could discuss this in depth, but the essence of the fight boils down to if you believe that the copyrighted works are being distributed via these trained models or not. Now, if people want to change copyright law such that there are specific laws around how ML models are trained on copyrighted works, then perhaps this problem gets resolved in one direction or the other. Until then, all parties are just talking past each other and hoping the courts eventually agree with their arguments.
> Now those people trying to sell their books, but nobody needs to buy them, because one can download thePile.
Approximately nobody gets their reading material from thePile. People buy books from book shops, or from Amazon, or borrow them from a library or from a friend, or buy it from a second hand shop.
> The LLM can now write somewhat nicely sounding books in no time.
That is again not a thing. People don’t read longform LLM output instead of published books.
They're angry because the LLM decreases the expected value of a leisure hobby they've invested a lot of time into. Yes, I get it, writing can be good as an art form and art is special and should be cherished.
Also, for what it's worth, Facebook (which should be shut down for unrelated reasons), isn't the only one doing this.