I find that thing where people say "I'm not going to publish anything creative e...

throw10920 · 2025-02-26T04:46:02 1740545162

You calling it a "dismal excuse" is an emotional hand-wavy dismissal that doesn't actually answer the fear or solve the problem.

Moreover, "the world" isn't a thing that you can add value to. You can add value to other people by sharing works with friends, or you can add value to AIs that are going to be used to replace you without getting compensated for it. One of those should be obviously bad.

ndarray · 2025-02-25T15:57:34 1740499054

The world is already flooded with more "artistic value" than you can ever experience in your life. This market has been over-saturated forever. And it's all free of charge because "building a portfolio" for a chance of getting paid in the future is the go-to strategy. It was only a matter of time for a new predator (AI data collection) to arrive and exploit this situation. A handful of artists trying to react to this shouldn't be alarming. Good for them. Not because more free publishing would hurt them, but because they can invest their time better than into whatever the marginal benefit of posting the 50th free artwork might be.

worthless-trash · 2025-02-25T05:12:50 1740460370

It's not an excuse, it is a reality. Why spend your personal time and effort for someone else with a deeper pocket to automatically extract value from your work.

There is certainly a line where if you're popular enough and have significant google juice you'll still get organic traffic, many small bloggers can go their entire posting history without getting more than a smattering of hits and now chatgpt is taking away that.

simonw · 2025-02-25T06:26:17 1740464777

"Why spend your personal time and effort for someone else with a deeper pocket to automatically extract value from your work."

That's the exact attitude I'm talking about.

Because creating things is good! Because it's good to put value out there into the world, even if someone else might also use it.

creata · 2025-02-25T09:25:30 1740475530

I think it can significantly change the harm-benefit calculus. (But I'd love to be wrong.)

In the past, I could be fairly confident that if someone else uses my work (and I want them to do that! that's the point of sharing!) the good that it causes will outweigh the bad that it causes. It's not like I'm helping people make missiles.

But now it's entirely possible (especially if my content is unpopular, such that LLMs make a larger proportion of its readers) that the bad outweighs the good, given the negative effects that LLMs have had and continue to have on our world.

mikrotikker · 2025-02-25T10:14:28 1740478468

Missiles save lives too.

petercooper · 2025-02-25T15:33:22 1740497602

Why spend your personal time and effort for someone else with a deeper pocket to automatically extract value from your work.

People releasing their code under MIT or BSD licenses might be able to give good answers to this.

ndarray · 2025-02-25T16:01:11 1740499271

Good answers like "it looks cool in my CV that big company XYZ uses my MIT licensed script"

throw10920 · 2025-02-26T04:47:19 1740545239

It's extremely dishonest to compare someone voluntarily releasing their work under a permissive license with someone who is involuntarily having their content and effort stolen by an organization training an AI.

petercooper · 2025-02-26T13:25:45 1740576345

And I think it's erroneous to say having publicly disseminated content being read in an LLM training process is "stealing."

If I read a publicly distributed, copyrighted blog post of yours, learn something, then use that knowledge later on, did I steal your content?

If an author distributes something in public, the public is allowed to read it and learn from it, whether with their eyes, a screen reader, AI agent, or whatever. Any copyright violation occurs if they attempt to republish your content, not in the reading of it.

However, scraping illegally obtained non-public material - such as books an author is trying to sell or blog posts behind a paywall - could well be a violation unless access is obtained legally.

throw10920 · 2025-02-27T13:45:51 1740663951

> And I think it's erroneous to say having publicly disseminated content being read in an LLM training process is "stealing."

It's clearly theft-adjacent. You're free to use "piracy" if you want, as long as it's clear that it's illegal and morally on the level of theft.

> If I read a publicly distributed, copyrighted blog post of yours, learn something, then use that knowledge later on, did I steal your content?

It's also extremely dishonest to compare AI to humans like this. AI are not people - morally, socially, biologically, or legally. What a human does with a piece of content is utterly irrelevant to the process of training an AI.

> If an author distributes something in public, the public is allowed to read it and learn from it, whether with their eyes, a screen reader, AI agent, or whatever.

Again - very dishonest to conflate a pre-trained AI agent (such as OpenAI's Operator) with the training process.

> Any copyright violation occurs if they attempt to republish your content, not in the reading of it.

OK, this is just factually incorrect. It is a violation of copyright law to make copies of copyrighted content, with very limited and case-by-case fair use exceptions - the claim that violation only happens in the republishing case is completely false.

This entire defense is a mix between deceptive and flat-out factually incorrect statements.

petercooper · 2025-02-27T15:26:06 1740669966

Your repeated use of the word 'dishonest' seems odd to me. I infer you think I'm making arguments disingenuously and without believing in them and/or manipulating the truth. I can reassure you this is not the case. I sincerely believe you are making your own arguments honestly also, and am engaging with them in that spirit.

as long as it's clear that it's illegal and morally on the level of theft.

It's not clear. I do not consider training an LLM on publicly disseminated text to be "morally on the level of theft." Stealing my car, or even a pen off my desk, is a much more reprehensible action than slurping everything I've shared in public into an LLM training process, purely IMHO.

It's also extremely dishonest to compare AI to humans like this. AI are not people - morally, socially, biologically, or legally. What a human does with a piece of content is utterly irrelevant to the process of training an AI.

People or corporations (which are usually treated as person-like) operate training processes and are morally and legally responsible for them. I believe training an LLM is "a human/corporation doing something" with a piece of content.

Again - very dishonest to conflate a pre-trained AI agent (such as OpenAI's Operator) with the training process.

Again, I am being honest. Whether I let an AI agent read your blog post or whether I write a program to read it into an LLM fine tuning process seems immaterial to me. I am open to being convinced otherwise, of course.

It is a violation of copyright law to make copies of copyrighted content, with very limited and case-by-case fair use exceptions

One of those exceptions (in many jurisdictions) is making temporary copies of data to use in a computational process. For example, browser caching, buffering, or transient storage during compression/decompression.

While many of the "pile" style of permanently stored and redistributed datasets are more than likely engaging in copyright violation, that's not inherent to the process of training an LLM, the topic of this thread. I believe that if copyright holders want to go after anyone and have success in doing so, they should go after those redistributing their content in such datasets, not those merely training LLMs which is not, in and of itself, violating any laws I can establish.

throw10920 · 2025-02-27T16:03:51 1740672231

> It's not clear. I do not consider training an LLM on publicly disseminated text to be "morally on the level of theft." Stealing my car, or even a pen off my desk, is a much more reprehensible action than slurping everything I've shared in public into an LLM training process, purely IMHO.

The theft is that of effort, in the exact same (or a worse) sense as pirating media or stealing IP from a company.

It takes effort to write. That effort is being stolen by an LLM during the training process - the LLM cannot possibly exist without the work done by the authors who wrote content that it is being trained by, and the LLM can also be used to automate away those authors' ability to do work (and jobs) by replacing them. Which is worse - to have your car stolen (which is very bad, I'm not arguing that it isn't), or to lose your job, and not being able to afford anything?

Alternatively, if you believe that it's not bad to take someone's effort without their consent and without compensating them for it, then you shouldn't object to your employer withholding wages from you, or a client refusing to pay you, on the same principle.

> People or corporations (which are usually treated as person-like) operate training processes and are morally and legally responsible for them. I believe training an LLM is "a human/corporation doing something" with a piece of content.

That's not reasonable, and most people do not share your opinion (including the relevant group, which is the authors of the content being trained on). That's equivalent to saying that a human writing a program to perfectly reproduce a copyrighted work (e.g. print out the complete text of Harry Potter) is a human "doing something" with that copyrighted work (in the same class as reading Harry Potter).

> Whether I let an AI agent read your blog post or whether I write a program to read it into an LLM fine tuning process seems immaterial to me.

Those are categorically different. The vast majority of the population (again, including those writing the works that are being trained on without their consent) will agree that they are categorically different and incomparable, and they are logically, legally, and morally distinct.

> One of those exceptions (in many jurisdictions) is making temporary copies of data to use in a computational process. For example, browser caching, buffering, or transient storage during compression/decompression.

To use in specific computational processes for which you do not store the output because the output is subject to the same copyright laws. The implicit premise when you talk about training is that you're going to save the trained model, so this obviously doesn't apply, in the same sense that if you take a copyrighted work and transcode it, the transcoded output is subject to the exact same set of copyright laws as the original.

> not those merely training LLMs which is not, in and of itself, violating any laws I can establish.

That's the "law is morality" fallacy. Morally, this is clearly wrong, the point of the copyright system is to prevent exactly things like this from happening. The courts have not yet decided whether training an LLM is "copying" a copyrighted work, but if they do, then it's clearly illegal.

petercooper · 2025-02-27T17:11:52 1740676312

I appreciate your arguments and know they are in good faith. I think we would have an edifying debate in person!

I'm not going to reply to everything as I think our viewpoints are tricky to reconcile, since we find different things to be moral/immoral. That's fine, but it might not be productive. However, I acknowledge your position and know it reflects much popular sentiment; I cannot dispute that.

if you believe that it's not bad to take someone's effort without their consent and without compensating them for it, then you shouldn't object to your employer withholding wages from you

I think this gets to the crux of our difference. Employment is an explicit contract that binds two parties to honor their obligations. If someone posts a blog post openly, busks in the street, or does some graffiti art, I don't think observers have any obligations beyond an implicit idea of "experience this in any way you like as long as it's legal". Whether you prefer 'legal' or 'moral' there, it brings us back to the problem that we disagree on the morality/legality of the core issue. Given the constraints of this venue, not to mention our time, I'm happy to recognize this difference and leave it unsettled.

That's the "law is morality" fallacy. Morally, this is clearly wrong, the point of the copyright system is to prevent exactly things like this from happening. The courts have not yet decided whether training an LLM is "copying" a copyrighted work, but if they do, then it's clearly illegal.

If that should come to pass, I agree. However, your suggested fallacy then comes into play the other way around. Merely because a legal precedent may be set does not change my opinion that it is not immoral. That is a point on which we clearly differ and one I think would be fascinating to debate if only in a more appropriate venue as I may even be won over but have not been by any arguments so far.

Hamuko · 2025-02-25T07:52:56 1740469976

*value to a private corporation that'll keep all of the profits, not pay for the environmental impact and then lobby lawmakers to stay untouchable.

You can still write, paint, compose and whatnot to create "value" – just don't put it on the Internet for scraping.

xamuel · 2025-02-25T13:52:02 1740491522

To me, the fact that a blog post would be used to train AI is a good thing. Hell yes I want my writing to inform the future zeitgeist! I guess it helps that the things I want to write about are novel things no-one has ever written about. I could see how AI would demoralize me if I were otherwise employed writing Generic Politics Blog #84773. But as someone who writes original unique content, I'm like, hell yes, the more readers the merrier, whether they be human or AI or some unholy combination!