Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> And I think it's erroneous to say having publicly disseminated content being read in an LLM training process is "stealing."

It's clearly theft-adjacent. You're free to use "piracy" if you want, as long as it's clear that it's illegal and morally on the level of theft.

> If I read a publicly distributed, copyrighted blog post of yours, learn something, then use that knowledge later on, did I steal your content?

It's also extremely dishonest to compare AI to humans like this. AI are not people - morally, socially, biologically, or legally. What a human does with a piece of content is utterly irrelevant to the process of training an AI.

> If an author distributes something in public, the public is allowed to read it and learn from it, whether with their eyes, a screen reader, AI agent, or whatever.

Again - very dishonest to conflate a pre-trained AI agent (such as OpenAI's Operator) with the training process.

> Any copyright violation occurs if they attempt to republish your content, not in the reading of it.

OK, this is just factually incorrect. It is a violation of copyright law to make copies of copyrighted content, with very limited and case-by-case fair use exceptions - the claim that violation only happens in the republishing case is completely false.

This entire defense is a mix between deceptive and flat-out factually incorrect statements.



Your repeated use of the word 'dishonest' seems odd to me. I infer you think I'm making arguments disingenuously and without believing in them and/or manipulating the truth. I can reassure you this is not the case. I sincerely believe you are making your own arguments honestly also, and am engaging with them in that spirit.

as long as it's clear that it's illegal and morally on the level of theft.

It's not clear. I do not consider training an LLM on publicly disseminated text to be "morally on the level of theft." Stealing my car, or even a pen off my desk, is a much more reprehensible action than slurping everything I've shared in public into an LLM training process, purely IMHO.

It's also extremely dishonest to compare AI to humans like this. AI are not people - morally, socially, biologically, or legally. What a human does with a piece of content is utterly irrelevant to the process of training an AI.

People or corporations (which are usually treated as person-like) operate training processes and are morally and legally responsible for them. I believe training an LLM is "a human/corporation doing something" with a piece of content.

Again - very dishonest to conflate a pre-trained AI agent (such as OpenAI's Operator) with the training process.

Again, I am being honest. Whether I let an AI agent read your blog post or whether I write a program to read it into an LLM fine tuning process seems immaterial to me. I am open to being convinced otherwise, of course.

It is a violation of copyright law to make copies of copyrighted content, with very limited and case-by-case fair use exceptions

One of those exceptions (in many jurisdictions) is making temporary copies of data to use in a computational process. For example, browser caching, buffering, or transient storage during compression/decompression.

While many of the "pile" style of permanently stored and redistributed datasets are more than likely engaging in copyright violation, that's not inherent to the process of training an LLM, the topic of this thread. I believe that if copyright holders want to go after anyone and have success in doing so, they should go after those redistributing their content in such datasets, not those merely training LLMs which is not, in and of itself, violating any laws I can establish.


> It's not clear. I do not consider training an LLM on publicly disseminated text to be "morally on the level of theft." Stealing my car, or even a pen off my desk, is a much more reprehensible action than slurping everything I've shared in public into an LLM training process, purely IMHO.

The theft is that of effort, in the exact same (or a worse) sense as pirating media or stealing IP from a company.

It takes effort to write. That effort is being stolen by an LLM during the training process - the LLM cannot possibly exist without the work done by the authors who wrote content that it is being trained by, and the LLM can also be used to automate away those authors' ability to do work (and jobs) by replacing them. Which is worse - to have your car stolen (which is very bad, I'm not arguing that it isn't), or to lose your job, and not being able to afford anything?

Alternatively, if you believe that it's not bad to take someone's effort without their consent and without compensating them for it, then you shouldn't object to your employer withholding wages from you, or a client refusing to pay you, on the same principle.

> People or corporations (which are usually treated as person-like) operate training processes and are morally and legally responsible for them. I believe training an LLM is "a human/corporation doing something" with a piece of content.

That's not reasonable, and most people do not share your opinion (including the relevant group, which is the authors of the content being trained on). That's equivalent to saying that a human writing a program to perfectly reproduce a copyrighted work (e.g. print out the complete text of Harry Potter) is a human "doing something" with that copyrighted work (in the same class as reading Harry Potter).

> Whether I let an AI agent read your blog post or whether I write a program to read it into an LLM fine tuning process seems immaterial to me.

Those are categorically different. The vast majority of the population (again, including those writing the works that are being trained on without their consent) will agree that they are categorically different and incomparable, and they are logically, legally, and morally distinct.

> One of those exceptions (in many jurisdictions) is making temporary copies of data to use in a computational process. For example, browser caching, buffering, or transient storage during compression/decompression.

To use in specific computational processes for which you do not store the output because the output is subject to the same copyright laws. The implicit premise when you talk about training is that you're going to save the trained model, so this obviously doesn't apply, in the same sense that if you take a copyrighted work and transcode it, the transcoded output is subject to the exact same set of copyright laws as the original.

> not those merely training LLMs which is not, in and of itself, violating any laws I can establish.

That's the "law is morality" fallacy. Morally, this is clearly wrong, the point of the copyright system is to prevent exactly things like this from happening. The courts have not yet decided whether training an LLM is "copying" a copyrighted work, but if they do, then it's clearly illegal.


I appreciate your arguments and know they are in good faith. I think we would have an edifying debate in person!

I'm not going to reply to everything as I think our viewpoints are tricky to reconcile, since we find different things to be moral/immoral. That's fine, but it might not be productive. However, I acknowledge your position and know it reflects much popular sentiment; I cannot dispute that.

if you believe that it's not bad to take someone's effort without their consent and without compensating them for it, then you shouldn't object to your employer withholding wages from you

I think this gets to the crux of our difference. Employment is an explicit contract that binds two parties to honor their obligations. If someone posts a blog post openly, busks in the street, or does some graffiti art, I don't think observers have any obligations beyond an implicit idea of "experience this in any way you like as long as it's legal". Whether you prefer 'legal' or 'moral' there, it brings us back to the problem that we disagree on the morality/legality of the core issue. Given the constraints of this venue, not to mention our time, I'm happy to recognize this difference and leave it unsettled.

That's the "law is morality" fallacy. Morally, this is clearly wrong, the point of the copyright system is to prevent exactly things like this from happening. The courts have not yet decided whether training an LLM is "copying" a copyrighted work, but if they do, then it's clearly illegal.

If that should come to pass, I agree. However, your suggested fallacy then comes into play the other way around. Merely because a legal precedent may be set does not change my opinion that it is not immoral. That is a point on which we clearly differ and one I think would be fascinating to debate if only in a more appropriate venue as I may even be won over but have not been by any arguments so far.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: