Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Clever, but the law is not a machine or an algorithm. Intent matters.

Training an LLM with the intent of contravening an NDA is just plain <intent to contravene an NDA>. Everyone would still get sued anyway.



It is a classic geek fallacy to think you can hack the law with logic tricks.


Indeed it is. Obligatory xkcd - https://xkcd.com/1494/


But then training a commercial model is done with the intent to not pay the original authors, how is that different?


> done with the intent to not pay the original authors

no one building this software wants to “steal from creators” and the legal precedent for using copyrighted works for the purpose of training is clear with the NYT case against open AI

It’s why things like the recent deal with Reddit to train on their data (which Reddit owns and users give up when using the platform) are becoming so important, same with Twitter/X


> no one building this software wants to “steal from creators”

> It’s why things like the recent deal[s ...] are becoming so important

Sorry but I don't follow. Is it one or the other?

If they didn't want to steal from the original authors, why do they not-steal Reddit now? What happens with the smaller creators that are not Reddit? When is OpenAI meeting with me to discuss compensation?

To me your post felt something like "I'm not robbing you, Small State Without Defense that I just invaded, I just want to have your petroleum, but I'm paying Big State for theirs cause they can kick my ass".

Aren't the recent deals actually implying that everything so far has actually been done with the intent of not compensating their source data creators? If that was not the case, they wouldn't need any deals now, they'd just continue happily doing whatever they've been doing which is oh so clearly lawful.

What did I miss?


The law is slow and is always playing catch up in terms of prosecution, it’s not clear today because this kind of copyright has never been an issue before. Usually it’s just outright stealing content that was protected, no one ever imagined “training” to be a protected use case, humans “train” on copyrighted works all the time, ideally copyrighted works they purchased for said purpose… the same will start to apply for AI, you have to have rights to the data for that purpose, hence these deals getting made. In the meantime it’s ask for forgiveness not permission, and companies like Google (less openAI) are ready to go with data governance that lets them remove copyright requested data and keep the rest of the model working fine

Let’s also be clear that making deals with Reddit isn’t stealing from creators, it’s not a platform where you own what you type in, same on here this is all public domain with no assumed rights to the text. If you write a book and openAI trains on it and starts telling it to kids at bed time, you 100% will have a legal claim in the future, but the companies already have protections in place to prevent exactly that. For example if you own your website you can request the data not be crawled, but ultimately if your text is publicly available anyone is allowed to read it, and the question it is anyone allowed to train AI on it is an open question that companies are trying to get ahead on.


That seems even worse: they had intent to steal and now they're trying to make sure it is properly legislated so nobody else can do it, thus reducing competition.

GPT can't get retroactively untrained on stolen data.


Google actually can “untrain” afaik, my limited understanding is they have good controls their data and its sources, because they know it could be important in the future, GPT not sure.

I’m not sure what you mean by “steal” because it’s a relative term now, me reading your book isn’t stealing if I paid for it and it inspires me to write my own novel about a totally new story. And if you posted your book online, as of right now the legal precedent is you didn’t make any claims to it (anyone could read it for free) so that’s fair game to train on, just like the text I’m writing now also has no protections.

Nearly all Reddit history ever up to a certain date is available for download now online, only until they changed their policies did they start having tighter controls about how their data could be used.


Chutzpah. And that the companies doing it are multi-billion dollar companies who can afford the finest legal representation money can buy.

Whether the brazenness with which they are doing this will work out for them is currently playing out in the courts.


It’s not done with the intent to infringe copyright.


It would appear that it explicitly IS done with this intent. We are told that an LLM is a living being that merely learns and then creates, but yet we are aware that its outputs regurgitate combinations of uta inputs.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: