The law is slow and is always playing catch up in terms of prosecution, it’s not clear today because this kind of copyright has never been an issue before. Usually it’s just outright stealing content that was protected, no one ever imagined “training” to be a protected use case, humans “train” on copyrighted works all the time, ideally copyrighted works they purchased for said purpose… the same will start to apply for AI, you have to have rights to the data for that purpose, hence these deals getting made. In the meantime it’s ask for forgiveness not permission, and companies like Google (less openAI) are ready to go with data governance that lets them remove copyright requested data and keep the rest of the model working fine
Let’s also be clear that making deals with Reddit isn’t stealing from creators, it’s not a platform where you own what you type in, same on here this is all public domain with no assumed rights to the text. If you write a book and openAI trains on it and starts telling it to kids at bed time, you 100% will have a legal claim in the future, but the companies already have protections in place to prevent exactly that. For example if you own your website you can request the data not be crawled, but ultimately if your text is publicly available anyone is allowed to read it, and the question it is anyone allowed to train AI on it is an open question that companies are trying to get ahead on.
That seems even worse: they had intent to steal and now they're trying to make sure it is properly legislated so nobody else can do it, thus reducing competition.
GPT can't get retroactively untrained on stolen data.
Google actually can “untrain” afaik, my limited understanding is they have good controls their data and its sources, because they know it could be important in the future, GPT not sure.
I’m not sure what you mean by “steal” because it’s a relative term now, me reading your book isn’t stealing if I paid for it and it inspires me to write my own novel about a totally new story. And if you posted your book online, as of right now the legal precedent is you didn’t make any claims to it (anyone could read it for free) so that’s fair game to train on, just like the text I’m writing now also has no protections.
Nearly all Reddit history ever up to a certain date is available for download now online, only until they changed their policies did they start having tighter controls about how their data could be used.
Let’s also be clear that making deals with Reddit isn’t stealing from creators, it’s not a platform where you own what you type in, same on here this is all public domain with no assumed rights to the text. If you write a book and openAI trains on it and starts telling it to kids at bed time, you 100% will have a legal claim in the future, but the companies already have protections in place to prevent exactly that. For example if you own your website you can request the data not be crawled, but ultimately if your text is publicly available anyone is allowed to read it, and the question it is anyone allowed to train AI on it is an open question that companies are trying to get ahead on.