> It is unclear if the Intercept ruling will embolden other publications to consider DMCA litigation; few publications have followed in their footsteps so far. As time goes on, there is concern that new suits against OpenAI would be vulnerable to statute of limitations restrictions, particularly if news publishers want to cite the training data sets underlying ChatGPT. But the ruling is one signal that Loevy & Loevy is narrowing in on a specific DMCA claim that can actually stand up in court.
> Like The Intercept, Raw Story and AlterNet are asking for $2,500 in damages for each instance that OpenAI allegedly removed DMCA-protected information in its training data sets. If damages are calculated based on each individual article allegedly used to train ChatGPT, it could quickly balloon to tens of thousands of violations.
Tens of thousands of violations at $2500 each would amount to tens of millions of dollars in damages. I am not familiar with this field, does anyone have a sense of whether the total cost of retraining (without these alleged DMCA violations) might compare to these damages?
If you're going to retrain your model because of this ruling, wouldn't it make sense to remove all DMCA protected content from your training data instead of just the one you were most recently sued for(especially if it sets precedent)
But all content is DMCA protected. Avoiding copyrighted content means not having content as all material is automatically copyrighted. One would be limited to licensed content, which is another minefield.
The apparant loophole is between copyrighted work and copyrighted work that is also registered. But registration can occur at any time, meaning there is little practical difference. Unless you have perfect licenses for all your training data, which nobody does, you have to accept the risk of copyright suits.
In almost all cases before gen AI, scraping was found to be legal unless the bot accepted terms of service, in which case bot is bound by ToS. The biggest and most clear is [1]. People have been scraping internet for as long as internet existed.
It would make sense from a legal standpoint, but I don't think they could do that without massively regressing their models performance to the point that it would jeopardize their viability as a company.
> I guess I should have used the phrase "common sense stealing in any other context" to be more precise?
Clearly not common sense stealing. The Intercept was not deprived of their content. If OpenAI would have sneaked into their office and server farm and took all the hard drives and paper copies with the content that would be "common sense stealing".
Very much common sense copyright violation though.
Copyright means you're not allowed to copy something without permission.
It's that simple. There is no "Yes but you still have your book" argument, because copyright is a claim on commercial value, not a claim on instantiation.
There's some minimal wiggle room for fair use, but clearly making an electronic copy and creating a condensed electronic version of the content - no matter how abstracted - and using it for profit is not fair use.
If the AI produces chunks of training set nearly verbatim when prompted, it looks like copying.
> And if so, why isn't someone learning from said work not considered copying in their brain?
Well, their brain, while learning, is not someone's published work product, for one thing. This should be obvious.
But their brain can violate copyright by producing work as the output of that learning, and be guilty of plagiarism, etc. If I memorise a passage of your copyrighted book when I am a child, and then write it in my book when I am an adult, I've infringed.
The fact that most jurisdictions don't consider the work of an AI to be copyrightable does not mean it cannot ever be infringing.
The output of a model can be copyright violation. In fact, even if the model was never trained on copyright content, if I provided copyright text then told the model to regurgitate it verbatim that would be a violation.
That does not make the model copyright violation itself.
This is is sort of like the argument against a blank tape levy or a tape copier tax, which is a reasonable argument in the context of the hardware.
But an LLM doesn't just enable direct duplication, it (well its model) contains it.
If software had a meaningful distribution cost or per-unit sale cost, a blank tape tax would be very appropriate for LLM sales.
But instead OpenAI is operating a for-pay duplication service where authors don't get a share of the proceeds -- it is doing the very thing that copyright laws were designed to dissuade by giving authors a time-limited right to control the profits from reproducing copies of their work.
Yea good point. whats the difference between spidering content and training a model? Its almost like access pages of contact like a search engine.. If the information is publically available?
A product from a company is not a person. An LLM is not a brain.
If you transcode a CD to mp3 and build a business around selling these files without the author's permission you'd be in big legal problems.
Tech products that "accidentally" reproduce materials without the owners' permission (e.g. someone uploading La La Land into YouTube) have processes to remove them by simply filling a form. Can you do that with ChatGPT?
It's legal for you to possess a single joint. It's not legal for you to possess a warehouse of 400 tons of weed.
The line between legal and not legal is sometimes based on scale; being able to ingest a single book and learn from it is not the same scale as ingesting the entire published works of mankind and learning from it.
> Are you describing what the law is or what you feel the law should be?
I am stating what is, right now.
I thought the weed example made that clear.
Let me clarify: the state of things, as they stand, is that the entire justice system, legislation and courts included, takes scale into account when looking at the line dividing "legal" from "illegal".
There is literally no defense of "If it is legal at qty x1, it is legal at any qty".
Excelent. Then the next question is where (in which jurisdiction) are you describing the law? And what are your sources? Not about the weed, i don’t care about that. Particularly the “being able to ingest a single book and learn from it is not the same scale as ingesting the entire published works of mankind and learning from it”.
The reason why i’m asking is because you are drawing a paralel between criminal law and (i guess?) copyright infringement. The drug posession limits in many jurisdictions are explicitly written into the law. These are not some grand principle of laws but the result of explicit legislative intent. The people writing the law wanted to punish drug peddlers without punishing end users. (Or they wanted to punish them less severly or differently.) Are the copyright limits you are thinking about similarly written down? Do you have case references one can read?
I made it clear in both my responses that scale matters, and that there is precedence in law, in almost all countries I can think off right now, for scale mattering.
I did not make the point that there is a written law specifically for copyright violations at scale (although many jurisdictions do have exemptions at small scale written into law).
I will try to clarify once again: there is no defence in law that because something is allowed at qty X1, it must be allowed at any qty.
This is the defence that was originally posted that I replied to, it is the one that is not valid because courts regularly consider the scale of an activity when determining the line between allowed and not allowed.
That might be the point. If your business model is built on reselling something you’ve built on stuff you’ve taken without payment or permission, maybe the business isn’t viable.
I wonder if they can say something like “we aren’t scraping your protected content, we are merely scraping this old model we don’t maintain anymore and it happened to have protected content in it from before the ruling” then you’ve essentially won all of humanities output, as you can already scrape the new primary information (scientific articles and other datasets designed for researchers to freely access) and whatever junk outputted by the content mills is just going to be a poor summarizations of that primary information.
Other factors that help this effort of an old model + new public facing data being complete, are the idea that other forms of media like storytelling and music have already converged onto certain prevailing patters. For stories we expect a certain style of plot development and complain when its missing or not as we expect. For music most anything being listened to is lyrics no one is deeply reading into put over the same old chord progressions we’ve always had. For art there are just too few of us who are actually going out of our way to get familiar with novel art vs the vast bulk of the worlds present day artistic effort which goes towards product advertisement, which once again follows certain patterns people have been publishing in psychological journals for decades now.
In a sense we’ve already put out enough data and made enough of our world formulaic to the point where I believe we’ve set up for a perfect singularity already in terms of what can be generated for the average person who looks at a screen today. And because of that I think even a lack of any new training on such content wouldn’t hurt openai at all.
> I wonder if they can say something like “we aren’t scraping your protected content, we are merely scraping this old model we don’t maintain anymore and it happened to have protected content in it from before the ruling”
I'm not a lawyer, but I know enough to be pretty confident that that wouldn't work. The law is about intent. Coming up with "one weird trick" to work-around a potential court ruling is unlikely to impress a judge.
They might make it work by (1) having lots of public domain content, for the purpose of training their models on basic language use, and (2) preserving source/attribution metadata about what copyrighted content they do use, so that the models can surface this attribution to the user during inference. Even if the latter is not 100% foolproof, it might still be useful in most cases and show good faith intent.
The latter one is possible with RAG solutions like ChatGPT Search, which do already provide sources! :)
But for inference in general, I'm not sure it makes too much sense. Training data is not just about learning facts, but also (mainly?) about how language works, how people talk, etc. Which is kind of too fundamental to be attributed to, IMO. (Attribution: Humanity)
But who knows. Maybe it can be done for more fact-like stuff.
Or this point, I'm sure there is more than enough publically and feely usable content to "learn how language works". There is no need to hoover up private or license-unclear content if that is your goal.
I would actually love it if that was true. It would reduce a lot of legal headaches for sure. But if that was true, why were previous GPT versions not as good at understanding language? I can only conclude that it's because that's not actually true. There's not enough digital public domain materials to train a LLM to understand language competently.
Perhaps old texts in physical form, then? It'll cost a lot to digitize that, wouldn't it? And it wouldn't really be accessible to AI hobbyists. Unless the digitization is publicly funded or something.
(A big part of this is also how insanely long copyright lasts (nearly a hundred years!) that keeps most of the Internet's material from being public domain in the first place, but I won't belabour that point here.)
Edit:
Fair enough, I can see your point. "Surely it is cheaper to digitize old texts or buy a license to Google Books than to potentially lose a court case? Either OpenAI really likes risking it to save a bit of money, or they really wanted facts not contained in old texts."
And yeah, I guess that's true. I could say "but facts aren't copyrightable" (which was supported by the judge's decision from the TFA), but then that's a different debate about whether or not people should be able to own facts. Which does have some inroads (e.g. a right against being summarized because it removes the reason to read original news articles).
> Training data is not just about learning facts, but also (mainly?) about how language works, how people talk, etc.
All of that and more, all at the same time.
Attribution at inference level is bound to work more-less the same way as humans attribute things during conversations: "As ${attribution} said, ${some quote}", or "I remember reading about it in ${attribution-1} - ${some statements}; ... or maybe it was in ${attribution-2}?...". Such attributions are often wrong, as people hallucinate^Wmisremember where they saw or heard something.
RAG obviously can work for this, as well as other solutions involving retrieving, finding or confirming sources. That's just like when a human actually looks up the source when citing something - and has similar caveats and costs.
Only half-serious, but: I wonder if they can dance with the publishers around this issue long enough for most of the contested text to become part of public court records, and then claim they're now training off that. <trollface>
Re-training can be done, but, and it is not a small but, models already do exist and can be used locally suggesting that the milk has been spilled for too long at this point. Separately, neutering them effectively lowers their value as opposed to their non-neutered counterparts.
The onus is on the person collecting massive amounts of data and circumventing DMCA protections to ensure they're not doing anything illegal.
"well someone snuck in some DMCA content" when sharing family photos and doesn't suddenly make it legal to share that DMCA protected content with your photos...
Enron energy traders called power plants and asked them to shut down during high load times. They encouraged the plant personnel to fabricate the reason for shut down. This created an electricity shortage, which forced rolling blackouts. Energy prices shot up, making for massive profits for Enron. There were also headline stories about elderly people suffering without air conditioning. The Enron traders joked about this in their phone calls to one another.
Of course, we only know about this because their phone calls were recorded. If I recall correctly, none of the traders indicated that they were aware their calls were recorded.
It's true, however the main security measure of Qubes-Whonix is not Kicksecure but hardware virtualization, which isolates Tor Browser (anon-whonix) from VM establishing the Tor connection (sys-whonix). Kicksecure is of secondary importance.
I have heard this argument before, and it made sense to me until I became a patient. First of all, if the 3rd party (insurance company) was so incentivized to guard against fraud, why would they repeatedly lose documents that had been submitted to them?
> For starters, she said bluntly, “we know everything is going to get denied.” It’s almost a given, she said, that the insurer will lose the first batch of records. “We often have to send records two or three times before they finally admit they actually received them. … They play all of these kinds of delaying games.”
Insurance companies costs come less from the first $10k of patients' spending, and much more from the next $10M. Very few, very expensive patients make up the bulk of the cost. This article (and other great Pro Publica reporting) demonstrates some of the ways that insurance companies cut these costs--ultimately by refusing to pay (or delaying) for necessary care.
The espionage act famously does not allow for whistleblowing as a defense. Tulsi Gabbard [1], Rashida Tlaib [2], Ron Wyden and Ro Khanna [3] (and perhaps others) have tried to introduce legislation to change this. For example, Reality Winner was unable to make any public interest arguments in her defense [4]. Ed Snowden has repeatedly said he would happily return to the US to face trial if he were allowed to make a public interest defense. For example, in a 2019 NPR interview [5]:
> My ultimate goal will always be to return to the United States. And I've actually had conversations with the government, last in the Obama administration, about what that would look like, and they said, "You should come and face trial." I said, "Sure. Sign me up. Under one condition: I have to be able to tell the jury why I did what I did, and the jury has to decide: Was this justified or unjustified." This is called a public interest defense and is allowed under pretty much every crime someone can be charged for. Even murder, for example, has defenses. It can be self-defense and so on so forth, it could be manslaughter instead of first-degree murder. But in the case of telling a journalist the truth about how the government was breaking the law, the government says there can be no defense. There can be no justification for why you did it. The only thing the jury gets to consider is did you tell the journalists something you were not allowed to tell them. If yes, it doesn't matter why you did it. You go to jail. And I have said, as soon as you guys say for whistleblowers it is the jury who decides if it was right or wrong to expose the government's own lawbreaking, I'll be in court the next day.
Great videos. First video, about 22:50: "and we had a new challenge, OK? For the last three or four years our, our distribution has run sold out. So anytime we go and get a sale and we go and create inventory, we know it's going to be bought."
Meanwhile, from the DOJ release: "According to evidence presented at trial, Shah, Agarwal, and Purdy sold advertising inventory the company did not have to Outcome’s clients, then under-delivered on its advertising campaigns. Despite these under-deliveries, the company still invoiced its clients as if it had delivered in full."
This story sounds all too familiar to me. I have been denied coverage and subsequently saddled with debt. I know others who have experienced similar. I wonder how widespread this is.
Reading about insurance company's internal process is deeply upsetting. It is like a kangaroo court, where the denial of benefits is predetermined. Documentation to support this conclusion is fabricated. Documentation that does not support the conclusion is buried.
It’s absolutely disgusting that healthcare is so tied to whether one can work or not. “Oh, you can still work, so you don’t need this”. Fuck right off.
Your suggestion of staying home if sick is one I agree with. Unfortunately I cannot make others do this. In fact, where I live in the USA I come across people who are visibly/audibly ill in my daily life: air travel, public transport, grocery store, and so on. I would prefer if these people stayed home, but can I make them? I can wear a mask, and I do (typically N95 in public places). If others think I look crazy, that is a small price to pay to avoid a few weeks of being sick or infecting someone else. Particularly my vulnerable family members.
Thanks for the arcgis link. It is interesting to compare to the Sentinel 1A data from the study[0]. For example there is one existing (ground based) measurement East of Mont Belvieu (P050), but most of the displacement in the satellite data appears just to the West, centered on Mont Belvieu. This is by eye only, so I may be mistaken in comparing the locations.
The ground based measurement for sensor P050 reports up-down displacement of -0.07 cm per year between 2017 and 2020.
It is difficult to determine the exact value from a shaded image, but the satellite data show that just to the West of this ground based measurement (about centered on Mont Belvieu), displacement was -1.91 to -0.85 cm per year between 2016 and 2020 (see figure 3b).
The arcgis site has useful data that could be used better compare trends for the same dates [1]. I did not look at every year, but it looks like 50+ ground based measurements per year. The study's methods are a bit beyond me, but section 3 describes processing a total of 89 Single Look Complex (SLC) images from 2016 to 2020. I could not find any mention of exact dates.
> It is unclear if the Intercept ruling will embolden other publications to consider DMCA litigation; few publications have followed in their footsteps so far. As time goes on, there is concern that new suits against OpenAI would be vulnerable to statute of limitations restrictions, particularly if news publishers want to cite the training data sets underlying ChatGPT. But the ruling is one signal that Loevy & Loevy is narrowing in on a specific DMCA claim that can actually stand up in court.
> Like The Intercept, Raw Story and AlterNet are asking for $2,500 in damages for each instance that OpenAI allegedly removed DMCA-protected information in its training data sets. If damages are calculated based on each individual article allegedly used to train ChatGPT, it could quickly balloon to tens of thousands of violations.
Tens of thousands of violations at $2500 each would amount to tens of millions of dollars in damages. I am not familiar with this field, does anyone have a sense of whether the total cost of retraining (without these alleged DMCA violations) might compare to these damages?