You can also get “the whole work” by asking another human to recite the lyrics to a song or draw the Finder logo from memory.
What’s really happening is that AI models have much better memory than humans and are more precise in their output. It would be stupid to try and “dumb down” AI models because they’re better at remembering some licensed content.
> What’s really happening is that AI models have much better memory than humans and are more precise in their output.
And yet, presumably we agree that a simple file server that serves up exact copies of copyrighted work does constitute copyright infringement. What's the difference? You could also say "what's really happening is that the file server has much better memory than humans." Duh!
It sounds like you're saying that, because an AI model is a very convoluted and sometimes inaccurate way to implement a computer system that sometimes serves up exact copies of copyrighted works, it's not copyright infringement when that computer system does serve up an exact copy of a copyrighted work. I'm not quite understanding the argument.
1) I’m generally copyleft and would argue that copyright as we know it is nonsensical in the digital era and needs to be entirely rebuilt to make any sense whatsoever. And that as is it harms the commons more than it protects creators. So yeah a fileserver has always been a game changer just like the printing press was and we’re far behind as a society, legally.
2) When a computer system does reproduce copyrighted content verbatim, it’s infringement the same as if a human did it from memory. That wasn’t my point. My point was that use of copyrighted content to train the model is fair use because it’s no different from a human consuming the content and committing it to memory.
On one hand, copyleft licenses are a creative form of copyright to enforce the wishes of the author to allow derivative works under the condition they are also distributed freely. Let's call it weak copyleft, the pragmatic variety.
OTOH are copyright abolitionists who are offended by the notion that their freedom to copy and modify code on their own harddisk could be restricted by a mere "license", to them the notion of intellectual property is poppycock. The Strong Copylefties consider GPL a necessary evil, a way to use their enemies' tools against them, to spread their ideals of free culture amidst a corporatist hellscape.
Or it could merely be someone taking the naïve reading of "anti-copyright"
So if I hear a song on the radio and it inspires my commercial purposes, then what?
Point being: whether a work is used commercially is not relevant. It's common that we think it is, but it's not. I first read about an LRU cache in my operating systems textbook and later used the concept in a commercial work. I have not committed copyright infringement.
I am a product of the impressions left by massive heaps of copyrighted content. One song on the radio is just a rhetorical device.
If OpenAI rented all humanity’s media from a library and used them to train an AI model then that seems 100% ethical to me.
Now if you ask the model to recite the script to Breaking Bad and it does so perfectly and I think that grants me copyright authority over it then we’re going to have problems. It’s just not the model or tool’s problem.
You’re lost in the weeds. I know that’s the point it’s why the whole song on the radio thought experiment got brought up. The question was, if an AI model trains on public radio waves, and hears a copyrighted song, is that infringement? My position is no, it’s not because the radio station had a license to broadcast that song on the radio.
Similar, if all the books used to train a model are available in the library, so long as someone rents the books, then they can be used to train a model.
The question was directed at you. I don’t know why you’re repeating it back to me like I didn't know what I was asking…
The file server is only infringing when it serves those files. Photoshop itself isn't infringing just because someone recreates a famous art piece in it, it's the end user that is infringing. The difference between server driver vs user driven creation shifts responsibility.
Photoshop doesn’t provide you with copyrighted materials to work with when prompted or require ingesting other copyrighted works to work optimally. Language models do both.
The courts are going to rule in favor of these authors if they have a basic understanding of what’s happening.
> Photoshop doesn’t provide you with copyrighted materials to work with when prompted or require ingesting other copyrighted works to work optimally. Language models do both.
They actually are not that different from Photoshop. Regarding providing you copyrighted materials, if you instruct it to generate something someone else already copyrighted, perhaps by using a feature meant to reproduce existing art style, you will generate an infringing work.
As for "ingesting of other copyrighted works to work optimally", you don't know what goes into designing and building Photoshop - how many third-party datasets or copyrighted assets, which get embedded deep into the application in a form that the end-user cannot discover or consume. You don't know, and it doesn't matter, because Adobe using copyrighted materials in building Photoshop does not propagate copyright claims to you the user/renter of Photoshop. Same can be argued about LLMs - copyrighted inputs from training set get sufficiently blended when turned into weights that you, the end user, should be shielded from any IP claims related to the training data.
(Yes, the last point enables "copyright laundering", but I'm not convinced this is a problem - not compared to regulatory environment trying to prevent it.)
My point is simply these models should not ingest copyrighted materials without paying the authors or publishers. Litigating end-users who are using LLMs that have ingested pirated copyrighted would be so complex as to not be worth it except for large businesses.
If I want to ingest/read a book I need to pay money but if an LLM does it they’re free to pirate the book? Why?
And why is it that I pay OpenAI to generate data based off books it stole for free when I had to pay money for the same book.
I know what I am actually paying for is the model obviously but it just feels extra wrong to be paying a company for a service it’s built using pirated content.
> Photoshop doesn’t provide you with copyrighted materials to work with when prompted or require ingesting other copyrighted works to work optimally. Language models do both.
Photoshop now has generative AI features that leverage language models as well as training on imagery, so this is literally false when discussing Photoshop as it currently exists.
What about a file server which hosts only encrypted files, which are unusable garbage on their own, that happen to turn into exact copies of copyrighted work when supplied with the right decryption key? That's user-driven creation, right?
I would say it’s reproduction of an exact copy that represents copyright infringement, not dissemination of the digital brain that remembered it.
If and when someone tries to profit off an AI work that would be copyright infringement if a human had made it, it should be copyright infringement of an AI does it also.
I'm not disagreeing with AI model training fair use, but this isn't the argument for it.
"(new tech) does the same thing as humans, just better" has never been a valid defense. It's like saying a human could explain the plot of a movie and draw the scenes, therefore it's okay to bring a camera into a theatre and record a movie and distribute it. Or that a human can hear a conversation and remember what was said, so there's no distinction between that and recording the conversation using a phone.
But an AI model doesn't record the original work verbatim with the goal of directly reproducing the original.
Aside: and you can bring a camera to a theater and record a movie and use it in a transformative work. And a human could still liable for damages if their hand-drawn performance of Star Wars detracted from Disney’s revenue. I’m not saying I agree, just stating tue status quo.
Training a model uses the work only to calibrate weights that govern entirely independent output. The fact that it can recall exactly in some cases is a secondary effect of the technology.
Anyway my argument is that “ability to reproduce verbatim a copyrighted work is not a valid characteristic when determining whether something consumed the work fairly”.
> Anyway my argument is that “ability to reproduce verbatim a copyrighted work is not a valid characteristic when determining whether something consumed the work fairly”.
I agree with this. I only disagree with the assertion that AI or $newtech "doing something humans already do but better" has any legal importance. There are many existing laws which apply only when using technology. It's legal to drink and run, but not drink and drive, even though they both get you from point A to point B and cars just do it faster.
Being paid is not what makes it into a performance. Having an audience, and the purpose of the recitation, are what make it performance.
If I pay a babysitter to look after my kid and they sing the child a song to get them to sleep, it’s not an infringing performance.
even if you are paying ChatGPT to answer your questions, if you ask it to tell you the lyrics of a song and it does so, that is not necessarily infringing.
If I am preparing a legal brief for a copyright case, and I pay a paralegal to transcribe the lyrics of a song, and they do so and send them to me in an email… is that copyright infringement? It seems very unlikely.
I just can’t come to any position on LLMs other than that the users of the LLM have to be held responsible for how they choose to use the output, not the LLM provider.
LLMs need to be aware of the content of copyrighted works in order for them to be able to fully and comprehensively communicate with humans who are immersed in and aware of the content of copyrighted works.
that's only half of it, the half that's been litigated via Xerox and Betamax, no - the manufacturer is not liable for what end users do with their product.
But what Xerox and Sony didn't do to build their machines is pirate everything they could get their hands on as a part of the manufacturing process.
Who says OpenAI pirated it? Unless the content was pirated in the first place simply showing it to an LLM is just like letting your friend borrow your book.
When Google crawls websites to build a search index, we don’t expect Google to pay royalties… all these analogies at least demonstrate that copyright is impossible to apply consistently and our notions about what’s fair are wholly subjective.
Edit: I guess even if they are SCOTUS recently decided that even transformative works can infringe if they compete commercially. So the question is not: “did the reciter make money performing”, it’s “did the reciter’s performance detract financially for the original artist”.
Song covers are also a special case with something called a "compulsory license" where the copyright owner is required to license it to you, can't be denied. You just do it and and pay them preset royalty rates.
That is only if it is recorded. You can't get a "compulsory license" if you are going to perform that cover in front of an audience. That you have secure from the rights holder.
I personally think our law regarding covers has been heavily influenced by record labels and is wrong about them, in practice. I’ve already heard the original enough times that if I want to hear a string quartet perform Viva La Vida it’s because it’s a new refreshing piece of art. The whole compulsory license thing indicates the law got it wrong. Anyway…
The song is distinct from an artists performance of that song, but let’s move outside of music to books etc.
Suppose A creates an epic poem, B does a poetry reading aka a performance of A’s poem. C records it, then plays that recording back in public. D reinterprets the poem making a new public performance. B can successfully sue C but not necessarily D if it’s sufficiently distinct. A however can potentially sue D, C, and B if none of them got the rights. [Substitute A making a painting and B making some needlepoint copy or whatever and the same principle applies.]
This is why JK Rawling got paid by the people making Harry Potter movies, she could sue if they didn’t pay. Trademarks may also be involved, but even without that if you want to make an MMORPG based on The Dresden Files or whatever be prepared to fork over cash. Unless you follow the Disney approach and use public domain works.
'Covers' require the payment of ASCAP fees -- usually done by the facility, not the performer, but if the facility does not pay, the performer can be liable. You even need a license to put a jukebox in your bar, so that argument doesn't hold up.
The AI tools are often representing the output as something their customers can use without restriction. I'm pretty sure that wouldn't work in your analogy. If I'm an agency and a customer asks for jingles, can I recite large parts of lyrics of copyrighted songs for them to use...as if I made them up?
Google image search produces copyrighted and restricted use images. On clicking an image it includes a little caveat warning “Images may be subject to copyright. Learn More” - but no specific attribution or copyright claim. It’s possible if you go to the source where Google found it you’ll find the attribution there but also very likely you won’t.
If an AI tool just says ‘this might be subject to copyright’, is it all good?
Words to that effect appear, for example, in the GitHub copilot terms and conditions.
Yeah I’m understanding the nuance much more now. There is a difference between: is it okay to use copyrighted content to produce OpenAI’s product, and is a verbatim reproduction of a poem fair use.
That’s a naive ‘what color are my bits’[1] mistake - classic software developer mindset.
Level one programmer naïveté is just ‘bits are bits, it doesn’t matter where they come from. Bitwise identical things are indistinguishable’.
Level 2 naïveté is when you accept that bits have color depending on how they came to be arranged thus, and that there are processes that get rid of the old color on some bits, and replace it with a new one. But then you figure - like a programmer - that if you compose that process with some other process you can get rid of the colors you don’t like.
Enlightenment is realizing that the law cares not one jot for the specific processes you apply to bits or their colors but criminalizes (or at least proscribes) particular actions and cares about things like intent.
How is this any different than something like Photoshop? You can recreate (and therefore copy) a piece of art and it's infringement, but not on the part of Photoshop. Yet, Adobe is still well within the right to say you can use what you create with Photoshop. Why can't AI tool makers have the same claim?
"You can also get “the whole work” by asking another human to recite the lyrics to a song or draw the Finder logo from memory."
You seem to be speaking as if this somehow would cleanse the copyright status of the work in question, but it wouldn't. If you memorize a book, or a friend does, and your or your friend recite it to someone who transcribes it, the result is still copyrighted by the original entity, and if you try to sell the result, you'll be on the hook for copyright violation. This would do nothing to the copyright status whatsoever, so whatever argument you're trying to imply doesn't hold.
The only difference between the two cases is which human violated copyright. If you ask a musician to play a cover of a famous song without the requisite royalties they violated copyright. If you instruct a machine to do it you violated copyright. Machine has no thoughts, head empty, does not know what copyright is, does not know abc's.
Which is also why machines can't create copyrighted works either. The standard example is that making a machine to generate random images doesn't copyright them, but if an artist chooses some of them because they look good then he may copyright them.
Spirit crafted by people who couldn't even begin to imagine LLMs. The proper answer here is new laws clarifying the stance of copyright and LLMs/generative models in general, not trying to reason like a 19th century person about 21st century tech.
The EU's approach is much more sensible - there is this new thing with vast ramifications, let's sit down and see what legal framework is needed for it.
What’s really happening is that AI models have much better memory than humans and are more precise in their output. It would be stupid to try and “dumb down” AI models because they’re better at remembering some licensed content.
In spirit, it’s still fair use.