How is that relevant? If some LLM were able to regurgitate a 60k word novel verb...

snovv_crash · on May 19, 2024

So the fact that it's a lossy compression algorithm makes it ok?

ben_w · on May 19, 2024

"It's lossy" is in isolation much too vague to say if it's OK or not.

A compression algorithm which loses 1 bit of real data is obviously not going to protect you from copyright infringement claims, something that reduces all inputs to a single bit is obviously fine.

So, for example, what the NYT is suing over is that it (or so it is claimed) allows the model to regenerate entire articles, which is not OK.

But to claim that it is a copyright infringement to "compress" a Harry Potter novel to 1200 bits, is to say that this:

> Harry Potter discovers he is a wizard and attends Hogwarts, where he battles dark forces, including the evil Voldemort, to save the wizarding world.

… which is just under 1200 bits, is an unlawful thing to post (and for the purpose of the hypothetical, imagine that quotation in the form of a zero-context tweet rather than the actual fact of this being a case of fair-use because of its appearance in a discussion about copyright infringement of novels).

I think anyone who suggests suing over this to a lawyer, would discover that lawyers can in fact laugh.

Now, there's also the question of if it's legal or not to train a model on all of the Harry Potter fan wikis, which almost certainly have a huge overlap with the contents of the novels and thus strengthens these same probabilities; some people accuse OpenAI et al of "copyright laundering", and I think ingesting derivative works such as fan sites would be a better description of "copyright laundering" than the specific things they're formally accused of in the lawsuits.