2. robots.txt is not legally enforceable. It's a gentleman's agreement, a widely...

Retric · on April 26, 2023

Robots.txt is permission it allows you to do something that is otherwise illegal. You are not allowed to do this stuff by default.

Something people don’t understand is “Right click file save as…” on a copyrighted image breaks copyright, as you don’t have permission to make a copy. The do have implied permission to make incidental copies to view a website, but that’s it.

srslack · on April 26, 2023

Are you saying that after all these years of omitting robots.txt from my public personal website of priceless blog posts and code repositories, I can take Google and Microsoft to court for scraping, linking to, and reproducing what I deem as substantial portions? They've violated my rights! They didn't have permission!

In all seriousness, that's total shit. It's publicly accessible. That's the permission. There have been different cases like weev where the judge has viewed it differently because of specific details like carrying out the action of enumeration, or guessing IDs, to reveal non-public information, but that had to do with the CFAA.

If you reproduce content in total in another work, resell, etc., that's another thing.

Retric · on April 26, 2023

> It's publicly accessible. That's the permission.

Fair use is permission. Every library book is accessible that doesn’t have copyright implications.

srslack · on April 26, 2023

You brought up robots.txt as if it was illegal to scrape or crawl the artist's website since his site had one that was supposed to "turn off scraping." That's what I was referring to. And that's just flat out laughably wrong.

In the context you're now talking about by the letter of copyright law, downloading a photo, image, file of something or other, a "copyrighted work" where it's publicly accessible, and has a particular license specified that you may not reproduce it, may technically be "unlawful" letter by letter of the law, but I doubt any judge is going to actually see it that way versus intent of the law unless you literally reproduce or share a substantial portion of it, sell a complete copy, etc. It's almost certainly fair use to study, and use portions of the copyrighted work in your own copyrighted works.

Retric · on April 26, 2023

The legality of scrapping without explicit permission depends on the context as has been demonstrated by multiple court cases. Robots.txt short circuits that by giving permission to people who might not otherwise qualify.

As to making a copy with intent to X, that’s what fair use is. A student may photocopy the full text of an short article so they can accurate quote it in their term paper. They can’t simply photocopy an article because they are a student with access to the article and a photocopier nor can the photocopy a full book because they want to use a short quote in their paper. Copying incidental to acceptable use becomes retroactively acceptable. This distinction may seem crazy to you, but the fact that intent matters means you can’t judge an action without context.

Fair use in commercial context is looked at with vastly more suspicion than fair use in academic context, which again demonstrates specific actions on their own aren’t always enough information to say if something is allowed.

srslack · on April 26, 2023

>The legality of scrapping without explicit permission depends on the context as has been demonstrated by multiple court cases. Robots.txt short circuits that by giving permission to people who might not otherwise qualify.

It's been determined that it's legal, if it's publicly accessible, and you don't receive a cease-and-desist letter telling you to stop. If it's public, that's permission. Simply "turning off scraping" by putting a robots.txt there doesn't make the content and linked images of a public web page any less public and restricted from being scraped, legally.

Just last year, the LinkedIn case: LinkedIn had a robots.txt, and the judge didn't give a fuck. Nor did he care what their terms of use said. Rather, it was hiQ's continued scraping of LinkedIn data even after LinkedIn's cease-and-desist letter to them that constituted access of data "without authorization."

>As to making a copy with intent to X, that’s what fair use is.

Yes, and?

>This distinction may seem crazy to you

It's not a crazy distinction, that's what I'd basically said previously, so perhaps we're talking past each other.

Retric · on April 26, 2023

> LinkedIn

There’s a lot of precedent here showing scrapping isn’t guaranteed to be acceptable.

https://en.wikipedia.org/wiki/Facebook,_Inc._v._Power_Ventur....

$79,640.50 in compensatory damages + $39,796.73 discovery sanction

I could go on, but I am not sure what exactly you’re trying to argue here.

srslack · on April 26, 2023

Power Ventures' actions violated the CFAA because they bypassed security measures intended to make the content not-exactly-public. The judge dismissed claims of copyright infringement despite them hosting "cached" versions of the scraped profiles using Facebook's trade dress. The damages + discovery sanction had to do with them bypassing security restrictions with their scraping, creating profiles and using bots to scrape with those profiles to access further information than was public, and Power Ventures' ignoring their explicit cease-and-desist the first time, and their non-compliance with discovery in some context. Read the case.

>I could go on, but I am not sure what exactly you’re trying to argue here.

That public is public. If you leave the door open in the real world, someone CAN enter your home. If you host your image on a public webpage, they can scrape it. robots.txt is not a security measure, nor is it a contract that magically gives the right to scrape where it wasn't given previously, it is a gentleman's agreement that you can ignore if you want to be a dick about it, and know about the robots.txt. Ethically, that's wrong, but it is how it is.

Not to mention, as I came to understand while reading during this discussion, LAION wasn't even crawling: they were using a public commoncrawl dump to gather their images. commoncrawl had crawled the author's site previously. They just took that data and got image links out of it.

1. they weren't selling a dataset

2. the artist didn't "disable" scraping in any meaningful way, legally

3. linking to the image is not illegal, and they're justified to respond with an invoice in Germany to recover legal fees for this dumb copyright complaint

4. it may fall under fair use to download images and train neural nets on them, it may not be. it always depends on the context and the specific case.

pretty simple stuff.

Retric · on April 27, 2023

> If you leave the door open in the real world, someone CAN enter your home.

No trespassing signs have legal weight even without a fence.

Read up on Thomson Reuters v Ross Intelligence.

srslack · on April 30, 2023

No trespassing signs have legal weight with certain conditions, and it's up to the judge. I've been in a court case where simply taking pictures outside of the property and showing that the gap between each no trespassing sign was more than 100ft wide was enough for a judge to throw out the charges. I was dirt biking on the power company's property, you see. It hinges on the defendant not noticing such restrictions. In both cases, the robots.txt meant jack. Even their security measures usually meant jack. Instead, it was a legal cease and desist from a lawyer that constituted "no authorization."