Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I wish every post/tool out there on scraping covered obeying robots.txt. It's a crap standard, but it's what we've got.

"Just ignore it" is a great way to identify yourself as a crappy netizen.



I'm glad that all of the sites you target want your scraper to access them. The goal in many cases where one would use a scraper is to access information not provided in an API or otherwise encased in HTML. Most of their robots.txt are "User-Agent: *\nDisallow: /\n"


Then we have no right to scrape that content.

Why is there an implied right to scrape?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: