Well, first obey "robots.txt". Our SiteTruth system does some web scraping. It's...

Well, first obey "robots.txt".

Our SiteTruth system does some web scraping. It's looking mostly for the name and address of the business behind the web site. We're open about this; we use a user-agent string of "Sitetruth.com site rating system", list that on "botsvsbrowsers.com" and what we do is documented on our web site. We've had one complaint in five years, and that was because someone had a security system which thought our system's behavior resembled some known attack.

About once a month, we check back with each site to see if their name or address changed. We look at no more than 20 pages per site (if we haven't found the business address looking in the obvious places, a human wouldn't have either). So the traffic is very low. Most scraper-oriented sites hit sites a lot harder than that, enough to be annoying.

We've seen some scraper blocking. We launch up to 3 HTTP requests to the same site in parallel. A few sites used to refuse to respond if they receive more than three HTTP requests in 10 seconds. That seems to have stopped, though; with some major browsers now using look-ahead fetching, that's become normal browser behavior. More sites are using "robots.txt" to block all robots other than Google, but it's under 1% of the several million web sites we examine. We're not seeing problems from using our own user-agent string.

So I'd suggest 1) obey "robots.txt", 2) use your own user agent string that clearly identifies you, and 3) don't hit sites very often. As for what you do with the data, you need to talk to a lawyer and read Feist vs. Rural Telephone.