Many years ago, I was asked to look at why all the content had vanished from a site (not built by me). After digging in a bit, I found that:
1) the original developer's idea of handling an unauthorized /admin request was just to set a redirect header and continue processing the current request .
2) the /admin page had a grid of all the content on the site, with handy 'Delete' links that ran over GET without confirmation.
You can probably guess where this is going – some search bot hit the overview page, ignored the redirect header, saw the content, and dutifully crawled every single link on it…
I think the state of the web has improved slightly over the last decade but this is a great example of why browser vendors are so conservative. You can do this now but only opt-in.
Was it blekko? We had a website owner email us about that issue when blekko's ScoutJet crawler was new... although I don't recall the bit about ignored redirect headers.
I'm pretty sure everyone with a crawler has hit this sort of problem before. The first startup I was at did with someone's wiki that had "delete" links everywhere with no auth.
Now that I've hit it once, I watch out for websites with this problem. I was surprised to notice that a Fortune50 tech company's internal employee-personal-webpages-maker-thingie had that issue. And then a week later they asked me if I could crawl their internal web. Uh, no, who knows what other internal systems had that problem?
1) the original developer's idea of handling an unauthorized /admin request was just to set a redirect header and continue processing the current request .
2) the /admin page had a grid of all the content on the site, with handy 'Delete' links that ran over GET without confirmation.
You can probably guess where this is going – some search bot hit the overview page, ignored the redirect header, saw the content, and dutifully crawled every single link on it…