One of our customers was paying a third party to hit our website with garbage traffic a couple times a week to make sure we were rejecting malformed requests. I was forever tripping over these in Splunk while trying to look for legitimate problems.
We also had a period where we generated bad URLs for a week or two, and the worst part was I think they were on links marked nofollow. Three years later there was a bot still trying to load those pages.
And if you 429 Google’s bots they will reduce your pagerank. That’s straight up extortion from a company that also sells cloud services.
I don’t agree with you about Google being well behaved. They were doing no follow links, and they also are terrible if you’re serving content on vanity URLs. Any throttling they do on one domain name just hits two more.
I guess my position it was comparatively well behaved? There were bots that would full speed blitz the website, for absolutely no reason. You just scraped this page 27 seconds ago, do you really need to check it for an update again? Also it hasn't had a new post in the past 3 years, is it really going to start being lively again?
if i'm understanding you correctly you had an indexable page that contained links with nofollow attribute on the <a> tags.
It's possible some other mechanism got those URLs into the crawler like a person visiting them? Nofollow on the link won't prevent the URL from being crawled or indexed. If you're returning a 404 for them, you ought to be able to use webmaster tools or whatever it's called now, to request removal.
The dumbest part is that we’d known about this for a long time and one day someone discovered we’d implemented a feature toggle to remove those URLs and then it just never got turned on, despite being announced that it had.
They were meant to be interactive URLs on search pages. Someone implemented them I think trying to allow A11y to work but the bots were slamming us. We also weren’t doing canonical URLs right in the destination page so they got searched again every scan cycle. So at least three dumb things were going on, but the sorts of mistakes that normal people could make.
I thought the argument was that if you run on gcp you can masquerade as googlebot and not get a 429 which is obviously false. Instead it looks like the argument is more of a tinfoil hat variety.
btw you don't get dropped if you issue temporary 429s only when it's consistent and/or the site is broken. that is well documented. and wtf else are they supposed to do if you don't allow to crawl it and it goes stale?
We also had a period where we generated bad URLs for a week or two, and the worst part was I think they were on links marked nofollow. Three years later there was a bot still trying to load those pages.
And if you 429 Google’s bots they will reduce your pagerank. That’s straight up extortion from a company that also sells cloud services.
I don’t agree with you about Google being well behaved. They were doing no follow links, and they also are terrible if you’re serving content on vanity URLs. Any throttling they do on one domain name just hits two more.