Thanks!!! If you feel comfortable with it , feel free to send me the measurements that you used so I can try to troubleshoot the problem! (or email me at easysloper@gmail.com)
A 503 would still require a GAE instance to be running so wouldn't necessarily deal with my problem.
I have seen "noindex nofollow" kill a site stone dead in the past so I am very wary indeed of using it. In my experience once you've noindexed a page it is nigh-on impossible to get the engine to index it again.
My content is autogenerated, though I hope it has enough value to be considered useful. It's time-series data of word frequencies in politics, so for example you might use it to see how one candidate is doing relative to another in an election campaign.
FWIW I think the main problem is that you're essentially creating an "infinite space," meaning there's an extremely high number of URLs that are findable through crawling your pages, and the more pages we crawl, the more new ones we find. There's no general & trivial solution to crawling and indexing sites like that, so ideally you'd want to find a strategy that allows indexing of great content from your site, without overly taxing your resources on things that are irrelevant. Making those distinctions isn't always easy... but I'd really recommend taking a bit of time to work out which kinds of URLs you want crawled & indexed, and how they could be made discoverable through crawling without crawlers getting stuck elsewhere. It might even be worth blocking those pages from crawling completely (via robots.txt) until you come up with a strategy for that.
I can't check at the moment, but my guess is that all of these generate the same content (and that you could add even more versions of those keywords in the path too). These were found through crawling, so somewhere within your site you're linking to them, and they're returning valid content, so we keep crawling deeper. That's essentially a normal bug worth fixing regardless of how you handle the rest.
> A 503 would still require a GAE instance to be running so wouldn't necessarily deal with my problem.
And persistence to track how many crawl requests have been served in the last N minutes. Even blindly serving a million 503's an hour could get really expensive.
I'd be concerned as to adverse effects on my indexing. But it wouldn't fix my problem as it would still be a request that would require a GAE instance to handle it.
(edit) Yes, the bot is still hitting the GAE site atm even though it's returning a quota error.
If you've such big site, you should avoid resource priced cloud and go with your own VPS. It might take you more time to set it up, but it will be definitely much cheaper ($50/month and less), and it can surely handle all your traffic until it grows really big....
Do you have any idea if this service will attract any actual users? Based on what I've read, I can't even figure out what it does, so I am certainly not a potential user. But, do you have any actual demand??
What I'm hearing is that you built a massive application, you've run into a technical problem and now you would rather wait on Google to fix it than to take any suggestions on how to get it up for actual users to use. Seriously, don't do this - at your stage, it would be better to have 10 real users than a site that has been fully indexed by Google.
On your note about persuading Google to index your site after being excluded, do you have any actual experience with this happening?? I've been doing this kind of stuff for years and years and have never had a problem. It can take five or six weeks at the outside, but that is still less of a problem than a product that can't be accessed...
I am a little concerned about doing that though. Having seen sites killed stone dead by people blocking stuff by mistake in robots.txt and the engines then never looking at them again I'm very wary indeed of blocking stuff I intend later to unblock.
It sounds like you care more about google traffic than you care about real users, if you want to do this without having more than a few entries in your robots.txt and your sitemap then you could simply remove the other pages until you're ready to have them spidered, alternatively have them behind a login and hand out invitations.
In a nutshell, if you put up millions of pages and tell google about it it will index you, if you don't want that you'll have to make choices about the quantity and/or switch to a different kind of host.
Also, this kind of 'bot trap' tends to attract penalties so if this is not some ploy to get traffic out of google you may want to re-consider how you've laid things out, the difference between a legitimate site with a lot of generated pages and a page-spammer is hard to determine and google tends to err on the side of caution.
Since your site is down I can't see how it's organized, but I would think hard about having two separate parts to it. One would be a site that changes slowly, perhaps a descriptive page, maybe with a 'best of' or examples that you cull. Then have the main page that is rapidly updated and keep the 'bots out of that. There's no reason to index the rapidly changing map, is there? Just index a slowly changing pointer to it.
The original Sinclair ZX Spectrum manual featured a listing to play a few bars of Mahler's First, with the reader asked to play the whole piece as an exercise. Here's Matt Westcott completing that exercise with the help of a few friends and a table covered in Spectrums. In the process of making this possible it is also believed that the record was broken for the number of networked Spectrums in one place.