It's weird that Google is charging for the GoogleBot bandwidth on its own servic...

skj · on March 12, 2015

I thought about that too (I'm a Googler working on cloud), but then a colleague mentioned that this would become a way to get free computation from Google.

So, while I agree with the sentiment that it sucks that this crawling eats the quota, the solution is not to simply bypass the quota.

DanBC · on March 12, 2015

> but then a colleague mentioned that this would become a way to get free computation from Google

I'm a bit confused. What computation does the GoogleBot cause to be performed that benefits the Google service user? (Not Googlebot related stuff like indexing).

EDIT: Thanks kyrra!

kyrra · on March 12, 2015

Have a bunch of pages with no real content (but have millions of pages). Everytime someone tries to load a page, do some intensive task (ex: mining bitcoins). If you just make it appealing to GoogleBot and no one else, you get free computational resources.

fragmede · on March 12, 2015

Sorry, I still don't get it.

How does not charging for outgoing network traffic make computation free? You'd still be paying for everything else, eg the instances themselves, datastore storage, read/write datastore calls, using the logs API, which means mining bitcoins wouldn't be free.

brianpgordon · on March 12, 2015

OP's concern isn't with network traffic, it's with GAE compute time. Googlebot keeps causing instances to run.

If requests initiated by Googlebot were free to run, you could make a giant website full of garbage and use each free request to spend 50ms mining bitcoin.

belorn · on March 12, 2015

If the mining are done at Google Cloud Storage, initiated by a google search bot, can't Google then identify and handle such abuse? I assume Google already scans for multiple types of abuse, such as sites that spread malware.

nemothekid · on March 12, 2015

Google shouldn't really have to do this. Replace GoogleBot with BingBot or GCE with AWS and you still have the same problem. A website operator should be working to make sure search crawlers don't consume too many resources given that the bots follow rules.

Otherwise you'd have a team at every cloud provider trying to figure out how to manage bots.

skj · on March 12, 2015

Now you're suggesting that Google basically devotes a team to detecting "crawler-free-quota abuse", when the real solution needs to handles crawlers from many different sources that aren't all Google.

xigency · on March 12, 2015

Is that really how it works?

cordite · on March 12, 2015

What about discounted? It is rather unfair for Google to be eating both the funds and the service here.

That, and adding proper support for tuning the crawl rate.

thekeywordgeek · on March 12, 2015

Absolutely, even I can't say I'm deserving of a free lunch.

thekeywordgeek · on March 12, 2015

Absolutely, It's only coincidence that Google are both the host and the spider. I really would appreciate the ability to throttle it though.

Someone1234 · on March 12, 2015

When your cloud provider lets a client run their own source code, how can you REALLY determine that incoming traffic is even from a crawler? Do you want them to spawn a specific instance of the app just used by Googlebot and then using a load balancer to redirect those requests to those specific instances?

The more you think about this, the more insanely complex it gets.

skj · on March 12, 2015

Google crawlers come from well-known IPs, especially well-known to Google. Appengine requests come through reverse proxies, and there is no fundamental difficulty in not counting requests from crawlers towards the quota. That said, see my other descendant comment.