This isn't exactly abnormal. SO is a big site with alot of fresh content, I'm guessing google indexes many thousands of sites at that rate. What's surprising to me is that it surprises them.
Indeed. There are only 86K seconds in a day, or 31.5 million seconds in a year. Even if you assume that Google refreshes a page only a couple of times per year (which is absurd in terms of their freshness), it still means that you cannot have more than a few million pages in the index without accepting on average multiple crawls per second from the bots.
Well, the necessity of scraping that much content would depend on how often it is being updated. There must be only a few thousand active pages on Stack Overflow on any given day, so executing a million-plus page views per day seems a bit overkill.
Google tends to reduce its crawl frequency for content that changes less frequently. Perhaps in SO's case, it struggles to identify recently changed content without crawling the entire site.
SO gets well over 10k completely new posts (Q&As) on a typical day.
A few thousand older posts are edited a day.
Throwing comments into the mix more than doubles the number of "new things", and of course every question, answer, comment, or edit is displayed on multiple pages (generally, on the owning user's page).
tl;dr - there's a lot more page churn than you might expect.
Here is our 10/s crawl stats too. Thought I'd share to get a contrast.
Though about half the page crawls per day. Note the page load times :)
Oops, recent regression due to building an internal cloud.
So you're right, not out of the ordinary.
10 requests per second doesn't sound majorly high. That's 36,000 pages per hour which whilst big, doesn't sound too high, especially for a site as popular as SO (Alexa puts it at 137rd most popular site; granted, Alexa isn't the most accurate).
This is addressed in the post - apparently it's hitting pages that haven't been accessed in a while, starting background tasks - but it still seems odd to me. I'd have expected a huge amount of Stack Overflow's traffic to come from long tail searches, which should be basically the same thing. Excerpt for the lazy:
"and when Google hits thousands of pages in a few minutes, that can kick off a lot of background work, such as rebuilding related questions. Not expensive by itself, but when multiplied by a hundred at once.. can be quite painful."
The rules are that you can't send google different page content than regular browsers, but there's no reason they have to run all the background processes on googlebot requests -- can't they just send it the most recent cached version?
Not a bad idea, but seems like it would be tricky to get right. You do kinda want Google to have the most recent version of a page, all other things being equal.
Why not put the cached content in the page by default, then do an update via AJAX only in the case where the cached version is old? That way it's not triggered for crawlers. It's probably secondary content anyway.
I was searching around for some SVG radial background gradient something or other that I wasn't sure even existed a few days ago and the top hit to come up was a SO question that had been asked 7 hours before. Answered my question, too. I was impressed.
What's funny is when someone asks a question and you think "Oh, I'll bet I could answer this with a little googling." And the top result turns out to be the question you're trying to answer.
What's frustrating is trying to Google a question, finding forum threads telling the original questioner to use Google, and refusing to answer the question.
I was trying to look up an error[1] regarding Padrino the other day and someone had asked about the same problem I was having. The exact question I typed in had been asked on SO 6 hours before I encountered it and then deleted from SO, but Google had cached it.
I was sad the top result led to a deleted question without an answer, but was impressed an exact match to my question was cached 2 minutes after it was asked and was the top result (hopefully it will be dropped from results soon since it leads to SO's 404 page now).
My two take aways are, generally you can make anything scale, cache like hell, and I personally don't see enough value in .NET to justify the licensing costs to roll it out initially or the long term.
One point on the cost issue. To an individual a Windows server license looks like a lot of money but to a business it really doesn't matter. Every server my company buys has many paying customers tied to it so it really doesn't matter to us at all.
Besides the monetary cost, there's the opportunity cost of dealing with licenses in the first place. Part is compliance (Does your company have current licenses that cover every bit of software on every virtual machine on every developer laptop? Can you prove it?) and part is procurement (Do you get the plan with free upgrades, or do you buy new? Will you need enough licenses over the next two years that you should get a site-license or is it cheaper to stick with single user licenses?).
I always thought that, at scale, SO and (Facebook or Twitter) were an apple to oranges comparison. Not because of load, but because of types of load.
For whatever reason, I have it in my head that the difficulty Facebook and Twitter (and even Digg) face in scaling are the social aspects of their sites. These are the things that require custom software (FlockDB and Cassandra) and a lot of machines.
Perhaps I need to use SO again, but in the day, this social aspect of SO didn't exist. This means their scaling challenges are far more traditional, say like slashdot. 99% cacheable reads type thing.
If I'm right, SO is really just a case study that, depending on what they are doing, some startups will be able to scale with .NET.
The interesting thing to know would be how much more efficient a push-based indexing approach would be instead of the current pull-based model. If frequently updated sites could push change notifications to google it would solve this problem. However, I'm not sure how google could trust such sites not to overload its own servers.
Yeah, Google supports the sitemaps standard but that doesn't really cater for content as dynamic as Stack Overflow's. The last-updated format is a day rather than a timestamp, for example, making it useless for very-frequently updated content.
My question would be "does SO create contet at that rate?" It seems to me that google need not index your site faster than you're creating things for it to see. Is there a way to vary how often google indexes you with how often your uses create content automatically?
Google is constantly re-indexing old pages, so the rate of new content creation isn't that big a factor in the crawl rate (though I imagine it does cause Google to ramp up their crawling rates if they aren't already running at the maximum).
this is a problem we had at a large social network i used to work with. launching a directory of users primarily for google's consumption was something that was difficult to scale for our huge size of database.
The solution for us was node.js
There is an API via which you can describe your URLs (it's called a sitemap) and you can ping to Google your sitemap when its content changes. You can have multiple sitemaps and ping only your changes. More on www.sitemaps.org .
But Google reserves the right to crawl non-sitemaps URLs, for obvious reasons. It would be quite a bad decision for them to restrict their crawls only to API-provided URLs.
We try to give search engines hints with the update frequency in our sitemaps.
We re-build our sitemaps nightly and make sure that new or recently-updated content is listed with an update frequency of "daily" or "weekly" and all other content pages are listed as being updated "monthly."
To be honest, I've never measured if it works, but it can't hurt.
You can have multiple sitemaps. You can ping just 1 sitemap containing only the links that you want to notify. You can use the optional <lastmod> tag to indicate a URL's last change date.
There are services in Webmaster Tools for pushing sitemaps, no? Which would not be quite sufficient for this--Google still has to scrape the respective pages in the sitemap--but it's about as close as you are going to get.
That screenshot was to point out that when changing from "automatic" to "custom", there was no difference ie the setting Google's automatic setting settled on was "full pelt".
I was surprised how much of a load webcrawlers (Google, Bing, Yahoo, etc) imposed on us at PatientsLikeMe, the majority of which is Google. The "intelligent" rate limiting results in a very high rate of crawl for many sites.
We added additional caching and manually lowered the crawl rate to address this at PatientsLikeMe.