If only Google had access to a full blown browser they could use in the crawl en...

rurounijones · on Oct 7, 2013

* At scale, without massive performance drops

Volpe · on Oct 7, 2013

I'm confused, search indexing isn't a realtime exercise... Why would performance be an issue? Running a headless browser vs running "whatever it is they run that can execute JS" doesn't seem like a huge leap...

rurounijones · on Oct 7, 2013

At Google's scale any performance drop can have massive implications. If Google's crawl rate is 100 million[1] pages a day then a 1% drop in crawl rate means 1 million less pages crawled per day (which has many implications, for example having to use more compute power to regain crawl rate which raises costs etc.)

You are right that it is not a real-time exercise but they do have crawl targets.

You cannot be flippant about "Why don't they just do X" when scale is that big.

[1] Picked out of the air but probably in the right magnitude (or even a little small)

est · on Oct 7, 2013

Have you ever experienced web apps that laggs like crap? Yeah think about that x 10000 million web pages.

Volpe · on Oct 7, 2013

... right but a bot doesn't get impatient. So I don't see your point.

cygx · on Oct 7, 2013

They should just shut down all their data centers and crawl the whole web from a single box located in someone's basement.

After all, the bot doesn't get impatient.

Volpe · on Oct 7, 2013

... Comments have really gone to shit here haven't they.

Some how we all end up antagonistic over bullshit like whether google have a big enough computer.

But alas, you're right, google could never crawl with an actual browser - what a ridiculous suggestion. I apoligise for such a dumb-witted comment.

As an aside: For my part in contributing such bad quality comments, I apoligise.

cygx · on Oct 7, 2013

The point is that Google probably doesn't have a lot of cycles to spare - anything else wouldn't be good business sense.

Anything that significantly adds to the load will lose them money - whether or not the operation needs to be realtime is secondary to that.

I apologise for giving offense: I wrote the comment the same way I would have made it face-to-face, which is always a bit risky in a purely textual medium.

dsl · on Oct 7, 2013

I don't know if you are trying to be serious at this point or not. Google has millions (literally) of machines with dozens of cores each. Search is their business that makes all the money.

Google executes JavaScript and renders the full DOM for every page internally. They generate full length screenshots of every page and have pointers to where text appears on the page so they can do highlighting of phrases within the screenshot.

It isn't even a debatable question if Google reuses the Chrome engine to do this.

richardwhiuk · on Oct 7, 2013

The compute costs would be extraordinarily expensive.