If you literally mean Yahoo never used the exact specification of the PageRank a...

utopcell · on June 1, 2019

Yahoo's core search engine was based on Inktomi's, which was acquired in 2003. There was no PageRank in there, and there was no infrastructure to execute anything similar either. (I worked for a significant amount of time on link-level features during Marissa's rebirth of Yahoo Search.) PageRank is an algorithm that is trivial to understand and prototype, but hard to scale efficiently to 100's of billions of pages.

bborud · on June 2, 2019

That's largely true. But not quite, as Yahoo also acquired the FAST engine. However, since the team developing this were located in Trondheim, Norway, the FAST-based web search was eventually discontinued (and morphed into the VESPA project).

FAST's web search used Page Rank from late 1999.

utopcell · on June 2, 2019

Fascinating! This explains why I never saw anything like PageRank in YST: I presume it was present in what became Vespa (which, to be fair, probably didn't scale to YST's corpora sizes.) Pity we can't continue this conversation offline..

bborud · on June 9, 2019

Of course we can continue the conversation offline. I'm not hard to find :-).

Yeah, VESPA wouldn't have scaled back then, but the search engine we used was far more scalable than Inktomi since it was the same search engine we used for web search. We did hold the record for largest index for a while (to distract people from the fact that our ranking was lagging behind Google's :-)).

But the search engine itself wasn't really the point for VESPA. Also page rank wasn't really as relevant for the use-cases VESPA was for. In fact, ranking in small, special corpora is very different from ranking in web search. And in the case of small document corpora: surprisingly hard, so one depended on tools to specialize both search, ranking and result processing.

I wrote the first implementation of the VESPA QRS with a couple of other guys, which I think was the second component in VESPA (if you count the fsearch/fdispatch as the first). I think this was the first step towards making easily customizable search. The big initial barrier was to convince people Java would be fast enough for this. (I was prepared for a 30% loss of performance in exchange for ease of extension. What we got was a 200% performance boost over the C++ implementation before even starting to optimize. But it was a bit of work to make it play nice with GC in Java and I remember David Jeske at Google refusing to believe me when I outlined how we'd done it :-))

An interesting question is what would have happened if Yahoo had chosen FAST web search instead of Inktomi. According to Jeff Dean, our search engine was the only competitor he was worried about (mentioned over lunch in may 2005 after i accepted a position at Google). Possibly because he didn't understand why it performed well. We made some fundamentally different design bets than Google (they bet RAM would become cheap fast, we bet that it wouldn't. They were right).

Inktomi was a technological dead end. That was a stupid choice by Yahoo top management based solely on geography and reflecting the ineptitude of top management when it came to technology.

To be quite frank, I think Yahoo would have flubbed web search either way. The only reason VESPA managed to survive at all was because it was being developed in Trondheim Norway - far away from Sunnyvale where we could get away ... well, bullshitting leaders and pretending to obey them while doing our own thing. Not that we weren't in deep doo-doo initially (we were in over our heads), but we had some really great people that were able to orchestrate the mess that was VESPA into something that worked, and then something that worked well.

Without mentioning any names, Yahoo had a problem with technologically inept leaders as well as too many useless middle managers. At the time just before we were acquired by Yahoo, it was quite clear that separating out important bits of the search engine into infrastructure components was key. Google had understood this early and done a few very important things (GFS, Protobuffers, MapReduce, Borg etc).

The funny thing was: our first two versions of our search engine (in 1998 and 1999) essentially used MR for crawling and processing, but we did so with shell scripts and duct tape (it was a mess). Anything that could be turned into "sort and scan", as we thought of it, could be done fast. Including page rank and deduplication - and deduplication was a much, much, much harder problem than page rank.

And when I say shell scripts and duct tape: we used unix sort, pipes, shell scripts and various small programs to do "mapping" and "reduction" :-) (strictly speaking, we used our own version of UNIX sort to have the same sorting order on all platforms, but essentially unix sort). Management were only focused on short term sales to portals, so we just reimplemented the same primitives over and over and over in every piece of technology we made. Wasting a ton of time and effort.

I was working on a storage system at the time that was sort of a combination of GFS, MR and Borg (the design came from before the papers about GFS etc were published). The idea was to have a distributed storage on which you could execute code in a sandboxed environment on each node. Meaning that you send the code to where the data lives and process it locally in a parallel manner and stream output to other nodes in the system. After certain executives felt a need to get involved and dictate technology choices I figured the project was doomed and abandoned it. (It was, for a while, known as "the storage system that can't store stuff").

Today I think that my approach would have been too complex to be sufficiently easy to develop. There were certain things about GFS I really didn't like (too trusting of clients), but slicing the problem into distinct domains was the right thing to do. Also, Google had Chubby and we didn't.