a little backstory (that seems to be missing from the blog post). this is a tech...

a little backstory (that seems to be missing from the blog post). this is a technique that was invented in the early days of google for deduplicating the crawl set (turns out that building an llm and building a plain ol' web text index look awfully similar; i wonder why that is?!). you can read about it in depth in jeffrey ullman's free book "mining massive datasets" which describes many of the really cool and impressive techniques that were used to prepare an index build for the entire internet when such a thing was really hard.

you can find the relevant material for free by searching "chapter 3 pdf mmds ullman"

enjoy!

edit: oh no! i'm wrong! according to wikipedia it was invented at dec for altavista. https://en.wikipedia.org/wiki/MinHash either way there's a nice description in the ullman book and they do describe how it was used at google as well.