I love the article on hyperloglog! It is really quite good to read even if you'r...

lucasschm · on May 27, 2017

You are correct, but HyperLogLog has many buckets counting the longest run of zeros in order to avoid the problem of outliers. I recently studied these probabilistic algorithms and did a notebook with code and plots to show their performance: https://github.com/lucasschmidtc/Probabilistic-Algorithms/bl...

snowcrshd · on May 27, 2017

Thanks for sharing that!

Just skimmed through it and seems pretty interesting. I'll read it more in depth later.

lucasschm · on May 27, 2017

No problem. If there are mistakes or a segment is not clear, let me know

meetapoorvgupta · on May 28, 2017

Thanks for the write up, Lucas. It was very intuitive and I learnt a lot.

I noticed that you used 5000 buckets to store the frequency of 7000 non-unique words in the section on 'Counting Bloom Filters'. How is that better than using 7000 buckets and a uniformly distributed hash function, which would maintain frequencies perfectly? We would be using fewer buckets by an order of magnitude in a real-world implementation to save memory.

lucasschm · on May 28, 2017

Yeah, I should have given more thought to that number. Updated the example for N=300. Thanks

anhldbk · on May 27, 2017

Thanks for sharing guy! Interesting repo.

twotwotwo · on May 28, 2017

For outliers by random chance, lucasschm's reply explains.

The usual trick for preventing folks _maliciously_ sending in outliers is to use a hash with a secret key such as SipHash so that folks on the outside can't trivially figure out what inputs will lead to hashes with a lot of leading zeroes.

Lukassus · on May 27, 2017

The HyperLogLog was a very nice article, but I wanted to ask, is this related to the estimation of Naci tanks during WW2 by Allies? https://www.wired.com/2010/10/how-the-allies-used-math-again...

andreareina · on May 28, 2017

The German Tank Problem guesses the size of a set, given a limited sample and successive serial numbers. If they had randomized the serial numbers it wouldn't have worked.

HyperLogLog is different because you have the entire population (not just a sample), and it's a multiset (the same element can appear more than once). Getting the size of a (non-multi) set is easy, you just keep a counter and increment it for each element; it only takes enough memory to maintain the counter. Counting the distinct members of a multiset takes a lot more memory because you have to remember whether you've already seen a particular element or not.

The tl;dr is that the German Tank Problem is about making an estimate of size when you have imperfect information, and HyperLogLog gives you an estimate when you have perfect information, but it's too expensive to make an exact calculation.