I love the article on hyperloglog! It is really quite good to read even if you're not interested in algorithms. I always liked number theory and I think that it's very interesting that you can guess how many uniques there are by counting how long your longest run of zeroes in a hash is.
I suppose this could be broken by injecting in a unique visitor id that would hash to something with an absurd amount of zeroes? That's assuming that the user has control over their user id and that I'm understanding the algorithm correctly.
You are correct, but HyperLogLog has many buckets counting the longest run of zeros in order to avoid the problem of outliers. I recently studied these probabilistic algorithms and did a notebook with code and plots to show their performance: https://github.com/lucasschmidtc/Probabilistic-Algorithms/bl...
Thanks for the write up, Lucas. It was very intuitive and I learnt a lot.
I noticed that you used 5000 buckets to store the frequency of 7000 non-unique words in the section on 'Counting Bloom Filters'. How is that better than using 7000 buckets and a uniformly distributed hash function, which would maintain frequencies perfectly? We would be using fewer buckets by an order of magnitude in a real-world implementation to save memory.
For outliers by random chance, lucasschm's reply explains.
The usual trick for preventing folks _maliciously_ sending in outliers is to use a hash with a secret key such as SipHash so that folks on the outside can't trivially figure out what inputs will lead to hashes with a lot of leading zeroes.
The German Tank Problem guesses the size of a set, given a limited sample and successive serial numbers. If they had randomized the serial numbers it wouldn't have worked.
HyperLogLog is different because you have the entire population (not just a sample), and it's a multiset (the same element can appear more than once). Getting the size of a (non-multi) set is easy, you just keep a counter and increment it for each element; it only takes enough memory to maintain the counter. Counting the distinct members of a multiset takes a lot more memory because you have to remember whether you've already seen a particular element or not.
The tl;dr is that the German Tank Problem is about making an estimate of size when you have imperfect information, and HyperLogLog gives you an estimate when you have perfect information, but it's too expensive to make an exact calculation.
I suppose this could be broken by injecting in a unique visitor id that would hash to something with an absurd amount of zeroes? That's assuming that the user has control over their user id and that I'm understanding the algorithm correctly.