(Fantastic post idea OP. One of the best I've ever seen :D) Related to bloom fil...

lorenzhs · on July 22, 2022

If you enjoyed XOR filters, you might also like ribbon filters, something that I had the pleasure of working on last year. They share the basic idea of using a system of linear equations, but instead of considering 3 random positions per key, the positions to probe are narrowly concentrated along a ribbon with a typical width of 64. This makes them far more cache-efficient to construct and query.

By purposefully overloading the data structure by a few per cent and bumping those items that cannot be inserted as a result of this overloading to the next layer (making this a recursive data structure), we can achieve almost arbitrarily small space overheads: <1% is no problem for the fast configurations, and <0.1% overhead with around 50% extra runtime cost. This compares to around 10% for XOR filters and ≥ 44% for Bloom filters.

In fact, I'm going to present them at a conference on Monday - the paper is already out: https://drops.dagstuhl.de/opus/volltexte/2022/16538/pdf/LIPI... and the implementation is at https://github.com/lorenzhs/BuRR/. I hope this isn't too much self-promotion for HN, but I'm super hyped about this :)

loxias · on July 23, 2022

This made my day, thank you so much. :) Your explanation makes sense and it sounds brilliant/clever yet obvious. (like many good ideas are)

I'm reading the paper and looking at your github now, and look forward to "github/academia" stalking you in the future. Skimming your list of repositories and seeing a lot of stuff I understand and could possibly use. ;-)

(I find it to be a useful strategy to, when I find a clever concept in a paper, or in code on github, then look at all the other things done by the same person, or the works of people they co-author with. "collaborative filtering" for ideas.)

zbobet2012 · on July 22, 2022

Y-fast tries are some of my favorites. I think they are heavily under utilized in modern terms. They sat by the the wayside for a long term because datasets where relatively small, at the time they where created ram didn't exist, and bitwise operations where inefficient along with many other constant factors.

Today; however, a lot of people have datasets on the order of 2^16 or 2^32 keys they need to maintain. And efficient bitwise operations (on upto 512 bits) are the norm. Y-fast tries are faster than b-trees or other datastructures. Also because they divide _space_ not the _data_ they are very amenable to multi-threaded and distributed algorithms. For example the hash-tables in a y-fast trie can actually be a rendezvous hash pointing to a database node. Once in that node you can hash on cores again to get to a local process for example.

loxias · on July 22, 2022

I want to hear more, esp about the distributed applications, do you have any useful links, or can I buy you an "e coffee" to pick your brain for a few minutes?

FridgeSeal · on July 22, 2022

> Also because they divide _space_ not the _data_ they are very amenable to multi-threaded and distributed algorithms.

Ha, I was thinking about this idea the other day and couldn’t figure out a good way to search for anything already written on the topic. I suspect there’s quite a lot of ground to be gained in the area.

tresil · on July 22, 2022

Big fan of HLL

Apache foundation has a fantastic DataSketches library that includes HLL and many other powerful data analytics algorithms: https://datasketches.apache.org/

Lee Rhodes has done an excellent introduction to this library - explaining some of the use cases, advantages, and things to be aware of when using these techniques: https://www.youtube.com/watch?v=nO9pauS-mGQ

twic · on July 22, 2022

On sketches, there is a genre of structure for estimating histogram-like statistics (median, 99th centile, etc) in fixed space, which i really like. Two examples:

t-digest https://github.com/tdunning/t-digest

DDSketch https://github.com/DataDog/sketches-java

arunk-s · on July 22, 2022

In the same spirit there is also Circllhist from Circonus. https://arxiv.org/abs/2001.06561

There was a good discussion on approaches to store histogram here: https://news.ycombinator.com/item?id=31379383

ranman · on July 24, 2022

t-digest is used by AWS X-ray to render their charts

champtar · on July 22, 2022

A fantastic thing about HyperLogLog is that it can be merged, so you can split your data between multiple server, precompute HLL for all IPs every minute, and then ask "how many unique IPs was there yesterday".

Discovered HLL because it's used in ClickHouse, which employ a ton of cool but obscure data structure.

Xorlev · on July 22, 2022

Works well in analytics cubes since they can be combined.

You can retain them across time too, such that you can ask questions like "how many unique users were there over the last N days?" without needing the source data. Great for privacy-aware analytics solutions.

tarun_anand · on July 22, 2022

Love DataSketches but I was wondering if there is a way to compute datasketches across time for e.g. I want to compute the users who did X and then Y in that order. Since intersection is commutative it doesnt give an answer for time ordering.

Nonetheless the best data structure I have read over last 10 years.

hendler · on July 22, 2022

I like https://en.wikipedia.org/wiki/Cuckoo_filter which allows for delete, unlike Bloom

bogomipz · on July 22, 2022

I forgot about Aerospike. They basically built a NAND optimized key, value store right? I remember reading about how they used the FTL and thinking they were pretty clever. I cant for the life of me find the article now. I think they were really big in the ad tech space? Is that still the case?

loxias · on July 22, 2022

"NAND optimized key value store" doesn't do it justice ;-) The fact that it's SSD optimized has nothing to do with key sharding across trees, the latter is what gets you the absurdly low latency and near infinite scale out. This link gives an overview: https://vldb.org/pvldb/vol9/p1389-srinivasan.pdf And it's open source...

bglazer · on July 22, 2022

I think what you’re describing in the last example is the MinHash algorithm

edit: actually I’m not sure, sounds related though

blueblob · on July 22, 2022

Sounds a bit like radix sort?

loxias · on July 22, 2022

AIUI Radix sort "kinda ish" related, yeah :) They both involve looking at your data at a 90 degree angle and slicing on bitplanes.