(Fantastic post idea OP. One of the best I've ever seen :D)
Related to bloom filters, xor filters are faster and more memory efficient, but immutable.
HyperLogLog is an efficient way to estimate cardinality.
Coolest thing I've learned recently was Y-fast trie. If your dataset M is bounded integers (say, the set of all 128 bit numbers), you get membership, predecessor, or successor queries in log log time, not log, like in a normal tree.
Would love to learn more "stupidly fast at the cost of conceptual complexity" things.
edit:
(adding)
I don't know a name for it, because it's not one thing but a combination, but once can:
Use the first N bits from a very quick hash of a key from an unknown distribution (say a file path, or a variable name, or an address, or a binary blob,..) as a way to "shard" this key across M other fast data structures (like a tree) for search/add/remove. By changing M, you can tune the size of the terms in the O(1)+O(log) equation for running time.
Trees getting too deep for fast search? Every increase of N moves the computation from the log search of the tree to the space tradeoff of having more trees.
Added benefit is you can scale to multiple threads easily. Instead of locking the whole tree, you lock a tiny sub-tree.
Very clever. (I first saw in the Aerospike key value store)
If you enjoyed XOR filters, you might also like ribbon filters, something that I had the pleasure of working on last year. They share the basic idea of using a system of linear equations, but instead of considering 3 random positions per key, the positions to probe are narrowly concentrated along a ribbon with a typical width of 64. This makes them far more cache-efficient to construct and query.
By purposefully overloading the data structure by a few per cent and bumping those items that cannot be inserted as a result of this overloading to the next layer (making this a recursive data structure), we can achieve almost arbitrarily small space overheads: <1% is no problem for the fast configurations, and <0.1% overhead with around 50% extra runtime cost. This compares to around 10% for XOR filters and ≥ 44% for Bloom filters.
This made my day, thank you so much. :) Your explanation makes sense and it sounds brilliant/clever yet obvious. (like many good ideas are)
I'm reading the paper and looking at your github now, and look forward to "github/academia" stalking you in the future. Skimming your list of repositories and seeing a lot of stuff I understand and could possibly use. ;-)
(I find it to be a useful strategy to, when I find a clever concept in a paper, or in code on github, then look at all the other things done by the same person, or the works of people they co-author with. "collaborative filtering" for ideas.)
Y-fast tries are some of my favorites. I think they are heavily under utilized in modern terms. They sat by the the wayside for a long term because datasets where relatively small, at the time they where created ram didn't exist, and bitwise operations where inefficient along with many other constant factors.
Today; however, a lot of people have datasets on the order of 2^16 or 2^32 keys they need to maintain. And efficient bitwise operations (on upto 512 bits) are the norm. Y-fast tries are faster than b-trees or other datastructures. Also because they divide _space_ not the _data_ they are very amenable to multi-threaded and distributed algorithms. For example the hash-tables in a y-fast trie can actually be a rendezvous hash pointing to a database node. Once in that node you can hash on cores again to get to a local process for example.
I want to hear more, esp about the distributed applications, do you have any useful links, or can I buy you an "e coffee" to pick your brain for a few minutes?
> Also because they divide _space_ not the _data_ they are very amenable to multi-threaded and distributed algorithms.
Ha, I was thinking about this idea the other day and couldn’t figure out a good way to search for anything already written on the topic. I suspect there’s quite a lot of ground to be gained in the area.
Apache foundation has a fantastic DataSketches library that includes HLL and many other powerful data analytics algorithms: https://datasketches.apache.org/
Lee Rhodes has done an excellent introduction to this library - explaining some of the use cases, advantages, and things to be aware of when using these techniques: https://www.youtube.com/watch?v=nO9pauS-mGQ
On sketches, there is a genre of structure for estimating histogram-like statistics (median, 99th centile, etc) in fixed space, which i really like. Two examples:
A fantastic thing about HyperLogLog is that it can be merged, so you can split your data between multiple server, precompute HLL for all IPs every minute, and then ask "how many unique IPs was there yesterday".
Discovered HLL because it's used in ClickHouse, which employ a ton of cool but obscure data structure.
Works well in analytics cubes since they can be combined.
You can retain them across time too, such that you can ask questions like "how many unique users were there over the last N days?" without needing the source data. Great for privacy-aware analytics solutions.
Love DataSketches but I was wondering if there is a way to compute datasketches across time for e.g. I want to compute the users who did X and then Y in that order. Since intersection is commutative it doesnt give an answer for time ordering.
Nonetheless the best data structure I have read over last 10 years.
I forgot about Aerospike. They basically built a NAND optimized key, value store right? I remember reading about how they used the FTL and thinking they were pretty clever. I cant for the life of me find the article now. I think they were really big in the ad tech space? Is that still the case?
"NAND optimized key value store" doesn't do it justice ;-) The fact that it's SSD optimized has nothing to do with key sharding across trees, the latter is what gets you the absurdly low latency and near infinite scale out. This link gives an overview: https://vldb.org/pvldb/vol9/p1389-srinivasan.pdf And it's open source...
Related to bloom filters, xor filters are faster and more memory efficient, but immutable.
HyperLogLog is an efficient way to estimate cardinality.
Coolest thing I've learned recently was Y-fast trie. If your dataset M is bounded integers (say, the set of all 128 bit numbers), you get membership, predecessor, or successor queries in log log time, not log, like in a normal tree.
see: https://www.youtube.com/playlist?list=PLUl4u3cNGP61hsJNdULdu... (6.851 Advanced Data Structures, Erik Demaine)
Would love to learn more "stupidly fast at the cost of conceptual complexity" things.
edit:
(adding) I don't know a name for it, because it's not one thing but a combination, but once can:
Use the first N bits from a very quick hash of a key from an unknown distribution (say a file path, or a variable name, or an address, or a binary blob,..) as a way to "shard" this key across M other fast data structures (like a tree) for search/add/remove. By changing M, you can tune the size of the terms in the O(1)+O(log) equation for running time.
Trees getting too deep for fast search? Every increase of N moves the computation from the log search of the tree to the space tradeoff of having more trees.
Added benefit is you can scale to multiple threads easily. Instead of locking the whole tree, you lock a tiny sub-tree.
Very clever. (I first saw in the Aerospike key value store)