Open addressing or 'open bucket' hash tables always seem to be a second consideration in textbooks or blogs. I find them generally faster, and more memory efficient, and easier to implement. Instead we're always taught the closed bucket approach, often using linked list buckets of all things.
In certain scenarios open addressing has worse performance, but with the right optimisations it's generally only contrived situations where it performs worse. In normal usage, open addressing has vastly superior locality, a smaller memory footprint, and less dereferencing overhead.
A lot of writing and textbooks describing open addressing seem to imply that the order of walking through the table when collisions occur is really important (and it is). But the approach with best performance in practice is to just do a dumb increment through the table in-order because of caching... the penalty for that simple approach can be made up for with a good, pseudo-random hash.
Inspired by Lua, I did the same for upb (https://github.com/protocolbuffers/upb). I recently benchmarked upb's table vs SwissTable for a string-keyed table and found upb was faster in both insert and lookup (in insert upb is beating SwissTable by 2x).
It's true that the links do create more memory overhead though.
Textbooks, and especially undergraduate textbooks, tend to favor conceptual simplicity over performance and ease of implementation. They are supposed to teach you the key ideas behind the techniques rather than the complicated details of how things work in the real life.
The fundamental idea of a hash table is that you turn the key into a hash value and then store the key in the bucket corresponding to the hash value. The simplest way of dealing with hash collisions is assuming that each bucket is a concrete data structure capable of storing multiple keys. With this conceptual model, you already understand the essence of how hash tables work and when it's the appropriate data structure to use.
Open addressing is a more advanced topic. Undergraduate data structure classes usually mention it, but many students don't have the prerequisites (such as probability and memory hierarchies) to understand it or to make informed choices about it. After all, basic data structures is an early class, typically taught after introductory programming but before the core CS classes that rely on it.
> But the approach with best performance in practice is to just do a dumb increment through the table in-order because of caching... the penalty for that simple approach can be made up for with a good, pseudo-random hash.
SwissTable, which is generally among the fastest hash tables around, doesn't do this. It does chunks of 16 entries each, then starts skipping chunks (so first 0 chunks are skipped, then 1, then 2, etc.) The chunk skipping behavior helps avoid O(n) behavior on pathological cases.
The O(n) behavior is not even really that 'pathological', lots of normal cases get that behavior. If SwissTable operates as you say, then it sounds like it is prone to bad performance for lots of normal data with a dumb hash, i.e. as a "Set" on numbers where the hash is the identity function (a valid hash for a closed bucket hash table), and many of the numbers are next to each other.
Therefore, much like the situation I described in my parent post, SwissTable would benefit from being used with a decent pseudo-random hash. Which is no trouble, there are a number of good, fast ones.
Yes, Swisstables explicitly require you to use a decent hash. If you insist on using a bad hash, you get terrible performance, so, don't. Google internally (I didn't check Abseil) has diagnostics so they just flag that the problem with your Swisstable is that you used a bad hash and save engineers wasting their time "investigating" this non-problem.
Rust's HashMap (which is a Swisstable reimplementation) chooses a SipHash 1-3 by default but you can drop in anything you want, just with the knowledge that if your hash is bad your HashMap will have lousy performance.
Swiss tables are part of Abseil and Abseil provides its own hashing library. I haven't looked into the details of the implementation, but by default it's not using the identity function for ints. These are the first few hashes it generates for int inputs:
Academic computer science tends to model a 'computer' as a simplified pointer-machine with O(1) memory access everywhere.
In reality computers actually have a memory hierarchy and no memory access is ever O(1). In RAM it only looks like constant access, because the memory address search is hardware optimized.
Many programming languages are heavily inspired by this academic pointer chasing and so their performance is pretty terrible, no matter how clever typing systems or compiler optimizations you add. It's all crap once you scale to a decent size of data and beyond one machine. The only thing that performs and transparently scales beyond one cpu are 1. array languages and 2. sql databases. Both of these actually understand things such as linear memory access, data-parallelism, the fact that cpus have caches and that 'memory' is a hierarchy of cache levels, RAM, disk, etc.
Academic computer science uses many different models for analyzing computational complexity. Most of them are not taught at undergraduate level, because students generally expect to graduate without spending 10+ years at college. Computer science is a wide and diverse field, and it's impossible to teach everything every CS graduate should know in any reasonable time.
The random access model is usually taught first, because it works reasonably well in most situations. A slight variation of it, the cell-probe model, tends to be more useful for proving lower bounds. In that model, all computation is free and only memory accesses count.
A more practical variant of the cell-probe model is often called the external memory model. The computer is assumed to have M words of fast memory and an unlimited amount of slow memory that is accessed in blocks of B words. Computation is still free, and parameters M and B are both known to the algorithm. This model is useful in situations where accesses to one level of memory hierarchy dominate the running time.
The cache-oblivious model can be useful in situations where many levels of memory hierarchy matter. It can be understood as the external memory model with unknown values of M and B.
Another useful technique is parameterizing the complexity in terms of key operations. If operation X takes t_X time, we can then say than an algorithm takes O(|P| t_X log n) time with pattern P and problem size n.
Isn't that technically correct? The access time is bounded by a constant, hence the 1 -- unless there's a tape in your memory hiearchy. (That you always strive for a better constant is of course another matter.)
It can't be O(1) because to just search for a memory location you need at least a binary search. Since memory lookup is implemented in hardware it is treated as constant access. Of course this is all obscured by magnitudes of the memory hierarchy. For real systems such as databases, it is a better idea to treat random memory access as O(log n), which then justifies doing batch lookups and libear scans. All of this O(1) RAM access makes people confused and they never start thinking beyond the simple undergrad abstraction of a single cpu with a single layer of RAM.
Open adressing = When there is a collision, that is when the memory slot is already taken, you just put your key in the next one (linear probing), or at squared increment (quadratic probing) or at an increment determined by another hash function.
Howewer all those techniques seems to be suffering from clustering and the more elaborate ones trying to trade memory locality for less clustering.
I guess it just depends on how long and what scale you are going to need a hash table ?
The clustering can still happen with any walk. The dumb walk is the best, but it is the most prone to clustering if the hash is simplistic. But if the hash is pseudo-random, it's as likely to get clusters as any other walk, and has better locality.
A direct addressing scheme with fixed cache line sized and aligned buckets [typically 8 machine words/entries] incurs precisely 1 cache line access. I think the strawman here is the "linked-list" bucket. If we even the playing field -- a fixed sized pre-allocation of space which open addressing certainly requires -- then we can dispense with a dynamically sized bucket [and the linked-list].
So then the issue is space efficiency of direct addressing schemes. Here n-choice (2 is optimal actually) hashing provides for excellent loading at the cost of (precisely) 2 [n] cache line accesses. Your linear probing for an empty slots is likely to incur that on average if not worse.
Cuckoo hashing, imo, is a hybrid that wants the deterministic access times of direct addressing (2 locations on reads) at the cost of variable times on writes (which are not worst than pure open addressing).
Since Robin hood hashing is a thing, you can even have assurances that any given "cluster" will still have good performance.
Robin-hood hashing is to "steal from the rich" (aka: if someone is 2-away from their ideal spot, and you're at 5-away from your ideal spot, you take the slot and make the 2-away "worse").
Totally agree with the assessment on the open address/linear probe ones - easy to implement, compact, lower cache pressure.
As for the 2nd consideration - it's not fancy, too easy to implement, no novelty, etc.
The way I have dealt with the downsides of linear probe, open address: not well spread out lower bits - smearing/rehashing the hashes of the keys (via murmur3). As a side benefit it effectively randomizes the iteration order.
> Open addressing or 'open bucket' hash tables always seem to be a second consideration in textbooks or blogs.
I find that weird since my introductory textbook went straight for an open addressing, linear probing hash table -- probably because this approach is ancient. I'm not even sure a different approach was mentioned (but of course in that course, it was only one topic out of many to grasp).
It requires not only a hashing function but a comparing one. Also it's not beneficial at lower size as 'compare' might fully derefences the keys (esp. if they are composite) and it's log2N is not that lower than just N.
So dynamically switching between a linked list and a rb tree (or a trie) is quite practical and done in the real world, e.g. Java's HashMap
I’m missing something… doesn’t open addressing fail catastrophically the moment you try to insert more items than the table has buckets? (With very bad degradation in performance as you approach this saturation point)?
Is the idea that you only use open addressing in cases where you know you’ll place much fewer items into the table than the number of bins?
When the load factor of an open-addressed hashtable gets too high, you allocate a larger set of buckets and rehash the table’s existing contents into them. You’ll have to do this with a chained hashtable anyway, as performance will start to approach O(n) once the table is overloaded.
This does mean that insertion time goes from a true O(1) to an amortized O(1), but that’s usually not an issue. (You can do the rehashing gradually if you really have to, but it’s probably not worth the complexity.)
In certain scenarios open addressing has worse performance, but with the right optimisations it's generally only contrived situations where it performs worse. In normal usage, open addressing has vastly superior locality, a smaller memory footprint, and less dereferencing overhead.
A lot of writing and textbooks describing open addressing seem to imply that the order of walking through the table when collisions occur is really important (and it is). But the approach with best performance in practice is to just do a dumb increment through the table in-order because of caching... the penalty for that simple approach can be made up for with a good, pseudo-random hash.