That depends on the implementation. If it's a simple binary tree, the worst case is O(n), not O(log n) since if you insert data in sorted order everything will always be in the left branch. O(log n) is average lookup time.
If you have a specialized tree structure like a red-black tree which will automatically balance, this will improve, but sorted input is still a viable attack method.
That said, a lot depends on the application and one should pick a hash function wisely. A binary tree–mapped hash table works best when you want to be able to iterate on keys in sort order or find an entry that's just before or just after a non-existent key in sort order. It's not really a good defense against hash attacks. On the flip side, the default hasher in Rust is hardened against hash attacks, but it's overkill for most applications and can cause severe performance degradation (although at least the library documentation is clear on this and points at alternative hash implementations for such cases). Unfortunately, the Rust resizing algorithm is flawed and won't resize until the hash capacity is completely full¹ making hash collisions much more likely. I had a doubling of performance in benchmarking some code that I was writing just by increasing the initial capacity of the map.
⸻⸻⸻
1. A common mistake I see Java developers make is, when they know they'll be inserting n items into a HashSet or HashMap is to initialize the structure with a capacity of n. This guarantees that Java will end up re-allocating the index since the default load factor in Java is 0.55 meaning that once you have 0.55×capacity elements in your collection it will resize.
> inserting n items into a HashSet or HashMap is to initialize the structure with a capacity of n.
That's actually a bug in HashMap (either the implementation or the spec). If you ask a structure to reserve space for n elements, that means it should should reserve enough space to actually store n elements (so n/0.55 in this case). If the method actually does mean to set the internal, implementation-detail capacity rather than the real capacity, then the specification is buggy, because that's largely useless unless you're relying on implementation details (as demonstrated).
"When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets."
and it explicitly tells you that HashMap will default to 0.75 load factor, therefore the documentation is telling you that HashMap(400) is not a suitable container for 400 items but only up to 300 items.
Now, it's true that e.g. Rust's HashMap::with_capacity(400) is actually a container which promises it's suitable for 400 items‡ and that's more ergonomic, but I don't think we can call Java's choice here a mistake, it's just harder to use correctly.
‡"The hash map will be able to hold at least capacity elements without reallocating".
Based on some benchmarking details¹ for my application, it appears that Rust is willing to completely fill its hashmap. It may be that my use case is somewhat unusual in that I'm expecting most requests to not return a value,² but I found that just specifying a larger initial capacity dramatically sped up the application (I got a bigger speed boost from that than I did from switching from SipHash to FNV).
The HashMap is "completely filled" in the sense that you get to put N items in a HashMap with capacity N, because that's specifically how Rust's HashMap chooses to define capacity as I explained above, so it's a tautological explanation.
The "real" capacity and load factor of the underlying data structure is not presented to you. Swisstables (the current implementation) can't be filled entirely, they need an empty slot to function correctly. There is some trickery to ensure that very small Swisstables get to have an "empty" slot that doesn't really exist, and thus doesn't need RAM. Since you can't ever write to it the non-existence doesn't matter. A Rust HashMap::with_capacity(0) is guaranteed not to allocate any actual heap storage, just like String::new() don't need heap to store the empty string inside it.
It depends on how you're using the hash lookup though. In my case, most lookups would be expected to not return a value, so completely filling the small hashmap will cause retrievals to be slower since it has to keep going until it determines that there's no match.
Yes, of course. There are many use cases. Yet, in practical terms - most (90%) of look ups do return a value. For 'small' N, you can consider everything O(1).
If you do know the use case - like most look ups fail, you can pick a proper impl. and tune it.
Yet again, very few cycles are spent on small hash table look ups in general. Yet, a lot of (like really a lot) is spent on small hash table wasted memory.
If you care about performance: measure, measure, measure... and know what you measure.
>A common mistake I see Java developers make is, when they know they'll be inserting n items into a HashSet or HashMap is to initialize the structure with a capacity of n.
The common mistake is using the c-tor allocating N size. The c-tor does way more harm than good. Most HashMaps in Java have few elements with tons of empty array space, many of them are pre-allocated to quite large sizes too. (At the very least nowadays the empty c-tor doesn't immediately allocate the Node[], so empty HashMaps are ok).
One of the strengths of HashMap is that resize/rehash is extremely fast/efficient and it doesn't dereference the keys (i.e. no hashCode/equals called). It doesn't reallocate Nodes, etc.
I said 'sorted' but I meant balanced as well. And even if you can 'attack' still, there is a limit to how slow it can get. Worst case O(log n) is not bad. As for security, I can't speak to its efficacy, but I know it's appropriate for real-time applications.
If you have a specialized tree structure like a red-black tree which will automatically balance, this will improve, but sorted input is still a viable attack method.
That said, a lot depends on the application and one should pick a hash function wisely. A binary tree–mapped hash table works best when you want to be able to iterate on keys in sort order or find an entry that's just before or just after a non-existent key in sort order. It's not really a good defense against hash attacks. On the flip side, the default hasher in Rust is hardened against hash attacks, but it's overkill for most applications and can cause severe performance degradation (although at least the library documentation is clear on this and points at alternative hash implementations for such cases). Unfortunately, the Rust resizing algorithm is flawed and won't resize until the hash capacity is completely full¹ making hash collisions much more likely. I had a doubling of performance in benchmarking some code that I was writing just by increasing the initial capacity of the map.
⸻⸻⸻
1. A common mistake I see Java developers make is, when they know they'll be inserting n items into a HashSet or HashMap is to initialize the structure with a capacity of n. This guarantees that Java will end up re-allocating the index since the default load factor in Java is 0.55 meaning that once you have 0.55×capacity elements in your collection it will resize.