That depends on the implementation. If it's a simple binary tree, the worst case...

a1369209993 · on Oct 17, 2021

> inserting n items into a HashSet or HashMap is to initialize the structure with a capacity of n.

That's actually a bug in HashMap (either the implementation or the spec). If you ask a structure to reserve space for n elements, that means it should should reserve enough space to actually store n elements (so n/0.55 in this case). If the method actually does mean to set the internal, implementation-detail capacity rather than the real capacity, then the specification is buggy, because that's largely useless unless you're relying on implementation details (as demonstrated).

tialaramex · on Oct 17, 2021

Here's what Oracle says for JDK8:

"When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets."

and it explicitly tells you that HashMap will default to 0.75 load factor, therefore the documentation is telling you that HashMap(400) is not a suitable container for 400 items but only up to 300 items.

Now, it's true that e.g. Rust's HashMap::with_capacity(400) is actually a container which promises it's suitable for 400 items‡ and that's more ergonomic, but I don't think we can call Java's choice here a mistake, it's just harder to use correctly.

‡"The hash map will be able to hold at least capacity elements without reallocating".

dhosek · on Oct 17, 2021

Based on some benchmarking details¹ for my application, it appears that Rust is willing to completely fill its hashmap. It may be that my use case is somewhat unusual in that I'm expecting most requests to not return a value,² but I found that just specifying a larger initial capacity dramatically sped up the application (I got a bigger speed boost from that than I did from switching from SipHash to FNV).

1. https://www.finl.xyz/2021/09/30/building-a-trie-in-rust-opti...

2. And since this won't be the only char-indexed map in the application, I will have to revisit this.

tialaramex · on Oct 17, 2021

The HashMap is "completely filled" in the sense that you get to put N items in a HashMap with capacity N, because that's specifically how Rust's HashMap chooses to define capacity as I explained above, so it's a tautological explanation.

The "real" capacity and load factor of the underlying data structure is not presented to you. Swisstables (the current implementation) can't be filled entirely, they need an empty slot to function correctly. There is some trickery to ensure that very small Swisstables get to have an "empty" slot that doesn't really exist, and thus doesn't need RAM. Since you can't ever write to it the non-existence doesn't matter. A Rust HashMap::with_capacity(0) is guaranteed not to allocate any actual heap storage, just like String::new() don't need heap to store the empty string inside it.

xxs · on Oct 17, 2021

>Rust is willing to completely fill its hashmap.

Which is the correct decision for lower size(s). I use dynamic fill factor with 1 being the default with <= 16 elements.

dhosek · on Oct 18, 2021

It depends on how you're using the hash lookup though. In my case, most lookups would be expected to not return a value, so completely filling the small hashmap will cause retrievals to be slower since it has to keep going until it determines that there's no match.

xxs · on Oct 18, 2021

Yes, of course. There are many use cases. Yet, in practical terms - most (90%) of look ups do return a value. For 'small' N, you can consider everything O(1). If you do know the use case - like most look ups fail, you can pick a proper impl. and tune it.

Yet again, very few cycles are spent on small hash table look ups in general. Yet, a lot of (like really a lot) is spent on small hash table wasted memory.

If you care about performance: measure, measure, measure... and know what you measure.

xxs · on Oct 17, 2021

>A common mistake I see Java developers make is, when they know they'll be inserting n items into a HashSet or HashMap is to initialize the structure with a capacity of n.

The common mistake is using the c-tor allocating N size. The c-tor does way more harm than good. Most HashMaps in Java have few elements with tons of empty array space, many of them are pre-allocated to quite large sizes too. (At the very least nowadays the empty c-tor doesn't immediately allocate the Node[], so empty HashMaps are ok).

One of the strengths of HashMap is that resize/rehash is extremely fast/efficient and it doesn't dereference the keys (i.e. no hashCode/equals called). It doesn't reallocate Nodes, etc.

veltas · on Oct 16, 2021

I said 'sorted' but I meant balanced as well. And even if you can 'attack' still, there is a limit to how slow it can get. Worst case O(log n) is not bad. As for security, I can't speak to its efficacy, but I know it's appropriate for real-time applications.