A High Throughput B+tree for SIMD Architectures (2020) [pdf]

Const-me · on Sept 17, 2023

I once needed that thing (with 32-bit integer keys, and either FP32 or FP64 values), but I didn’t want to spend too much time designing novel data structures.

I have copy-pasted B+ tree from TLX https://github.com/tlx/tlx (Boost license), and then I did some local changes to improve the performance for my specific keys and value types. I have used AVX2 intrinsics to improve search within nodes, and I also implemented a few more simple tricks. After these changes, the performance became satisfactory for my application.

kolbe · on Sept 17, 2023

I've gone through a similar exercises with other trees, but I learned that much of the improvement I attributed to AVX was actually due to exploiting the idea that non-blocking operations done on data in the register are essentially free. The nice thing about AVX was that it forced me to write the code that way, but the bulk of the improvement was actually due to writing code in an unintuitive, but machine-optimal way.

cmrdporcupine · on Sept 17, 2023

I found this, too. Spent a bunch of time optimizing with SIMD intrinsics only to find that the real win was the array-oriented-style I made along the way.

Also a lot of assumptions from the past start to erode with the memory architecture and cache performance we have today. Things where one would reach for binary search or hash table before end up getting outperformed by linear scan, as long as the set of data fits roughly in a cache line, etc.

It's possible that in this day and age we'd all just be better off writing actually performance critical stuff in an APL offshoot, with the language forcing one to steer away from branching and dispatch whenever possible and deal with data in declared, bulk, array/vector-type operations.

kolbe · on Sept 17, 2023

In my experience, linear can outperform even on larger sizes. This looks like a pretty robust analysis comparing linear and binary search:

https://dirtyhandscoding.github.io/posts/performance-compari...

jeeeb · on Sept 19, 2023

In my experience using linear search it’s easy to introduce accidental O(n^2) time complexity into a function, which can really come back to bite you.

The benchmark linked looks like the best case for linear search - searching a packed array of ints. Even then from the results after just 256 elements binary search starts to win out (comparing the optimised implementations of both). By 1024 elements the difference between O(log N) and O(N) complexity is clearly starting to dominate the results.

For more complex comparison functions, and pointer indirections (e.g. searching arrays of heap allocated objects), I’d assume the lower time complexity of binary search would dominate results even more.

cmrdporcupine · on Sept 19, 2023

The issue is the bar keeps moving on where it makes sense to switch from one approach to the other.

I recently implemented the Adaptive Radix Tree from https://db.in.tum.de/~leis/papers/ART.pdf in Rust. It defines explicit node types with different trade offs for search and memory complexity in order to improve overall performance and depending on prefix sparsity, etc; a 4-child node 16, 48, and 256. The 16 and 48 are kept sorted (the 256 has implicit sorting because it's just byte->child) so binary search can be used (with SIMD optimizations)

But in reality, my benching showed linear scan outperforms this fairly consistently and the cost of keeping the keys sorted is not justified (except that it makes ordered iteration easier, so that's the trade-off) -- but evidently did not at the time the paper was written (2016 I think?).

I'm sure there's plenty of examples like this, and I'm sure we're likely using datastructures all over the place that were built under the assumption that linear scan was expensive, and optimizations were added to hash or sort data and those optimizations may actually be hindering rather than helping.

slashdev · on Sept 17, 2023

That was smart. I once did a similar thing, but I built it from scratch. A good learning experience, but man it was a lot of work. I changed from SIMD to branchless though. Branchless binary search inside a node, plus branchless search through the tree (with a fixed max height.) The only branch was to test if the item was found or not at the end. And that’s predictable usually, depending on the workload. I wanted to combine multiple searches over multiple trees using SIMD to batch the execution, but I never got around to that.

rurban · on Sept 17, 2023

This, Harmonia, is 3 years old.

Code is here: https://github.com/JustKshitijD/Harmonia_for_B_plus_trees

dboreham · on Sept 17, 2023

Quite a while since I saw a file named a.out !