It is! Although my test case is probably an unrealistically bad scenario: It's t...

ammar2 · 2025-04-05T14:43:04 1743864184

Edit: Analyzed the wrong thing earlier.

This depends on the Python version, but if it has the specializing interpreter changes, the `COMPARE_OP` comparing the integers there is probably hitting a specialized `_COMPARE_OP_INT` [1].

This specialization has a ternary that does `res = (sign_ish & oparg) ? PyStackRef_True : PyStackRef_False;`. This might be the branch that ends up getting predicted correctly?

Older versions of Python go through a bunch of dynamic dispatch first and then end up with a similar sort of int comparison in `long_richcompare`. [2]

[1] https://github.com/python/cpython/blob/561965fa5c8314dee5b86...

[2] https://github.com/python/cpython/blob/561965fa5c8314dee5b86...

dzaima · 2025-04-05T14:52:00 1743864720

This isn't actually timing the sorting, but just the (dumb) function f.

ammar2 · 2025-04-05T15:07:17 1743865637

Oh whoops, that's right. I totally missed that.

LPisGood · 2025-04-05T16:40:29 1743871229

This is a really good example. It is more like branch prediction than standard data/instruction caching.

I wonder if you could do Spectre type vulnerabilities in python. You would need some way to leak micro-architectural state, so without being particularly clever maybe python code could only be used as a gadget or something.

cma · 2025-04-05T15:10:26 1743865826

Python speed up is probably from small integer caching, a sorted array will have runs of pointers to the same integers adjacent. The compiled language one is probably branch prediction right?

exyi · 2025-04-05T16:37:47 1743871067

I intentionally stayed in the small integer range to avoid benchmarking the cache. 256 distinct values should fit into L1 just fine in both cases.

I'm now thinking that the difference might be even larger if we instead avoid small integers and let the CPU get stuck chasing pointers. The idea is that it gets stuck on a memory access, which forces it to speculate much further, which in turn makes it backtrack a longer path if a branch was mispredicted. I'm obviously no expert on this, feel free to correct me

The results for 1B range instead of 255 are 17.6 ms for unsorted / 68.2 ms for sorted! We are back to what the original article observed and it's a way stronger effect than what branch prediction can offer. So don't sort your arrays, keep them in the order the boxed values were allocated ;)

cma · 2025-04-05T17:02:53 1743872573

How big is the pointed to small integer? With alignment etc. I'm seeing some stuff saying 256 of them would fill an 8KB L1. Plus other stuff for the interpreter might overfill it. Sorted that would be less of an issue.

Larger range one being slower unsorted yes makes sense because of allocation order no longer matching the iteration order.

exyi · 2025-04-05T18:07:42 1743876462

I don't know how large are those boxes, but normal CPU L1 cache has 32 or 48KB which should be plenty for this. Python opcodes for this program are going to be tiny, and the interpreter itself uses the instruction-L1 cache (which is another 32-48KB). I hope the sequential scan of the big array won't flush the L1 cache (there should be 12-way associativity with LRU, so I don't see how it could).

Anyway, there is no need to have 256 integers, just 2 is enough. When I try that, the results are similar: 17.5 ms (unsorted) / 12.5 ms (sorted)

bgwalter · 2025-04-05T16:46:11 1743871571

That seems very likely. The benchmark should probably use a range that is guaranteed to be outside of the cached smallint range.

exyi · 2025-04-05T18:24:23 1743877463

Then you are back to what the article discusses. Each integer is in a separate box, those boxes are allocated in one order, sorting the array by value will shuffle it by address and it will be much slower. I tested this as well, see the other comment.