Hacker News new | past | comments | ask | show | jobs | submit login
A single line of code made a 24-core server slower than a laptop (2021) (pkolaczk.github.io)
526 points by xk3 on June 18, 2023 | hide | past | favorite | 163 comments



This is really good. Thank you for this blog post.

Asynchronous code, coroutines, async/await, parallelising problems is my deep interest and I blog about it everyday.

I think the easiest way to parallelise is to shard your data per thread and treat your multithreaded (or multimachine) architecture as a tree - not a graph - where dataflow doesn't need to pass between tree branches. This is similar to the Rust's "no interior mutabiliy" and Rust data structures pattern.

My machine can lock and unlock 61570760 times a second. But it can count to 2 billion in 1 second. So locks are expensive.

I recently worked at parallelising the A* graph search algorithm that I'm using for code generation/program synthesis.

For 16 processes it takes 35 seconds to synthesise a program but with 3 processes it takes 21 seconds. I think my approach to parallelising A* needs a redesign.

We hit Amdahl's law when it comes to parallelising. I need to split up my problem into spaces that don't require synchronization/serialisation.

EDIT: I've mentioned this whitepaper before ("Scalability! But at what COST?") but this whitepaper would be useful reading of anybody working on multithreaded or distributed systems. In summary: single threaded programs can easily be faster and more performant (wall clock time) than multithreaded/multimachine distributed machines, but they don't scale.

https://www.usenix.org/system/files/conference/hotos15/hotos...


This is in a nutshell the moral of the performance story. I took a graduate level class in performance computing that was basically all lab based. In the end what I learned, overwhelmingly, first hand, is that in the performance computing world, what wins is what exploits the hardware intelligently.

Theoretical advancements matter too but usually only in so much as they can translate to hardware. Although, there are some special case where even a slight theoretical gain matters even more than how it translates to hardware, but they're limited.

Anyways, to that end, everything becomes about dividing work in a way that parallelizes nicely, exploits cache well, reduces the need to share information between threads, etc... and this ultimately comes down to data structures. However, these aren't your normal, fundamental data structures. Instead each problem sort of has some kind of exotic, hypothetically ideal data structure, that is fine tuned to exploit the machine's resources to the max for just that problem. By the time you're done they rarely resemble anything intelligible, let alone wha the whiteboard version of the algorithm was.

In that vein, there are a few general trends that appear over and over again. Trees vs graphs is definitely one of of those trends, although that's more of a general theme and not a literal rule.


That sounds very interesting. Would you be able to reference any such exotic data structure and maybe the process of getting to defining it?


I don't know any of their names, or if they even have names. My TA for that class had, at the time, one of the world's fastest SSSP implementations, and our last class was basically just spending the whole lab time understanding his personal implementation. It was really in that moment that it clicked for me that the more you come to understanding a problem in CS the more your code morphs from something general into becoming one big, bespoke data structure. Almost his entire program was just one giant data structure. So, I don't think this kind of thing has a name, nor can I say I really even understood it, but I do remember he used a lot of bags.

For reference, we spent the whole class doing SSSP on a graph representing every road in the United States. The naive A*/Dijkstra implementation took like 45 min, and naive bellman-ford never finished (probably days/weeks). By the end of the class I was so proud to have parallelized Bellman-Ford to the point where it took like 30s or something. And my TA's record breaking implementation was <1s. All of this was on a normal university linux desktop.

It's kind of like programming = logic + data. However, there's a continuum between the two, and on the far, extreme end of that spectrum, the structured access of the data can be the program itself.


These stories really expose how fast and efficient computers can actually be. I remember a post (don’t have the link) where someone sped up a Python script by 3000x or something crazy like that with C++. A script anyone would’ve been fine running, but it was just so much slower than what was possible.

Sometimes I wonder whether we should not focus more on these micro adjustments. We usually have this “it’s fast enough” attitude, but probably everything we use today (including web services) could be instant if focus was given to optimization (although yes, I understand the drawbacks of focusing solely on that).


We as in the general every day programmer especially your web services example should definitely not focus on these micro adjustments I would say. Sure it's fun. If you make library type code, yes please do focus on them! If you're a run of the mill corporate dev, very rarely I would think.

Other much simpler optimizations are usually possible in "corporate code". Lots of stupid things being done. To use your python script example, proba ly you could've gotten 2950x improvement rewriting the bad parts better but still in python :)

Do you remember what the python script did/was for? How much time was spent building the original script? What was the guy that built it paid?

How long did it take to build the C++ version and how much was that guy paid?

If these were all open source/unpaid, what would it cost for these different types of people and could the company conceivably pay those salaries/keep the guy happy and busy enough to stay around?


I understand all that and apply that day to day. But my thought here is more to think of the possibilities if the performance of most things was brought to its absolute best. There’s a company that went from 30 servers to 2 just by rewriting from Ruby to Go (I can find the link later if interested). Even if they had kept Ruby and achieved a speed up, there was something just massively inefficient and they didn’t really care as they were in “we're shipping fast” mode and were making money. Which I understand perfectly but that’s a big waste of energy and resources.

Not that I think that should be priority #1 but it’s staggering how much more efficient things can be than the “naive” approach.


Perhaps you're thinking of Matt Parker's (of standup maths and numberphile fame) video:

https://m.youtube.com/watch?v=c33AZBnRHks


I think that’s it! Really shows my point: the first code was already ok and solved the problem in a reasonable amount of time, in a normal “let’s ship faster” setup the ridiculous speed up achievable would’ve never even be considered. 4 million percent faster! A whole class of possibilities open up at that point.


Thank you for the detailed answer. I am looking into a risk management system and valuation also seems to be an area where data structures can significantly improve calculation time. Also on my end looking into the role of data structure into enabling two separate optimisation paths, one being the real time latency oriented and the other being the batch throughout oriented. If anyone has seen any work on data structures with focus on this kind of optimisations it would be great. A bit related to the parent discussion, I am looking at graphs with tree views if that makes sense.


For me it sounds a lot like data-driven programming or data-oriented programming.


The point is that these data structures are highly dependent on the problem you're trying to scale.

I don't think an example of one and the process taken to define it is that helpful, since the skill is about being able to do this for an arbitrary problem. It also probably requires a lot of trial and error.


> I need to split up my problem into spaces that don't require synchronization/serialisation.

When I first got a dual CPU (before "cores" were a thing) computer, I decided I'd try dipping my feet in some "proper" parallel coding.

I started with something simple, parallelizing a quicksort routine. This seemed quite trivial: instead of recursing, add the spans to be sorted to a list. I then spawned a thread per CPU, which fetched a span from the list, did a single quicksort pass on it and added up to two new spans to the list. Rinse repeat until list was empty.

Since each span was non-overlapping, the threads only had to synchronize while accessing the list of tasks.

When benchmarking it became clear that while there was a good performance win for large arrays, for short arrays the multithreaded code was much slower than the non-multithreaded version. At my hardware the threshold was around 50k items for integer elements and 20k or so for string elements, IIRC.

I added a threshold detection, where the thread would do a regular recursive quicksort on the span if the length was below the threshold, and this yielded significantly better results.

And with that the harsh reality of multithreading hit me: no free lunch. It was clear the threshold varies not just with element type (slow/complex comparators would reduce the threshold, and vice versa) but with the details of the hardware. So a hardcoded threshold was out of the picture, and it would have to be dynamically determined at runtime.

Was a great learning experience though.


For those who might wonder why, the quicksort splits the array into half and then repeats on the two halves.

Thus the majority of spans are small and quickly processed, and hence the naive multithreaded version leads to lock contention as the threads fight to access the shared list of spans.


The basic algorithm of A* is very sequential, isn't it? It works by taking the best scoring unexpanded node and expand that. Most of the time, there's only one such node. When expansion has finished, you need to re-sort the queue/heap of unexpanded nodes. All those steps are sequential. So I guess the only gain is when the node expansion can be done in parallel; expanding the top-N nodes probably is counterproductive for many problems. How much you gain then depends on the time expansion takes. The advantage of parallelism then depends on how much time one step "down" takes.


Yes, I think I can take advantage of my particular problem to parallelise, which I'm still trying to work out.

My neighbour scanning is dynamic and my neighbours and all neighbours from a node is independent from that point forwards, it shall not visit the exact same node. In essence my problem is kind of a tree.

I am trying to infer data flow between two states including hidden states such as functions calls. My dream is that I can provide a start state and end state and the computer writes itself based on type information and data flow analysis of values.

Here's my input data - which is what memory is set to and what registers are set to.

  start_state = {
    "memory": [0, 0, 0, 0],
    "rax": 0,
    "rbx": 1,
    "rcx": 2,
    "rdx": 3,
    "rsp": -1,
    "rdi": -1,
    "rbp": -1
  }

  end_state = {
    "memory": [3, 1, 2, -1],
    "rax": 3,
    "rbx": 2,
    "rcx": 1,
    "rdx": 0,
    "rsp": 6,
    "rdi": -1,
    "rbp": -1
  }

  # these functions take in a value and return another value
  minus_1_to_four = Function("minus1", -1, 4)
  four_to_five = Function("fourtofive", 4, 5)
  five_to_six = Function("fivetosix", 5, 6)
This synthesises the following program in 16 seconds (I improved the heuristic function). Function values input and output can be in any register, but in my example they are all in the same register. With 3 processes it synthesises in 0.6 seconds.

  [start, mov %rax, (%rdx), mov %rbx, (%rbx), mov %rcx, (%rcx), mov %rdx, (%rsp), mov %rax, %rbp, mov %rdx, %rax, mov %rbp, %rdx, mov %rcx, %rbp, mov %rbx, %rcx, mov %rbp, %rbx, call minus1(rsp=-1) -> rsp=4, mov $-1, %rbp, call fourtofive(rsp=4) -> rsp=5, call fivetosix(rsp=5) -> rsp=6]


This is correct, and Bellman-Ford, while naive compared to A*, actually parallelizes much better. That said, an optimized Bellman-Ford doesn't even resemble the original, naive version of the algorithm.


I wouldn't say sequential. Maybe at implementations are sequential and state based because games need to pause the process on a frame refresh.

A* is naturally recursive [1] and so can be parallelized as you go further.

[1] unless I'm mistaking some other algorithm for A*, each new recursive call of all the neighbours can be started on a different thread or node.


When I was noodling with the Traveling Salesman problem, I was sure that what I really needed was to spend X% of available resources on the Big Gamble (a low probability algorithm with fast results), Y% on a common heuristic and the rest on the honest work of plowing through the linear equations progressively culling the remaining scenarios that need to be tested. I had limited success with this though. I just haven’t done enough LP to make anything noteworthy, and it was a tool sharpening exercise, which made me a little more effective at more mundane batch processing tasks, not a great mind of NP-completeness.


FYI, there are lots of papers on parallel shortest path algorithms. If an approximate solution suffices, there’s also a lot of research available on that, often with some parameter that lets you trade more computation for a tighter approximation. It's not a problem that parallelises particularly well though, so not sure if you'll see good gains in practice.

If you can restrict the structure of your graphs (e.g. planar) then some very efficient methods exist.


Isn't a tree just a graph that is directed and acyclic (DAG)?

So a tree is just a subtype of a graph?


A tree is a subtype of graph. However, a DAG is not a tree if it has cycles once you forget direction, meaning paths can join up again after splitting. This distinction matters because when different paths "join up" again, there is often complicated data duplication/integration that is necessary in order order to combine the results. On a distributed system, it may mean data passing over a network, which you want to avoid.


In graph theory, trees are undirected. In computing, trees usually have two features that differentiate them from their more minimal graph theoretical cousins: they are (a) directed and (b) rooted (a particular vertex is designated as the root, and every vertex can be walked to from that vertex).

But yeah. Some graphs are trees. And you can construct trees within graphs for efficiently navigating connected graphs, which is done in various important and famous algorithms.


A tree is a subtype of a graph, but it is not the same as a DAG. A diamond-shaped directed graph (edges A->B, A->C, B->D, C->D) is a DAG, but not a tree.


Can you share any of the parallel A* code? I’m currently looking at using A* for use in generative design and am pretty ignorant about parallelising code. Would love to learn more!


I've taken a very naive approach and it does NOT scale. It uses python multiprocessing. I think the fScore updates pickling and serialization - communication and synchronization lock overhead is slowing it down. It simply does not parallelise.

This is my code generator/program synthesiser, it synthesises assembly instructions from two given states of memory and registers to take it from one to the other, including hidden states such as functions.

https://replit.com/@Chronological/SlidingPuzzle3#main.py


Thanks, regardless of ability to parallelise it looks interesting.


I've changed the algorithm to not send fScore updates and treat each branch as independent.

This speeds it up from 16 seconds to 320 milliseconds. I am thinking how to make the algorithm scale, by creating better and alternative program candidates in each thread.


This is exactly a reason why I like Scylla as a database. One shard per CPU. Each shard owning different partition of data. Great performance.


I'd like your opinion: channels or locks?

This is really for a program I'm writing in Nim, so perhaps it depends on how channels are implemented?


Atomics don’t scale. In this age that needs to be widespread elementary knowledge. They are particularly bad on armv8 without atomic extensions because that platform has no equivalent to “lock; xadd” and an atomic increment could theoretically become an infinite loop.


Contended atomics don't scale. It is possible to construct concurrent structures which contend rarely (but which must still use atomics to guard against the rare case when mediation is required). There was also an interesting paper from a few years ago about using HTM to detect contention in a scalable fashion. I will aver that such code may be difficult to reason about—shared-nothing has far more obvious correctness and performance properties. (Queueing/amortisation also works, and lands somewhere in the middle wrt ease of reasoning.)

Riscv has an interesting compromise, which is to delineate a subset of ll/sc loops which is guaranteed to eventually make global progress. I do agree that it is better to include real wait-free primitives like cas and faa; but I wish that such guarantees of global progress would be provided to HTM.


Let’s be clear here, while you are 100% right that armv8 atomics kinda suck by default, neither compare and swap nor load linked store conditional scale but some atomics can scale if implemented and used appropriately. As parent points out, an atomic increment can scale, we proved they could scale to the performance of a load in the 80s for goodness sake. The fact that arm, ppc, and some others tend to implement these in the absolute worst way possible for performance doesn’t mean atomics can’t scale.


You're not being clear - are you claiming that they do scale well on ARM/PPC, despite the "absolute worst" implementation or that they don't?


He means that locks anyway have the same problems as atomics on platforms with load locked/store conditional. Therefore yes, that's a problem of the platform, but even on arm/ppc atomics scale better than locks.

Which is true, but eliminating sharing works even better if possible as proved by the article.


I'm not sure about how atomics having the performance of loads mean they will scale (and to be honest i doubt they could have the perf of loads on modern architecture, otherwise why would e.g. Intel not implement them to be faster - but lets pretend it is possible)

The fine article shows that a single lock xadd can destroy perfs on some x86 systems and explain that it is due to cache line bouncing. You would get the same effect with loads: if the loaded data is RO or mostly RO it will of course scale fine. It won't scale as soon as it starts bouncing too much.


Then why keep bouncing it? Leaving it managed by a known single user in a cache architecture like on x86 means that latency goes up, but the overall throughput of the operation goes up drastically. That’s why flat combining data structures are so popular there despite their absolute maximum throughput being bounded by sequential performance.

Also, FWIW, intel largely does implement operations closer to that way on single socket parts, if you want to see it for real look at on-device atomics on a GPU. Ironically an average laptop chip handles atomics much faster than most servers as a result.


> an atomic increment can scale, we proved they could scale to the performance of a load in the 80s

Interesting--can you link a reference for this?


This is the citation I most often use from that time, though the primary source is probably another reference down the line: https://dl.acm.org/doi/10.1145/69624.357206

The short version is that if atomics are implemented as part of the memory network, common cache, or memory controller, then atomics of the form “fetch-and-X” can be implemented in roughly equivalent complexity to a load of the current value (plus an instruction for the op, give it take) with the cost only scaling past that as op queues or other implementation-specific limits fire. It’s the infinite consensus ops that just can’t scale no matter what you do. The coherence and memory model matter a lot too of course, which is part of why x86 tends to be slow for atomics, while arm and ppc (with fetch-and-X extensions) or GPUs tend to do much better.


Ignorant question: what's the alternative? A normal mutex? I just sort of assumed atomics were abstractions around some type and a mutex.


Partition your data structure to be per-CPU and get rid of any sharing.


> Just don't use any shared state, bro

Thank you, Sherlock.

But for the rest of us: when you need shared state, lockfree atomic spinlocks are roughly 1000 times more performant than mutexes. (Not a scientific estimate, numbers taken from real-word experience.)


No, they are not.

The slow part of locking is invalidation of cache lines, and this has to happen with spinlocks anyway. Modern mutex implementations also first try to acquire the lock optimistically, so in the uncontended case they are as fast as userspace spinlocks (modulo inlining).

And if you have a contended lock, then userspace spinlocks are a PITA. You need to take care of fairness, ideally deal with the scheduler (yield to a thread that is not spinning on the same spinlock), and so on.

You can do all of that properly, but even then, you're looking at maaaaybe 10-20% performance increase in real-world applications.

Pure spinlocks can win only in contrived cases, like only having exactly two threads contending for the lock, with short locked sections.


>userspace spinlocks are a PITA ... >ideally deal with the scheduler

Don't you anyway need to drop the scheduler a hint so that the thread holding the spinlock doesn't get scheduled off the CPU, making the contenders wait longer than they ideally should? (Or is this what you meant by your "fairness" reference?)

In my limited understanding, this was the no. 1 reason why userspace spinlocks were discouraged -- because pretty much no scheduler accepted a hint from userspace to not kick a thread off the CPU -- modulo jumping through hoops with priority, et cetera.

If I'm missing something (and I likely am), I would be glad to be educated.


> Don't you anyway need to drop the scheduler a hint so that the thread holding the spinlock doesn't get scheduled off the CPU, making the contenders wait longer than they ideally should?

How would you do it? You can change the thread's priority to realtime to prevent the scheduler from pre-empting it while holding the lock, but this requires a kernel roundtrip and several scheduler locks anyway.

You can have a worker thread pool, with individual threads hard-pinned to specific CPUs. Then you can dispatch your work into these threads. This in practice will guarantee that they are not pre-emptied except for occasional kernel housekeeping needs.

But this will make it impossible to use the kernel-level mutexes because they can block your worker threads. So you'll have to reimplement waiting mutexes in userspace, along with a scheduler to intelligently switch to a work item that is not blocked on waiting for something else to complete.

Long story short, you're eventually going to reimplement the kernel in userspace. This can be done, and you can get some performance improvements out of it because you can avoid relatively slow kernel-userspace transitions. DPDK is a good example of this, but at that point you're not just using spinlocks, you're writing software for essentially a custom operating system with its own IO, locking, memory management, etc.


Yes indeed -- all of these reasons make userspace spinlocks undesirable in the general case, which is pretty much the point I was trying to make :)

(Edit:) Arguably, I could've been less obtuse in what I wrote.


You're correct -- the problem with userspace spinlocks is that the holding thread can be scheduled off. You can prevent this to some degree (probabilistically) with isolcpus and thread pinning, but that usually doesn't prevent hardware interrupts from running on those cores (which kernel spinlocks can avoid!). This isn't really solvable without running in kernel context (to have the elevated permissions necessary to mask interrupts).


Pure spin locks will always win for uncontended locks.


Without any contention at all, a spinlock and a mutex are identical: a single compare-and-swap.


Technically you can use a release-store on a spinlock unlock path, but you need a CAS for a mutex.


> spinlocks are roughly 1000 times more performant than mutexes

That is absolutely not universal. See e.g. but there are of course many places discussing this: https://news.ycombinator.com/item?id=21970050


This immediately popped into my head too when I saw the comment.

Spinlocks are terrible, and written by people who are trying to do quick hacks because they work in terrible environments and are taught to do bad things.

There is a reason that most GPU drivers are just lists of hacks to get games working correctly.


Yeah, spinlocks sound cool but actually they are terrible. Hybrid locks are generally quite good. The only way to get better is to pin cores and let nothing else run on them. Only then will spinlocks have a chance to increase performance by a tiny bit.


Locking is hard in general. Don’t do an unbounded spin in userspace is typically good advice though. The typical mutex construction these days will spin for a little while in an attempt to take advantage of mostly uncontended locks and then yield to the kernel.


mutexes are not that slow.

Uncontested (single thread):

  incrementing using atomics took 0.002011 s (0.2513 ns / increment)
  incrementing using mutex took   0.005515 s (0.6894 ns / increment)
Contested (8 threads trying to increment a single protected integer):

  incrementing using atomics took 0.1069 s (13.36 ns / increment)
  incrementing using mutex took   1.970 s (246.3 ns / increment)
So mutexes are roughly the same speed in the uncontested case, and about 20x slower in this heavily contested case. This is on Windows.


Is that AMD or Intel? Intel ivalidates the other CPU's cache line on write. AMD sends an update to the other CPU's cache line. AMD is supposed to be significantly faster than Intel on a "real" multiprocessor workload.


Don’t know why you were downvoted but yes, AMD uses MOESI and Intel uses MESI.


AMD Ryzen 9.


I guess you have more than 8 physical CPUs?


16 cores


This is not true, you should never use spinlocks in userspace unless you know exactly what you're doing. Using spinlocks well requires scheduling, which you have no control over in userspace, this is why it makes more sense for kernel to use spinlocks. Without scheduling, with spinlocks, random threads can starve for no reason.


You are right, theoretically.

Unfortunately, the userspace part of pthreads is not tuned for performance and does a lot of ridiculous things if you care about parallelism.

(Mostly I was complaining about the poor quality of userspace system libraries.)


how is this done in rust or c?


This is not a language problem. It’s an algorithm design problem. There’s no silver bullet. The basic principle is to divide the problem space into independent blocks of work. How to achieve that depends on the problem.


For a good example, there's a tricky parallelization problem in physical simulation, which you have update edge/triangle/bending wing forces in a mesh structure without any race conditions. (This becomes especially thorny if you want to parallelize your algorithm to the GPU.) A surprising solution for this is graph coloring, where you "color" each element without having two elements that interfere with each other the same color. Then you can safely parellelize the updates of all elements inside each color group, since the same color guarantees absolutely no interference.


Algorithm design problems, when general enough, warrant being treated as language problems!

Every feature of programming languages started in this fashion.


Indeed, the holy grail (or perhaps the rapture) of programming languages is a compiler which generates the entire program with zero human-written code. I fear it may be on the horizon.


There are programming languages e.g. occam designed with the intention of exposing parallelism algebraically, but I wouldn't call them a silver bullet either.


They are great in theory, but sucks in practice.

We don't have good optimising compiler for that


For something small like a counter you can use thread_local since C11, but for substantial parallelized computation, designing the division of work to avoid shared writes typically entails a scheduling function that parcels up the work, sets up memory allocation in advance to avoid conflict, and then hands an entirely unshared execution context for each thread to the thread start function, most likely as a pointer to a app-specific struct (since that is what pthread_create allows for), and then subsequently applies a combination operation in the thread reaper loop to collate results (extra brownie points accrue when a reduce function is written to vectorize).

The memory allocator plays a significant role, since allocation strategy needs to be per-thread-/per-CPU-cache-aware. Choosing and then tuning a different malloc (e.g. tcmalloc, jemalloc) to the one in your platform's default library is a non-trivial matter but may have enormous impact both on overall performance and memory demand.

In addition, when you design computation this way it is relatively easy to hadoopify it later, since it's basically map-reduce writ small.


MPI partitioning by rank? I’m curious what other solutions there may be.


In rust, you can usually use the rayon library which handles partitioning and scheduling for you.


that's just multi-threading in general

i feel like the post i was responding to was talking about handling/pinning a thread to a specific CPU core?


aka share-nothing architecture


It's the other way around. Mutex is an abstraction over atomics.


Other way around mutexes abstract around atomics but in typical implementations will yield to the kernel scheduler fairly. That system is called futexes on Linux. The kernel will also use atomics on its end.


Not ignorant at all, You have actually pointed to the core of the problem.

IMO It is not enough to know the logical constructs used for synchronization in parallel programs, you have to know the hardware too.

A little bit of everything from high-level parallel algorithms/data structures through memory consistency models, compiler optimizations to processor micro-architecture (cache coherency protocols, atomic instructions, NoC overhead etc.) is needed. Basically we need to be aware of the overheads for contention at every level in the system.


Don't share atomics among threads. For example, envoy proxy mostly doesn't share atomics among threads, and can scale nicely on arm64 without requiring the atomic extensions.


Honest question: why would atomics be necessary or useful if data isn’t shared between threads?


Because at some point data has to be exchanged across threads. For example a task queue might have tasks that can independently executed in a thread pool, but the queue index has to be atomically modified when some other thread emplaced a new task. Or if you want to transfer ownership of a heap allocated object between threads, you need to atomically transfer the pointer, or modify the reference count of that pointer. Things like that.


You can and should use atomics, just not in any kind of hot loop. Using atomics is fine but expensive.


You can reduce sharing probabilistically, for example -- because contention is an N-squared problem, reducing sharing by some linear factor is enough for a large reduction in contention. You aren't eliminating contested atomics entirely, just making them low-contention rather than highly contended.


Reference counting (the use of std::sync::Arc covered in the article) is a parallelism unfriendly type of garbage collection algorithm - so use a better one. But might be that Rust doesn't make that easy.


It’s not just theoretical. In bad situations (many cores, heavy contention…) you can get cores to starve each other as they try to each poke the monitor and fail continuously. I know of at least one platform which moved to LSE immediately partly because it fixed stuff like this. LL/SC is nice from some perspectives but it fails if you scale it up and also can be difficult to reason about (cough, cough, Linux getting their cmpxchg implementation wrong for years…)


Can you say a bit more about what Linux got wrong in cmpxchg or provide a link?



Thanks.


Atomics scale very well if you are reading often and writing rarely.


Exactly. Atomics are a red herring. The single writer principle should be the fundamental guideline.


I was optimizing some code where, for reasons, many threads had to aggregate data in a shared data structure. Each thread would have a small buffer, and when full it would acquire a lock and aggregate.

While fooling around I had the idea to exploit that not all atomic operations are equal. So I added an additional "contention" flag. When a thread wanted to aggregate it would do an atomic read of the flag, if it was set it would bail[1] and continue to accumulate to the local buffer. Once done aggregating the flag would be reset after unlocking.

Effectively this was adding a single-iteration spinlock before the "heavy" lock, but even using CriticalSections for the lock on Windows (which does spin before acquiring a mutex lock) it resulted in clear improvements, especially when running on machines with more than 8 cores.

[1]: It would bail unless the local buffer grew too large, so slight memory vs perf tradeoff there.


This is more or less equivalent to a try_lock() operation, yeah? And continuing to collect to the local buffer if try_lock fails, up to some limit.


Yeah I suppose. One limitation we had was cross-compiling on Windows, Linux and OSX so available locking primitivea differed, so was easier to just do it via atomics.


Yep, uncontended atomics are quite fast. When they're contended is when things start to slow down like OP has seen


Sorry to nit, but this is important. Parent was saying that reads are cheap, which is true. Writes can be expensive even if uncontended, because they invalidate cache lines. I guess you could say they contend with unrelated data but that would stretch the definition a bit.

So what does this mean in practice? In my view, the way to think about it is that atomic writes have non-local side effects. But since atomics are necessary for synchronization, and involves both reads and writes, we should compartmentalize and minimize synchronization as much as possible, to avoid these gnarly issues creeping up and tanking real world performance.

Arc<T> (and it’s relatives in other languages) constitute textbook violations of this rule. In Rust they are everywhere in non-trivial code, including in the async runtimes themselves. Of course, they also violate (or evade if you’re generous) ownership principles of idiomatic Rust, (or “hello world-Rust”, if you will). I think we need to take a hard look as an industry at ref counting as a silver bullet escape hatch to shared data.


> Writes can be expensive even if uncontended, because they invalidate cache lines.

This isn't expensive if cache lines are uncontended, though.

> I guess you could say they contend with unrelated data but that would stretch the definition a bit.

I think you might be talking about "false sharing." This is real contention on the cache line due to co-location of apparently unrelated variables.

> Arc<T> (and it’s relatives in other languages) constitute textbook violations of this rule.

Definitely!

> In Rust they are everywhere in non-trivial code

Ehh.. only the hot ones matter. Most are not actually contended much, and the article's solution (unshared clone) is a very reasonable approach to scale these without an API change.


> I think you might be talking about "false sharing." This is real contention on the cache line due to co-location of apparently unrelated variables.

You’re right. And cache lines are quite small, so this is probably less common. Yet, it’s another potential source of perf regressions in concurrent code, as if it wasn’t incredibly complex already.

> Ehh.. only the hot ones matter.

Well.. first atomics have even more non-local effects, such as barriers on instruction reordering. So Arcs that are cloned willy nilly can still be significant, with no contention.

But let’s ignore that and focus on the contended case: when you hear “uncontended X are basically free” it (subjectively, imo) downplays the issue, like contention is some special case that you can compartmentalize and only worry about when you consciously decide to write contended code. The blog post demonstrates exactly how this is so easy for contention to creep in, that you have to be superhuman levels of vigilant and paranoid to spot these issues upfront. Extremely easy to miss in eg code review.

I think both compile- and runtime tooling could help at least partly here. I’d also give rust some credit for having explicit clone instead of hiding it.


It's a good point, it's easy to cause contention and not realize it because of cache lines


`lock ; xadd` isn't really fundamentally different than `ll ; add ; sc; b again`

The latter is a bit clunky but the core more or less implements them in the same way. Acquire a line exclusive, load value, increment it, write it back. And you can hold the line exclusive such that the conditional store failure cause is mostly a formality, and can't actually become an infinite loop.

No general purpose atomics are done by shipping the operation to the cache or to memory controllers, it just doesn't work[*]. So even if they look slightly different in the core, they all end up looking exactly the same at the caches and coherency protocols, and that is where atomics are slow. Well any sharing of cache lines updates really.

[*] EDIT: That is to say it doesn't work for performance, for many reasons. Some CPUs do have "remote atomics" something like that which does exactly this, but they are not intended to be broadly used.


Yes, AFAIK in practice many architectures special case some ll/sc sequences to guarantee forward progress and fairness.


> an atomic increment could theoretically become an infinite loop

Only if your software is badly implemented. If you follow the requirements specified by the architecture, forward progress is guaranteed. Of course there is no guarantee how long it will take, but the things that make it slow are essentially the same things that make atomics slow.


the x86 lock prefix does the same loop, this is the best performant and scaling way to do it


What about variables locked with mutexes or semaphores?


locks and mutexes will perform worse than things like atomic increment.


Maybe. It depends on the algorithm. If you share a lot of data than a mutex, update it all then unlock is fastest. If you share little data atomics can be faster. This is case be case on both the algorithm and hardware, so nobody can say which is better.

Of course not sharing at all is of course best, but often you have no choice in that.


The significant cost comes from contending on memory. Changing the synchronization primitive doesn't help much. You have to change your algorithm not to contend.

This is significant because if you profile your code and find that a mutex is expensive, you should change your algorithm to avoid contention rather than blindly trying to change the code to use atomics instead of mutexes.


Ouch. I had no idea that contended Arc could be that expensive.

I found a contention bug inside of Wine a few weeks ago. Something that is supposed to be "lockless" really had three nested spinlocks. With many threads contending for a lock, performance would drop to about 1% of normal.[1]

[1] https://bugs.winehq.org/show_bug.cgi?id=54979


Lockless is not the same as lock-free, which is not the same as wait-free which seems to be what you are describing


>Lockless is not the same as lock-free

Can I get the HN comment length explanation of what this means?


Lock-free is a term of art with a formally defined meaning (at least one thread makes forward progress at any time). Lock-less doesn't really have a well defined meaning (first you would have to define what a lock is).


Sometimes I wish HN would have a term-of-art dictionary built in so users can check if there's a special meaning to reduce the chance of misinterpretation.


lockless is often the wrong term, goal, idea and solution. mutexes/futexes do very well, almost zero cost when not contended.


Previous discussion: https://news.ycombinator.com/item?id=29747921 (617 points | Dec 31, 2021 | 195 comments)


> Therefore the Arc had to stay. Instead of using a single Arc value we can use one Arc per thread.

I thought the title sounded familiar, and the culprit is more or less the same (false or in this case unnecessary sharing). But I didn’t think it was quite that long ago, so maybe it’s two articles about the same classic blunder.


Same problem occurs with c++ std::shared_ptr. I guess all reference counting has this inherent scaling issue due to contention ruining cache lines. Makes me wonder how/if you get linear parallelism in Swift.


Not sure what is currently in swift, but this paper described biased reference counting approach - e.g. in way two counters - one non-atomic to be used only by specific thread (supposed owner?), and another (atomic) by all other threads - so the sum of these two shows the real reference count (somewhat). Paper here - https://dl.acm.org/doi/pdf/10.1145/3243176.3243195

(Before reading the paper I was expecting that the additional bytes were put for the split counter, plus thread id - but it actually packs them using lower bits for reference counting).

I wonder what abseil/folly/tbb do - need to check (we are heavy std::shared_ptr users, but I don't think 14 bits as described in the paper above would be enough for our use case)


Can't speak for abseil and tbb, but in folly there are a few solutions for the common problem of sharing state between a writer that updates it very infrequently and concurrent readers that read it very frequently (typical use case is configs).

The most performant solutions are RCU (https://github.com/facebook/folly/blob/main/folly/synchroniz...) and hazard pointers (https://github.com/facebook/folly/blob/main/folly/synchroniz...), but they're not quite as easy to use as a shared_ptr [1].

Then there is simil-shared_ptr implemented with thread-local counters (https://github.com/facebook/folly/blob/main/folly/experiment...).

If you absolutely need a std::shared_ptr (which can be the case if you're working with pre-existing interfaces) there is CoreCachedSharedPtr (https://github.com/facebook/folly/blob/main/folly/concurrenc...), which uses an aliasing trick to transparently maintain per-core reference counts, and scales linearly, but it works only when acquiring the shared_ptr, any subsequent copies of that would still cause contention if passed around in threads.

[1] Google has a proposal to make a smart pointer based on RCU/hazptr, but I'm not a fan of it because generally RCU/hazptr guards need to be released in the same thread that acquired them, and hiding them in a freely movable object looks like a recipe for disaster to me, especially if paired with coroutines https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p05...


I think RW locks are also worth a mention in this problem space, and folly has them. Actually several implementations.


Yes, though note that RW locks need to maintain the count of readers, and most implementations just use use a single counter in the lock state word, which makes them as subject to contention as a reference count.

folly::SharedMutex, the main RW lock in folly, tries to shard them by core when it detects contention (in fact it is the OG core-sharded primitive in folly) and when that works it is virtually linearly scalable, but the detection is a heuristic (which also has to minimize memory and writer cost) so there are still access patterns that can be pathological.


> the main RW lock in folly, tries to shard them by core when it detects contention

That's very interesting. I dealt with the scenario where I had to scale the hash-map to support 1k-5k concurrent readers and a few sporadic writers. I ended up sharding the hash-map with each one using the RW lock to guard the access to the underlying data. This essentially declined the contention within the RW lock itself, or at least I wasn't able to measure it.


RW locks have a pretty limited space where they're useful (because readers are still contending on a cache line to update the reader count). They're similar to a mutex + ref count. They pretty much only work well in situations where there aren't many readers and they hold the lock for a relatively long amount of time. (Ok, there are sharded reader count variants that reduce the cost of the reader count, but they're still only useful for relatively long reader lock sections.)

Instead, if a mutex showed up as hot in a profile, I'd look at things like RCU/hazard pointers for read-biased data structures in most situations, or trying to shard or otherwise split data between cores such that there isn't much contention on the boring, vanilla mutex.


> because readers are still contending on a cache line to update the reader count

They don't have to, see my sibling comment about folly::SharedMutex.


I preemptively attempted to respond to this:

> (Ok, there are sharded reader count variants that reduce the cost of the reader count, but they're still only useful for relatively long reader lock sections.)

RW locks are a code/design smell, even with a cheap reader count.


Was that a later edit? I didn't see it when I read the comment. Also what's the point of saying that something is not possible and then "actually it's possible, but I don't like it anyway" in parentheses?

What you're saying is wrong: if the reader section is long, you have no problem amortizing the cache invalidation. Sharded counts are useful when the reader section is small, and the cache miss becomes dominant.

Also I don't get this "RW locks are code smell" dogma, not having RW mutexes forces you to design the shared state in a way that readers can assume immutability of the portion of the state they acquire, which usually means heavily pointer-based data structures with terrible cache locality for readers. That is, in order to solve a non-problem, you sacrifice the thing that really matters, that is read performance.

I've heard this thing from Googlers, who didn't have a default RW mutex for a while, then figured out that they could add support for shared sections for free and suddenly RW mutexes are great.


Coalescing reference counting [0] avoids almost all synchronisation. n.b. The abstract says they do "not require any synchronized operation in its write barrier" but they rely on a micro-architectural hack; in practice I'd expect one atomic test-and-set per modified object per collection.

[0] https://sites.cs.ucsb.edu/~ckrintz/racelab/gc/papers/levanon...


There are no miracles here because it is not a language "feature". It is a property of algorithms. When you divide your large task into parts and schedule execution of those on multiple threads make absolutely sure that there is no locking (atomics are locking) happening inside each individual task.


This is one of the most common arguments for non reference counted GCs


I feel like using hybrid-rc [1] (biased reference counting [2]) instead of Arc should be more popular. You rarely need to send data between threads so when you do you pay the atomic cost but otherwise you’re doing normal super fast arithmetic.

[1] https://docs.rs/hybrid-rc/latest/hybrid_rc/

[2] https://dl.acm.org/doi/10.1145/3243176.3243195


Thanks for the link. Biased reference counting is a topic I've heard of in Python now too (the Nogil discussions).

However, HybridRc would still be as contended in the scenario in the blog post, wouldn't it (before the fix that solved it)? Just checking my understanding.


That’s a good question and I’m not sure but I think so? The unit and RuntimeContext would be sent once to each thread and Rune would take the RC variant everywhere and only upgrade to ARC when they actually need ‘Send (which you don’t use). It requires changes to Rune obviously but I think just converting Arc into HybridRc types.


Small note:

> In Rust, it is very easy to generate flamegraphs with `cargo flamegraph`.

... Also in pretty much every other language that can generate perf stacktraces, because this is just a wrapper around Brendan Gregg's FlameGraph visualizer: https://github.com/brendangregg/FlameGraph


`cargo-flamegraph` uses a port of FlameGraph written in Rust and handles the "capture" and "fold" steps so I'd say it's at least a bit easier


Oh, I see! Well then, almost-full credit to them.


in C++ std::shared_ptr (similar story) has similar effect. One of our applications (3D editor) went way slower when an artist were given a server-class machine (NUMA) and we had to ensure that all threads would run on a single CPU socket (yes they were still accessing the "other" memory, but it was better somehow).


I might be wrong on this explanation, but the reason why it was faster might have been the following:

During execution you had two kinds of memory locations, some in CPU caches and some in RAM. By running all the threads on one socket, everything accessed from the cache was just a fast cache access. Everything accessed from the memory was a slower memory load. Frequently loaded/stored locations will tend to go to the cache.

In the NUMA setup, you would have a larger cache (more than one socket) which would mean that more locations were likely to be in the cache. However, if a core on a socket tries to access a location which is on another socket's cache, it will use the interconnect between them to access it.

If you have an unfortunate memory layout, this can make it so that you end up having a large percentage of the accesses using the interconnect (slower than cache access) and values get swapped between the caches constantly, which forces subsequent accesses to also use the interconnect.

Another way to avoid this except using just one socket is for the designer of a program to consider NUMA nodes as separate processing units and design around that. Both should be processing separate data and they should only share small amounts of data for synchronization/communication. Then the caches will be much less affected.


That's a pretty reasonable explanation, and one day I should sit down and write some artifical test/bench to get more details.


NUMA can be such a pain, especially because it's really difficult to account for in code, and all sorts of stuff you'd never think of can end up impacting your performance too, that usually you'd never even think of, e.g. the linux page cache.

With the way that processors are going, with this focus on increased core counts etc, NUMA is increasingly being important to understand and account for, as processors are getting more "NUMA-ish" (to borrow a co-workers apt description). Especially Neoverse/Arm CPUs, etc.


Cache synchronization operations between CPU cores (MOESI) is cheaper (lower latency) between cores in the same socket than across sockets, often by a factor of 2x or more. But also limiting the program to one socket would reduce contention significantly. Both help.


Arc and shared_ptr should be used sparingly - especially in languages with more or less real ownership semantics, sharing ownership of state seems like a hack job.


Good article, but if i had to guess the subsequent L3 cache access after an increment is likely far overshadowed by the overhead of coherence messages, or the cores communicating between each other.


Reminder: refcounting is garbage collection. There are parallelism friendly GCs too. I wonder if the same interface in Rust could accommodate them.


>Although using atomic instructions is often referred to as “lockless programming”, this is slightly misleading – in fact, atomic operations require some locking to happen at the hardware level.

Lockless/lockfree refers to the fact that there are no deadlocks.


That's not entirely correct. Deadlock-freedom means that at least one thread will make progress in the absence of failures.

Lock-freedom is a much stronger statement. It guarantees progress of at least on thread in finite time, regardless of failures.

Source: Dan Alistarh, Keren Censor-Hillel, and Nir Shavit. "Are lock-free concurrent algorithms practically wait-free?" December 2013


Not directly related, but https://github.com/nosqlbench/nosqlbench is very flexible benchmark tool for Cassandra and other distributed systems


This is what I'd describe as the deepest of the deep magic.


Oh, the rabbit hole goes far deeper than this.


Would love to see someone recommend some beginner-friendly books to learn more about the theory behind this blogpost. Seems like a CS degree is the only straightforward way.


Patterson/Hennessy on Computer Organization and Hennessy/Patterson on Computer Architecture are considered foundational canon.

The implied expectation underlying "beginner-friendly" seems naively misguided; it's an advanced undergrad computer/software engineering topic in the most permissive sense, and the blog's prose appears to have been tailored with that minimum target audience in mind.


Hey, noob question guy here. Can anyone explain why the last graph shows a slight performance drop going from 48 to 96 threads?


It can be many things, my guess is likely memory bandwidth. There's so much MB/s the RAM can handle. Also, above 48 cores, those are hyper threading and for CPU bound tasks, hyper threading is known to be slower.

Example: https://ieeexplore.ieee.org/document/7804711

Edit: looks like there's only 12 cores per CPU so that's 24 physical cores. 48 HT cores. So the drop must be cache trashing?


Higher parallelism is usually about timeliness for interactive workloads, not throughput. Unless it’s just the classic fallacy that parallelism = speed.

For throughput tasks it’s often the case that you go with less parallelism to reduce Amdahl’s law a few percent, and instead investing in keeping the pipeline saturated, so that the variance in concurrent tasks is lower. Work stealing being one of the more notable tricks.


96 processes on 24 cores, not 96 cores.

apparently 2 process per core is more efficient than 4.


Hmm so the embarrassingly parallel task wasn't embarrassingly parallel because of an implementation detail, basically?


The issue with Rust in particular is that the zero-cost fearless yadda yadda is real, but often not feasible in practice. Arc is this magical escape hatch which always appears like an uninvited guest in all non-trivial real-world programs. Heck, Arc is even part of the async specification itself.

It’s very complicated and I don’t blame anyone in particular. It’s an ok solution, but it’s definitely not in the zero-cost category. I’d rather accept that this is the state of things and working towards systemic solutions in allowing borrowing in more situations, but that requires a compile-time verifiable hierarchical thread- and task model.

Fun fact: Arc was partially the reason Rust disallowed borrowing across threads. Arc had already become popular and combined with thread borrowing someone demonstrated use-after-free. Borrowing across threads seemed less important at the time, so Arc was kept. This led to the famous “leaks are safe” rule, which was sold as a mere clarification of an inherent truth.

It’s possible that all that was inevitable and correct, but I was never convinced by the arguments made in those old threads or the subsequent writings about it. To me, it looked like details were glossed over in order to get to a swift resolution. I’m quite content waiting for someone else to figure out whether Rust could have gone down a different path. Worst case, I’ll come down peacefully from this tiny hill. Best case, there’s an alternate timeline where Rust could live out its full potential in concurrent environments.


The joys of parallelism. Communication/signal propagation is hard, yet extremely rewarding once you nail it. You've really gotta be willing to dig into the guts of what you're doing though.


Pleasingly parallel is a far better term than “embarrassingly parallel“ - it was a poor description and I never understood why people liked using a pejorative for an elegant solution.


Some people like to be embarrassed? It’s that kind of community.


I like the writing style of this post, it doesn't gloss over the details and conveys a lot of information from first principles. Good work pkolaczk.


Let me guess: someone dared to right-click a pdf in windows explorer? My computer at work locks up every time i do that. It takes exactly 24 seconds for the server to respond. I think i could physically run back and forth to the server room in less time.

(Started last month after an update. They are still trying to figure it out. I instead not use ctrl-c to copy files as that doesnt need the little menu to load. I kick myself every time i forget and accidentally right-click.)


Just in case you haven't already done this:

Try using ShellExView to disable all 3rd party shell extensions. There's a good chance that something like Acrobat is running some code that doesn't play nice with your windows version (or hardware etc).


Ive done nothing. I can do nothing. On this network i am a simple end user, unable to change even basic settings let alone install extra software. I can only submit a work ticket and wait for our support staff to address my issue.


Also 24-core servers from cloud services usually run your app slower than a laptop anyway.


(2021)


Updated




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: