It seems like weak memory models get short shrift, but if you're going to program without locks it's semi-important to understand what information one gets when examining a read-write pair.
It's true that (as implied by the article) you can probably get by with just studying/programming with the C/C++ [1][2] "atomic" memory access types and letting the compiler enforce those semantics, though the reasoning behind a lot of these memory orderings is lost without understanding the motivation/arch. models behind them.
If you're interested in the C/C++ memory models there's active research into specifying them without bad behavior (thin-air reads [3]). Recent results include a semantics that makes value promises and requires a justifying execution [4] which is not totally dissimilar from those required by the official java memory model [5].
These papers take forever to read (take it from me, I am doing research in this area, in particular proofs of correctness for lock free programs). I recommend focusing on the C/C++ ones if only for the practical value.
As for comparison, the C/C++ memory model is more general and is operational. It is also formalized in coq and has some good theorems (data race free sequential consistency being the most obvious).
The RISC memory model is axiomatic and follows the standard axiomatic approach adopted for many other memory models like Java and the current C/C++ standard. That's not a dig, they just don't care as much about rigor.
Axiomatic models consider the every possible execution and then weed out bad ones, where as the operational semantics defines the set of all possible executions using a transition relation. If you're into math you might see these vaguely as extensional and intensional respectively.
thanks! I think the RISC/WMM folks provide an operational model too, using their own "I2E" formalism. The end of the paper seems to have a proof of equivalence between their axiomatic and operational models.
Lockless concurrency was always an "interesting" subject to me, but I never thought it was practical for me to use them. Mostly because multithreaded programming generally has easier-to-use and higher-level constructs (ie: Reader-Writer Queues already written) that can be used.
However, I've recently grown to respect Lockless Programming as I have begun to experiment with GPU programming. OpenCL doesn't give you Mutexes, and indeed, you really shouldn't "lock" an instruction pointer in a GPU ever. (On AMD systems, a singular instruction pointer runs 64 different "threads". So if one of these "threads" lock up, ALL 64 of them lock. This "Single Instruction, Multiple Data" is the key to why GPUs have so many "cores")
As such, the only practical way to write decently high-performance code with concurrency on a GPU is through Atomics (ie: Compare and Swap). Which of course, requires an understanding of memory barriers as well.
I know its what NVidia calls SIMT, but its what AMD calls SIMD. ¯\_(ツ)_/¯
NVidia is somewhat correct to call their "SIMT" concept something new, because Intel's original SIMD implementations (MMX and SSE) couldn't do the thread-divergence thing very easily (AVX 512 adds a few more features to make thread-divergence easier to handle at the CPU level). But as far as I can tell, AMD GPUs can do thread-divergence, "constant broadcasts", and all that jazz too, but AMD still calls it SIMD.
Keep in mind lockless algorithms are not necessarily more scalable than lock-based algorithms, usually have higher constant overheads, and are significantly easier to get wrong.
However, this post is, in part, about how to implement locks.
> However, this post is, in part, about how to implement locks.
I haven't read the article in-depth but yeah, it seems like it maybe would have been better titled "What every C++ systems programmer should know about std::atomic". Which is still cool but from the title I was expecting some fancy voodoo algorithms for threadsafe non-locking FIFO queues or something.
I think it's meant to be pretty fundamental. It's not supposed to include "voodoo algorithms" because it literally says "what every systems programmer should know...". The voodoo algorithm is a voodoo algorithm because not everyone needs to know it. In contrast, a mutex and all of that stuff is a consequence of how memory sequencing on a CPU works. If you understand those concepts, the rest is a hop, skip, and a jump away.
I guess my beef was mostly in response to the word "lockless", which I'd hoped meant "not using any wait-for-other-thread type semantics" but in this context seems to mean "implementing locking without using OS-provided locking primitives, although standard library is fine."
I was hoping that there was some cool new trick I'd missed or fundamentally different way to do concurrency than locking. I get disappointed easily in cases like this. :/
Me too. If you can do lockless add-to-queue and remove-from-queue, that's valuable. It's hard to get that right, especially on non x86/amd64 platforms. Every network packet, and often every interrupt, goes through a queue operation like that. Not stalling the other CPUs on each interrupt is important.
I think it's worth thinking in terms of the probability of contention per cycle. If the odds of contention is very low, lock-free will be much better (imagine accessing a shared data structure which is vast compared to the range of usage in a typical use).
Also, I think it's worth noting that it's really not that much more complicated than a simple mutex if you use the default sequential consistency memory order on an atomic. When you start attempting to optimize around certain patterns using the weaker memory orders, yea, things get a lot tougher.
If the odds for contention are low then something like a futex is pretty snappy, too.
Lockless solutions do naturally fit in certain cases. For example, producer-consumer queues where there's only some invariants that need to be respected. There the producer must not overwrite what the consumer hasn't consumed yet and the consumer must not advance to areas that the producer hasn't yet filled. Using a lockless approach both the producer and the consumer can continue to operate as long as they won't step on each others' toes. Instead, using a mutex or read-write locking would basically make the access to the queue mutually exclusive to the producer and consumer.
I, myself, like the approach of building my data out of near-immutable blocks and just replacing the references using atomic compare-and-swap, for example.
There are two-lock queues that allow consumers to mostly run independently from the producers while still using bog-standard locks.
Still you are right that lock-free algos really shine in implementing message queues, but mostly because they can be implemented with fewer or cheaper memory operations, so even in the uncontended case, the lock-free queue can shave some cycles.
> I, myself, like the approach of building my data out of near-immutable blocks and just replacing the references using atomic compare-and-swap, for example.
Yes, one huge advantage of lock free algorithms is that you can have read mostly data structures with no contentions between readers (most rw-lock implementations do require readers to write on some shared state).
Yup using a layer of indirection definitely works well. Regarding the futex, I definitely agree that most futex implementations on most hardware are designed to spin and resolve quickly in low contention scenarios, but this won't be true on all consoles/hardware/OS configurations. For certain cases (like a game engine), I'd be less inclined to rely on the futex to do the right thing.
Futex are a linux kernel-level specific waiting primitive [1]. The point of the OP is that mutex implemented on top of a futex has a purely userspace fast path that needs only a single CAS for each acquire and release in the uncontended space, so in this scenario an userspace spin-lock is not often significantly better (a spin lock release only need a store on some machines though). Spin locks are significantly worse in the contended space though.
[1] futexes enable Fast Userspace muTEXES, but futex themselves are kernel space and not particularly fast.
I think it's usually the other way around: when the odds of contention is very low, locks are better. Uncontended lock acquisition is relatively cheap, and it can be done once. The downside of using a lock is that it prevents other threads which are contending for that lock from proceeding, hence why locks are better when contention is low.
Lock free algorithms are generally better when the odds of contention are high, as you can devise algorithms that allow all threads to make progress at the same time. In general, this allows more parallelism, but the cost is that it typically requires more overall atomic operations.
> Lock free algorithms are generally better when the odds of contention are high, as you can devise algorithms that allow all threads to make progress at the same time. In general, this allows more parallelism, but the cost is that it typically requires more overall atomic operations.
I can't say I agree here. It really depends on the cost of the swap operation required to make this work. Consider a lock-free map for example. Under high contention, I'd prefer to wait for a lock than copy the map while spinning on an atomic. What would push the needle for me one way or another (with respect to locking vs lockless) is a blend between the probability of contention, the cost of the loop itself, and the cost of the swap.
I don't know what you mean here: "than copy the map while spinning on an atomic." Why would we copy the map? And by "spinning on an atomic", do you mean just spin-waiting on some flag? That's, morally, a lock. A well designed lock free algorithm allows all threads to make progress at the same time, and if there is contention, an alternative that hopefully is not just redoing that same work. The cost is more atomic operations, which means that if there is low contention, you're paying more overhead than needed.
It depends, on high contention, waiting writer threads will only touch a single shared cache line containing the lock (all the time if spinning, once if there is a proper thread queue). In a lock free algo, the writers might be stepping on each other toes all the time, with heavy cache-line pingpoing, which is not good for general throughput.
Lock free algos are very good when you have many pure readers and few or occasional writers.
In my experience, once you get up to the 100s of threads potentially touching global structures, locks impeded scalability significantly. Even if each of those threads is modifying the global data. The key is to design your algorithms around the kind of structures that are amenable to lock free algorithms. (The experience I'm referring to: http://www.scott-a-s.com/files/pldi2017_lf_elastic_schedulin...)
The Vega64 GPU that AMD just released has 4096 "threads" operating at one time (actually in flight), with up to 40,960 of them "at the ready" at the hardware level (kinda like Hyperthreading, except GPUs keep up to ~10 threads per "shader core" in memory for quick swap in-and-outs). Subject to memory requirements of course. A program that uses a ton of vGPR Registers on the AMD system may "only" accomplish 4096 threads at a time, and maybe only 5 per core (aka 20,480) of them are needed for maximum occupancy.
Its a weird architecture because of how each "thread" shares an instruction pointer (ie: NVidia has 32 threads per wavefront, AMD has 64 workitems per Work Group), so its not "really" the same kind of "thread" as in Linux pthreads. But still, the scope of parallelism on a $500 GPU today is rather outstanding.
All of these threads could potentially hit the same global memory at the same time. I mean, if you want bad performance of course, but its entirely possible since the global memory space is shared between all compute units in a GPU.
Yes. See the paper I linked to. I work on a parallel and distributed dataflow system called IBM Streams. When we have thousands of our operators running on a large system, it is realistic for there to be 100s of threads.
The lock could still be faster. atomic ops are more expensive than regular opcodes even when there is zero contention. On x86, a lock requires only a single atomic opcode and can be released with a normal write (because of TSO). Since it's uncontended it won't spin at all. Therefore, any alternative lock-free algorithm that uses more than a single atomic op is actually more expensive.
As the text puts it:
"And as always, consider the tradeoffs you make between complexity and performance. Lockless programming is a perilous activity with a proud tradition of code that is ever so subtly wrong."
Lock-free using CAS type operations looks good in theory but having to do a cache flush for every atomic operation really bogs things down when you get into complex algorithms so the simple lock ends up being cheaper.
An alternative lock-free approach is to segmented memory between threads and then coordinate between threads using messages passed with ring buffers. This can be done entirely without atomic operations but hits other scaling problems and is also extremely difficult to implement so probably not widely applicable.
> Also any lock implementations also use a cas or equivalent internally
But if you have a data-structure with say, 20 ints... you have two approaches.
1. Use a mutex (which may be CAS-based under the hood) at the beginning and end: 1 lock and 1 unlock. The 20-ints are then accessed with non-atomic operations.
2. Use atomic operations on those 20-ints. But in this case, you pay the "atomics cost" each time.
In many cases, #1 may be faster than #2.
-----
Sorry for the 2nd response, but this is a different idea than what I was talking in my 1st response.
Using 20 cas to update 20 ints it is probably not the best idea. There are significantly more than two approaches (seqlocks, RCU, persistent data structures, etc). Which one works best depends on the actual use case.
> CAS don't do cache flushes. Caches are fully coherent
Not in the sense of flushing from cache to memory, but yes in the sense of data from a local core cache needing to be sent to another core.
Caches may be coherent but coherency is definitely not free, especially on multi-processor systems, which you probably care about if you are writing this kind of code.
CAS has no effect on the cache (other than guaranteeing that the cacheline to be in the exclusive state for the time to perform the load+op+store). CASs and other atomic RMW ops in general, in all common architectures are purely local operations that do not affect the coherency fabric (not any more than a store would, memory barriers do not even do that).
The only thing that can be said to be 'flushed' by a CAS (at least on a x86 where it has full barrier semantic) is the store buffer (which is not a cache), but even there the only thing it does is simply prevent more recent loads [1] to complete before any store older or equal than the cas has completed. The store buffer is not really flushed and is still drained at the same speed it would without at cas.
[1] or any other operation for simplicity of implementation.
> CAS don't do cache flushes. Caches are fully coherent so they never enter into the window in any lock-free algorithms.
Perhaps "Cache Flush" is the wrong word, but as long as cores are forced to communicate, you're going to get a performance penalty.
I know when you do a simple atomic_increment on x86_64 platforms, things can be 2x slower at a minimum than a non-atomic increment.
"Cache Coherence" simply means that the CPU handles all of those edge cases for you. But it still needs to be handled, and that entails a performance penalty.
If you have a 4-core / 8-thread machine (Like a recent Intel i7), its faster to have 8-Non Atomic counters that are added up at the end... rather than to have 1-atomic counter with 8-threads doing atomic-increment on it.
----------------
Any core that modifies a memory location invalidates the cache for all other cores. (ie: Core 1 does atomic_inc(x), implies that Core2/Core3/Core4 all have an invalid value of "x". So atomic_inc(x) causes a cache-flush on Core2/Core3/Core4)
Again, I'm not an expert, but that's my understanding of how modern systems work.
-------
I looked up the Intel Architecture Manual, and they have the following table:
If the data is in another core's L1 cache, the latency increases dramatically, especially on a "Dirty Hit". (ie: Core1 is trying to read "x", but Core3 is telling the other cores that "x" has changed)
In effect, its more efficient for an Intel CPU to access something from L3 cache than for it to access a "Dirty" value from another core. Heck, based on these numbers, even a "Clean" hit is more costly than using the L3 cache!
You pay the sharing cost regardless of whether the operation used to access the cache line is atomic or guarantees visibility. At least on x86. See false sharing.
The extra penalty introduced by common atomic operations that make strong guarantees about ordering is waiting for the store buffer to flush all prior stores before the result of the atomic operation is flushed out of the store buffers to L1 and then the atomic operation is flushed.
I hesitate to make generalizations about memory visibility requirements and costs as it is very API and hardware specific. Weaker guarantees exist and would have different costs.
TL;DR The total memory overhead is the same the difference is the atomic operation stalls the CPU that is performing the atomic operation unless it can speculate past it.
Also blind stores can be resolved in the background and don't block the CPU from continuing as long as the cache line is never read from.
> Line 1 defines an atomic variable, line 5
atomically increments it, and line 10 reads it out. Because
this is atomic, it keeps perfect count. However, it is
slower: on a Intel Core Duo laptop, it is about six times
slower than non-atomic increment when a single thread
is incrementing, and more than ten times slower if two
threads are incrementing.
The PDF changes over time. But this line can be found on page 42 at the moment.
--------
So according to the tests done by the author here, atomic operations are CERTAINLY slower, even in single-threaded cases.
I can't say I understand why they're slower, but something is definitely going on.
Of course they are slow, atomic RMW [1] at least on x86 stall the pipeline waiting for the implied memory barrier to be flushed out of the store buffer. What me and
arielweisberg have been trying to say is that barriers and RMW are purely local and have nothing to do with caches [1], as such they are purely a constant overhead on algorithms and they scale perfectly: in the contended case the scaling cost is completely determined by the number of cacheline written to and the number of cores involved and it is completely independent of the number of atomic operations performed per operation.
[1] atomic stores and loads are extremely cheap on x86.
[2] you can see that from the cost: a RMW or membar is around 20-30 clock cycles, while cross core communication costs in the order of 100s cycles.
One reason for this is that for the longest time they weren't really available in a cross platform way. I'd learn about e.g. mutexes in a computer architecture class, and then I could immediately go use them with boost or pthread or whatever.
I'd also learn about atomic operations, atomic cas, and memory barriers, but short of some very specific game libraries with weird macros for this, I couldn't find them anywhere.
It seemed like we were expected to sprinkle volatile everywhere, remember rules about unaligned access, and write our own platform specific macros for memory barriers. If you're primarily testing your code on an x86 laptop, it's very difficult to be confident that you're doing it correctly.
> I'd also learn about atomic operations, atomic cas, and memory barriers, but short of some very specific game libraries with weird macros for this, I couldn't find them anywhere.
How come? These days atomic ops are included in C and C++ standard library. And there are atomic operations for every important compiler out there and if you need portability across compilers (without relying on new C/C++ standards), there are portability libraries (e.g. boost).
> It seemed like we were expected to sprinkle volatile everywhere, remember rules about unaligned access, and write our own platform specific macros for memory barriers.
Unless you're using Java, "volatile" is not useful for concurrent programming. When you see volatile keyword in C code, it's almost always wrong or there for a legacy reason.
Sounds like your computer architecture classes were not up to date or very well taught.
Although I did understand memory barriers early while playing with lock free concurrency, I didn't really grok them until much later. I had initially a very hard time, while thinking in term of reordering, in figuring out which barrier goes where.
But the C++ memory model (and the acquire/release consistency model in general), which does away with explicit barriers but attach synchronization behaviour to store->load edges [1], instead clicked almost immediately and in retrospect made it easier for me to describe an algorithm in the more informal reordering based model.
[1] barriers are available, but are discouraged and are still defined in term of acquire/release and sequential consistency.
The thread example given in article is only limited to a single writer and multiple readers. Things get interesting when you want to have lock free concurrency with multiple readers and writers. One of the famous techniques is using software transactional memory (https://en.m.wikipedia.org/wiki/Software_transactional_memor...). But it hasn't really caught up in the industry and it still exists within academia.
There are lock free algos that support multiple writers. The reason they are sometime avoided is that, in addition of being often more complicated, they usually scale badly; that's inherently true with any algorithm violating the single writer principle, not just with lock-free algos; that's true with STM as well.
I wish this were more useful. The real sticking point for me is that you can run out of transaction contexts - which means you have to have a full software implementation as backup.
the only case I can think of where I would actually use the thing is if I already had an STM runtime and I wanted to cut down some of the constant time overheads.
the other problematic aspect s that transaction scheduler policy is pretty important for contended performance and as part of the broader system design. in this case you have to live with whatever intel gives you
I implemented a lockless IPC, DIPC - Interprocess Communication and Distributed IPC library in 2000-2002 time frame.
The library is used in vxWorks, QNX, Linux (both kernel and user space.) and tested heavily on Xeon SMP 1-2GHz CPU at that time.
For memory management, it was implemented in C++ and the message (Memory) Alloc/Free calls were all lockless. One can allocated a memory from kernel and send it as msg to user space and use it there without copy. The limitation is that the max allows messages are statically pre-allocated for the system. The library manages the pre-allocated pools of various sizes with lockless APIs. When the pools runs out, the API returns error.
The only primitive the library depended on was Atomic_Inc/Dec - absolutely no mutex, spin locks in the non-blocking API code path, etc.
The library was designed to support HA (High available ) TCP/IP stack and routing protocols. The APIs was not the complex. The most complicate part was the regress testing code and the diagnostic code. The API was used by team of 100+ developers. I need help them to debug complex IPC/DIPC issues - and prove to them when the error return, it was not the library/API issue.
There was a trace mechanism that track all the messages went thru the system(s). The trace system was also completely non-blocking and lockless, worked both in kernel and user space. All context switches in Linux Kernel was tracked - very similar to FTRACE in kernel today. I hacked the kernel/bios to preserve the DRAM content part of the trace system during warmboot. This way, I can recovered all the messages, last context switched info even for kernel hang, crash issue for postmortem. The trace system used rdtsc for time stamping had 0.5 nano-seconds resolution on 2Ghz Xeon CPU.
The API/Library worked fine. The overall system complexities was a big issue. We can demo TCP/IP, BGP and various routing protocols that did HA Switch over and In Service OS upgrade. But only as demo - and after switch over the state syncing took a long time and reliability of HA TCP/IP and routing protocol was an big issue.
Base on the lesson learned from that project - I went on and coded a different Application level HA system. That system based everything on top of standard Linux API/utilities. That works much better and the HA port of the SW took only 1 dev (me) 3 weeks to code. It supported in service SW (include Linux kernel) upgrade, HW/SW fault detected and automatically switch over. It was deployed in Comcast. Fault detection took < 50 ms and switch over is almost instance. Full state sync was done with wget over management Ethernet interface and always took < 1 second. All incremental states update were done over UDP and took 1 pkt.
The funny Biz/$ outcomes from these two projects:
The first one: startup sold only one HA router for $200k and burned 100M+ of VC $ but sold to Nokia for $400+Mil. I was able to paid off my mortgage with the options from that company - Not FU money but ok outcome.
The 2nd one: The startup sold $40M of products for $60K from that project to Comcast and various Cable companies at 60% margin. That product (with 19+ FPGA) was created/code/developed with ~8 HW/FW/SW engineers (including the 2 founders). The VC got "a seasoned CEO" - managed to raise a few more rounds, kick out the founders (MIT PhD), hired 160+ people and ran the company to the ground.
It's true that (as implied by the article) you can probably get by with just studying/programming with the C/C++ [1][2] "atomic" memory access types and letting the compiler enforce those semantics, though the reasoning behind a lot of these memory orderings is lost without understanding the motivation/arch. models behind them.
If you're interested in the C/C++ memory models there's active research into specifying them without bad behavior (thin-air reads [3]). Recent results include a semantics that makes value promises and requires a justifying execution [4] which is not totally dissimilar from those required by the official java memory model [5].
1. http://en.cppreference.com/w/cpp/atomic/atomic
2. http://en.cppreference.com/w/c/atomic
3. http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html
4. https://people.mpi-sws.org/~dreyer/papers/promising/paper.pd...
5. http://rsim.cs.uiuc.edu/Pubs/popl05.pdf