> Line 1 defines an atomic variable, line 5
atomically increments it, and line 10 reads it out. Because
this is atomic, it keeps perfect count. However, it is
slower: on a Intel Core Duo laptop, it is about six times
slower than non-atomic increment when a single thread
is incrementing, and more than ten times slower if two
threads are incrementing.
The PDF changes over time. But this line can be found on page 42 at the moment.
--------
So according to the tests done by the author here, atomic operations are CERTAINLY slower, even in single-threaded cases.
I can't say I understand why they're slower, but something is definitely going on.
Of course they are slow, atomic RMW [1] at least on x86 stall the pipeline waiting for the implied memory barrier to be flushed out of the store buffer. What me and
arielweisberg have been trying to say is that barriers and RMW are purely local and have nothing to do with caches [1], as such they are purely a constant overhead on algorithms and they scale perfectly: in the contended case the scaling cost is completely determined by the number of cacheline written to and the number of cores involved and it is completely independent of the number of atomic operations performed per operation.
[1] atomic stores and loads are extremely cheap on x86.
[2] you can see that from the cost: a RMW or membar is around 20-30 clock cycles, while cross core communication costs in the order of 100s cycles.
Ultimately, I'm basing my viewpoint from this PDF: the "Is parallel programming Hard" free online book.
https://www.kernel.org/pub/linux/kernel/people/paulmck/perfb...
> Line 1 defines an atomic variable, line 5 atomically increments it, and line 10 reads it out. Because this is atomic, it keeps perfect count. However, it is slower: on a Intel Core Duo laptop, it is about six times slower than non-atomic increment when a single thread is incrementing, and more than ten times slower if two threads are incrementing.
The PDF changes over time. But this line can be found on page 42 at the moment.
--------
So according to the tests done by the author here, atomic operations are CERTAINLY slower, even in single-threaded cases.
I can't say I understand why they're slower, but something is definitely going on.