nsync has wrapper macros for all the various atomics libraries that prevented it from using two things.
1. Weak CAS. nsync always uses strong CAS upstream to make the portability abstraction simpler. Being able to use weak CAS when appropriate helps avoid code being generated for an additional loop.
2. Updating the &expected parameter. nsync upstream always manually does another relaxed load when a CAS fails. This isn't necessary with the C11 atomics API, because it gives you a relaxed load of the expected value for free when it fails.
Being able to exploit those two features resulted in a considerable improvement in nsync's mu_test.c benchmark for the contended mutex case, which I measured on RPI5.
C++ insists on providing a generic std::atomic type wrapper. So despite my type Goose being almost four kilobytes, std::atomic<Goose> works in C++
Of course your CPU doesn't actually have four kilobyte atomics. So, this feature actually just wraps the type with a mutex. As a result, you're correct in this sense atomics "use pthreads" to get a mutex to wrap the type.
C++ also provides specializations, and indeed it guarantees specializations for the built-in integer types with certain features. On real computers you can buy today, std::atomic<int> is a specialization which will not in fact be a mutex wrapper around an ordinary int, it will use the CPU's atomics to provide an actual atomic integer.
In principle C++ only requires a single type to be an actual bona fide atomic rather than just a mutex wrapper, std::atomic_flag -- all the integers and so on might be locked by a mutex. In practice on real hardware many obvious atomic types are "lock free", just not std::atomic<Goose> and other ludicrous things.
I am also curious about this and the ambiguity of "AARCH64". There are 64-bit ARM ISA versions without atomic primitives and on these what looks like an atomic op is actually a library retry loop with potentially unbounded runtime. The original AWS Graviton CPU had this behavior. The version of the ISA that you target can have significant performance impact.
It depends on what atomics. In principle most of them should map to an underlying CPU primitive, and only fallback to a mutex if it's not supported on the platform.
> At least in Linux, C++11 atomics use pthreads (not the other way around).
I have no idea what you can possibly mean here.
Edit: Oh, you must have meant the stupid default for large atomic objects that just hashes them to an opaque mutex somewhere. An invisible performance cliff like this is not a useful feature, it's a useless footgun. I can't imagine anyone serious about performance using this thing (that's why I always static_assert() on is_always_lock_free() for my atomic types).
Briefly mentioned elsewhere in the comments, but C++11 had a similar issue around the transition from a copy-on-write (COW) to a small-string-optimization (SSO) implementation for std::string. If any type is more ubiquitous than std::string, I don't know what it could be, but the transition was reasonably painless, at least in my shop.
If you're using dynamic linking, the following two tools will come in very handy:
- pldd (https://man7.org/linux/man-pages/man1/pldd.1.html) shows the actual dynamic libs linked into a running process. (Contrast this with ldd, which shows what the dynamic libs would be based on the current shell environment).
We've open-sourced the tools we use to run valgrind (and ASAN) on large mixed C++/Java code bases. The JVM in particular triggers a slew of errors which can make filtering valgrind output impractical, but the scripts we developed can handle that. FWIW, we use these tools every day on the code that goes into NYFIX Marketplace (https://www.broadridge.com/financial-services/capital-market...).
I thought it was a brilliant design, but it was dog-slow on hardware at the time. I keep hoping someone would revive the design for current silicon, would be a good impedance match for modern languages, and OS's.
We actually measured latency and throughput to find the efficient frontier, the number of logical transactions per physical DBMS query, that optimizes both.
What you find is that the relation between latency and throughput looks more like a U-shaped curve.
As you process only 1 debit/credit at a time, you get worse throughput but also worse latency, because things like networking or fsync have a fixed cost component, so your system can't process incoming work fast enough and queues start to build up, impacting latency.
Whereas, as you process more debit/credits per batch, you get better throughput but also better latency, because for the same fixed costs your system is able to do more work, and so keep queueing times short.
At some point, which for TigerBeetle tends to be around 8k debit/credits per batch, you get the best of both, and thereafter latency starts increasing.
You can think of this like the Eiffel Tower. If you only let 1 person in the elevator at a time, you're not prioritizing latency, because queues are going to build up. What you want to do rather is find the sweet spot of the lift capacity, and then let that many people in at a time (or let 1 person in immediately and send them up if there's no queue, then let the queue build and batch when the lift comes back!).