I worked on a CPU with very relaxed ordering once. It was a MIPS variant, designed by the people who had previously worked on DEC Alpha. Unfortunately, there hadn't been an architecture since Alpha that had such a relaxed model, so there was code in various parts of the Linux kernel that had gotten sloppy about memory barriers. I remember at least one in NBD, at least one in NFS, multiple in Lustre. Each one was a major PITA to debug, because (a) there was no evidence in memory of what went wrong and (b) the code would literally only fail once in a billion times. So there you are, looking at register values in a crash dump and trying to guess exactly which memory locations had to be read in which incorrect order to get to the state you're seeing. Ugh.
I have mixed feelings about all of this. On one hand, requiring every CPU to implement strong memory-order guarantees is one of the things holding back frequencies and concurrency levels. On the other hand, weaker models make bugs so likely and those bugs are so difficult that people add excessive memory barriers and performance ends up being even worse. I don't know what the answer is, but it's a complex set of tradeoffs that I wish more people knew about.
I do not think there is any strong evidence that relaxed memory CPUs clock higher. In practice high performance CPUs with relaxed memory models are going to need the additional tracking required to make TSO fast to speed up their memory barriers anyway, so the saving is probably minimal.
You say there's no strong evidence, but then you speculatively say "are going to need"? And you refer to total store ordering, which is not required even for a barrier? Next time I'm sitting with the people who designed a chip which was the clock-speed champion for quite some time, maybe I'll ask them about the reasons.
Yes, that's precisely the tradeoff the parent is talking about — relaxed ordering enables faster clocks, but you end up paying for that clock speed when you need more, more expensive, barriers.
Yes and I'm saying that in practice as an architecture matures, especially in the multicore era, expensive barriers are no longer acceptable and even relaxed mode architectures have to introduce the same optimisations that stronger architectures had to introduce earlier on, negating a lot of the advantages of having a relaxed memory model in the first place.
Have you considered alternatives besides always-on total ordering and weak ordering with full barriers? Various forms of partial barriers already exist. In theory, they can preserve the performance wins from weak ordering. In practice, programmers seem to have even more trouble reasoning about them than they do with full barriers. That's why I expressed ambivalence above. I don't have the certainty that comes with mere speculation.
Well x86 itself requires barriers for store/loads; the cost of such a barrier had been going down for a few generations but improvements have stalled recently.
Recent ARMs apparently have cheap explicit load acquire and store release which are great and map very well to the c11 memory model. If this means they need to use less transistors or save some power, that's even better.
The issue is with some RISCs were even implemeting acquire/release semantic requires expensive barriers (and seq cst requires even more expensive barriers), the only alternative being address or flow control dependent loads which are hard to reason about in higher level languages.
And then there was alpha were even load consume required a barrier.
There are several reasons that wasn't a viable option in this case. First and foremost, the problems were usually in explicit code, not the result of optimization. Second, even for problems that are caused by optimization, such an approach tends not to work for something so timing- and workload-dependent. These problems don't cause incorrect behavior every time. The code can pass tests over and over again, and still fail often enough in production to annoy the customer - especially when they're in the kernel so the result is the whole machine crashing. In fact, reducing the optimization level might well reveal new race conditions. Thirdly, performance was one of this system's selling points. Its customers would hardly have tolerated making it slower overall.
Think of it like you would any more common garden-variety race condition. The solution is not to add sleep() calls, even if that appears to work. The solution is to put in the necessary synchronization. Failing to use the right barriers on a data structure that pings between processors is exactly the same as failing to take the right locks when accessing that data structure. It's just easier to get away with when all the code has run on before is x86 processors with strong ordering.
P.S. This same processor did at one point have a bug that prevented locks from working sometimes, but that's a whole different horror story.
Memory ordering isn't the kind of thing that can be neatly separated from your instruction scheduler, register file, etc. Some of the communication and checking that's necessary to maintain strong ordering has to be handled within each unit each cycle. I'm probably doing a poor job explaining it, though. Maybe someone even closer to these issues can do better.
> On one hand, requiring every CPU to implement strong memory-order guarantees is one of the things holding back frequencies and concurrency levels.
Are you sure about that? Moderns x86 CPU are fast, can have a lot of cores, and actually even have quite fast atomics compared to their current competition (Arm) IIRC. Which is very useful for exploiting your multiple cores with fancy parallel algorithms.
I don't think this kind of things is holding back frequencies today. Most of the time thermals is. And that's probably the same for pretty much all the things that fall into transistor budget constraints. The reduced approach has "failed", because a good HW is "always" faster and now the transistor budget is so high that clever HW will be what matters for a while. It has even began a "few" years ago, I would say.
Back to the more precise subject: you have that huge OOO engine anyway (if you want perfs); would you get that much speed (even just theoretically) by not loading, retiring and queuing in order? You have to be cache coherent anyway, and if you want real gains then you have to avoid cache line bounces. From the store side I don't believe it. The load side, hm maybe some small gains? Maybe that is even one of the things that let the A12 be competitive? Does not seems to be a hard stop though, because pretty much all uarch structures continue to grow with the usual associated IPC gains. So yeah, first I don't believe at all that this is holding "frequencies and concurrency levels"; maybe IPC, and maybe even efficiency, and so because of those 2 maybe even frequency, but only indirectly (boosting for longer while the thermals are ok) but for sure we have not hit any wall even with quite strong ordering. The finer programming that would be required is a joke also; it is well known that trying to program with atomics shall be done using sequential consistency unless you are an extraordinary good expert, and are ready to spend a huge time on verification. So it is kind of back to the beginning, because in pretty much all cases your CPU better has to be fast enough for sequential consistency. At this point it will likely be possible to get it fast for a quite strong default ordering, so why not keep it convenient and reduce the likelihood of bugs?
I think for general purpose CPU, strong ordering is here to stay (or even for weaker ISA still on the market, they will eventually become stronger), kind of the same way as the page table walks, prefetching, speculation, etc. stories go (they all have had their moment of attempting simpler HW but more complex SW; it never worked in the end, because the HW can act as a kind of JIT optimizer and self adapt to workloads thanks to e.g. predictors way more efficiently than SW could do it). Of course you will "always" have the possibility of load-store reordering, because this is a logical perf issue from first principles, not just a quality of implementation problem (well, you could avoid it by being extremely slow with current uarch approach, but that is so impractical that we should as well consider it not "fixable" -- then maybe there could be a radically different arch that get it fast, but I really don't believe it)
While the x86 bus architecture is very forgiving, its guarantees only extend up to the level of assembly language. Compilers are happy to re-order operations that the machine has so carefully sequenced.
At the source-code level, therefore, you need to use "atomic" data types and operations (and carefully) just to retain the deterministic semantics the underlying machine implements.
That's the bad news. The good news is twofold: using atomics on x86 doesn't cost any extra (i.e., you pay for fully ordered operations whether you rely on them or not), and doing it right makes your code portable to machines with "relaxed" ordering, where you only pay for ordering if you ask for it.
Intel and AMD manage to make a strongly ordered bus system almost as fast as a relaxed one by throwing enormous numbers of transistors at the problem. It costs more power, and is the very devil to get right, but it makes wrong programs more likely to get the right answer anyway.
That's a little oversold. In general only the most subtle code is going to rely on instruction ordering for correctness. Pedestrian users just use an OS supplied locking primitives, which will include appropriate fencing at both the ISA and optimizer level.
I mean, yeah, compilers can reorder your instructions at build time just like CPUs can at runtime. But neither will reorder across a call to pthread_mutex_lock(). Use that. Doing otherwise is just a footcannon, unless you happen to be actually implementing a locking primitive.
Thanks for clarification. It only adds two fences, one into the load and one into the store ordering buffer. (not IO).
But this has an as dramatic of an performance impact as a "clear cache", that's why I usually describe it that way. it just sets something like two dirty flags.
I have mixed feelings about all of this. On one hand, requiring every CPU to implement strong memory-order guarantees is one of the things holding back frequencies and concurrency levels. On the other hand, weaker models make bugs so likely and those bugs are so difficult that people add excessive memory barriers and performance ends up being even worse. I don't know what the answer is, but it's a complex set of tradeoffs that I wish more people knew about.