Hacker News new | past | comments | ask | show | jobs | submit login
Memory Reordering Caught in the Act (preshing.com)
137 points by jenhsun on May 15, 2012 | hide | past | favorite | 32 comments



Probably the most important broad point of the article:

"And there you have it! During this run, a memory reordering was detected approximately once every 6600 iterations. When I tested in Ubuntu on a Core 2 Duo E6300, the occurrences were even more rare. One begins to appreciate how subtle timing bugs can creep undetected into lock-free code."

If you were shipping software to millions of people, you could reasonably never see a specific race condition error that ends up occurring on many tens of thousands of machines.

And then when you start to consider interactions between infrequent errors...


But now every programmer is supposed to take advantage of the multi-core processors needed to run modern operating systems, aren't they? This is going to lead to a number of prominent disasters.


I don't know about "disasters". For better or worse, lots of software ships with severe bugs that are evidently rare enough never to be fixed but do happen.

For example, every few months, OSX kernel panics on my MacBook Air, with no obvious trigger. The backtrace always implicates the AHCI driver for the SATA SSD. Still, Apple evidently don't receive enough crash reports for this particular bug to actually fix it, as it's happened since I first bought the Air in November 2010.

I have no idea whether it's a threading bug like the one in the article (I'm not about to run my system in single-core mode for months just to find out) or maybe a race condition with DMA or just a simple logic error that only rarely applies. It might even be specific to the exact SSD model and revision I ended up with in my device. But that's pretty much irrelevant - there certainly hasn't been a widespread outcry about it.

Personally, as a developer, I would definitely want to fix that kind of bug in anything I'd built. But tracking it down might take weeks of costly developer time. So Apple's bug triage probably marked this one "WONTFIX" after deciding it only happened on hardware they no longer sell, so fixing it wasn't going to have a positive ROI.

(FWIW, if it sounds like I have a personal vendetta against the AHCI driver, that's because I do. ;-) It behaves erratically in ways unrelated to this crash, but that weirdness doesn't cause kernel panics, just extra work for developers of other drivers.)


Going on a little (actually really long) tangent here about Apple and the general inscrutableness of their bug fixing ...

I have a late 2008 MBP from the first batch of unibodies and after having it for a few months I started getting this thing where the screen would blink. A lot other people did as well because by mid-2009 the relevant thread on their forum ran to over 75 pages so they had to be completely aware of it. People even put up youtube videos of it in action, which must have been a little tricky to catch because it was completely random.

In that thread on the forum some people told how they had managed to get their machines replaced. Sometimes the replaced machines had the defect as well. I never had any luck with my local support as they basically said they couldn't reproduce it so they couldn't replace it.

However, at one point a rookie tech left a support manual in pdf form on my machine after I had taken it in for servicing. There was a table in it which listed about a dozen different types of problems and the recommended approach for each. Which type of problem was the only one to have a blank cell where the solution should have been? Flickering or blinking displays.

So every time I had some other issue with my MBP, and I had several, I tried to get them to do something about the blinking problem. I finally got a second-tier support person (on the phone) to accept the issue. He said it had happened to his as well and replacing the Airport card (which happened to be the thing that had failed that particular time) had fixed it. This was about two months before the end of my three-year Apple Care period and he said that if that didn't fix it, he would try to get it replaced.

Well it didn't, to my utter lack of surprise, but I couldn't manage the hassle of giving up my machine yet again and, anyway, I was used to it so the warranty period expired.

About a month later Apple released a firmware update that fixed the blinking.

I know there are companies that might support their products after the warranty period and I know there are companies that stonewall but I thought this combination was pure Apple.

Completely ignore the problem for three years and then release an update after everyone is out of warranty anyway.


Wow, that is indeed classic Apple. Does the firmware update really fix it, though? My father-in-law has had all sorts of trouble with standby and his 2011 Mac Mini + Thunderbolt Display. Supposedly "fixed" by various firmware updates, but not much has changed.


It seems to have worked.


Interesting, I'd given up all hope of the flickering being fixed.


To clarify, this is the problem where the screen blinks off really fast at completely random times. There's another problem with flickering that had something to do with scrolling.


Not really. Memory ordering and cache coherency issues are endemic to race conditions, but race conditions are bad bugs even in the absence of fun tricks by the caches.

You "fix" this the same way you fix any race: you get your synchronization right. The instructions used to implement the mutex (e.g. lock cmpxchg) are "serializing", which means that all memory operations issued before them in the instruction stream will be completed and committed before the instruction issues.


But not every programmer is going to implement their own lock-free algorithms (http://en.wikipedia.org/wiki/Non-blocking_algorithm). Experts write such algorithms, and other programmers will build their applications on top of those libraries. And even more programmers will use abstractions that aren't even libraries, but language-level abstractions.


I would say it more strongly. Threading is like crypto: unless your full time job is writing threading libraries don't even try to write your own.


I can't say I agree with this. With threading there are certainly extra things that always have to be in the back of your mind, however saying you have to specialize in only writing threading libs is a bit extreme.


Most programs don't use anywhere near enough cpu to make parallel algorithms pay off. So you may as well keep it simple, especially if you aren't hitting the limits of your current algorithms. Simplicity means less code, which means less bugs.


You don't need parallel algorithms to see the effect of memory ordering bugs. Any program that runs concurrent threads (for different operations, say a thread for computation and a thread for IO) can be affected if the threads attempt to share information. Any sharing among threads must be done very carefully.


On a related note, I compiled and ran this sample on Playstation 3, and no memory reordering was detected. This suggests that the two hardware threads inside the PPU effectively act as a single processor, with very fine-grained hardware scheduling.

This is because all of the cores on the Cell processor are in-order (http://en.wikipedia.org/wiki/Out-of-order_execution#In-order...). That fact is well documented: http://hpc.pnl.gov/people/fabrizio/papers/ieeemicro-cell.pdf

And for the record, the "correct" thing to do is to always use the appropriate memory and instruction fences. Reasoning about where these should go can be subtle.


Actually, it's not because the PPU executes in-order, which has only to do with the way the CPU orders instructions internally and says nothing about memory ordering. For example, the Xbox 360 has in-order processors too, yet you can observe memory reordering all over the place. Both consoles use the PowerPC architecture, which is well-known to provide weak memory ordering.


Correct. Memory operation re-ordering can occur due to the way the cache-coherence mechanism is implemented. If messages on the bus between cores are not ordered, memory operations can effectively be reordered even if each core never performs any reordering (e.g. due to cache misses).

This is a great explanation: http://irl.cs.ucla.edu/~yingdi/paperreading/whymb.2010.06.07...


The Xbox 360 and PS3 processors are very similar, one main difference is the PS3 (ignoring SPUs) is single-core with two hardware threads while the 360 is triple-core with two hardware threads per core. I would imagine that if you locked the affinity of the two threads to the two hardware threads of one of the 360's processors you would see the same behavior. IIRC, The two hardware threads on those CPUs are essentially just duplicate sets of registers that each have their own instruction stream, but execute the instructions in a single shared pipeline (the benefit being that you have more instructions available to fill the rather long CPU pipeline and are less likely to unused cycles while stalled on memory fetches). As far as the actual executing instructions are concerned there would be a single stream of instructions and thus no opportunity for the memory ordering effects that you might see with two distinct cores/processors.


You can most definitely witness memory reordering on the PS3 when you have code on the PPU and SPUs operating on the same memory.



So what's the surprise here? this is exactly why you use locks in the code, because you don't know what the compiler is going to do with your code, and you don't want to know all the details of that. The solution is not to know all those asm details, but to use tools like locks correctly.

edit: after writing lock-free code years ago, and having very subtle bugs like the ones described here, I don't think it is a good idea anymore, unless you are in a very specific scenario.


> ...unless you are in a very specific scenario.

Locks can be very slow (relatively). At my last job using lock-free code was necessary for performance in many cases.


Apparently this happens alot more for me on an AMD bulldozer.

I'm getting it happening every 1600 iterations with their makefile and 2000 with -march=native. Very interesting.


This whole article was problematic to me because it starts out with:

Two processors, running in parallel, execute the following machine code

Processors don't "run in parallel." Among other things, the OS could be scheduling the threads on each CPU in any way it wants. I just can't think about the false hypothetical that the processors "run in parallel."

I'm in the process of going through the article more carefully to see if this erroneous way of thinking propogated. It seems like the author knows his stuff, though, so I'm guessing not.


I didn't find any problems.

However, one of the reasons I was uncomfortable with the setup (besides thread scheuling), is that I personally don't make any assumptions about cache coherency. To me, it could take thread 1 an arbitrary amount of time to see an update made to memory by thread 2, unless some primitive (like a semaphore... with "acquire and release semantics") is used.

Right? I mean, couldn't that account for what is happening just as much as instruction reordering? In theory, I'm fairly certain this is true. In practice, it depends on the specifics of cache coherency in x86-64; if anyone can comment on that, I'd appreciate it.


This is a very nice post. I didn't use to know <code>-S</code> existed before.


On the MSVC side the option is -FA[scu]

/FA[scu] configure assembly listing

(this is from CL 15.00.30729.207 from the WDK 7.1)


I hope this isn't too dumb, but I always imagined each processor having its own set of registers.

Am I incorrect? I can't see how multiple processors can share registers without chaos?


The registers aren't shared here, but the main memory is. The example has each thread write '1' into a [shared, main] memory location and then read the other thread's memory location into a register. If both CPUs' registers are 0, that means that both reads occurred before both writes.


Of course they have different registers, but they sure don't have different memory.


I do embedded Linux development and I see bugs related to this stuff frequently. What can make things more tricky is different architectures have different ordering guarantees. Anyone know of any cpu architectures that intentionally sacrifice performance for a simpler less bug prone model? I'm thinking of markets like aviation where bugs like this seem particularly scary.


There's no need for a separate architecture to avoid concurrency issues; a simpler solution is to simply use a uniprocessor system. (And be careful with interrupts.)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: