The rowhammer "attack" is successful only because the hardware is just plain broken, and I consider it in the same category as things like a CPU which will calculate 1+1=3 if the computation of 1+1 is done enough times --- nothing software should even try to fix, because the problem is at a lower level. The solution is to demand that the hardware manufacturers make memory which actually works like memory should; and it should be possible, since apparently previous generations of RAM don't have this problem at all. In the early 90s Intel recalled and replaced, free of charge, CPUs which didn't divide correctly. Perhaps the memory manufacturers today should do the same for rowhammer-affected modules and chips.
Memory errors are particularly disturbing because they are often highly dependent on data and access patterns, and can be extremely difficult to pinpoint without special testing tools. I've personally experienced a situation where a system which otherwise appears to work perfectly well would always corrupt one specific bit of a file when extracting one particular archive.
As a testing tool, MemTest86+ has always worked well for me, and the newer versions can detect rowhammer, although there is this interesting discussion about whether it is actually a problem (to which I say a resounding YES!!!) or if there's some sort of cover-up by the memory industry:
> The rowhammer "attack" is successful only because the hardware is just plain broken
I too am of this opinion and am surprised this view isn't widely shared. With DDR4, we should be asking for a refund and/or starting a class-action suit, yet we're putting up with software 'mitigations' instead.
This isn't like the 2008 Phenom TLB bug [1] where the CPU was locking up so AMD released a workaround that kept it from freezing at the expense of a 14% performance penalty. This is like the floating point division bug [2] where the device no longer meets basic operational and accuracy guarantees. RAM cells bleeding into each other ought to be considered a fatal flaw, not some intellectual curiosity.
I too am of this opinion and am surprised this view isn't widely shared. With DDR4, we should be asking for a refund and/or starting a class-action suit, yet we're putting up with software 'mitigations' instead.
I extensively test all the hardware I buy (CPU: LINPACK, RAM: MemTest86+) and if it fails any of those tests, it gets returned as "not fit for purpose". I've done this successfully a few times. A lot of other enthusiasists/power users do the same too, especially if they're overclocking, and searches on other forums show plenty of users testing and finding (mostly other, not rowhammer) errors in newly-bought RAM even when not overclocking. But as noted in the threads I linked to, manufacturers may be trying to cover this up and downplay its severity. Even in the original paper on rowhammer, the authors didn't disclose which manufacturers and which modules were affected, although I think this should really be treated like the FDIV bug: name and shame. I blame political correctness...
The Intel LINPACK distribution contains, besides the library, a sample benchmarking application using it, and that happens to be a very intense and "real" workload (solving systems of equations, i.e. scientific computation.) There's plenty of posts on various PC enthusiasists forums about how to run it correctly. (And plenty arguing that it's irrelevant, mostly because their insane overclock seems fine but instantly fails this test. There's a good reason most doing "real" scientific computing don't overclock; a lot of CPUs just barely pass this absolutely realistic test with stock speeds and voltages.)
FDIV was really not technically a serious errata in the grand scheme of erratas. The Phenom TLB bug was worse. Intel basically denied/sat on the issue for half a year, stopped just short of slandering Dr. Nicely, etc, they made it into a complete PR disaster. If they came out the week after it was reported and just said, here's a workaround, here's an opt-in replacement program (which they finally did, but then it was too late), you would probably never have heard about the FDIV bug -- like the countless other errata we have software workarounds for.
In retrospect I regret bringing up the Phenom because my argument could've stood without it, and I could realistically argue either way.
But my original intention was pointing out that the failure mode of the Phenom was such that it wasn't exploitable for anything other than potentially denial-of-service; it was just inconvenient, and only affected a subsystem of the CPU which worked fine without it using a firmware workaround.
Though you don't expect your CPU to halt and lock up, I believe it's far more insidious when you feed a device inputs and get the wrong output without any obvious indication that something went wrong, like in the case of rowhammer-vulnerable memory and FDIV.
I think that is the reason for the misunderstanding.. FDIV was not really insidious in the way you describe. It was 100% predictable, certain bit patterns always gave the wrong answer in the quotient on the affected hardware and it had a very straightforward software fix (with a performance effect sure). You could demonstrate it immediately, but it really wasn't severe.
(Q9 and Q10 http://www.trnicely.net/pentbug/pentbug.html)
Rowhammer is a much more complex errata and I don't feel qualified to comment on, especially the safety of the published mitigations, but it is in a class of bugs where the outcomes are not generally predictable due to more variables involved.
My reason for replying initially though, is that I don't think that the line for what types of hardware defects are open to software workarounds is so cut and dry, and I don't think many people outside of kernel/OS dev realize how many errata are on the chips they use everyday with workarounds they don't notice.
I don't agree, there is software that is designed to run on faulty hardware. This is often in high radiation environments (see: outer space). I agree this is not an area that much hardening has been done in conventional security models, but in other environments, it is common to use CRC error detection, parity information or other means to ensure that even if data is partially corrupted, that the original can be restored.
I see no reason to prevent someone from implementing this sort of error correction for GPG and other important cryptography.
Hostile environments attack your software without intelligence. (When working with them, it may seem otherwise, but that's just cynicism.) Hostile people attack intelligently. Whatever mitigation you may imagine is possible by checking CRCs or something after the fact, you must account for the possibility that the software, the OS, or the CRC has also been attacked by a hostile intelligent adversary. The fact that we can make reliable software in the face of unintelligent attacks is not evidence that we can make secure software in the face of intelligent ones.
Rowhammer is too powerful a technique to expect secure software to run on machines affected by it. This is an attack based on using rowhammer to change bits in other VM's memory. The only sane response to that, from the perspective of writing secure software, is despair. You can't deal with attackers in possession of that primitive.
Rowhammer is largely random. You don't get to target specific bits of physical ram. You find scarce weak bits and work to get the data located there. In this case that means you can only pick a couple bits per 4KB to attack. That won't let you fake out a CRC.
That's where I'm getting a little hazy. The paper says the attacker can "induce bit flips over arbitrary physical memory in a fully controlled way." Sounds a little more advanced than "largely random" to me, and based on the article it sounds like FFS is a step up from "vanilla" Rowhammer...am I missing something?
ECC RAM makes it a harder, but three bit flips will still survive. It depends on whether the system actually acts properly when it sees a huge amount of ECC errors happening.
They can pick a bit or two per page to attack, but then they're stuck with those bits.
In theory they could attack a new bit every few minutes, but that requires a system that allows the victim page to be remapped multiple times. KSM does not; any other memory-merging system could work the same way to mitigate things.
Even if they could keep remapping, it's a very slow attack that way. Reloading the checksum every ten minutes would keep you safe.
This does not make sense. If an attacker an alter your data, he can alter your CRC codes as well. Or just replace the pointer to the checkCRC function to "return true".
without being an expert in this area; my gut feel is that the fix to this problem is likely going to be funded by the end user. Given that competition continues to drive prices down, would 'secure ram' be viable? would you pay more for it?
Given that competition continues to drive prices down, would 'secure ram' be viable? would you pay more for it?
It's funny you mention this, since the problem only affects newer DDR3 and DDR4 modules and older RAM (EDO modules are apparently still in limited production and being sold) does tend to be significantly more expensive. Unfortunately the rest of the hardware needs to be compatible.
This also means all the older hardware that gets scrapped in massive quantities daily is likely to contain RAM immune to this problem, which is somewhat ironic... maybe it's just a (sad) continuation of the "newer is more volatile" trend that can be traced back to thousand-year-old stone tablets which remain readable today.
Not stone, but clay tablets are probably more volatile than your USB drive. There's a huge sampling bias here.
About the main point, why isn't ECC fixing this for everybody? I'll surely get cheaper, more volatile RAM, and use some of it on redundancy so it works better than the more expensive, less volatile kind.
I have the same opinion as you. I would but most people wouldn't. The root cause of the problem is since the trend about ram is "the bigger the better" (in terms of GBs) we have tons of capacitors on a small surface. I'm no expert too but I think there's no simple hardware fix for this instead of returning back to RAMs that hold less memory, but most people won't accept it. Maybe we're hitting the limits of the current technology and we should switch to another one. Just on a side note two years ago one of my professors quoted an ongoing research in my university about RAM that instead of storing electrons formed crystals, but I don't know any other detail about this.
There's CPU's that do memory, integrity checking to contain attacks. They're designed for stoping software and peripheral attacks mainly but consider RAM untrusted. They could probably be modified to deal with the new attacks.
Encrypted RAM is offered by the newest Intel server-grade CPUs (SGX, Skylake) and the next AMD server-grade CPUs (SME, Zen).
One of the main use-cases for these technologies is trusted computing in a cloud environment - the customer can assert that the hardware is securing the program state from the eyes of the computer owner!.
However, the cloud is actually made from cheap commodity boxes without server-grade anything! ;)
Encrypting RAM pages would prevent the hypervisor from deduping pages between virtual machines, and this would be very negative for cloud providers who want to up the occupancy on each box as much as possible...
In a few years, or perhaps longer, perhaps proper DDR4 and other immune memory will be mainstream in clouds. But until then, it seems we'll have a cloud fitted out with increasingly aging cheap machines with no rowhammer immunity.
> However, the cloud is actually made from cheap commodity boxes without server-grade anything! ;)
You know, I refuse to buy anything that does not support ECC for my home desktops (and don't even pay much for it). Only my laptop got a pass from this because there was literally no option available with it.
Good to know cloud providers are not as careful... But honestly, shouldn't be a surprise.
Same here. It helps to sell it if you don't say ECC = RAM + extra cash. That's the normal method. I instead say you have two options:
1. RAM that works at this price.
2. RAM that allows more crashes or corruption of your files for slightly-lower price.
The Right Thing suddenly looks more obvious except to cheap skates. Now I just need one with ChipKill built-in. That's the next level of ECC. I haven't heard whether Intel or AMD got something similar.
Encrypted RAM as AMD is implementing it (SME) protects nicely from "cold-boot attacks" but is otherwise largely a feel-good feature. It also probably doesn't help a whole lot against rowhammer-style attacks because it's merely encrypted, not authenticated. The result is that a bit flip will effectively randomize 64 bytes or whatever the block size is but will not be otherwise detected by the hardware. I bet that clever attackers will find a nice way to take over by randomizing 64 bytes.
Intel's encrypted RAM is authenticated quite nicely, but it's not (yet?) designed for general purpose use -- it's for SGX only right now. Using it for everything would (if I understand correctly) add considerable space overhead and possibly considerable latency.
Don't think I've seen any non-server-grade processors in even the cheapest bargain-basement VPS hosts. (Low-end dedicated is different.) Cramming as many VMs into a big server as possible seems to be too important to their cost structure for that.
We perhaps only disagree on what is "server-grade" vs what is sold for servers.
Google, for example, are famous for making big data centres out of cheap commodity boxes, and I double Amazon are any different. I certainly know the rackspace blades I've played with didn't make my grade of either! :)
I can't make any claims to contrary about other providers, but I know at the very least that at one point not in too distant past the primary systems used for Rackspace Cloud hypervisors were Dell R720 rackmount servers. Maybe not the most amazing hardware, but considering how common they are you can hardly refuse to say they're "server-grade". The newer OpenCompute stuff is also clearly well-made hardware.
Everything I've read implies that cheap commodity servers like Open Compute are just as reliable as name brand Intel servers (not surprising considering that they're made from the same parts), and ~95% of the market appears to be satisfied with that level of reliability.
I figured it would end up in security-oriented, bare-metal hosting first. Or racks people rent out for their own boxes. Didn't know something like that was on new Inte/AMD CPU's. Thanks for tip.
I know what the root problem is. I also know it comes from an oligopoly of companies that only care about money, probably have patents on key features, and operate in a price-sensitive market. Fixing root cause might be tricky unless you could be sure via contracts of volume deals from cloud and other big buyers.
Meanwhile, small teams in academia are building CPU's that knock out those and other issues. Worth bringing up given the fix you want isnt doable for most HW dedigners. RAM vendors might eventually use it as a differentiator but that's not guaranteed.
You can't entirely blame the providers for only caring about money; the consumers that choose the budget hosting options for critical applications must surely share some of it.
Server grade hardware is certainly available to cloud/VPS providers, but it turns out people are unwilling to pay $2 for a VM if there's one going elsewhere for $1.50.
"the consumers that choose the budget hosting options for critical applications must surely share some of it."
The customers expect the RAM they bought to work correctly. They might have even read papers on ASIC verification where the hardware companies brag about all these techniques they use to prevent recalls like one Intel had. The issue is that the companies stopped doing or reduced verification on specific components to reduce costs. What they bring in on the chips is way more than it takes to do that. So, the reason must be greed driving the profits up a little bit.
This one is the companies' fault. I'd have assigned blame differently if we were talking security of regular, consumer products or even operating systems. Verification of repeating pieces of hardware circuits is an industry-standard practice, though. Except for RAM providers apparently.
This blame placed on HW stems from a lack of understanding of RAM physics/electronics. As dimensions scale down, these things happen.
The market has chosen to adopt the cost benefits of smaller transistors and higher capacity for the same $. It's a mix of physics and market forces, not malfunctioning hardware.
Memory errors are particularly disturbing because they are often highly dependent on data and access patterns, and can be extremely difficult to pinpoint without special testing tools. I've personally experienced a situation where a system which otherwise appears to work perfectly well would always corrupt one specific bit of a file when extracting one particular archive.
As a testing tool, MemTest86+ has always worked well for me, and the newer versions can detect rowhammer, although there is this interesting discussion about whether it is actually a problem (to which I say a resounding YES!!!) or if there's some sort of cover-up by the memory industry:
http://www.passmark.com/forum/memtest86/5903-rowhammer-probl...
http://www.passmark.com/forum/memtest86/5475-memtest86-v6-2-...
Run it on your hardware and if it fails, I think you should definitely complain and get it fixed.