ECC RAM should be a human right

PaulKeeble · on Jan 21, 2023

ECC has been used as an artificial market segmentation mechanism for a long time and it needs to come to an end. RAM just like SSDs and HDDs ought to have some amount of self protection against basic errors, all places where data is stored even for short periods needs this.

deadbunny · on Jan 21, 2023

> ECC has been used by Intel as an artificial market segmentation mechanism

FTFY

leguminous · on Jan 22, 2023

What's really annoying is that the current generation of Intel CPUs support ECC, they just don't implement it in consumer chipsets. You can get working ECC with a W680 motherboard, but those are very expensive and availability is slim.

nimish · on Jan 21, 2023

And AMD, and Nvidia.

hamandcheese · on Jan 21, 2023

AMD ryzen CPUs have ECC support, but it varies by motherboard. My home NAS runs a Ryzen in an ASrock Rack motherboard with ECC ram.

I never did figure out a way to verify that the ECC is working/if it is able to report errors (to the kernel?). It was also a bit hard finding the right ram, but there’s a brand called Nemix that I found. I was a bit sketched out by the brand, but the chips themselves were Micron.

karmakaze · on Jan 22, 2023

AMD used to not segment ECC for non-consuner lines. It wasn't very popular with consumers. So I can't really blame them for not supporting it well. One way that is a good balance is to make it work but not officially supported on consumer lines. Then it's up to mb vendors and buyers to take advantage if worthwhile to them. That's enough to not cut into non-consuner lines or add support costs to consumer pricing.

One thing that could change this quick is if Apple used ECC. Others would follow to not seem inferior. I don't hear Apple users complain much at all about lack of ECC options.

antifa · on Jan 22, 2023

> I don't hear Apple users complain much at all about lack of ECC options.

Comes up in almost every Apple Silicon thread.

karmakaze · on Jan 22, 2023

I wasn't aware. Did some searching around and it mostly concerns the Mac Pro/Studio products in discussions. I was thinking more along the lines of regular MacBook Pros and such. Using ECC in products priced much higher doesn't change things elsewhere.

toast0 · on Jan 22, 2023

> AMD ryzen CPUs have ECC support

Not all of them, afaik the am4 processors with graphics don't unless you have a pro branded version.

> I never did figure out a way to verify that the ECC is working/if it is able to report errors (to the kernel?

The easiest way is to induce an error. Some of the memtest tools try? But also, you can tweak memory voltage and timing to make things less reliable, and you should be able to get errors and error reporting as a result.

rom-antics · on Jan 22, 2023

I think the question is where are the errors reported? Would you see logs in dmesg for example? Or is there some smartctl-like tool that exposes the raw correction counters?

ar-jan · on Jan 22, 2023

You can use https://linux.die.net/man/1/edac-util

toast0 · on Jan 22, 2023

I'd expect them in dmesg or other kernel logs. That's where they show up on FreeBSD.

iguessthislldo · on Jan 22, 2023

I was looking into getting a motherboard for a consumer CPU with ECC support several months ago. The fact that AMD and the motherboard manufactures don't explicitly say they support ECC is the reason I went with a Intel Supermicro motherboard that did explicitly say it supports ECC. I agree with the article though that it's crazy I have to resort to a workstation/server motherboard to be certain I'm getting ECC.

bigmadwolf · on Jan 23, 2023

To see if ECC is working try overclocking the RAM if possible and looking for EDAC errors (they'll show up in dmesg) or from running edac-util. Eventually you should see something reported.

solomatov · on Jan 22, 2023

I suspect that other methods of CPU segmentation, i.e. RAM channels, and PCIe lane are artificial in a similar way, i.e. cost of producing a die for a CPU is more or less the same, but pro CPUs have more of them.

dale_glass · on Jan 21, 2023

ECC RAM would actually be a boon to everyone, including gamers.

ECC means not only that you know precisely when you've gone too far with overclocking, but potentially allows overclocking a bit further, relying on that some amount of trouble can now be tolerated.

It also means you're not going to break your OS by playing with this stuff. Memory corruption carries a huge risk of disk corruption, which can mean things like corrupt data, random crashes or an unbootable system that persists even after reverting everything to defaults.

derkades · on Jan 21, 2023

Indeed, because I had ECC RAM I felt comfortable overlocking it, and managed to go from 2400MT/s to 3200MT/s for my home server. A massive difference in performance, while paying only slightly more for ECC.

RealityVoid · on Jan 21, 2023

I doubt that would actually be useful with overclocking. I don't know the arch of the modern PC well enough to say with 100% confidence, but on embedded arches, the RAM has the parity bits checked when they get placed on the bus. If the error happens on data retrieval(or was already present) , then the ECC saves you, but if it happen anywhere else... not really? I don't know if.. ALU's for example automatically include the parity bits in the computation.

p1mrx · on Jan 21, 2023

You're talking about overclocking the CPU. ECC is more relevant when overclocking the RAM itself, which also affects gaming performance.

willis936 · on Jan 21, 2023

The GTX 1080 has GDDR5X, which has ECC. I was able to dial in my memory OC on that card by increasing the clock then running memory throughput benchmarks until the throughput was maxed out. At the point where errors start the theroughput would decrease, but nothing would fail.

Assuming stationary processes this is pretty nice. Maybe bump back 5% for margin.

sliken · on Jan 22, 2023

I believe that's inside the chip, much like DDR5. However that doesn't help if there's an error introduced by the packages, pin, board traces, CPU pins, or CPU.

Better than nothing, but not full ECC either.

willis936 · on Jan 23, 2023

You can see here that there are 4 EDC lines used for CRC of every word.

https://web.archive.org/web/20170207162829/https://www.micro...

Dylan16807 · on Jan 21, 2023

They specifically mean overclocking the memory.

jjtheblunt · on Jan 21, 2023

totally embarrassingly naive question : why bother overclocking ?

eric__cartman · on Jan 21, 2023

Some people prefer to trade off stability for a slight performance improvement. With modern hardware I don't think it's worth it to be honest. I want my computer to work day in and day out even if it means a 2% lower score in some benchmark.

JohnBooty · on Jan 21, 2023

    Some people prefer to trade off stability for a slight performance improvement.

In my PC, I've got a Intel Core i5-3570K (3.4ghz stock) overclocked to 4.5ghz or something silly like that. I used the motherboard manufacturer's "one touch overclocking" feature to determine that speed years ago and haven't touched it since.

The performance improvement is more than slight. Stability is rock solid across thousands of hours of gaming. I'm not running advanced cooling. $60 Corsair closed loop cooler and three Noctua case fans. System runs cool and quiet, fans throttle up and down nicely. Near silent when idle and still rather quiet under load.

That is a pretty decent gain in performance for zero drawbacks and essentially zero effort.

antisthenes · on Jan 22, 2023

That's around the last generation when it was worth to overclock things, because the headroom on a typical consumer-grade -K cpu was at least 25%.

Personally, I run a 4670K at 4Ghz, and it's been rock stable as my main machine for the last 8.5 years. It's finally beginning to show its age in 22-23, but the longevity and use I got out of a $200 processor is incredible.

But when CPUs auto-turbo to 5+ Ghz, I agree, overclocking sort of loses its luster.

chongli · on Jan 21, 2023

Yeah personally I’d rather build a PC with efficient power supply and quiet, high performance cooling. No gaudy LEDs either, just a plain case. I want it to be very stable and reliable and unobtrusive.

thaumasiotes · on Jan 22, 2023

> No gaudy LEDs either, just a plain case. I want it to be very stable and reliable and unobtrusive.

Good luck with that. https://chainsawsuit.krisstraub.com/20160223.shtml

Last time I was looking for a case, this was a real issue.

jjeaff · on Jan 21, 2023

There are also a lot of cases where you can overclock without sacrificing stability. The standard clock speed for any line of processors is simply the minimum it is tested for. But you sometimes get lucky and can get a better chip with more viable transistors. So you can boost the clock ok those and reap the benefits without any drawbacks.

There are sites and services that do "binning" where they test the specific chips and you can buy ones that have been vetted to clock higher.

freetime2 · on Jan 21, 2023

XMP profiles are another example where you can boost RAM speeds without much risk of instability [1]. I assume the manufacturer is just doing binning on their end and selling the higher performing RAM at higher prices.

And the performance differences can be significant: 20% or more in certain workloads.

[1] https://www.pcgamer.com/what-are-xmp-profiles-and-how-do-i-u...

thfuran · on Jan 21, 2023

And undervolting lets a CPU/GPU boost to higher frequency for longer before hitting power/temperature limits or just use less power to hit the same performance. There's a decent chance that any given chip can be undervolted enough to make a notable difference before becoming unstable.

cm2187 · on Jan 22, 2023

And with the energy crisis, power consumption is another consideration these days, particularly for a computer that is running all the time.

dale_glass · on Jan 21, 2023

I think it's mostly pointless in this day and age.

I'm just saying that it has a potential appeal for gamers too, so it's not just a datacenter type of technology that some nerds want to play with.

At the very least it'd make overclocking safer and easier, so any manufacturer making gamer type boards with a lot of overclocking settings in the BIOS should like the idea of it.

p1necone · on Jan 21, 2023

The sweet spot for overclocking ECC ram is still before it starts malfunctioning. If it's clocked higher but is correcting for errors it will still be slower.

ilyt · on Jan 21, 2023

Entirely depends on error rate

sitkack · on Jan 22, 2023

Read the paper(s) below. Absolutely frightening, we all basically swimming in quicksand. ECC should be mandatory! Not using it is like huffing leaded gas.

Silent Data Corruptions at Scale https://arxiv.org/abs/2102.11245

Harish Dixit has two other papers available https://arxiv.org/search/cs?searchtype=author&query=Dixit%2C...

Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field https://users.ece.cmu.edu/~omutlu/pub/memory-errors-at-faceb...

Justin Meza has a similar body of research. https://www.semanticscholar.org/author/Justin-Meza/144606145

sitkack · on Jan 22, 2023

Justin Meza's thesis

https://kilthub.cmu.edu/articles/thesis/Large_Scale_Studies_...

mixedbit · on Jan 21, 2023

Linus recently also had troubles with his development machine caused by the lack of ECC: 'I absolutely detest the crazy industry politics and bad vendors that have made ECC memory so "special".' https://lkml.iu.edu/hypermail/linux/kernel/2210.1/00691.html

jsz0 · on Jan 22, 2023

Personally used about an even split of ECC vs. non-ECC RAM over the last ~30 years and can't say I've ever encountered any problems with non-ECC RAM. If it happened it was no significance for me to notice. Professionally yeah I've run across a server here and there report bit errors but it was a four leaf clover the of thing. Not unheard of but not common. The more important thing in my experience is using high quality hardware with stable well tested drivers. That is the first practical precaution to consider. If you've got a box full of the cheapest parts you can find on Amazon you got bigger problems to consider before worrying about ECC.

Dalewyn · on Jan 22, 2023

This kind of comment is the most insightful kind concerning ECC RAM.

If you're relying on your computer to be 100% error-free (eg: doing professional work), paying some extra for ECC RAM support isn't even a drop in the bucket. You either do or don't, money isn't an object.

If you're Joe Average browsing Hacker News or playing some games at home, who cares if you win the bit flip lottery?

Would it be nice for ECC RAM to be more mainstream? Sure. But the fact it is not isn't breaking anyones' backs either.

rapjr9 · on Jan 22, 2023

I wonder if lack of ECC is a form of planned obsolescence. It makes the computer cost less, and it insures the computer starts failing after a while as the OS and files get corrupted, making the owner want a new computer. More profit due to both of those. Kind of like using rubber for seals in a car rather than silicone, even though the cost difference is tiny, so owners eventually have to buy a new car.

adhoc32 · on Jan 22, 2023

Silent data corruption is silent. ECC should be mandatory just for the error reporting. The system should inform the user that a DIMM has gone bad and needs replacement.

chatmasta · on Jan 21, 2023

> It was the worst-case scenario for RAM failure: bit flip errors that get written back to the disk. I discovered that several video files that I had been editing had corrupted bits, and were no longer usable.

What a horrible failure mode! Kudos to the OP for even thinking to investigate this.

The mentioned memtest86 tool sounds quite useful - so useful, in fact, that it seems it should be part of the operating system. If my OS is perfectly willing to write corrupted memory back to disk, and it's designed to run on thousands of different OEM hardware configurations that may or may not have ECC RAM, then I would expect it to proactively monitor for faulty RAM and bitflips.

Is there a good reason why this isn't the default behavior of operating systems (maybe it is - I use Mac and don't know much about Windows)? It seems like a trivial diagnostic tool that could prevent a lot of headache. But perhaps the problem is there's no reliable test without some number of false positives, so popping a warning on the screen that your RAM appears unhealthy seems like a good way to confuse the average user. But then again, so does silently corrupting a video file.

rcxdude · on Jan 21, 2023

Note that memtest86 is a test application, not a means of ensuring integrity: it writes known patterns into memory in order to test it. Windows also has a built-in version which has the same limitations: you reboot into a special mode where it runs the test, it doesn't work online or validate actual application data.

The OS layer is probably the hardest place to do online application data integrity checking: it's too low level to know what data is important and which parts will change and why in a way which allows checksumming to work effectively and efficiently, but too high above the hardware to be able to check the integrity of memory without a massive performance penalty (especially when it comes to how memory is moving in and out of caches). Most solutions work at a higher level in the application itself or lower level with ECC RAM as mentioned in the article.

wongarsu · on Jan 22, 2023

> The mentioned memtest86 tool sounds quite useful - so useful, in fact, that it seems it should be part of the operating system

Windows has the Memory Diagnostics Tool, and your linux probably has a memory tester option in the boot manager. Of course you have to run them manually, which is a bit bothersome. Thee have been attempts at kernel patches for linux to test memory in the background, going at least back to 2005 [1], but there were probably some before that. The simpler versions just test whatever memory is free, the more sophisticated aproaches try to move stuff in physical memory to free each region periodically so it can be tested. But in the end it's spending CPU cycles on a problem most users never experience.

1:https://groups.google.com/g/comp.os.linux.development.system...

mutagen · on Jan 22, 2023

Windows has a memory test tool built in called the Windows Memory Diagnostic. It runs on the next reboot to fully test system memory.

Apple Diagnostics check memory as well, though maybe not as thoroughly as the Microsoft or memtest86 do (multiple passes).

I've had memory errors that only appeared on two of eight passes through the whole memory test overnight. They certainly caused occasional issues with games though.

bsder · on Jan 21, 2023

> Is there a good reason why this isn't the default behavior of operating systems

Because any equivalent of memtestx86 will flag a LOT of hardware for the buggy piles of shit that they really are.

If Windows started flagging people's crappy cheap hardware, Microsoft would get a lot of grief and have to spend a chunk of money on customer support and PR.

formerly_proven · on Jan 22, 2023

I dunno, I've had overclocked memory where memtest86 wouldn't report errors in multiple passes, yet TestMem5 (weird win32 tool from some .ru site) tells you it's bad in about ten seconds. memtest86 doesn't seem to be that stressful for memory.

neilv · on Jan 22, 2023

I think all my Debian installs include memtest86(+) as a grub menu option at boot time.

mike_hock · on Jan 21, 2023

memtest86 needs to run alone so it can test the entire physical RAM, and the actual, meaningful tests that actually test RAM and not L1 take forever to run. What would you have the OS do?

sitkack · on Jan 22, 2023

It still doesn't have full access it doesn't memtest its own code segment for instance. Memtest can also be run as a usermode program (although it needs permission to lock the pages in memory).

https://linux.die.net/man/8/memtester

mike_hock · on Jan 26, 2023

Not the segment where it's currently loaded but it can copy itself around and then test its previous location.

bpye · on Jan 22, 2023

Operating systems zero memory before handing out pages, I wonder what the perf impact would be to run at least a very basic test on a page before allocating it. Obviously allocation would be slower, but maybe it's worth it in some scenarios.

NelsonMinar · on Jan 21, 2023

Still wild to me we lost ECC RAM. It used to be standard in PCs.

Does Apple hardware come with ECC RAM? If anyone could make it make sense as a business, it's them.

MBCook · on Jan 21, 2023

When was it standard? It’s been the high-end extra thing for as long as I can remember.

NelsonMinar · on Jan 21, 2023

I was thinking of 386 era computers and strictly speaking it was just parity RAM, not ECC. Which often led to annoyances when a single parity error would cause your whole computer to halt.

Wikipedia says "By the mid-1990s, most DRAM had dropped parity checking as manufacturers felt confident that it was no longer necessary.". https://en.wikipedia.org/wiki/RAM_parity

I'd love to read a technical deep dive on RAM reliability over time. You'd think with increasing memory cell density and overall larger RAM the number of absolute errors on a desktop computer would be going up over time.

MBCook · on Jan 21, 2023

Ah. I was introduced to PCs during the late 386 era, so that would explain it. I do remember parity errors.

Thanks.

Felger · on Jan 21, 2023

I can remember 486 Motherboard in Packard Bell (quite the entry brand...) systems frequently used 36 bits ECC FP SIMMs.

Printers and plotters from this era used ECC modules most of the time.

But by the end of the century, they were replaced by unbuffered, unregistered 16/32/64 bits modules.

Every mid range server still use ECC. Entry HPE Servers use ECC UREG (unregistered, 9 chips) modules, while mid range and more use ECC REG modules (9 chip + interface controller onboard). Ironically, UREG module are more expensive than ECC REG.

Also, most workstations used ECC modules. Less frequently since 4-5 years.

MisterTea · on Jan 21, 2023

I know the Pentium Pro/2/3 chipsets and motherboards all(?) supported it. Unsure of the Pentium 1 as the 430TX on my Tyan Tomcat IV doesn't, and that is a dual processor board. 486 and earlier likely depended on the chipset as there were many.

At work I have two working slot 1 PIII 800's each with 1GB ECC (4x 256MB DIMMS) on a regular Asus board (doing nothing but waiting to go home with me one day). The board reports the RAM is in fact ECC and that it is enabled.

pram · on Jan 21, 2023

The Xeon based Macs had ECC of course. None of the ARM ones do (yet)

veidr · on Jan 22, 2023

Apple does not make any Macs with ECC RAM any more.

This was the reason I always used Mac Pro machines, and at the end, iMac Pro: those machines had ECC RAM, and incidentally always seemed more stable over months and years than the MacBook Pros I had, which didn’t.

(Of course, there are a lot of reasons why a laptop might crash or malfunction an order of magnitude more than a desktop machine. The benefits of ECC RAM are really hard to catch definitively in the wild. We know they exist though, so I’m in the “it’s insane not to use ECC RAM for any use case that persisting data and then reading it back later” camp.)

whydid · on Jan 22, 2023

The 2019 MacPro with Xeon has ECC RAM.

veidr · on Jan 22, 2023

Oh, my bad; I thought they didn't sell that any more. Indeed, that machine has ECC, (as do all Mac Pros, IIRC... so it will be interesting to see what happens when the Apple Silicon version of the Mac Pro is released).

greenbit · on Jan 21, 2023

Commodity PCs back in the 80s and 90s didn't have error correction but they did have parity (iirc). Correction requires three extra bits per byte compared to parity only carrying one extra bit per byte. I recall around 1990 you could get your 30pin SIMs as 9-bit (parity) or 8-bit (no-parity), and virtually all of the PCs at the time wanted the 9 bit modules. Parity can't correct errors, but at least it can cause an exception when you read something that's had a bit flip.

Dylan16807 · on Jan 21, 2023

On current memory, parity and correction get applied to the entire bus width. So with parity taking one bit, and correction taking log2(width) bits, it only requires 7-8 bits to apply both of them to an entire memory channel. (DDR4 has 64/72 bit channels, and DDR5 has 32/40 bit channels)

Felger · on Jan 21, 2023

Yes I can indeed recall systems being shipped with 4x 9-bits sticks.

And god the horrific price of thoses sticks...

giardia · on Jan 22, 2023

Oh god this gives me flashbacks. I never knew how bad corruption was on home computers until I started using git annex on a multi-terabyte file collection. No matter which computer or disk I used, it would inevitably happen. Then I started to wonder how much corruption had crept in before I was using git annex and never noticed it.

I've basically given up on maintaining large digital media collections for long term purposes. It puts my OCD in overdrive.

ls65536 · on Jan 22, 2023

How about storing the data on a filesystem that performs data checksumming, also configured in a RAID-1-like mirroring profile to enable any corruption to also be corrected? Filesystems like ZFS or BTRFS could be a good choice here.

diarrhea · on Jan 22, 2023

One of the prime prerequisites of e.g. zfs actually is using ECC RAM.

I’m running btrfs with integrity checks, in RAID 1 so it can automatically heal. Yet it’s non-ECC and therefore still has this gaping Achilles heel.

wizeman · on Jan 22, 2023

> One of the prime prerequisites of e.g. zfs actually is using ECC RAM.

ECC RAM is not a prerequisite for using ZFS.

Matt Ahrens, co-creator of ZFS and still one of the main developers, said this [1]:

"There's nothing special about ZFS that requires/encourages the use of ECC RAM more so than any other filesystem. If you use UFS, EXT, NTFS, btrfs, etc without ECC RAM, you are just as much at risk as if you used ZFS without ECC RAM. Actually, ZFS can mitigate this risk to some degree if you enable the unsupported ZFS_DEBUG_MODIFY flag (zfs_flags=0x10). This will checksum the data while at rest in memory, and verify it before writing to disk, thus reducing the window of vulnerability from a memory error.

I would simply say: if you love your data, use ECC RAM. Additionally, use a filesystem that checksums your data, such as ZFS."

[1] https://arstechnica.com/civis/threads/ars-walkthrough-using-...

diarrhea · on Jan 22, 2023

True, I phrased it too strongly. Let's put it this way: zfs (or btrfs, ...) can make for a bullet-proof system through checksumming and self-healing. However, its Achilles heel is non-ECC RAM. With it, at least in theory and excluding universal disaster, data can live perpetually and remain intact indefinitely. Without ECC, zfs remains at the mercy of what the RAM might get wrong. That's what I remember learning about it a while ago.

wizeman · on Jan 22, 2023

Yes, that's true.

However, even if you're not using ECC RAM, you're much better off using ZFS or btrfs, because due to their frequent checksumming and checksum validations, these filesystems will usually detect memory corruption much sooner than if you didn't use them.

This could be immensely helpful in scenarios such as the great^4-parent poster.

Note that bad hardware is not limited to non-ECC RAM. ZFS and btrfs help just as much in detecting other kinds of bad hardware, such as bad SATA cables, bad disks, bad disk/SATA controllers, bad CPUs, bad power supply, etc.

But of course, once these checksum errors are flagged by ZFS/btrfs, it's a signal to test and fix/replace your hardware, not keep using a machine with bad hardware.

And yes, while ZFS and btrfs cannot fix errors that happen before the checksumming takes place (e.g. due to bad RAM, bad CPUs, etc), they can still detect these kinds of errors (in some cases, at least), especially when they happen after checksumming already took place. And they definitely can detect and fix errors in the rest of the data-to-storage-and-back path (e.g. bad disk cables, bad disks or disk controllers, etc).

bsder · on Jan 21, 2023

My big complaint is actually digging through the mess of MB/RAM/BIOS documentation to actually verify that I have the right combination of processor/motherboard/memory for ECC AND TEST IT'S WORKING.

AMD seems to support ECC on some consumer-level chipsets, but it's a nightmare to sort it all out and verify that all the bits are actually functioning properly.

indolering · on Jan 21, 2023

Were you able to find a consumer motherboard that supports fault injection (or whatever it's called)? I was able to verify that the motherboard reported ECC was working and performed a 24/hour burn in, but I have been unable to perform a functional test showing that ECC had corrected an error.

bsder · on Jan 22, 2023

Sadly, no. Verifying that ECC is actually functioning is ridiculously hard on consumer grade motherboards.

LanternLight83 · on Jan 21, 2023

I'm with you, but it's worth noting that errant bit-flips are also the most convincing argument for vertically integrated file-systems like ZFS and BTRFS.

diarrhea · on Jan 22, 2023

I was under the impression running such file systems without ECC was kind of reckless and not a solution either? I recall ZFS being “useless” (not technically entirely bulletproof) without ECC.

wizeman · on Jan 22, 2023

> I was under the impression running such file systems without ECC was kind of reckless and not a solution either?

It's about as reckless as running any other filesystem without ECC [1].

In fact, you're much more likely to discover memory errors earlier if you use a checksumming filesystem than if you use a non-checksumming one.

[1] https://news.ycombinator.com/item?id=34477035

diarrhea · on Jan 23, 2023

Thanks! We already had that discussion, your link is your reply to another comment of mine :^)

I was wrong twice huh!

deckard1 · on Jan 21, 2023

> a multitude of bit flip errors

ECC can only report >1 bit flip per word. It can't correct it. If your memory is going bad, the chance of multiple bit failure is going to be much higher than the stray bit flip from other causes (mind you, I have 32GB ECC running right now and have seen zero bit flips over 1.5 years).

I'm not arguing against ECC. I have it and run it myself. But this blog post is an argument for ZFS and regular backups. Failures can occur at many other places than RAM. I had a drive that had corruption and I thought went bad. Turns out, bad SATA cable. A week later I upgraded from cable internet to 1Gbit fiber. My speeds were not at all what I expected. Turns out... bad ethernet cable. Lightning can strike twice I guess.

> By the way, you’d better believe that your disk(s) have all kinds of error correction schemes built into them, which work automatically and transparently.

Well...

https://twitter.com/xenadu02/status/1495693475584557056

https://www.tomshardware.com/news/sk-hynix-sabrent-rocket-ss...

https://twitter.com/bsdphk/status/1495899958960136202

Hizonner · on Jan 22, 2023

> If your memory is going bad, the chance of multiple bit failure is going to be much higher than the stray bit flip from other causes

It's completely possible for a single DRAM cell to go bad permanently.

> But this blog post is an argument for ZFS

ZFS won't save you if what you tell it to put into a file is wrong. And I think that corrupting ZFS' own in-RAM data structures voids your ZFS warranty.

> and regular backups.

Of course you should have regular backups. But do you really feel like spending hours restoring from a backup (and then maybe weeks finding things you didn't restore)? Do you feel like trying to guess how far back you have to go to get to a good backup of a corrupted file? Do you want to lose whatever new work got corrupted between the time your memory failed and the time you finally noticed it?

OP's machine was corrupting data for weeks. You can't just roll all your work back to several weeks ago. Most of us can't, anyway.

You need backups and ECC.

> I had a drive that had corruption and I thought went bad. Turns out, bad SATA cable.

If a bad cable did that, something wasn't doing SATA CRC checking the way it was supposed to (or wasn't reacting to detected CRC errors the way it should have). Similar in spirit to ECC checking for memory.

> My speeds were not at all what I expected. Turns out... bad ethernet cable.

It probably got slow because of retransmissions. If there hadn't been CRCs on both the Ethernet layer and higher layers, it could possibly have just silently corrupted your data.

You need to cover as much of a computer system as you can with error detection and correction, or you always lose. Even one big failure in a lifetime can pay for a lot of ECC RAM.

Dylan16807 · on Jan 21, 2023

> If your memory is going bad, the chance of multiple bit failure is going to be much higher

When a chip starts going bad, the thousands of error reports will more than make up for the errors that slip through.

Just make sure reporting is working.

> Well...

Well what?

Those drives were lying about finishing writes, which is barely related to losing data that was written, and especially data that was written more than a few seconds ago. And if drives didn't have ECC, data loss would be a million times more common.

indolering · on Jan 21, 2023

It also allows for runtime detection of bad actors attempting to flip memory.

phkahler · on Jan 21, 2023

Unfortunately DDR5 is going to complicate rather than fix the story.

zdw · on Jan 21, 2023

It's very strange that DDR5 mandates internal ECC within each physical package, but not on the longer and possibly more EMI sensitive connections between the memory chips and controller.

I would have thought that the additional cost would be minimal (additional wiring on the logic board in some cases), but maybe this is just more artificial market segmentation?

ilyt · on Jan 21, 2023

They have internal ECC coz that allows them to have higher yields, what could be considered faulty chip in DDR4 can be now sold in DDR5. So it is effectively cost-reducing measure for them. Exposing that to the user not only would cost extra pennies, but potentially have uses go "hey, this stick is shit, look at how many correctable errors it is producing, please replace it"

Dylan16807 · on Jan 21, 2023

And LPDDR5 gets to have link-level ECC.

Thinking about it, when a DIMM doesn't have extra RAM chips for ECC, you could reuse the ECC pins to run a parity bit to each chip. It would cost nothing. And with DDR5 having 2x8 ECC pins, it would work with any chip layout: x16, x8, or x4.

willis936 · on Jan 22, 2023

Especially given the tighter electrical tolerances with running 3+ GHz signals across a PCB. I have dialed in an OC from 2x 16 GB Hynix A-die (sold as 6600 MT/s). Using buildzoid's settings I got 7000 MT/s stable, but 7200 MT/s runs into intermittent errors on y-cruncher VST. I don't think there's much headroom in voltages without degrading life. I haven't tried loosening timings, so that's a thread worth tugging on.

I have had a scenario where the machine failed to boot at 7200 until I reseated the memory, so there's definitely a limit on the physical media being hit.

arp242 · on Jan 21, 2023

Why is that?

loeg · on Jan 21, 2023

DDR5 will have some minimal ECC on the stick but critically does not mandate full runs to the CPU. Or in Wikipedia's words:

> Unlike DDR4, all DDR5 chips have on-die ECC, where errors are detected and corrected before sending data to the CPU. This, however, is not the same as true ECC memory with an extra data correction chip on the memory module. DDR5's on-die error correction is to improve reliability and to allow denser RAM chips which lowers the per-chip defect rate. There still exist non-ECC and ECC DDR5 DIMM variants; the ECC variants have extra data lines to the CPU to send error-detection data, letting the CPU detect and correct errors that occurred in transit.

So in some ways it is better than previous generations, but it gives vendors another excuse not to implement full-coverage ECC. That's my guess of why GP said it complicates things.

https://en.wikipedia.org/wiki/DDR5_SDRAM

bpye · on Jan 22, 2023

I wish Intel and AMD would support in-band ECC on their consumer platforms. That way you can use the same DIMMs and you make the choice - either sacrifice a few percent of memory for memory correction, or don't. Intel _do_ support this on _some_ workstation products, but it's very uncommon and naturally those platforms are expensive.

laweijfmvo · on Jan 21, 2023

I've never had ECC memory to my knowledge. What have I been missing out on, in practice? Can we confidently say that x% of my past BSOD were the direct consequence?

TacticalCoder · on Jan 21, 2023

> I've never had ECC memory to my knowledge. What have I been missing out on, in practice? Can we confidently say that x% of my past BSOD were the direct consequence?

I very highly doubt so. I remember BSOD and Guru meditations (on the Amiga) before that. Once I switched to Linux suddenly no more BSOD and very little kernel panics. I've had at times my desktop reach six months of uptime.

I think bit-flips are a great excuse for unreliable software.

BTW I'm not saying ECC is unnecessary: I'm saying it's unlikely the lack of ECC was the reason for most of the BSOD you saw throughout the ages.

indolering · on Jan 21, 2023

The bitflip would also have to cause an error in logic that would result in a system lockup. That's unlikely, given that most information flowing through ram is probably data, not logic.

derkades · on Jan 21, 2023

You may have been lucky enough to never experienced a bad module. But when it does happen, it can have major consequences (losing data, having to reinstall an OS). All while being totally preventable.

kiririn · on Jan 22, 2023

Not enough people realise that the driving reason for ECC in home systems is this, not random once-in-a-blue-moon bit flips

I’ll always buy ECC after multiple data loss events from ram suddenly going bad, worth every penny to have ECC ring alarm bells rather than put 2+2 together after random BSODs and silent corruption

adhoc32 · on Jan 21, 2023

Latest Intel desktop CPUs (i.e. i9-13900KF) supports ECC with the W680 chipset.

Dylan16807 · on Jan 21, 2023

It's worth keeping in mind that the chipset has zero involvement in ECC. The CPU is directly attached to the memory slots. They're using the chipset as an expensive dongle.

jeffbee · on Jan 22, 2023

Well, the chipset does enable error detection and correction features, because it is the responsibility of the chipset to raise certain interrupts or assert this or that signal in certain error cases. You may view this as artificial segmentation but without the more advanced management engine in the Q680 and W680 chipsets, the Z690 and all lower chipsets that contain the simpler "client" i.e. consumer management engine can't enable ECC.

Dylan16807 · on Jan 22, 2023

You're saying the memory controller sends an error signal to the chipset, and the chipset sends it back to the CPU package?

Even so that's an extremely trivial task. It doesn't need a "more advanced" anything. It needs them to not deliberately disable the code or remove the tiny tiny amount of circuit.

kitsunesoba · on Jan 21, 2023

They do it with rear I/O too. Motherboards with anything but workstation or flagship consumer chipsets typically have an anemic port selection, which is silly because for many half the reason to choose building a desktop over buying a laptop is to be able to plug in a lot of stuff without a bunch of hubs/docks/etc.

solomatov · on Jan 22, 2023

Then, the competition is working as expected, i.e. Ryzen had ECC unofficially for some time, now Intel has it. There're plenty of other ways to segment users, i.e. Memory Channels, PCIe lanes, etc.

skunkworker · on Jan 21, 2023

I wish those motherboards didn't cost $450+, I've contemplated building a home server with a 13th gen + ECC because you also get quicksync onboard.

coder543 · on Jan 21, 2023

Exactly. $450 for a motherboard just to get ECC support is ridiculous. I don't know how it is with AM5, but on AM4, my understanding is that you could use ECC memory with many normally-priced motherboards. (Even if it wasn't "officially" supported.)

Mentioning W680 feels pointless. You've always been able to buy high-end workstation-class motherboards and stick ECC in them. The entire point of the article is that all computers should be using ECC RAM, not just the expensive, workstation class computers.

Dalewyn · on Jan 22, 2023

If you sincerely /need/ ECC, any amount of money is of no concern because you need it. It's just a cost of doing business, as accountants might say.

If you just /want/ ECC, then yes $450 bucks is expensive. You don't /need/ it, though, so this is neither here nor there.

coder543 · on Jan 22, 2023

> You don't /need/ it, though, so this is neither here nor there.

The whole point of this article and discussion says otherwise. Don't come to me to complain about that. Your comment should be a top-level comment complaining at the author.

I fully agree with the author that ECC should not be reserved for expensive computers, of course, but I'm just here to point out that W680 is not a response to the author's concerns at all, period. W680 is a continuation of Intel's status quo.

Also, you can use asterisks to create italics.

derkades · on Jan 21, 2023

I believe AMD APUs also have decent hardware acceleration like quicksync, for example available through VAAPI.

spear · on Jan 22, 2023

Unlike the CPUs, I think the AMD APUs that support ECC (the "Pro" versions) aren't available through normal retail channels.

fortran77 · on Jan 21, 2023

One of the main reasons I buy Xeon desktops is the ECC. With 128 GB of memory, and 1 bitflip/GB/year average error rate, it seems too risky to not use ECC for production work.

Retric · on Jan 21, 2023

Real world numbers are closer to 1 bitflip/GB/hour than year because bit flips are highly correlated.

“A large-scale study based on Google's very large number of servers was presented at the SIGMETRICS/Performance '09 conference.[6] The actual error rate found was several orders of magnitude higher than the previous small-scale or laboratory studies, with between 25,000 (2.5 × 10−11 error/bit·h) and 70,000 (7.0 × 10−11 error/bit·h, or 1 bit error per gigabyte of RAM per 1.8 hours) errors per billion device hours per megabit. More than 8% of DIMM memory modules were affected by errors per year.” https://en.wikipedia.org/wiki/ECC_memory

A random stick of non ECC memory might be far above average or have several errors per minute, but you just don’t know.

gautamcgoel · on Jan 21, 2023

That study is very old and is based on long-outdated DRAM tech. I suspect that DDR5 has much lower error rates.

Retric · on Jan 22, 2023

I would welcome more recent data, but I doubt we are talking about a 4 orders of magnitude change to get to /year vs /hour error rates.

gautamcgoel · on Jan 22, 2023

Actually, Samsung claimed a factor of a million lower error rate in DDR5 vs DDR4 due to the on-die ECC.

Source: https://www.anandtech.com/show/16900/samsung-teases-512-gb-d...

Retric · on Jan 22, 2023

> The company details a 512 GB module of DDR5 memory, running at DDR5-7200, designed for server and enterprise use.

That just shows how useful ECC memory is not that these bit flips didn’t occur.

gautamcgoel · on Jan 22, 2023

If you scroll down a bit, the article shows a slide from Samsung claiming that DDR5 improves error rates by a factor of 10^6.

Retric · on Jan 22, 2023

That’s a comparison with ODECC vs non ODECC memory chips not DDR5 vs DDR4.

ODECC was added because they wanted to be able to use DDR5 chips which would have had unacceptably large error rates without it. In other words that improvement is before the binning process, so they are selling chips with a vastly higher innate error rate to the point where the average DDR5 stick could actually be worse than the average DDR4 chip, it’s hard to say without large scale testing from multiple manufacturers.

whitepoplar · on Jan 21, 2023

At least make it user-configurable! I'd trade off a bit of memory capacity for ECC protection in a heartbeat.

deckard1 · on Jan 21, 2023

It is on Ryzen. Stick with AsRock or Asus motherboards, though.

SergeAx · on Jan 22, 2023

I don't know. I am building my own computers for about 30 years, starting with Intel 486, and using them heavily. Couple of times I bought ECC memory, but most of the time - not. For the last 4 years I have 64G of non-ECC RAM. Maybe I had some problem with non-ECC memory during all those years, like sudden BSODs, freezes or some corrupted data, but never anything significant. So I honestly don't see a reason for drama here.

intrasight · on Jan 21, 2023

I believe that is is a right. Every machine I've built in the last ~12 years had had ECC. Nobody is trying to take that ability away from me. I treat my machine as a "workstation". I run several VMs. I keep my machines for a few years. I clean the filters periodically.

Do game machines need ECC? I'd say no. They are optimized for cost-performance. The worse that can happen with a memory error is a lost game.

indolering · on Jan 21, 2023

You have the good fortune of being able to afford lots of expensive hardware. Mandating ECC is basically like mandating seat-belts in all cars: it makes safety cheaper for everyone through the economies of scale. Gamers and libertarians could theoretically disable it if they so chose and the total cost to them would only be a few pennies for the wasted motherboard circuitry.

intrasight · on Jan 22, 2023

But he said "right" not "mandate". My point was only that I currently do have an implicit right.

ECC memory isn't that much more expensive. I was fortunate to build my current machine a few months before memory prices doubled. Previous generation Xeon CPUs aren't very expensive. Same with motherboards. And another option is to just buy a used server. These are super cheap now as so many companies are moving to cloud computing.

marius_k · on Jan 21, 2023

If there would be such right, it should be expresed in terms of error probabilities rather than a specific features like ECC.

phendrenad2 · on Jan 21, 2023

This is just unscientific nonsense.

> Memory manufacturers assure us that desktop RAM is so reliable that it doesn’t need ECC, that the probability of bit flip events is so low that it’s not worth the extra “cost” of ECC

Yea, and they are right.

Night_Thastus · on Jan 21, 2023

I'm not very familiar with hardware architecture.

Could you explain why you disagree?

phendrenad2 · on Jan 22, 2023

ECC requires extra circuitry, and that results in more cost, also more testing cost, and also more failed chips (due to more complexity).

Most people have very little need for ECC. The author didn't even know that they wanted ECC until they were unlucky enough to get a stick of RAM that failed, and failed in such a way that the OS booted, but a file silently corrupted (not that common, because if a chip fails it usually doesn't silently fail like that).

The author is basically asking for ECC for free.

Night_Thastus · on Jan 22, 2023

How much more is that cost, and how does it scale? Are we talking 5%, 10%, 30%?

phendrenad2 · on Jan 23, 2023

Well, ECC memory is more expensive than non-ECC memory by about 10-20% (according to a quick google search). Part of that may be markup due to people who want ECC being willing to pay for a highly reputable brand sold by a highly-reputable seller (they won't be buying ECC in bulk from amazon marketplace).

olliecornelia · on Jan 22, 2023

Cosmic rays make things fun.

thinking001001 · on Jan 21, 2023

Digital privacy should be a human right. ECC RAM is just another iteration.

theandrewbailey · on Jan 21, 2023

These things are not comparable.

indolering · on Jan 21, 2023

The miserable industry response to Row Hammer makes ECC RAM a necessity for anyone doing dangerous human rights work.

theandrewbailey · on Jan 22, 2023

Virtually every piece of consumer hardware and software has security vulnerabilities, all the way down to CPUs and RAM. Perhaps computing, in general, is too hazardous for human rights work?