ECC has been used as an artificial market segmentation mechanism for a long time and it needs to come to an end. RAM just like SSDs and HDDs ought to have some amount of self protection against basic errors, all places where data is stored even for short periods needs this.
What's really annoying is that the current generation of Intel CPUs support ECC, they just don't implement it in consumer chipsets. You can get working ECC with a W680 motherboard, but those are very expensive and availability is slim.
AMD ryzen CPUs have ECC support, but it varies by motherboard. My home NAS runs a Ryzen in an ASrock Rack motherboard with ECC ram.
I never did figure out a way to verify that the ECC is working/if it is able to report errors (to the kernel?). It was also a bit hard finding the right ram, but there’s a brand called Nemix that I found. I was a bit sketched out by the brand, but the chips themselves were Micron.
AMD used to not segment ECC for non-consuner lines. It wasn't very popular with consumers. So I can't really blame them for not supporting it well. One way that is a good balance is to make it work but not officially supported on consumer lines. Then it's up to mb vendors and buyers to take advantage if worthwhile to them. That's enough to not cut into non-consuner lines or add support costs to consumer pricing.
One thing that could change this quick is if Apple used ECC. Others would follow to not seem inferior. I don't hear Apple users complain much at all about lack of ECC options.
I wasn't aware. Did some searching around and it mostly concerns the Mac Pro/Studio products in discussions. I was thinking more along the lines of regular MacBook Pros and such. Using ECC in products priced much higher doesn't change things elsewhere.
Not all of them, afaik the am4 processors with graphics don't unless you have a pro branded version.
> I never did figure out a way to verify that the ECC is working/if it is able to report errors (to the kernel?
The easiest way is to induce an error. Some of the memtest tools try? But also, you can tweak memory voltage and timing to make things less reliable, and you should be able to get errors and error reporting as a result.
I think the question is where are the errors reported? Would you see logs in dmesg for example? Or is there some smartctl-like tool that exposes the raw correction counters?
I was looking into getting a motherboard for a consumer CPU with ECC support several months ago. The fact that AMD and the motherboard manufactures don't explicitly say they support ECC is the reason I went with a Intel Supermicro motherboard that did explicitly say it supports ECC. I agree with the article though that it's crazy I have to resort to a workstation/server motherboard to be certain I'm getting ECC.
To see if ECC is working try overclocking the RAM if possible and looking for EDAC errors (they'll show up in dmesg) or from running edac-util. Eventually you should see something reported.
I suspect that other methods of CPU segmentation, i.e. RAM channels, and PCIe lane are artificial in a similar way, i.e. cost of producing a die for a CPU is more or less the same, but pro CPUs have more of them.
ECC RAM would actually be a boon to everyone, including gamers.
ECC means not only that you know precisely when you've gone too far with overclocking, but potentially allows overclocking a bit further, relying on that some amount of trouble can now be tolerated.
It also means you're not going to break your OS by playing with this stuff. Memory corruption carries a huge risk of disk corruption, which can mean things like corrupt data, random crashes or an unbootable system that persists even after reverting everything to defaults.
Indeed, because I had ECC RAM I felt comfortable overlocking it, and managed to go from 2400MT/s to 3200MT/s for my home server. A massive difference in performance, while paying only slightly more for ECC.
I doubt that would actually be useful with overclocking. I don't know the arch of the modern PC well enough to say with 100% confidence, but on embedded arches, the RAM has the parity bits checked when they get placed on the bus. If the error happens on data retrieval(or was already present) , then the ECC saves you, but if it happen anywhere else... not really? I don't know if.. ALU's for example automatically include the parity bits in the computation.
The GTX 1080 has GDDR5X, which has ECC. I was able to dial in my memory OC on that card by increasing the clock then running memory throughput benchmarks until the throughput was maxed out. At the point where errors start the theroughput would decrease, but nothing would fail.
Assuming stationary processes this is pretty nice. Maybe bump back 5% for margin.
I believe that's inside the chip, much like DDR5. However that doesn't help if there's an error introduced by the packages, pin, board traces, CPU pins, or CPU.
Some people prefer to trade off stability for a slight performance improvement. With modern hardware I don't think it's worth it to be honest. I want my computer to work day in and day out even if it means a 2% lower score in some benchmark.
Some people prefer to trade off stability for a slight performance improvement.
In my PC, I've got a Intel Core i5-3570K (3.4ghz stock) overclocked to 4.5ghz or something silly like that. I used the motherboard manufacturer's "one touch overclocking" feature to determine that speed years ago and haven't touched it since.
The performance improvement is more than slight. Stability is rock solid across thousands of hours of gaming. I'm not running advanced cooling. $60 Corsair closed loop cooler and three Noctua case fans. System runs cool and quiet, fans throttle up and down nicely. Near silent when idle and still rather quiet under load.
That is a pretty decent gain in performance for zero drawbacks and essentially zero effort.
That's around the last generation when it was worth to overclock things, because the headroom on a typical consumer-grade -K cpu was at least 25%.
Personally, I run a 4670K at 4Ghz, and it's been rock stable as my main machine for the last 8.5 years. It's finally beginning to show its age in 22-23, but the longevity and use I got out of a $200 processor is incredible.
But when CPUs auto-turbo to 5+ Ghz, I agree, overclocking sort of loses its luster.
Yeah personally I’d rather build a PC with efficient power supply and quiet, high performance cooling. No gaudy LEDs either, just a plain case. I want it to be very stable and reliable and unobtrusive.
There are also a lot of cases where you can overclock without sacrificing stability. The standard clock speed for any line of processors is simply the minimum it is tested for. But you sometimes get lucky and can get a better chip with more viable transistors. So you can boost the clock ok those and reap the benefits without any drawbacks.
There are sites and services that do "binning" where they test the specific chips and you can buy ones that have been vetted to clock higher.
XMP profiles are another example where you can boost RAM speeds without much risk of instability [1]. I assume the manufacturer is just doing binning on their end and selling the higher performing RAM at higher prices.
And the performance differences can be significant: 20% or more in certain workloads.
And undervolting lets a CPU/GPU boost to higher frequency for longer before hitting power/temperature limits or just use less power to hit the same performance. There's a decent chance that any given chip can be undervolted enough to make a notable difference before becoming unstable.
I think it's mostly pointless in this day and age.
I'm just saying that it has a potential appeal for gamers too, so it's not just a datacenter type of technology that some nerds want to play with.
At the very least it'd make overclocking safer and easier, so any manufacturer making gamer type boards with a lot of overclocking settings in the BIOS should like the idea of it.
The sweet spot for overclocking ECC ram is still before it starts malfunctioning. If it's clocked higher but is correcting for errors it will still be slower.
Read the paper(s) below. Absolutely frightening, we all basically swimming in quicksand. ECC should be mandatory! Not using it is like huffing leaded gas.
Linus recently also had troubles with his development machine caused by the lack of ECC: 'I absolutely detest
the crazy industry politics and bad vendors that have made ECC memory
so "special".' https://lkml.iu.edu/hypermail/linux/kernel/2210.1/00691.html
Personally used about an even split of ECC vs. non-ECC RAM over the last ~30 years and can't say I've ever encountered any problems with non-ECC RAM. If it happened it was no significance for me to notice. Professionally yeah I've run across a server here and there report bit errors but it was a four leaf clover the of thing. Not unheard of but not common. The more important thing in my experience is using high quality hardware with stable well tested drivers. That is the first practical precaution to consider. If you've got a box full of the cheapest parts you can find on Amazon you got bigger problems to consider before worrying about ECC.
This kind of comment is the most insightful kind concerning ECC RAM.
If you're relying on your computer to be 100% error-free (eg: doing professional work), paying some extra for ECC RAM support isn't even a drop in the bucket. You either do or don't, money isn't an object.
If you're Joe Average browsing Hacker News or playing some games at home, who cares if you win the bit flip lottery?
Would it be nice for ECC RAM to be more mainstream? Sure. But the fact it is not isn't breaking anyones' backs either.
I wonder if lack of ECC is a form of planned obsolescence. It makes the computer cost less, and it insures the computer starts failing after a while as the OS and files get corrupted, making the owner want a new computer. More profit due to both of those. Kind of like using rubber for seals in a car rather than silicone, even though the cost difference is tiny, so owners eventually have to buy a new car.
Silent data corruption is silent. ECC should be mandatory just for the error reporting. The system should inform the user that a DIMM has gone bad and needs replacement.
> It was the worst-case scenario for RAM failure: bit flip errors that get written back to the disk. I discovered that several video files that I had been editing had corrupted bits, and were no longer usable.
What a horrible failure mode! Kudos to the OP for even thinking to investigate this.
The mentioned memtest86 tool sounds quite useful - so useful, in fact, that it seems it should be part of the operating system. If my OS is perfectly willing to write corrupted memory back to disk, and it's designed to run on thousands of different OEM hardware configurations that may or may not have ECC RAM, then I would expect it to proactively monitor for faulty RAM and bitflips.
Is there a good reason why this isn't the default behavior of operating systems (maybe it is - I use Mac and don't know much about Windows)? It seems like a trivial diagnostic tool that could prevent a lot of headache. But perhaps the problem is there's no reliable test without some number of false positives, so popping a warning on the screen that your RAM appears unhealthy seems like a good way to confuse the average user. But then again, so does silently corrupting a video file.
Note that memtest86 is a test application, not a means of ensuring integrity: it writes known patterns into memory in order to test it. Windows also has a built-in version which has the same limitations: you reboot into a special mode where it runs the test, it doesn't work online or validate actual application data.
The OS layer is probably the hardest place to do online application data integrity checking: it's too low level to know what data is important and which parts will change and why in a way which allows checksumming to work effectively and efficiently, but too high above the hardware to be able to check the integrity of memory without a massive performance penalty (especially when it comes to how memory is moving in and out of caches). Most solutions work at a higher level in the application itself or lower level with ECC RAM as mentioned in the article.
> The mentioned memtest86 tool sounds quite useful - so useful, in fact, that it seems it should be part of the operating system
Windows has the Memory Diagnostics Tool, and your linux probably has a memory tester option in the boot manager. Of course you have to run them manually, which is a bit bothersome. Thee have been attempts at kernel patches for linux to test memory in the background, going at least back to 2005 [1], but there were probably some before that. The simpler versions just test whatever memory is free, the more sophisticated aproaches try to move stuff in physical memory to free each region periodically so it can be tested. But in the end it's spending CPU cycles on a problem most users never experience.
Windows has a memory test tool built in called the Windows Memory Diagnostic. It runs on the next reboot to fully test system memory.
Apple Diagnostics check memory as well, though maybe not as thoroughly as the Microsoft or memtest86 do (multiple passes).
I've had memory errors that only appeared on two of eight passes through the whole memory test overnight. They certainly caused occasional issues with games though.
> Is there a good reason why this isn't the default behavior of operating systems
Because any equivalent of memtestx86 will flag a LOT of hardware for the buggy piles of shit that they really are.
If Windows started flagging people's crappy cheap hardware, Microsoft would get a lot of grief and have to spend a chunk of money on customer support and PR.
I dunno, I've had overclocked memory where memtest86 wouldn't report errors in multiple passes, yet TestMem5 (weird win32 tool from some .ru site) tells you it's bad in about ten seconds. memtest86 doesn't seem to be that stressful for memory.
memtest86 needs to run alone so it can test the entire physical RAM, and the actual, meaningful tests that actually test RAM and not L1 take forever to run. What would you have the OS do?
It still doesn't have full access it doesn't memtest its own code segment for instance. Memtest can also be run as a usermode program (although it needs permission to lock the pages in memory).
Operating systems zero memory before handing out pages, I wonder what the perf impact would be to run at least a very basic test on a page before allocating it. Obviously allocation would be slower, but maybe it's worth it in some scenarios.
I was thinking of 386 era computers and strictly speaking it was just parity RAM, not ECC. Which often led to annoyances when a single parity error would cause your whole computer to halt.
Wikipedia says "By the mid-1990s, most DRAM had dropped parity checking as manufacturers felt confident that it was no longer necessary.". https://en.wikipedia.org/wiki/RAM_parity
I'd love to read a technical deep dive on RAM reliability over time. You'd think with increasing memory cell density and overall larger RAM the number of absolute errors on a desktop computer would be going up over time.
I can remember 486 Motherboard in Packard Bell (quite the entry brand...) systems frequently used 36 bits ECC FP SIMMs.
Printers and plotters from this era used ECC modules most of the time.
But by the end of the century, they were replaced by unbuffered, unregistered 16/32/64 bits modules.
Every mid range server still use ECC. Entry HPE Servers use ECC UREG (unregistered, 9 chips) modules, while mid range and more use ECC REG modules (9 chip + interface controller onboard).
Ironically, UREG module are more expensive than ECC REG.
Also, most workstations used ECC modules. Less frequently since 4-5 years.
I know the Pentium Pro/2/3 chipsets and motherboards all(?) supported it. Unsure of the Pentium 1 as the 430TX on my Tyan Tomcat IV doesn't, and that is a dual processor board. 486 and earlier likely depended on the chipset as there were many.
At work I have two working slot 1 PIII 800's each with 1GB ECC (4x 256MB DIMMS) on a regular Asus board (doing nothing but waiting to go home with me one day). The board reports the RAM is in fact ECC and that it is enabled.
Apple does not make any Macs with ECC RAM any more.
This was the reason I always used Mac Pro machines, and at the end, iMac Pro: those machines had ECC RAM, and incidentally always seemed more stable over months and years than the MacBook Pros I had, which didn’t.
(Of course, there are a lot of reasons why a laptop might crash or malfunction an order of magnitude more than a desktop machine. The benefits of ECC RAM are really hard to catch definitively in the wild. We know they exist though, so I’m in the “it’s insane not to use ECC RAM for any use case that persisting data and then reading it back later” camp.)
Oh, my bad; I thought they didn't sell that any more. Indeed, that machine has ECC, (as do all Mac Pros, IIRC... so it will be interesting to see what happens when the Apple Silicon version of the Mac Pro is released).
Commodity PCs back in the 80s and 90s didn't have error correction but they did have parity (iirc). Correction requires three extra bits per byte compared to parity only carrying one extra bit per byte. I recall around 1990 you could get your 30pin SIMs as 9-bit (parity) or 8-bit (no-parity), and virtually all of the PCs at the time wanted the 9 bit modules. Parity can't correct errors, but at least it can cause an exception when you read something that's had a bit flip.
On current memory, parity and correction get applied to the entire bus width. So with parity taking one bit, and correction taking log2(width) bits, it only requires 7-8 bits to apply both of them to an entire memory channel. (DDR4 has 64/72 bit channels, and DDR5 has 32/40 bit channels)
Oh god this gives me flashbacks. I never knew how bad corruption was on home computers until I started using git annex on a multi-terabyte file collection. No matter which computer or disk I used, it would inevitably happen. Then I started to wonder how much corruption had crept in before I was using git annex and never noticed it.
I've basically given up on maintaining large digital media collections for long term purposes. It puts my OCD in overdrive.
How about storing the data on a filesystem that performs data checksumming, also configured in a RAID-1-like mirroring profile to enable any corruption to also be corrected? Filesystems like ZFS or BTRFS could be a good choice here.
> One of the prime prerequisites of e.g. zfs actually is using ECC RAM.
ECC RAM is not a prerequisite for using ZFS.
Matt Ahrens, co-creator of ZFS and still one of the main developers, said this [1]:
"There's nothing special about ZFS that requires/encourages the use of ECC RAM more so than any other filesystem. If you use UFS, EXT, NTFS, btrfs, etc without ECC RAM, you are just as much at risk as if you used ZFS without ECC RAM. Actually, ZFS can mitigate this risk to some degree if you enable the unsupported ZFS_DEBUG_MODIFY flag (zfs_flags=0x10). This will checksum the data while at rest in memory, and verify it before writing to disk, thus reducing the window of vulnerability from a memory error.
I would simply say: if you love your data, use ECC RAM. Additionally, use a filesystem that checksums your data, such as ZFS."
True, I phrased it too strongly. Let's put it this way: zfs (or btrfs, ...) can make for a bullet-proof system through checksumming and self-healing. However, its Achilles heel is non-ECC RAM. With it, at least in theory and excluding universal disaster, data can live perpetually and remain intact indefinitely. Without ECC, zfs remains at the mercy of what the RAM might get wrong. That's what I remember learning about it a while ago.
However, even if you're not using ECC RAM, you're much better off using ZFS or btrfs, because due to their frequent checksumming and checksum validations, these filesystems will usually detect memory corruption much sooner than if you didn't use them.
This could be immensely helpful in scenarios such as the great^4-parent poster.
Note that bad hardware is not limited to non-ECC RAM. ZFS and btrfs help just as much in detecting other kinds of bad hardware, such as bad SATA cables, bad disks, bad disk/SATA controllers, bad CPUs, bad power supply, etc.
But of course, once these checksum errors are flagged by ZFS/btrfs, it's a signal to test and fix/replace your hardware, not keep using a machine with bad hardware.
And yes, while ZFS and btrfs cannot fix errors that happen before the checksumming takes place (e.g. due to bad RAM, bad CPUs, etc), they can still detect these kinds of errors (in some cases, at least), especially when they happen after checksumming already took place. And they definitely can detect and fix errors in the rest of the data-to-storage-and-back path (e.g. bad disk cables, bad disks or disk controllers, etc).
My big complaint is actually digging through the mess of MB/RAM/BIOS documentation to actually verify that I have the right combination of processor/motherboard/memory for ECC AND TEST IT'S WORKING.
AMD seems to support ECC on some consumer-level chipsets, but it's a nightmare to sort it all out and verify that all the bits are actually functioning properly.
Were you able to find a consumer motherboard that supports fault injection (or whatever it's called)? I was able to verify that the motherboard reported ECC was working and performed a 24/hour burn in, but I have been unable to perform a functional test showing that ECC had corrected an error.
I'm with you, but it's worth noting that errant bit-flips are also the most convincing argument for vertically integrated file-systems like ZFS and BTRFS.
I was under the impression running such file systems without ECC was kind of reckless and not a solution either? I recall ZFS being “useless” (not technically entirely bulletproof) without ECC.
ECC can only report >1 bit flip per word. It can't correct it. If your memory is going bad, the chance of multiple bit failure is going to be much higher than the stray bit flip from other causes (mind you, I have 32GB ECC running right now and have seen zero bit flips over 1.5 years).
I'm not arguing against ECC. I have it and run it myself. But this blog post is an argument for ZFS and regular backups. Failures can occur at many other places than RAM. I had a drive that had corruption and I thought went bad. Turns out, bad SATA cable. A week later I upgraded from cable internet to 1Gbit fiber. My speeds were not at all what I expected. Turns out... bad ethernet cable. Lightning can strike twice I guess.
> By the way, you’d better believe that your disk(s) have all kinds of error correction schemes built into them, which work automatically and transparently.
> If your memory is going bad, the chance of multiple bit failure is going to be much higher than the stray bit flip from other causes
It's completely possible for a single DRAM cell to go bad permanently.
> But this blog post is an argument for ZFS
ZFS won't save you if what you tell it to put into a file is wrong. And I think that corrupting ZFS' own in-RAM data structures voids your ZFS warranty.
> and regular backups.
Of course you should have regular backups. But do you really feel like spending hours restoring from a backup (and then maybe weeks finding things you didn't restore)? Do you feel like trying to guess how far back you have to go to get to a good backup of a corrupted file? Do you want to lose whatever new work got corrupted between the time your memory failed and the time you finally noticed it?
OP's machine was corrupting data for weeks. You can't just roll all your work back to several weeks ago. Most of us can't, anyway.
You need backups and ECC.
> I had a drive that had corruption and I thought went bad. Turns out, bad SATA cable.
If a bad cable did that, something wasn't doing SATA CRC checking the way it was supposed to (or wasn't reacting to detected CRC errors the way it should have). Similar in spirit to ECC checking for memory.
> My speeds were not at all what I expected. Turns out... bad ethernet cable.
It probably got slow because of retransmissions. If there hadn't been CRCs on both the Ethernet layer and higher layers, it could possibly have just silently corrupted your data.
You need to cover as much of a computer system as you can with error detection and correction, or you always lose. Even one big failure in a lifetime can pay for a lot of ECC RAM.
> If your memory is going bad, the chance of multiple bit failure is going to be much higher
When a chip starts going bad, the thousands of error reports will more than make up for the errors that slip through.
Just make sure reporting is working.
> Well...
Well what?
Those drives were lying about finishing writes, which is barely related to losing data that was written, and especially data that was written more than a few seconds ago. And if drives didn't have ECC, data loss would be a million times more common.
It's very strange that DDR5 mandates internal ECC within each physical package, but not on the longer and possibly more EMI sensitive connections between the memory chips and controller.
I would have thought that the additional cost would be minimal (additional wiring on the logic board in some cases), but maybe this is just more artificial market segmentation?
They have internal ECC coz that allows them to have higher yields, what could be considered faulty chip in DDR4 can be now sold in DDR5. So it is effectively cost-reducing measure for them. Exposing that to the user not only would cost extra pennies, but potentially have uses go "hey, this stick is shit, look at how many correctable errors it is producing, please replace it"
Thinking about it, when a DIMM doesn't have extra RAM chips for ECC, you could reuse the ECC pins to run a parity bit to each chip. It would cost nothing. And with DDR5 having 2x8 ECC pins, it would work with any chip layout: x16, x8, or x4.
Especially given the tighter electrical tolerances with running 3+ GHz signals across a PCB. I have dialed in an OC from 2x 16 GB Hynix A-die (sold as 6600 MT/s). Using buildzoid's settings I got 7000 MT/s stable, but 7200 MT/s runs into intermittent errors on y-cruncher VST. I don't think there's much headroom in voltages without degrading life. I haven't tried loosening timings, so that's a thread worth tugging on.
I have had a scenario where the machine failed to boot at 7200 until I reseated the memory, so there's definitely a limit on the physical media being hit.
DDR5 will have some minimal ECC on the stick but critically does not mandate full runs to the CPU. Or in Wikipedia's words:
> Unlike DDR4, all DDR5 chips have on-die ECC, where errors are detected and corrected before sending data to the CPU. This, however, is not the same as true ECC memory with an extra data correction chip on the memory module. DDR5's on-die error correction is to improve reliability and to allow denser RAM chips which lowers the per-chip defect rate. There still exist non-ECC and ECC DDR5 DIMM variants; the ECC variants have extra data lines to the CPU to send error-detection data, letting the CPU detect and correct errors that occurred in transit.
So in some ways it is better than previous generations, but it gives vendors another excuse not to implement full-coverage ECC. That's my guess of why GP said it complicates things.
I wish Intel and AMD would support in-band ECC on their consumer platforms. That way you can use the same DIMMs and you make the choice - either sacrifice a few percent of memory for memory correction, or don't. Intel _do_ support this on _some_ workstation products, but it's very uncommon and naturally those platforms are expensive.
I've never had ECC memory to my knowledge. What have I been missing out on, in practice? Can we confidently say that x% of my past BSOD were the direct consequence?
> I've never had ECC memory to my knowledge. What have I been missing out on, in practice? Can we confidently say that x% of my past BSOD were the direct consequence?
I very highly doubt so. I remember BSOD and Guru meditations (on the Amiga) before that. Once I switched to Linux suddenly no more BSOD and very little kernel panics. I've had at times my desktop reach six months of uptime.
I think bit-flips are a great excuse for unreliable software.
BTW I'm not saying ECC is unnecessary: I'm saying it's unlikely the lack of ECC was the reason for most of the BSOD you saw throughout the ages.
The bitflip would also have to cause an error in logic that would result in a system lockup. That's unlikely, given that most information flowing through ram is probably data, not logic.
You may have been lucky enough to never experienced a bad module. But when it does happen, it can have major consequences (losing data, having to reinstall an OS). All while being totally preventable.
Not enough people realise that the driving reason for ECC in home systems is this, not random once-in-a-blue-moon bit flips
I’ll always buy ECC after multiple data loss events from ram suddenly going bad, worth every penny to have ECC ring alarm bells rather than put 2+2 together after random BSODs and silent corruption
It's worth keeping in mind that the chipset has zero involvement in ECC. The CPU is directly attached to the memory slots. They're using the chipset as an expensive dongle.
Well, the chipset does enable error detection and correction features, because it is the responsibility of the chipset to raise certain interrupts or assert this or that signal in certain error cases. You may view this as artificial segmentation but without the more advanced management engine in the Q680 and W680 chipsets, the Z690 and all lower chipsets that contain the simpler "client" i.e. consumer management engine can't enable ECC.
You're saying the memory controller sends an error signal to the chipset, and the chipset sends it back to the CPU package?
Even so that's an extremely trivial task. It doesn't need a "more advanced" anything. It needs them to not deliberately disable the code or remove the tiny tiny amount of circuit.
They do it with rear I/O too. Motherboards with anything but workstation or flagship consumer chipsets typically have an anemic port selection, which is silly because for many half the reason to choose building a desktop over buying a laptop is to be able to plug in a lot of stuff without a bunch of hubs/docks/etc.
Then, the competition is working as expected, i.e. Ryzen had ECC unofficially for some time, now Intel has it. There're plenty of other ways to segment users, i.e. Memory Channels, PCIe lanes, etc.
Exactly. $450 for a motherboard just to get ECC support is ridiculous. I don't know how it is with AM5, but on AM4, my understanding is that you could use ECC memory with many normally-priced motherboards. (Even if it wasn't "officially" supported.)
Mentioning W680 feels pointless. You've always been able to buy high-end workstation-class motherboards and stick ECC in them. The entire point of the article is that all computers should be using ECC RAM, not just the expensive, workstation class computers.
> You don't /need/ it, though, so this is neither here nor there.
The whole point of this article and discussion says otherwise. Don't come to me to complain about that. Your comment should be a top-level comment complaining at the author.
I fully agree with the author that ECC should not be reserved for expensive computers, of course, but I'm just here to point out that W680 is not a response to the author's concerns at all, period. W680 is a continuation of Intel's status quo.
One of the main reasons I buy Xeon desktops is the ECC. With 128 GB of memory, and 1 bitflip/GB/year average error rate, it seems too risky to not use ECC for production work.
Real world numbers are closer to 1 bitflip/GB/hour than year because bit flips are highly correlated.
“A large-scale study based on Google's very large number of servers was presented at the SIGMETRICS/Performance '09 conference.[6] The actual error rate found was several orders of magnitude higher than the previous small-scale or laboratory studies, with between 25,000 (2.5 × 10−11 error/bit·h) and 70,000 (7.0 × 10−11 error/bit·h, or 1 bit error per gigabyte of RAM per 1.8 hours) errors per billion device hours per megabit. More than 8% of DIMM memory modules were affected by errors per year.” https://en.wikipedia.org/wiki/ECC_memory
A random stick of non ECC memory might be far above average or have several errors per minute, but you just don’t know.
That’s a comparison with ODECC vs non ODECC memory chips not DDR5 vs DDR4.
ODECC was added because they wanted to be able to use DDR5 chips which would have had unacceptably large error rates without it. In other words that improvement is before the binning process, so they are selling chips with a vastly higher innate error rate to the point where the average DDR5 stick could actually be worse than the average DDR4 chip, it’s hard to say without large scale testing from multiple manufacturers.
I don't know. I am building my own computers for about 30 years, starting with Intel 486, and using them heavily. Couple of times I bought ECC memory, but most of the time - not. For the last 4 years I have 64G of non-ECC RAM. Maybe I had some problem with non-ECC memory during all those years, like sudden BSODs, freezes or some corrupted data, but never anything significant. So I honestly don't see a reason for drama here.
I believe that is is a right. Every machine I've built in the last ~12 years had had ECC. Nobody is trying to take that ability away from me. I treat my machine as a "workstation". I run several VMs. I keep my machines for a few years. I clean the filters periodically.
Do game machines need ECC? I'd say no. They are optimized for cost-performance. The worse that can happen with a memory error is a lost game.
You have the good fortune of being able to afford lots of expensive hardware. Mandating ECC is basically like mandating seat-belts in all cars: it makes safety cheaper for everyone through the economies of scale. Gamers and libertarians could theoretically disable it if they so chose and the total cost to them would only be a few pennies for the wasted motherboard circuitry.
But he said "right" not "mandate". My point was only that I currently do have an implicit right.
ECC memory isn't that much more expensive. I was fortunate to build my current machine a few months before memory prices doubled. Previous generation Xeon CPUs aren't very expensive. Same with motherboards. And another option is to just buy a used server. These are super cheap now as so many companies are moving to cloud computing.
> Memory manufacturers assure us that desktop RAM is so reliable that it doesn’t need ECC, that the probability of bit flip events is so low that it’s not worth the extra “cost” of ECC
ECC requires extra circuitry, and that results in more cost, also more testing cost, and also more failed chips (due to more complexity).
Most people have very little need for ECC. The author didn't even know that they wanted ECC until they were unlucky enough to get a stick of RAM that failed, and failed in such a way that the OS booted, but a file silently corrupted (not that common, because if a chip fails it usually doesn't silently fail like that).
Well, ECC memory is more expensive than non-ECC memory by about 10-20% (according to a quick google search). Part of that may be markup due to people who want ECC being willing to pay for a highly reputable brand sold by a highly-reputable seller (they won't be buying ECC in bulk from amazon marketplace).
Virtually every piece of consumer hardware and software has security vulnerabilities, all the way down to CPUs and RAM. Perhaps computing, in general, is too hazardous for human rights work?