When I recently priced out a new PC, I opted for one that had as a standard feature error correction code (ECC) memory -- a type of system memory that supports advanced error correction. I was concerned that without ECC, I'd be more prone to memory errors and was willing to pay a little extra for the insurance policy.
There's no question ECC memory comes at a premium. Not only is the memory itself more expensive, but the technology required to support it must be built into the motherboard of the computer that uses it (which, in turn, raises the price of the system again).
To understand what ECC is really for, you have to delve into the history of commodity PC memory.
History of error correction in memory
When the desktop PC first appeared, memory was not only an expensive commodity, but it was also far more prone to errors than it is today. The traditional stopgap for memory errors was a technique called parity, where each byte of memory had an additional ninth bit associated with it as an error-checking measure. If the parity check failed, the system would be stopped to prevent data corruption.
However, there were four problems to this approach:
The only way to deal with a memory error was to shut down the whole system. This often meant lost work, even if you saved often (simply because the process of saving was that much slower!).
Parity didn't protect against there being an error in two bits of a given byte. While the chances of this happening are statistically very low, it was still seen as a defect in the design -- especially if the amount of information processed by computers as a whole was exploding geometrically each year, thereby increasing the odds of memory errors in general.
Parity memory was slightly more expensive than regular memory, and as the PC market became more competitive, parity was one of the first features to be dropped entirely from lower-end PCs.
The quality of memory components increased dramatically over time, drastically reducing the need for aggressive error-checking.
Because of these issues, parity memory gradually faded from common use. Today it's scarcely even offered.
Error correction code resembles auto insurance
Another technology eventually arose as a replacement for parity: error correction code. ECC has several advantages over parity. For one, it can detect and repair single-bit errors and do so without having to stop the whole system. Multiple-bit errors will still return a parity error, but the odds of this happening are astronomically low during the lifetime of a PC unless the memory itself is defective. ECC is like auto insurance: It covers you for the majority of things that can go wrong, but it can't prevent a multi-car pileup.
So who really needs error correction code, and why?
I posed this question to Mike Sanor, compatibility and performance manager at Crucial Technology, a division of DRAM manufacturer Micron Technology that sells memory directly to end users.
Sanor said that ECC is most useful for "servers and precision workstations, but not commodity desktops." The reason is simple: The error rate in today's consumer-level memory is so low so that for most everyday applications, adding ECC is pure overkill. For standard DDR2 memory, the error rate is something like 100 soft errors over 1 billion device hours. If there are 16 memory devices or chips on a given module, that translates to one soft error every 30 years. Even if you only have two such DIMMs in a system, that's still less than one error for more than the lifetime of the system as a whole.
So where do memory errors come from, if not from manufacturing defects in memory itself? According to Sanor, "One of the biggest problems we've seen is when the motherboard doesn't reliably supply the needed voltage to drive the memory module. This could be due to a low-end power supply, but it's also sometimes due to the power regulator on the motherboard itself. If the data going into the memory module is bad to begin with, ECC will not help."
Windows improvements obviate memory correction
Another reason why memory correction has become less urgent for commodity systems is better operating system design. Windows 95, Windows 98 and Windows ME were not very discriminating about the memory used in the system. But the NT family of Windows products (NT, 2000, XP, Server 2003, Vista and upwards) attempts to determine the reliability of the memory, the kernel and its essential services being loaded.
This is partly why, when some people tried to upgrade from Windows 98 to XP on the same hardware, XP bluescreened when 98 hadn't. XP was that much more scrupulous about memory quality and wasn't going to allow itself to run if there was a risk of data corruption. Vista adds another element to this mix -- Address Space Layout Randomization -- which guarantees that parts of memory that previously might not have been used in an older version of Windows now are.
Since memory's as good as it is these days, who is ECC really for? "If your job is data, then ECC is for you," says Sanor. The more rigorous the need for data precision, i.e., CAD, engineering, math and finances, the more useful ECC will be (and the easier it will be to justify the cost).
The bottom line: Most servers should have error correcting code memory, as should high-end workstations. But it's hard to cost-justify adding ECC memory to a commodity desktop PC. What matters more in a desktop PC is that the motherboard, power supply and other elements are well-engineered.
About the author:
Serdar Yegulalp is editor of Windows Insight (formerly the Windows Power Users Newsletter), a blog site devoted to hints, tips, tricks and news for users and administrators of Windows NT, Windows 2000, Windows XP, Windows Server 2003 and Vista. He has more than 12 years of Windows experience under his belt and contributes regularly to SearchWinComputing.com and SearchSQLServer.com.