The heat and stress testing of computer components in the lab does not necessarily bear out how components will behave in the field, according to a study done by Google.
When you are Google, and you have millions of server nodes in production using a mix of different technology, you can actually study component failures with a statistically significant sample. That is what Google has done, tracking memory failures in a subset of its servers over the past two and a half years.
Google techies Eduardo Pinheiro and Wolf-Dietrich Weber and their collaborator, Bianca Schroeder of the University of Toronto, have produced a research paper on the subject, entitled DRAM Errors in the Wild: A Large-Scale Field Study. In it, they point out that the number of soft errors - where error correction algorithms can keep a server running after fixing the memory errors - is lower than you might expect in the field based on lab tests. This is good. But the number of hard errors - such as when bits get stuck and a machine crashes and you need to replace a memory module - is a lot higher than current lab tests from memory and server makers might suggest.
Google ran its memory crash tests on six different server platforms in its data centres from January 2006 through June 2008. Three of the six platforms had hardware memory scrubbing technologies that allowed for single-bit soft errors to be washed out of memory systems, at about a rate of 1GB in 45 minutes, according to Google. Three of the platforms didn't have such memory scrubbing electronics, which means soft single-bit errors can accumulate and turn into multi-bit errors.
Google would not say how many machines were in the sample, but rather said that in the 30-month study, the sample had an aggregate of "many millions" of DIMM-days. The servers in the sample used a mix of 1 GB, 2 GB, and 4 GB DIMMs, and DDR1, DDR2, and FB-DIMM memory types. Google does not discuss what processor architecture it uses, but there is little doubt that most - if not all - of Google's machines are x64 (with maybe some still being x86) architecture.
Google had a monitor program that logged correctable errors, uncorrectable errors, CPU utilization, temperature, and memory allocation to see what the relationships were.
One of the interesting bits is that Google discovered that some servers are just plain crankier than others, which is something that system administrators can attest to, even for identical machines. "Some machines develop a very large number of correctable errors compared to others," the authors of the study write. "We find that for all platforms, 20 per cent of the machines with errors make up more than 90 per cent of all observed errors for that platform."
Across all server platforms tested by Google and all of their DIMMs, 8.2 per cent of the memory modules have correctable errors and an average DIMM has almost 4,000 correctable errors per year, if it is on the blink. Some of the server types among the six that Google monitored had much higher error rates than others, but the reasons why were not obvious.
"There is not one memory technology that is clearly superior to the others when it comes to error behaviour," the authors write. So that isn't it. Whatever the problem is, it was not attributable to different memory manufacturers - Google couldn't find any correlation between who made the memory and error rates. Pinheiro, Weber, and Schroeder speculate that higher memory error rates are caused by DIMM layout and differences in the error correction algorithms used by different memory makers.
Interestingly, the platforms that did not have chipkill error correction - which can recover from multiple bit errors in memory subsystems - had lower correctable error rates, but their servers could not survive multi-bit errors. Clearly, there is some kind of tradeoff here. But Google's research also suggests that more power error correction (chipkill versus normal ECC scrubbing) can reduce unrecoverable error rates by a factor of 4 to 10.
The point is, memory error rates on servers are much higher than the lab tests done to date might suggest. Depending on the server platform, Google said it saw per-DIMM correctable error rates that convert to something on the order of 25,000 to 75,000 failures in time (FIT) per billion hours of operation per Mbit. Compared to this, prior lab tests (using the stresses of higher utilization or temperature to simulate a longer time) showed a failure rate of between 200 and 5,000 FIT per Mbit. This is a huge difference, and you can see now why Google invented its Google File System and massive clustering done on the cheap.
The other interesting finding in the research, and one that system admins will nod their heads at almost immediately, is that the number of correctable errors increases as memory modules age, with error rates spiking up after between 10 and 18 months in the field. The incidence of uncorrectable errors goes down over time, however, as crappy components are replaced and hardy ones are left in the systems.
Google's research also suggests that faster and denser memory technologies have had no appreciable effect on increasing memory error rates, contrary to what many server vendors and customers have feared it might - hence the invention of chipkill, to compensate. And while higher temperatures can cause higher memory error rates, the effect is not as high as many would think. Instead, error rates are strongly correlated with the utilization rates on the DIMMs. Temperature is not the biggest cause of stress - swapping data in and out is. ®