Knowing just what breaks a storage box is of obvious interest to data center admins. It's quite reasonable to conclude the blame should be heaped on the 80-some platters spinning all day at 7200 RPMs.
But a recent study presented at the USENIX Conference on File and Storage Technologies argues that disk failure isn't nearly the whole story. Other components in a storage subsystem are often the point of failure, although their failings are still treated as disk faults. This results in unnecessary disk replacements — and inevitably an incomplete perspective on storage system resiliency.
The study, titled "Are Disk the Dominant Contributor for Storage Failures?" was authored by Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou of the University of Illinois department of computer science and Arkady Kanevsky of Network Appliance.
Over a period of 44 months, the group analyzed storage logs of about 39,000 commercially deployed storage systems. They estimate the systems in total were composed of about 1,800,000 disks hosted in about 155,000 storage shelf enclosures. The researchers examined near-line (backup) disks, low-end, mid-range, and high-end hardware.
While the findings do show disk failures contribute to 20-55 per cent of storage subsystem failures, other components such as physical interconnects (broken wires, shelf enclosure power outages, HBA failures, etc) and protocol stacks (software bugs and compatibility issues) also account for a significant percentage of problems.
The group states that recent studies on storage failures have failed due to an excessive focus on disk malfunctions. For example, in June, Google released a paper that disputed the reliability claims of disk manufacturers from a user perspective. A good start, write the researchers.
"But as this study indicates, there are other storage subsystem failures besides disk failures that are treated as disk faults and lead to unnecessary disk replacements," their paper claims.
The research indicates between 27-68 per cent of storage subsystem failures come from physical interconnects. Between 5-10 per cent are a result of protocol stack errors. Due to component failures, even slower, more reliable disks like near-line backup have higher failure rates.
"These results indicate that, to build highly reliable and available storage systems, only using resiliency mechanisms targeting disk failures (e.g. RAID) is not enough," the study states. "We also need to build resiliency mechanisms such as redundant physical interconnects and self-checking protocol stacks to tolerate failures in these storage components."
As an example, in low-end storage systems (defined as having embedded storage heads with shelf enclosures) the annualized failure rate (AFR) is about 4.6 per cent. The AFR for the disks only is 0.9 per cent, or only 20 per cent of overall AFR.
Near-line storage disks (mostly SATA) show a 1.9 per cent AFR, but again the whole storage subsystem failure is higher, at 3.4 per cent.
So, on their own, low-end disks fail less often than higher-end SATA disks, but total SATA systems fail less often than lower-end systems.
The researchers argue this indicates that "disk failure rate is not indicative of the storage subsystem failure rate," meaning there's other factors for failures such as shelf enclosure model and network configurations that strongly affect reliability.
The research team concludes that storage subsystem components cannot be ignored when designing a reliable storage box. They offer some suggestions to improve reliability.
Redundancy mechanisms such as mulitpathing were able to reduce AFR for storage systems by 30-40 per cent when paths were increased from one to two.
The researchers also recommend spanning a RAID group across multiple shelves — and using fewer disks per shelf, with more shelves in the system. This helps reduce the chances of a shelf failure taking out an entire RAID group.
The full paper is available here at the Usenix website. ®