Flash banishes the spectre of the unrecoverable data error
Escape the fatal flaw in RAID
Those who follow storage developments know that there are concerns about the viability of RAID systems.
Rebuild times are so long that the chances of an unrecoverable read error (URE) occurring are dangerously high. What is true for traditional disk, however, is not necessarily true for flash.
Now that traditional magnetic disks have surpassed 4TB, the standard four-disk RAID 5 is dead. Without delving deep into what goes on under the hood, flash responds in a different way in an array than with traditional magnetic disk.
Let’s look at the maths of rebuild times and how they are different when using flash. To understand all this, first we need to understand UREs and why they matter.
URE not my friend
UREs can be lots of things. They could be a sector gone bad. They could be an undetected error during the writing process that left some of the bits in a sector a little more ambiguous than they should have been. Maybe a transient electromagnetic field depolarised a bit, or a cosmic ray slammed into your drive and your nice cluster of 1s and 0s now contains a 0.5.
Regardless of the mechanism, UREs happen, and they happen with some regularity. All drives – be they magnetic or flash – ship with extra capacity. When blocks of flash or sectors of a disk are permanently unreadable (one type of URE) then the sector is marked bad and a "spare" sector is mapped in.
That is great for drive life, but it doesn't help you read the data that was lost during that URE. Older drives were ~512 bytes per sector (512 bytes usable data space + error correction space + sync + address + block gap makes up one physical sector, if you want to get technical about it). Modern drives are ~4KiB (4,096 bytes) per sector.
One unreadable bit means the loss of a whole sector. You can fit whole files into 4Kib. There are several DLLs in my System32 directory that are less than 4KiB – some with the word "security" in them.
How this affects RAID is that RAID rebuilds tend to fail if a URE is encountered during rebuild and no parity sets remain.
For example, say you have a RAID 5 setup with a dead disk. RAID 5 can only suffer one drive loss. If you encounter a URE during a RAID 5 rebuild there is no other copy of that information, nor any parity data from which to rebuild the data. It is simply gone. Your array has failed.
With a RAID 6 rig you can lose one drive and encounter a URE because RAID 6 was designed to suffer two drive failures. You cannot lose two drives and then suffer a URE.
Calculate the risks
The important question then is how likely are UREs to occur? This is the subject of some debate. Manufacturers post specs for their drives. For the sake of this article we will ignore the arguments about accuracy and brush aside questions about whether the posted bit error rates (BERs) represent the worst case, the average or the mean. We will simply accept the posted BERs as fact.
It is time now to look at the maths. The definitive document on the subject is Adam Leventhal's Triple-Parity RAID and Beyond. There is plenty of room for debating Leventhal on this subject – and many do – but if you want to talk about UREs, BERs and the viability of RAID, the discussion starts there.
A bunch of maths is involved in determining RIAD viability rates. Rather than try to walk you through it all, I will give you a link to an excellent forum post by user EarthwormJim. He does the math so you don't have to, but there is an even simpler way to approximate the worst case of how much trouble you might be in: use the raw BER numbers.
There are some rules of thumb when looking at what kind of drive will give you what error rate. Simple googling will find exceptions to each of these broad categories but sadly, not many.
Note that we are using TB and PB, not TiB and PiB. 1TiB is what Windows would report as a TB and is 1,099,511,627,776 bytes. 1TB is what drive manufacturers call a TB and is 1,000,000,000,000 bytes.
- Consumer magnetic disk error rate is 10^14 bits or an error every 12.5TB.
- Enterprise magnetic disk error rate is 10^15 bits or an error every 125TB.
- Consumer SSD error rates are 10^16 bits or an error every 1.25PB.
- Enterprise SSD error rates are 10^17 bits or an error every 12.5PB.
- Hardened SSD error rates are 10^18 bits or an error every 125PB.
Putting this into rather brutal context, consider the data sheet for the 8TB Archive Drive from Seagate. This has an error rate of 10^14 bits. That is one URE every 12.5TB. That means Seagate will not guarantee that you can fully read the entire drive twice before encountering a URE.
Let's say that I have a RAID 5 of four 5TB drives and one dies. There is 12TB worth of data to be read from the remaining three drives before the array can be rebuilt. Taking all of the URE math from the above links and dramatically simplifying it, my chances of reading all 12TB before hitting a URE are not very good.
With 6TB drives I am beyond the math. In theory, I shouldn't be able to rebuild a failed RAID 5 array using 6TB drives that have a 10^14 BER. I will encounter a URE before the array is rebuilt and then I’d better hope the backups work.
So RAID 5 for consumer hard drives is dead.
None of all this factors in real-world issues. For example, UREs tend to cluster together, which can be really unfortunate. Disk failures also cluster together, so that even RAID 6 is starting to look questionable for consumer drives.
Enterprise magnetic disks move us from a 10^14 BER to a 10^15 BER, but this is not really buying us much. Sure, we can read 10 times as much data from an enterprise drive as we can from a consumer drive, but data requirements are exploding and drive sizes are climbing.
Suddenly those folk making all-flash arrays look a lot less crazy
The short answer to our problems is to look to SSDs. Consumer SSDs offer BERs that are 100 times less frequent than in consumer magnetic drives, and enterprise SSD BERs are 1,000 times less likely. Suddenly those folk making all-flash arrays look a lot less crazy.
Of course, we can't put everything on flash. Even if we have the money, the whole world’s fab capacity comes nowhere close to meeting data requirements. Flash for anything more than your really important data is still something of a pipe dream.
Hot and cold
Despite the doom and gloom, however, all is not lost. Hybrid arrays exist that combine flash and magnetics. The magnetic drives in a hybrid array are typically set up in RAID 6 or RAID 10, which also offers much better resiliency against UREs than RAID 5.
The hot (most frequently accessed) data is typically kept on the flash drives, while the cold (rarely accessed) stuff is demoted to magnetics.
Another alternative is to segregate your data manually. Put the critical stuff on your flash arrays and keep either bit object stores or erasure coded clusters of magnetic disks around to do bulk storage of your cold data.
RAID 5 and RAID 6 parity calculations are a type of erasure coding, but erasure coding in a storage context is usually referring to a setup that allows you bring together multiple servers full of disks and specify the number of disk (and entire server) failures you wish to be able to sustain.
This will affect the amount of usable space you have and the amount of CPU power required to make all the calculations.
Erasure coding is used by numerous hyperscale providers to do large object storage setups, and by smaller clustered-array vendors to offer solutions for backup and (more recently) virtualisation workloads. Currently, I am aware only of Yottabyte as a hyperconverged vendor using erasure coding.
There are plenty of ways to ensure that we can reliably store data, even as we move beyond 8TB drives. The best way, however, may be to put stuff you really care about on flash arrays. Especially if you have an attachment to the continued use of RAID 5. ®