How does RAID 5 reduce data loss?
RAID 5 takes your data and adds some parity data that makes it possible to reconstruct the original data if there is a drive failure (RAID 6 is similar, except it can reconstruct after two failures). So why would it stop working?
The URE problem
RAID 5 works fine when there are no further failures or errors during data reconstruction. Back in 2007 though, almost all SATA drives, and many SCSI drives, were spec’d with one Unrecoverable Read Error (URE) at 10^14. That’s one URE every 12.5TB.
One terabyte drives were coming into production then. If you had an 8 drive RAID 5 stripe, and one drive failed, the RAID controller would have to read 7TB of data to reconstruct the failed drive.
That meant a better than 50 percent chance that during the reconstruction a URE would scuttle the entire process. When that happens it would have been faster to use a backup to rebuild the data.
Of course, drives have only gotten bigger. Four terabyte drives are common and we now have 10TB drives.
Why do we still rebuild RAID drives? | Storage looks inward: Today’s action is inside the server, not out on the SAN | Disk drive reliability: What we’ve learned from a billion hours | How to really erase any drive — even SSDs
Why does RAID 5 still work?
Simple: drive vendors up’d the spec – for some drives – to one URE in 10^15 bits, or about 125TB. Of course, now that drive capacities have also increased by 10x, the problem of failure due to a URE during reconstruction is coming back.
However, many other large capacity drives aren’t at the higher spec. If you use a low-spec drive in a RAID, there’s a good chance the rebuild won’t work.
It pays to look at spec sheets if you have critical applications or data. Or you can do what I do.
The Storage Bits take
I have a couple of 4 drive RAID 5 arrays. I don’t worry about the URE problem because I have all the critical data backed up to the cloud.
In case of a drive failure your first action should be to copy all data from the array before replacing the failed drive. If you encounter a URE during copying, at least you’ve saved all the other data. Not all low-cost RAID controllers report read errors, so you might copy a corrupted file, but that would have happened anyway.
This reiterates the core premise of RAID: it provides data access after drive failures and is NOT a substitute for backup. Fortunately, hard drives are getting more reliable, so your chance of needing this advice is declining.
But as drive capacities continue to rise, vendors need to raise their URE spec. When will they do it?
Courteous comments welcome, of course.