Tuesday, 21 December 2010

RAID Tips Bonus - Unrecoverable Errors in RAID 5

There is a known issue with RAID 5 that if one drive in the array fails completely, then there may be a data loss during the rebuild if one of the remaining drives encounters an unrecoverable read error. These errors are relatively rare, but the size of arrays involved had increased to the point where one cannot even read the entire array without encountering a read error.

There are some scary calculations available on the Internet (see an example here), concluding that there is as much as 50% probability of failing the rebuild on the 12TB (6x2TB) RAID 5 if one disk fails. These calculations are based on somewhat naive assumptions, making the problem look worse than it actually is.

First of all, the statement that "There is a 50% probability of being not able to rebuild a 12TB RAID 5" is the same as "If you have a 10TB RAID 0 array, there is a 50% probability of not getting back what you write to it, if you write the data and then read it back immediately." That's assuming same amount of user data on both arrays and 2TB hard drives. Still, nobody declares RAID 0 dead.

This can be reformulated even further. Assuming 100MB/sec sustained read speed, we can say "There is a 50% chance that a hard drive cannot sustain a continuous sequential read operation for 30 hours non-stop", which just does not look right. 30 hours is the approximate time to read 10TB of data at 100MB/sec.

The silent assumptions behind these calculations are that
  1. read errors are distributed uniformly over hard drives and over time, and
  2. the single read error on rebuild kills the entire array.

both of these are not true, making the result useless.

No comments:

Post a Comment