Friday, 20 March 2015

Fault tolerance in storage systems

When someone says RAID5 is fault-tolerant, this is not meaningful enough.

  1. Specific implementation must be named. 
  2. The set of anticipated failures must be listed.
  3. For each of the anticipated failures, the extent of degradation must be specified.
So, generic implementation of RAID5 does not lose data when exactly one drive fails. This does not say anything about performance and, generally, data availability. Another example is generic NAS does not lose data if its network connection fails. However, the data is unavailable until connection is fixed in some way or other.

So, when talking fault tolerance, don't forget to include at least the set of anticipated failures.