Friday, 31 December 2010

RAID5 write hole

RAID 5 "write hole" happens if a power failure occurs during the write, causing an incomplete parity block update.

The more detailed analysis of a "write hole" and its consequences is available at the RAID Recovery Guide.

Monday, 27 December 2010

Large disks and FAT32.

Microsoft Windows starting with Windows XP doesn't allow you to format a partition larger than 32 GB to FAT32. Note that if such partition has been created earlier, you can use the partition normally.

However, there are situations when you need to format a large disk to FAT32, for example to use the disk in a standalone player. To do this, you can use the program which is called FAT32format (http://www.ridgecrop.demon.co.uk/index.htm?fat32format.htm).

Theoretically, the maximum size of FAT32 partition equals 228*32 KB = 8 TB (http://support.microsoft.com/kb/314463), but in practice (for the sake of compatibility) it would be better to limit yourself to 2TB.

Thursday, 23 December 2010

TLER in end-user systems

In your home PC, you're better off without TLER.

TLER (Time Limited Error Recovery) is a feature of WD hard drives. It limits the time the hard drive spends attempting to recover from a read error. Most other manufacturers has similar options, under different names.

TLER is designed to improve performance, not reliability. TLER kicks in when the hard drive experiences an uncorrectable read error (most likely due to a bad sector). The drive with a bad sector should generally be replaced as soon as possible, because it is likely to produce another bad sector soon. Pre-TLER result would be the RAID controller dropping the drive offline. With TLER, the missing data is recovered from RAID redundancy and the drive remains online.

With a typical consumer-grade maintenance this effectively hides the problem until the drive fails completely, thereby increasing the vulnerability window of a fault tolerant arrays. In an enterprise-grade systems, where continuous availability is important, this is good. A consumer-grade system, where shutdown is not a problem, is better be shut down immediately once a read error occurs.

Tuesday, 21 December 2010

RAID Tips Bonus - Unrecoverable Errors in RAID 5

There is a known issue with RAID 5 that if one drive in the array fails completely, then there may be a data loss during the rebuild if one of the remaining drives encounters an unrecoverable read error. These errors are relatively rare, but the size of arrays involved had increased to the point where one cannot even read the entire array without encountering a read error.

There are some scary calculations available on the Internet (see an example here), concluding that there is as much as 50% probability of failing the rebuild on the 12TB (6x2TB) RAID 5 if one disk fails. These calculations are based on somewhat naive assumptions, making the problem look worse than it actually is.

First of all, the statement that "There is a 50% probability of being not able to rebuild a 12TB RAID 5" is the same as "If you have a 10TB RAID 0 array, there is a 50% probability of not getting back what you write to it, if you write the data and then read it back immediately." That's assuming same amount of user data on both arrays and 2TB hard drives. Still, nobody declares RAID 0 dead.

This can be reformulated even further. Assuming 100MB/sec sustained read speed, we can say "There is a 50% chance that a hard drive cannot sustain a continuous sequential read operation for 30 hours non-stop", which just does not look right. 30 hours is the approximate time to read 10TB of data at 100MB/sec.

The silent assumptions behind these calculations are that
  1. read errors are distributed uniformly over hard drives and over time, and
  2. the single read error on rebuild kills the entire array.

both of these are not true, making the result useless.

Friday, 17 December 2010

RAID Tips 10 of 10 - Recover a failed RAID.

If a disk fails in a fault-tolerant array such as RAID5, you just replace the disk an carry on.
However, if there is a controller failure or an operator error, you might end up with a set of disks lacking the array configuration.

You can then send the disk set to the data recovery lab, or try to get the data off yourself using ReclaiMe Free RAID Recovery. The most difficult part of RAID recovery is the destriping, the process of converting the striped array to the contiguous set of sectors as it is on a regular hard drive. ReclaiMe Free RAID Recovery does exactly that, giving you a choice of several output options, and at no cost.

Tuesday, 14 December 2010

RAID Tips 9 of 10 - Monitor the RAID performance

The ability of the RAID to handle failures of its hard drives relies on redundancy of the storage. Once redundancy is lost because of the first drive failure, the human intervention is needed to correct the problem and restore the required level of redundancy. Redundancy is to be restored quickly, or otherwise there is no point in having the RAID at all. However, you do not know when to act if you do not monitor the array and disks often enough.
  • Regularly check the SMART status on the drives using the appropriate software. With a software RAID, use SpeedFan or HDDLife. With a hardware RAID, use the vendor-supplied monitoring software.

  • Any unexplained drop in the throughput may indicate a problem with one of the hard drives.

Saturday, 11 December 2010

RAID Tips 8 of 10 - Backup Often

Even if your RAID is supposed to be fault tolerant, backup often.

Although the RAID is redundant with respect to hard drive failures, there are still issues that may bring down the entire array, requiring either a restore from a backup or a RAID recovery.

  • The RAID controller itself is a single point of failure.
  • The failures of the hard drives may be correlated if they have a single root cause (like a power supply failure involving overvoltage).
  • Last not least, RAID does not protect you against human error.

Wednesday, 8 December 2010

RAID Tips 7 of 10 - Test your RAID

The fault tolerance needs to be tested so that you

  • know exactly how the RAID behaves when a hard drive fails;
  • ensure that the RAID is actually capable of surviving a disk failure.

When deploying a fault-tolerant array (RAID 1, RAID 5, or RAID 6), test the system with a simulated disk failure.

  • If you have hot swappable drive bays, just pick a random one and pull the hard drive on a live system.
  • If there is no hot swap available, then disconnect one of the disks with the system powered off.


Obviously, you better do the testing before the array is loaded with the production data, or you'd have an unplanned RAID recovery incident.

Sunday, 5 December 2010

RAID Tips 6 of 10 - Software RAID

Do not underestimate the software RAID.

  • Software RAID would typically provide the same, of not better reliability than an entry-level server hardware RAID controller.
  • Software RAID is easier to move around from server to server. It does not require an identical replacement controller in case of a controller failure, which is sometimes a hinderance with server hardware.
  • In RAID 0, RAID 1, and RAID 10, the hardware RAID controller does not provide any computational power benefit because there is nothing to offload.
  • Most modern SOHO and small business NAS devices, like Synology, QNAP, or NetGear, use the software RAID (Linux/mdadm).

Thursday, 2 December 2010

RAID Tips 5 of 10 - Hot Spares

Hot spares are a good addition to a fault-tolerant array.

If a drive has failed in a fault-tolerant (RAID 1 or RAID 5) array, there is a vulnerability window. If another drive fails during this vulnerability window, the data is lost. Hot spare drives allow the controller to rebuild the array without administrator intervention, thereby reducing the vulnerability window.

The need for a hot spare increases as the number of disks in array increases.

Hot spares are most effective when a single hot spare drive is shared between several arrays. Consider for example an 8-bay NAS. If there is only one RAID 5 array in the NAS, then RAID 6 may be a better option than a hot spare. The hot spare drive just sits there idly. In a RAID 6 array, the same drive would be utilized to improve a read speed. However if you need two RAID 5 arrays, the hot spare drive is shared between these two arrays, reducing the disk space overhead.