Monday, 29 November 2010
The problem with RAID 5 is that once the member disk had failed, it is required to read the entire array in order to complete the RAID rebuild. Although the probability of encountering a read error in any particular read operation is very low, the chance of a read error occurring increases as the array size increases. It has been widely speculated that the probability of encountering the read error during rebuild becomes practically significant as the array size approaches 10TB. Although the speculation relies on certain assumption which is not likely to be true (we'll have a writeup on that later), consider being better safe than sorry.
RAID 6, being capable of correcting two simultaneous read errors, does not have this problem.
Friday, 26 November 2010
Tuesday, 23 November 2010
The relationship between Speed, Price, and Fault Tolerance mostly determines the RAID level to use. Of these three parameters, pick any two.
- Fast and Fault Tolerant - RAID 1+0
- Fast and Cheap - RAID 0
- Cheap and Fault Tolerant - RAID 5 or RAID 6.
Saturday, 20 November 2010
- Array capacity. Whatever your capacity requirements are, these are underestimated, most likely by the factor of two.
- Budget limits.
- Expected activity profile, especially read-to-write ratio. If there are mostly reads and few writes, then RAID 5 or RAID 6 would be OK. If significant random write activity is expected, consider RAID 10 instead.
- Expected lifetime. Whatever the projected lifetime of the storage system is, it is underestimated.
For a quick estimation of capacities for various RAID levels, check the online RAID calculator.
Wednesday, 17 November 2010
However, 3 and 4 year old drives stand out. For such drives, the correlation between average temperatures and failure rates turned out to be more pronounced, probably due to then current HDD technology.
Thus, the studies show that the disk temperature affects the failure rate directly only for old drives and high temperature ranges (above 500C). For the moderate temperatures other factors affect failure rates much more strongly than temperatures do.
Sunday, 14 November 2010
The only information needed to recover a RAID are the RAID disks themselves. If the recovery lab asks something like controller model, they are not a professional outfit.
This guy has some merit. If you can get your hands on the actual drives, you do not really need anything else to do the recovery. This is true for the recovery lab, which works with the actual disks (or images thereof). When we are debugging our RAID recovery freeware, there is one significant disadvantage. The actual disk images are always cost-prohibitively large to transfer, so we had to figure the problem out without these.
Lacking the images, we still have our test data sets, crash dumps, whatever, but the customer description of the problem becomes more important.
Consider the following problem report, just for an entertainment purpose:
We were running XP the software RAID5 volume holding the data failed. The array is 4x 1TB WD whatever model hard drives. The hard drives were verified separately with WD Lifeguard and tests returned no errors. However, Windows refuses to mount the array and ReclaiMe Free RAID Recovery fails to produce proper output.
Now what is the problem with the recovery? (select whitespace below for an answer).
There is a discrepancy between two statements 1. running XP and 2. using RAID5. They must have been using RAID0, because XP does not support software RAID5.
This illustrates the importance of all the details perfectly.
Thursday, 11 November 2010
For a RAID 0, this is obviously not the case. The only overhead involved is to dispatch the sectors being read or written to their appropriate disk, requiring a fairly simple calculation once for every sector (512 bytes of data) written.
RAID 5 and RAID 6 are more complicated. There is a requirement that parity data is computed for each write. However, the processing power requirements are modest and the resources are in abundance. Given the 100 MB/sec write speed, we need, say, 1,000 MIPS (Million Instructions Per Second) to calculate the parity. Also, there will be an additional memory bandwidth requirement of, say, 200 MB/sec (100 MB/sec in and out). Properly designed caching would alleviate the load even further. Still, a pretty modest CPU (made circa 2005) can provide about 15,000 MIPS and about 5,000 MB/sec memory bandwidth. So, the requirements of the RAID performing a sustained write at a rate of 100 MB/sec do not seem very high compared to the resources available.
Error messages displayed during the boot sequence are no longer useful as the uptime is now measured in weeks even for the home PCs.
If the controller doesn't report error messages or for some reason you don't take prompt action to restore the array redundancy once the disk failure has occurred, then there is no point in running a RAID. Using hot spares alleviates the problem for disk failures, but not for a silent controller malfunction.
On one of the forums someone told a story about RAID 1 failure where one of the member disks had failed and later it was found out that the second member disk contained the two-month-old data. Two months before the disk failure, the controller lost the array for some reason but didn't report the error. So nobody bothered to restore the redundancy. As a result, when the only remaining disk failed, there was no redundancy and all the data had been lost.
Remember that the RAID recovery may be difficult if the array is seriously out of sync.
Monday, 8 November 2010
There are two types of RAID implementation:
1. Configuration-on-disks (COD) - in which the information about RAIDs along with the data about to what array exactly the current disk belongs to is stored on the disk itself.
In this case you can transfer the disks between the ports and even between the controllers of the same model. Such a scheme is implemented in modern software RAIDs (Windows LDM and Linux mdadm) and in most hardware controllers as well. Sometimes you can even transfer the array between the different controller models, for example Intel ICH9R and Intel ICH10R.
2. RAID implementations in which the information about member disks is stored in the controller memory. Here, the controller actually monitors not the disks, but the ports and so you cannot swap the disks. Othewise, you lose the array and then you need a RAID recovery.
Wednesday, 3 November 2010
To enable BigLBA on an XP computer, it is needed to change the parameter in the registry. Such an approach works well when the XP has been already installed, but if you need to make a fresh installation on the large disk, you can't change the parameter because the registry does not yet exist.
To work around this issue, you can do one of the following:
- Include the latest Service Park into the installation CD (this process is called "slipstreaming"), and then the full disk capacity will be available during install.
- Install the XP on a partition with the size, say, 100 GB, then install the latest Service Pack, enable BigLBA, and use a tool like Partition Magic to extend the partition onto the remaining space. Normally, we'd recommend that you backup before resizing a partition, but since this is a new install anyway, there is nothing useful to backup.