Friday, 31 December 2010

RAID5 write hole

RAID 5 "write hole" happens if a power failure occurs during the write, causing an incomplete parity block update.

The more detailed analysis of a "write hole" and its consequences is available at the RAID Recovery Guide.

Monday, 27 December 2010

Large disks and FAT32.

Microsoft Windows starting with Windows XP doesn't allow you to format a partition larger than 32 GB to FAT32. Note that if such partition has been created earlier, you can use the partition normally.

However, there are situations when you need to format a large disk to FAT32, for example to use the disk in a standalone player. To do this, you can use the program which is called FAT32format (http://www.ridgecrop.demon.co.uk/index.htm?fat32format.htm).

Theoretically, the maximum size of FAT32 partition equals 228*32 KB = 8 TB (http://support.microsoft.com/kb/314463), but in practice (for the sake of compatibility) it would be better to limit yourself to 2TB.

Thursday, 23 December 2010

TLER in end-user systems

In your home PC, you're better off without TLER.

TLER (Time Limited Error Recovery) is a feature of WD hard drives. It limits the time the hard drive spends attempting to recover from a read error. Most other manufacturers has similar options, under different names.

TLER is designed to improve performance, not reliability. TLER kicks in when the hard drive experiences an uncorrectable read error (most likely due to a bad sector). The drive with a bad sector should generally be replaced as soon as possible, because it is likely to produce another bad sector soon. Pre-TLER result would be the RAID controller dropping the drive offline. With TLER, the missing data is recovered from RAID redundancy and the drive remains online.

With a typical consumer-grade maintenance this effectively hides the problem until the drive fails completely, thereby increasing the vulnerability window of a fault tolerant arrays. In an enterprise-grade systems, where continuous availability is important, this is good. A consumer-grade system, where shutdown is not a problem, is better be shut down immediately once a read error occurs.

Tuesday, 21 December 2010

RAID Tips Bonus - Unrecoverable Errors in RAID 5

There is a known issue with RAID 5 that if one drive in the array fails completely, then there may be a data loss during the rebuild if one of the remaining drives encounters an unrecoverable read error. These errors are relatively rare, but the size of arrays involved had increased to the point where one cannot even read the entire array without encountering a read error.

There are some scary calculations available on the Internet (see an example here), concluding that there is as much as 50% probability of failing the rebuild on the 12TB (6x2TB) RAID 5 if one disk fails. These calculations are based on somewhat naive assumptions, making the problem look worse than it actually is.

First of all, the statement that "There is a 50% probability of being not able to rebuild a 12TB RAID 5" is the same as "If you have a 10TB RAID 0 array, there is a 50% probability of not getting back what you write to it, if you write the data and then read it back immediately." That's assuming same amount of user data on both arrays and 2TB hard drives. Still, nobody declares RAID 0 dead.

This can be reformulated even further. Assuming 100MB/sec sustained read speed, we can say "There is a 50% chance that a hard drive cannot sustain a continuous sequential read operation for 30 hours non-stop", which just does not look right. 30 hours is the approximate time to read 10TB of data at 100MB/sec.

The silent assumptions behind these calculations are that
  1. read errors are distributed uniformly over hard drives and over time, and
  2. the single read error on rebuild kills the entire array.

both of these are not true, making the result useless.

Friday, 17 December 2010

RAID Tips 10 of 10 - Recover a failed RAID.

If a disk fails in a fault-tolerant array such as RAID5, you just replace the disk an carry on.
However, if there is a controller failure or an operator error, you might end up with a set of disks lacking the array configuration.

You can then send the disk set to the data recovery lab, or try to get the data off yourself using ReclaiMe Free RAID Recovery. The most difficult part of RAID recovery is the destriping, the process of converting the striped array to the contiguous set of sectors as it is on a regular hard drive. ReclaiMe Free RAID Recovery does exactly that, giving you a choice of several output options, and at no cost.

Tuesday, 14 December 2010

RAID Tips 9 of 10 - Monitor the RAID performance

The ability of the RAID to handle failures of its hard drives relies on redundancy of the storage. Once redundancy is lost because of the first drive failure, the human intervention is needed to correct the problem and restore the required level of redundancy. Redundancy is to be restored quickly, or otherwise there is no point in having the RAID at all. However, you do not know when to act if you do not monitor the array and disks often enough.
  • Regularly check the SMART status on the drives using the appropriate software. With a software RAID, use SpeedFan or HDDLife. With a hardware RAID, use the vendor-supplied monitoring software.

  • Any unexplained drop in the throughput may indicate a problem with one of the hard drives.

Saturday, 11 December 2010

RAID Tips 8 of 10 - Backup Often

Even if your RAID is supposed to be fault tolerant, backup often.

Although the RAID is redundant with respect to hard drive failures, there are still issues that may bring down the entire array, requiring either a restore from a backup or a RAID recovery.

  • The RAID controller itself is a single point of failure.
  • The failures of the hard drives may be correlated if they have a single root cause (like a power supply failure involving overvoltage).
  • Last not least, RAID does not protect you against human error.

Wednesday, 8 December 2010

RAID Tips 7 of 10 - Test your RAID

The fault tolerance needs to be tested so that you

  • know exactly how the RAID behaves when a hard drive fails;
  • ensure that the RAID is actually capable of surviving a disk failure.

When deploying a fault-tolerant array (RAID 1, RAID 5, or RAID 6), test the system with a simulated disk failure.

  • If you have hot swappable drive bays, just pick a random one and pull the hard drive on a live system.
  • If there is no hot swap available, then disconnect one of the disks with the system powered off.


Obviously, you better do the testing before the array is loaded with the production data, or you'd have an unplanned RAID recovery incident.

Sunday, 5 December 2010

RAID Tips 6 of 10 - Software RAID

Do not underestimate the software RAID.

  • Software RAID would typically provide the same, of not better reliability than an entry-level server hardware RAID controller.
  • Software RAID is easier to move around from server to server. It does not require an identical replacement controller in case of a controller failure, which is sometimes a hinderance with server hardware.
  • In RAID 0, RAID 1, and RAID 10, the hardware RAID controller does not provide any computational power benefit because there is nothing to offload.
  • Most modern SOHO and small business NAS devices, like Synology, QNAP, or NetGear, use the software RAID (Linux/mdadm).

Thursday, 2 December 2010

RAID Tips 5 of 10 - Hot Spares

Hot spares are a good addition to a fault-tolerant array.

If a drive has failed in a fault-tolerant (RAID 1 or RAID 5) array, there is a vulnerability window. If another drive fails during this vulnerability window, the data is lost. Hot spare drives allow the controller to rebuild the array without administrator intervention, thereby reducing the vulnerability window.

The need for a hot spare increases as the number of disks in array increases.

Hot spares are most effective when a single hot spare drive is shared between several arrays. Consider for example an 8-bay NAS. If there is only one RAID 5 array in the NAS, then RAID 6 may be a better option than a hot spare. The hot spare drive just sits there idly. In a RAID 6 array, the same drive would be utilized to improve a read speed. However if you need two RAID 5 arrays, the hot spare drive is shared between these two arrays, reducing the disk space overhead.

Monday, 29 November 2010

RAID Tips 4 of 10 - RAID 5 uncorrectable error probability

If you plan on building RAID 5 with a total capacity of more than 10 TB, consider RAID 6 instead.

The problem with RAID 5 is that once the member disk had failed, it is required to read the entire array in order to complete the RAID rebuild. Although the probability of encountering a read error in any particular read operation is very low, the chance of a read error occurring increases as the array size increases. It has been widely speculated that the probability of encountering the read error during rebuild becomes practically significant as the array size approaches 10TB. Although the speculation relies on certain assumption which is not likely to be true (we'll have a writeup on that later), consider being better safe than sorry.

RAID 6, being capable of correcting two simultaneous read errors, does not have this problem.

Friday, 26 November 2010

RAID Tips 3 of 10 - RAID 0

If you are planning to build a RAID 0, consider using an SSD instead. Depending on what your requirements are, you may find a better bang for your buck with just one SSD. Also, higher RPM rotational drives (e.g. WD VelociRaptor series) or hybrid drives (like Seagate Momentus XT) may be interesting.

Tuesday, 23 November 2010

RAID Tips 2 of 10 - The RAID Triangle

The relationship between Speed, Price, and Fault Tolerance mostly determines the RAID level to use. Of these three parameters, pick any two.



  • Fast and Fault Tolerant - RAID 1+0

  • Fast and Cheap - RAID 0

  • Cheap and Fault Tolerant - RAID 5 or RAID 6.

Saturday, 20 November 2010

RAID Tips 1 of 10 - Requirements

When you are about to build a RAID, make sure you understand your storage requirements. The following points are significant
  • Array capacity. Whatever your capacity requirements are, these are underestimated, most likely by the factor of two.
  • Budget limits.
  • Expected activity profile, especially read-to-write ratio. If there are mostly reads and few writes, then RAID 5 or RAID 6 would be OK. If significant random write activity is expected, consider RAID 10 instead.
  • Expected lifetime. Whatever the projected lifetime of the storage system is, it is underestimated.

For a quick estimation of capacities for various RAID levels, check the online RAID calculator.

Wednesday, 17 November 2010

The effect of S.M.A.R.T.-reported temperatures on failure rate

It was previously thought that there was a clear correlation between disk temperatures and failure rates; however, the studies undertaken by Google Inc. on the large disk population have revealed that the correlation is not as strong as it was assumed earlier. In the studies, S.M.A.R.T. data which were collected every few minutes during 9-month window of observation have been analyzed. Only average temperatures were taken into account. It was found that failures don't increase when the temperature increases. Moreover, the higher probability of failure rates was observed for the lower temperature ranges. The positive correlation has been detected only for the disks with temperatures greater than 500C.

However, 3 and 4 year old drives stand out. For such drives, the correlation between average temperatures and failure rates turned out to be more pronounced, probably due to then current HDD technology.

Thus, the studies show that the disk temperature affects the failure rate directly only for old drives and high temperature ranges (above 500C). For the moderate temperatures other factors affect failure rates much more strongly than temperatures do.

Sunday, 14 November 2010

Why do we need as much information as possible?

Once there was a discussion on one of the repair forums, and one poster said something along these lines

The only information needed to recover a RAID are the RAID disks themselves. If the recovery lab asks something like controller model, they are not a professional outfit.

This guy has some merit. If you can get your hands on the actual drives, you do not really need anything else to do the recovery. This is true for the recovery lab, which works with the actual disks (or images thereof). When we are debugging our RAID recovery freeware, there is one significant disadvantage. The actual disk images are always cost-prohibitively large to transfer, so we had to figure the problem out without these.

Lacking the images, we still have our test data sets, crash dumps, whatever, but the customer description of the problem becomes more important.

Consider the following problem report, just for an entertainment purpose:

We were running XP the software RAID5 volume holding the data failed. The array is 4x 1TB WD whatever model hard drives. The hard drives were verified separately with WD Lifeguard and tests returned no errors. However, Windows refuses to mount the array and ReclaiMe Free RAID Recovery fails to produce proper output.

Now what is the problem with the recovery? (select whitespace below for an answer).

There is a discrepancy between two statements 1. running XP and 2. using RAID5. They must have been using RAID0, because XP does not support software RAID5.

This illustrates the importance of all the details perfectly.

Thursday, 11 November 2010

On RAID computational requirements

There is a widespread opinion that software RAID imposes significant processing power penalty on the host system, thereby decreasing overall performance.

For a RAID 0, this is obviously not the case. The only overhead involved is to dispatch the sectors being read or written to their appropriate disk, requiring a fairly simple calculation once for every sector (512 bytes of data) written.

RAID 5 and RAID 6 are more complicated. There is a requirement that parity data is computed for each write. However, the processing power requirements are modest and the resources are in abundance. Given the 100 MB/sec write speed, we need, say, 1,000 MIPS (Million Instructions Per Second) to calculate the parity. Also, there will be an additional memory bandwidth requirement of, say, 200 MB/sec (100 MB/sec in and out). Properly designed caching would alleviate the load even further. Still, a pretty modest CPU (made circa 2005) can provide about 15,000 MIPS and about 5,000 MB/sec memory bandwidth. So, the requirements of the RAID performing a sustained write at a rate of 100 MB/sec do not seem very high compared to the resources available.

On RAID diagnostic messages

Always check what your RAID controller says when a disk failure or other array failure occurs. You should verify by testing that such messages are difficult to miss.

Error messages displayed during the boot sequence are no longer useful as the uptime is now measured in weeks even for the home PCs.

If the controller doesn't report error messages or for some reason you don't take prompt action to restore the array redundancy once the disk failure has occurred, then there is no point in running a RAID. Using hot spares alleviates the problem for disk failures, but not for a silent controller malfunction.

On one of the forums someone told a story about RAID 1 failure where one of the member disks had failed and later it was found out that the second member disk contained the two-month-old data. Two months before the disk failure, the controller lost the array for some reason but didn't report the error. So nobody bothered to restore the redundancy. As a result, when the only remaining disk failed, there was no redundancy and all the data had been lost.

Remember that the RAID recovery may be difficult if the array is seriously out of sync.

Monday, 8 November 2010

Moving drives across ports in a RAID

Can I swap the disks in RAID (connect the disks to different ports) and don't lose the RAID data?

There are two types of RAID implementation:

1. Configuration-on-disks (COD) - in which the information about RAIDs along with the data about to what array exactly the current disk belongs to is stored on the disk itself.
In this case you can transfer the disks between the ports and even between the controllers of the same model. Such a scheme is implemented in modern software RAIDs (Windows LDM and Linux mdadm) and in most hardware controllers as well. Sometimes you can even transfer the array between the different controller models, for example Intel ICH9R and Intel ICH10R.

2. RAID implementations in which the information about member disks is stored in the controller memory. Here, the controller actually monitors not the disks, but the ports and so you cannot swap the disks. Othewise, you lose the array and then you need a RAID recovery.

Wednesday, 3 November 2010

Installing XP on a hard drive larger than 128/137GB limit

If Windows XP doesn't see more than 137 (or 128) GB of disk space on the large disk, then you need to turn on BigLBA.

To enable BigLBA on an XP computer, it is needed to change the parameter in the registry. Such an approach works well when the XP has been already installed, but if you need to make a fresh installation on the large disk, you can't change the parameter because the registry does not yet exist.

To work around this issue, you can do one of the following:

  1. Include the latest Service Park into the installation CD (this process is called "slipstreaming"), and then the full disk capacity will be available during install.
  2. Install the XP on a partition with the size, say, 100 GB, then install the latest Service Pack, enable BigLBA, and use a tool like Partition Magic to extend the partition onto the remaining space. Normally, we'd recommend that you backup before resizing a partition, but since this is a new install anyway, there is nothing useful to backup.

Tuesday, 2 November 2010

Hard disk sounds

what does spindle motor damage sound like?


Silence.

Saturday, 30 October 2010

mdadm

Software RAID in Linux (mdadm) implements many possibilities to reshape the array including
  • changing the type of the array,
  • changing the array size (by adding new disks),
  • and even changing the block size in RAID 5 or RAID 6.


http://neil.brown.name/blog/20090817000931#4.

Tuesday, 26 October 2010

RAID type migrations

Can I change the RAID type after the data was written on the array?

Tom's Hardware answers "No"

In fact, it depends on the particular RAID implementation.

  • In some implementations you can change nothing.

  • In others, you can extend an array by adding new disks.

  • There are also the implementations which allow you to even change the array
    type (although not in all combinations).

You need to consult your particular hardware and/or software manual for a list of available options before committing to a certain expansion plan. In any case, make sure to backup the array contents before expanding or migrating to the different array type. Should the migration process fail midway, the RAID recovery is not going to be cheap.

Friday, 22 October 2010

On critical error messages

If a situation occurs that requires immediate user response, then the error message should exactly describe the situation and possible courses of action.

The warning message should be displayed explicitly, because one cannot expect that the user is able to reliably draw conclusions by comparing several independent parameters.

Tuesday, 5 October 2010

Imaging RAIDs

If you recover RAID and need to create a RAID image file, you should create the image files of the member disks rather than the whole array. Generally, it is not possible to restore the original array state using a RAID image file.

As for RAID0, the original state can be restored by writing the image to the array provided that the controller's settings (and thus the array parameters) have not been changed. If the controller's settings have been changed between reading and writing the image file, then image file data doesn't go to the original place on the member disks.

For RAID5 it is not enough to have a RAID image file of the entire array even if the controller's settings have not been changed. This is because the parity data is not written to the image file. If the array works properly, the parity data is not used, but for the RAID recovery you need the parity data. A RAID5 consisting of 3x1TB disks contains 3TB of raw data (with parity), while the array image file will be 2TB in size and, obviously, does not contain all the data.

The image file of the whole RAID, which was copied using an incorrectly configured controller, is a mess of the block parts and it can only be used provided that you know what the controller's settings were when creating the RAID image file. As of 2010, we know of no software capable of recovering such array image files.

Saturday, 2 October 2010

CHKDSK is not the tool to check the disk.

In fact, CHKDSK only checks the filesystem, regardless of where it is located and what state of corresponding physical storage is. The filesystem may be located on a RAID 5 in which one of the member disks has already failed. In this case, from the CHKDSK's point of view, the filesystem is still OK.

One should understand that the data storage system consists of several levels and every level should be checked using the appropriate tools and techniques. For example,

  1. A database should be checked using whatever tool is provided with the SQL server.
  2. The volume that stores the database should be checked by CHKDSK.
  3. The RAID that contains the volume should be checked using diagnostic tools provided with the RAID controller.
  4. The member disks, should the need arise, should be checked by their corresponding vendor diagnostic tools.

Sounds a bit like "The House that Jack Built".

Wednesday, 29 September 2010

Nikon internal memory

If you need to access Nikon's camera (similar to Nikon Coolpix S200) internal memory for photo recovery, just eject the memory card from the camera and connect the camera to the PC using the USB cable. You'd better not to lose the cable supplied with the camera - Nikon USB connectors are something non-standard and spare cables are hard to come by.

Friday, 24 September 2010

Housekeeping

Broken parts should be marked clearly and unambiguously so that no one uses them once again by mistake.


Actually, broken parts should be thrown away. If for some reason, e.g. for a subsequent data recovery, you have to keep them, you need to mark them clearly.


If it is determined that the cable is bad, it makes sense to cut it in two immediately.


On the larger parts, you can draw a cross with a marker (it is better to mark all the sides). In such a way we keep old power supply units in case the connectors will be needed or hard drives queued for secure erase.





Another option is to tape over the connector, for example, if one port of the router has burned but the rest are working and you don't want to throw out the entire router.


Tuesday, 21 September 2010

QNAP revisited

For reference, we were using QNAP TS-639 Pro with six WD20EADS disks.

After about half a year of use, the web-interface was running slower, so that we needed to wait for several minutes to obtain the list of the disks. Gradually, we noted that the array performance decreased significantly.

It seemed obvious to assume that one of the member disks has been dropped from the array and a RAID5 was working in the degraded mode. However, all the member disks were marked as GOOD in the web-interface.

It was suspicious that LEDs on the disk bays indicating the state and activity of the disks were blinking unevenly. Logically, for RAID5 one should expect almost symmetric load on the disks, but in fact, one of the disk LED was blinking much more frequently.

Once we had run the bad blocks check on this disk through the web-interface, the disk dropped from the array in less than half an hour and its bay LED turned red. At this point, the web-interface began to work properly again. The disk taken from the array doesn't stay online for longer than a couple of minutes when connected to a Windows PC.

RAID 5 is designed to be redundant in order to improve reliability but a single disk failure destroys the redundancy. If the second disk fails before the first failed disk will be replaced, the data loss occurs. In our case, the disk failed but all the obvious diagnostic tools reported that the RAID was OK. So two weeks passed before we realized that the disk had to be replaced.

The hot-swapping did not work properly too. When we put in the new disk, we expected the array would detect it and start the rebuild. Instead the web-interface hung once again, showing the strange indications like "Disks 2, 3, 4, 5, 6 are not present", though read-write operations worked fine all the time. The reboot was required to detect the new disk and start the rebuild.

If you happen to have one of the QNAP units, note that the LEDs on the device indicate the state of the array correctly. If there is a discrepancy between LEDs and what the web interface tells you, go with LEDs.

Tuesday, 14 September 2010

Finally a free RAID recovery software

We have released ReclaiMe Free RAID Recovery which is available for downloading at www.FreeRaidRecovery.com. As the name implies, this is absolutely free RAID recovery software. The tool reconstructs most widely used RAID layouts - RAID 0, RAID 5, and RAID 01/10. ReclaiMe Free RAID Recovery is capable of recovering the following array parameters:
  • start offset,
  • block size,
  • member disks and data order,
  • parity position and rotation.

Once you recovered the RAID parameters, you can:

  • Run ReclaiMe data recovery software to recover data from the array;
  • Create the array image file;
  • Write the array to disk;
  • Save layout to the XML file;
  • Get the instructions and recover data using other data recovery software.
In addition to the fact that the tool is absolutely free, it is really simple to use - no Settings button at all. All you need to start RAID recovery is to select the available member disks (our tool can reconstruct RAID 5 with one disk missing), decide on the array type, and click Start.

Sunday, 12 September 2010

QNAP sucks

We have a QNAP 639 Mega Hyper Super Turbo Station NAS unit. Long story short, web interface now refuses to work. Earlier, it was more of an intermittent PITA, but now that completely ceased to work. To be precise, the AJAX part of the interface never responds, leaving us without the array status, SMART data, whatever. It would just sit there displaying "Loading" message along with a stupid rotating AJAX thing for no end.

Not exactly the performance one would expect from a $1500 unit.
Maybe we should use it for a RAID 6 recovery as a practice target, because we have no RAID 6 recovery yet.

Friday, 3 September 2010

Difference between RAID 0+1 and RAID 1+0




The diagrams of RAID 0+1 and RAID 1+0 are shown above.
As you can see they seem to have different data organization.






Here, four drives are shown. Can you determine what type of the RAID this is?
Actually, RAID 0+1 and RAID 1+0 are the same when it comes to data recovery.

Friday, 20 August 2010

RAID5 diagram fail


(Taken from WD site)

What's wrong with the picture?

Select the whitespace below for an answer
In the above RAID5 picture, there are two parities in the middle row. In a proper RAID5 there must be exactly one parity per row.

Monday, 16 August 2010

Upcoming: free RAID recovery software

We are now developing a RAID recovery software. As you know, RAID recovery boils down to two independent steps - RAID parameters recovery and the subsequent recovery of data off the reconstructed array. Knowing the array parameters, the data recovery (second step) can be done using any data recovery software. The RAID parameters reconstruction is not such an easy task, and frankly speaking, there are very few RAID reconstruction tools.

So, we decided to develop our own RAID parameters recovery software, which we think would be the best of ever been, and at the best possible price.

Our RAID parameters recovery software will work with RAID 0 and RAID 5 The output would be either a parameter set, or an image file of the entire RAID array. The option to write the array data directly to the specified device is also considered.

We are planning to release the software by the end of September (year 2010, you know). All the other specifics like download location is still to be determined. Keep an eye on our blog posts.

Saturday, 7 August 2010

Partitioning in modern systems

In the early days of computing, partitioning disk to several volumes was sometimes a good idea. Now, “one disk – one volume” is the most practical way to go.

The original factors and reasoning behind having multiple partitions have much less significance now. The maximum available volume size, limited in FAT16 filesystem, is no longer a factor. Modern filesystems such as NTFS have much smaller clusters for a given volume size, so the loss of space to slack is much less a concern.

There are few exceptions,

  • When an extremely large RAID is involved, a single volume may be impractical because of the backup and the filesystem consistency checking considerations. Last but not least, the read-only data recovery requires a lot of free space same size as the damaged volume; this may be difficult to provide for a gigantic monolithic volume.
  • If multiple operating systems are required with multiple different filesystems, then having several partitions is perfectly justified.

The most common side effects of having multiple partitions are:

  • Free space gets “fragmented” – although the combined free space may be large, the individual partitions would not have enough free space to hold whatever monolithic disk space block is required.
  • If the data is somehow logically grouped (e.g. “OS” and “data” partitions), disk space requirements are hard to predict. Eventually OS would outgrow its designated partition, but the free space from the “data” partition cannot be used.

Tuesday, 3 August 2010

Waterproof or not?

Roaming the web we stumbled upon a blog post that tells about new waterproof SD card technology. Naturally, we wondered whether such memory cards are really needed or may be ordinary cards are just enough. So we gathered memory cards which were at hand and also decided not to ignore a USB pen drive. We formatted and put files on all of them.
Then we filled a microwave container with water, put all the devices into it, and left them sit underwater for half an hour.

So we got (left to right)

  • Nikon EC8-CF 8MB CompactFlash
  • Transcend JetFlash V30 2GB (USB pen drive)
  • Sandisk M2 2GB
  • Kingston MMC mobile 1GB

under water



After that we got the devices out, wiped them dry and connected to the computer one by one.

And here is the result of our experiment: all of the memory cards work well with their content intact. As for the USB pen drive, we were concerned for a moment that we killed as we watched it sink slowly, filling up with water. However, in the end, the pen drive works perfectly as well.

Monday, 2 August 2010

Data recovery after reinstalling Windows yields better results than it can be expected. During the reinstall, the drive is formatted and the new copy of Windows is installed.

If the format was "Complete" and it was Windows Vista/7 that was installing then nothing can be recovered. During the format, Windows Vista and Windows 7 overwrite the drive content with zeros.

If you use the "Quick" format, then Windows will be installed on the clean drive, in the same way as it was installed the first time. Since a computer always produces the same result for the same task, the new copy will be written over the previous copy (except for possible different selection of the components to be installed). Thus the previous Windows files will be overwritten, but not the user files which were written later. Certainly some files will be lost (e.g. data saved in a registry like settings and passwords), but the documents have a good chance to survive a reinstallation.

This approach doesn't work if you are going to install a newer Windows version because new versions have larger size.

Wednesday, 28 July 2010

Just kidding


Came across the term "USB drive housing" today. So here goes a little attempt in drawing :)

Tuesday, 27 July 2010

Uncommanded HPA activation

Yesterday, the request came through like

we got a Samsung 40GB hard drive which started to show 4GB capacity. The drive was partitioned 4GB+35GB, now the second partition is just gone, and the first is displayed as raw file system.

So naturally, I pulled up the hard drive capacity troubleshooting manual and started going through it.

  1. Old mainboard? No way, it was working just fine the day before with the very same mainboard.
  2. BigLBA? Not relevant because the limit is 128/137GB whilst the drive is 40GB.
  3. Jumper settings? Checked, removed/reset, no fix.
  4. Host Protected Area (HPA)? That was the only option remaining.
Atola's HDD capacity restore does not work on x64 Vista (and presumably Windows 7 as well) because of the driver issues. It took a while to move to a 32-bit XP installation, but it was well worth the hassle. It turned out that resetting HPA returned the drive to the normal condition immediately.

The question however remains what caused the HPA to activate in the first place. This is not an unusual occurrence, as we've seen it several times last year, and resetting capacity always seemed to resolve the problem, but the root cause was never identified.

Monday, 19 July 2010

Bad sectors, part III - Zero-fill

Drive zero-filling (sometimes errorneously called low-level format) in some cases can fix bad sectors.

Firstly all sectors with incorrect checksums will be overwritten with the correct checksums and therefore these sectors can be used again. Those sectors which can't be fixed by simple overwrite will be reallocated.

It is important to understand that zero-filling doesn't eliminate the reason why the bad sectors appear. For example if there is a problem with the power to the drive (i.e. loose contact) the drive will power down periodically and as a result soft bad sectors will appear.

There are software vendors who claim that their software can repair the drive surface, HDD Regenerator and SpinRite. In fact there is no general technique to view or change the list of defective or reallocated sectors, or perform a low-level format on a modern hard drive. These techniques differ from model to model and usually require hardware-assisted solutions such as PC3000. The best DIY choice is a diagnostic utility from a drive's vendor. Some of them can zero-fill the drive but one should understand that such zero-filling destroys all data irreversibly. No data recovery is possible after zero-filling.

Monday, 12 July 2010

Bad sectors, part II - reallocation

Since it is known in advance that it is impossible to create a perfect magnetic surface, a number of spare sectors are reserved on the drive.

When a surface defect appears, the sector with the defect is replaced with a good one from the pool of a reserved sectors. Obviously, there is no surface repair involved. Instead the special record is made in the address table, like "if the write/read request arrives for the sector 123, use the sector 456 instead". This results in a certain loss of performance because it is now required to move the head to the reserved sectors zone and back again instead of just reading a contiguous chunk of data. On top of that, the data which was stored in the bad sector is lost. Nevertheless, theoretically you can use the drive further as if there are no bad sectors at all.

This process is called “reallocation”. The S.M.A.R.T. attribute named “Reallocated Sectors Count” shows a number of the reallocated (replaced) sectors.

If the drive idles long enough, it can start a self-test, reading random sectors to make sure that they are not corrupted. The sectors with defects are queued and then subjected to the reallocation if needed. Another S.M.A.R.T. attribute – “Current Pending Sector Count” - is designated for monitoring of the queue status.

The first surface check is done during production of the drive, and the new drive (just from the factory) may already have several reallocated sectors. However, these "factory-certified" defects are not shown in the S.M.A.R.T. counters.

Sunday, 11 July 2010

Bad sectors, part I - soft bad sectors

There are two kinds of bad sectors - those that can be recovered by overwriting and those that can't.

It is almost impossible (and would be very expensive for practical usage anyway) to create a perfect magnetic surface without a single defect. Instead of trying to create a perfect surface, additional redundant data is written along with the user data. It is data then possible to recover short data sections which were read incorrectly based on the redundant data. This is called Error Correction Code (ECC). Nevertheless, the capability of ECC to correct the errors is limited. It is not possible to recover either too many bad bits or too long a continuous bad section.

If a power failure occurs when you are writing data to the drive, the write procedure may be interrupted approximately halfway. Thus, the first half of the sector contains the new data while the other half still has the old data. Error correction code is not capable of fixing such a error, and when attempting to read the sector, it will be declared bad.

In fact, such a sector is not mechanically bad, it just contains the data with a wrong checksum. To fix this, it is enough to write a new data to the sector, and the sector will then function properly.

The truly mechanical damage - the destruction or wear of magnetic surface - cannot be corrected in such a way.

Monday, 5 July 2010

The missing ingredient

I've been reviewing the vendor-supplied hard drive diagnostic tools, and found one thing that is missing - in all of them. Before each test (the S.M.A.R.T. selfscan, whatever) there should be a clear indication if the test is destructive or not.

Naturally, it is reasonable to assume that there would be a warning before the test if the test is destructive (like zero-filling the drive). However, the gut instinct does not allow most of us to rely on that assumption. Murphy's law reinforced by the past experience suggests that the developers may have forgotten to include the warning, and testing failed to spot that.

So, there is always some concern when starting a test. It would feel much better if there was a message clearly stating that the test is not destructive.

Wednesday, 30 June 2010

Fake flash drive



This is the most (or one of the most) well-known fake flash drive images. Looking at it once again, I suddenly spotted that it is actually not a pen drive. It is, or rather it is supposed to be, a DLink DWL-G122 USB Wireless Network device.

Following is the more realistic, and more widespread variant of the fake pen drive.
This one is working, but just mislabelled. Typically, they get the 2GB pen drive, stick a 32GB label onto it, and then resell at higher prices. The following image is a lesser size older model pen drive forced into the smaller case of the newer model:

Note that the PCB is cut off a little bit at the top to make it fit and then glued to the case with a blob of a silicone sealant.

The flash drive relabelled from 2GB to 32GB still only has 2GB of actual capacity. So if there is more data written to the flash drive, the excess data has to go somewhere. There are two possibilities: the excess data either goes nowhere, it is just discarded and never stored at all, or the data is written again from the beginning of the device, thus overwriting what was there previously. Either way, it is not possible for a data recovery software to retrieve the data in excess of 2GB capacity.

Monday, 28 June 2010

BigLBA

If the drive has become the RAW file system, you should first check the drive size to see if it is exactly 128 or 137 GB. The raw filesystem issue may be caused by the clipped hard drive capacity.

If you see 137 or 128 GB, you need to check whether the BigLBA still works before you rush to recover data. This is especially true for the drives (including external USB drives) which were brought to an old computer.

The BigLBA is a registry parameter which determines whether to use 48bit block addressing or not. If it is disabled then the maximum accessible disk size equals to 128 or 137 GB (depending on what units are used for the drive size, binary or decimal gigabytes).

Sometimes it turns out that the BigLBA is off despite the fact that theoretically it should be enabled in any modern installation (starting with Windows XP SP2).

Refer to the instructions for troubleshooting disk capacity issues for more details on what you need to check and how to enable the BigLBA parameter. If in fact BigLBA was an issue, then fixing it typically restores the drive with RAW filesystem to proper functioing.

Thursday, 24 June 2010

Secure erase a.k.a. data wiping

If you need to delete a file irreversibly, it is not enough to just delete it and then empty the Recycle Bin. Data recovery software is quite capable of restoring data that was deleted in such a way.

In earlier days, when the FAT filesystem was widely used, it was sufficient to write some garbage data to the file. To overwrite the file data completely, the garbage data size should be no less than the original file size. This worked because FAT is rather simple filesystem.

With the filesystem complexity increasing, a number of filesystem features which should be taken into account is increased as well. Nowadays, it is no longer enough just to write other content to the file to delete it irreversibly.

For example, if a file is stored on the NTFS filesystem in compressed form, then depending on the compressibility of the data in the original file (to be secure erased) and a new content, most likely a new set of clusters would be allocated for the new file data. Therefore, the original file data would not be overwritten at all.

It is particularly useless to write zeros - if the NTFS compression is turned on, zeros would not be written at all (so called sparse file); and therefore original data would not be overwritten.

The next obvious step is to delete the file and write some incompressible garbage to all the free space. Sounds good, but unfortunately does not work because the original file may be “resident” and so its content would not be overwritten. Thus, you should not just write the free space, but also overwrite all the free MFT records.

In short, the secure erase is complicated and difficult to do properly. If you ever need it, use SDelete. SDelete is free, created by Mark Russinovich, and it was tested to work many times. Additionally, they have a good explanation of how does it work and what was taken into consideration.

Tuesday, 22 June 2010

Should I make a disk image file?

A disk image file - an exact copy of all the disk content - is the first thing that a data recovery lab makes. When recovering data at home it is often not reasonable to create a disk image file.

Having the disk image file stored aside makes recovery more safe. If the fix gets wrong, the image file provides the backup to try again. If the disk is physically damaged, a disk image file allows you to perform the recovery independent of the mechanical conditions. Almost any data recovery software can create a disk image file and in most cases to load a disk image file that was created by another tool.

The significant disadvantage is that creating a disk image file takes a long time and requires a lot of free disk space.

When using read only data recovery software given that the drive is physically OK, the risk of further data damage is negligible. Read only recovery itself requires free space at least equal to the size of the data being recovered; if a disk image file is used, you should have free space for this image as well - and may even need to buy a new large hard drive.

Thus, it might be reasonable to attempt a recovery without creating a disk image first.

Thursday, 17 June 2010

Data recovery and different USB protocols

When recovering data from a USB external hard drive, you should keep an eye on a data read speed. If the speed is less than 2 MB (megabyte) per second, it would be better to abort the recovery and figure out in what mode the devices are working. The speed is of less concern with smaller devices, that is if you need to recover pen drive, you just sit, watch, and wait it out.

There are two different versions of the USB protocol, USB 1.1 and USB 2.0. From the user point of view, these protocols differ from each other only by a data transfer speed. USB 1.1 transfers maximum ~ 1.5 MB/sec, while USB 2.0 can achieve ~ 50 MB/sec. If several USB devices involved in a data transfer use different USB protocols, the lowest data transfer speed is used.

Although USB 2.0 was developed in 2000, USB 1.1-only hard drive enclosures and card readers are still produced. If you are going to buy a USB enclosure for external hard drive recovery, check that it supports the USB 2.0 protocol. Data recovery from the drive connected via USB 1.1 is too slow to be practically used, because it would take a couple of days or sometimes even weeks to perform a data recovery.

If you use a USB hub you should check it as well. Generally, whenever possible, try not to use any intermediate elements. Some hubs can switch in USB 1.1 mode when many devices are connected via the hub.

On some motherboards USB ports of different versions are mixed. If possible, check the motherboard manual to find out what ports are USB 2.0 and connect to them. If you do not have the manual, simply try a few different ports. Most often there is a difference between USB ports in the front of the case and those located on the rear.

Tuesday, 15 June 2010

Folder tree structure vs. file data

Is it possible for a data recovery software to get a correct file and folder structure but bad file content or vice versa? Why does it happen?

The answer depends on the filesystem type being recovered.

On FAT, the location of the parent folder is determined depending on the same formulae which are used for finding data. If the parameters in these formulae are invalid, neither data nor a folder structure can be restored. Hence, typically if you have a folder tree recovered properly or close to that, the files should be good as well.

On NTFS, there are two independent sets of parameters, one set controlling the data location and the other set covering the parent-child relationships in a folder tree. So on NTFS, it is theoretically possible (and sometimes happens) to have one good set of the parameters but the other one wrong. So, if you unformat an NTFS drive, a good folder tree full of damaged files is perfectly possible.

On HFS and HFS+, the parent-child relationship is described by designated records in the catalog file. So it is possible to recover a folder tree even if both child and parent folder records are damaged. HFS utilizes three different datasets to store information about the file data, file names, and content of the large files. Any of these three may be damaged separately, leading to all sorts of combinations being possible.

Rinse, repeat

On Tom's hardware, there is a question My 120 GB portable drive has some corrupted files that I can't delete ... How do I wipe the drive or delete the corrupted data?

The answer goes obvious, just copy all the good data you need (if any), and format the drive. This definitely resolves any software corruption which may be present.

The less obvious option would be to run CHKDSK /F, then reset permissions on the folders, then delete the folders and files. Format would just be more simple, faster, and generally more definite.

However, another poster chimes in saying to use XP setup CD and delete [the partition] completely,then create a drive again...do this 2-3 times and ur hdd is all clean. This is
  1. not precisely true, because deleting the partition and then re-creating it does not wipe out the data (so the drive does not become all clean, it is still subject to unformat), and
  2. not needed, because only the first time matters - on the subsequent delete-create cycles the system does not delete more data than already deleted during the first cycle.

Sunday, 13 June 2010

Thou shalt not overclock

The computer components have the specifications they are designed to meet. The specifications are there for reason. Most particularily, the reason of stability. If the CPU is rated for, say, 2.0 GHz frequency, this means it would run flawlessly at 2.0 GHz. If you find a way to force it to 3.0 GHz, all bets are off.

The art of running the components faster than they are rated for is called "overclocking". Some pretty amazing results were achieved, especially if one throws in some nonstandard technology, along the lines of liquid nitrogen cooling. Unfortunately, there is one thing all these achievements lack - the stability.

Overclocked system tends to bite its owner one day. Even if it runs fine for a while, the overclocked system tends to degrade faster, and may soon degrade to the point where it fails to perform.

Take this long story for example. It involves a long list of suspected components: PSU, RAM, dying CPU, you name it. Lo and behold, simple revert to the rated speeds fixes the problem. The owner is lucky that the filesystem did not crash during the troubleshooting. If you boot up with the CPU or memory not functioning properly, filesystem crashes (either partial or leading to the raw file system state) are more than likely. In this particular case, looks like CHKDSK took proper care of the filesystem. However, does not look like the end of story just yet - i'm going to give it a few days and attempt a small overclock again. Yep, just a small one.

Wednesday, 9 June 2010

Partitioning for speed

I want to partition the RAID 0 array in order to create a dedicated space for Virtual Memory and Scratch disks for Adobe Photoshop and Premiere. The array is 4x 300GB WD VelociRaptor RAID0.

This is not going to work as intended. To get a better overall performance, he'd be better off splitting the array to two 2x drives each, or maybe even down to standalone drives. Scratch and swap files produce better performance when placed on separate hard drives in such a way that no "spindle" is servicing more than one data stream. In the layout with one 4-disk array, four "spindles" would be serving three or four data streams (source data, swap, scratch, and output data), which is far from ideal because the number of seeks would be too high. Adding partitions to the mix would only ensure there is a certain minimum distance for the disk heads to travel across the partition boundaries. This would actually decrease the performance.

In RAID planning, speed estimations, such as provided by the RAID calculator may be handy but only apply to the simplest case of a single data stream. Also, keep in mind that a RAID setup does not improve access time (command-to-start-read).

Tuesday, 8 June 2010

When calling in, or posting on a forum to get help with a RAID recovery, one should have the following info readily available.

  1. What is the array type, RAID0, RAID0+1, RAID5, whatever.
  2. How many drives were in the array originally. Might seem surprising but every once in a while there is a difficulty establishing the number of drives with an appropriate degree of certainity.
  3. How many drives are available now.
  4. Are there any known drives with a mechanical hard drive damage? If yes, how many drives are affected?
  5. What device the array comes from? Is it a NAS (and what model), brand server (what brand, model, and configuration options), or maybe a homebuilt machine (controller model or RAID software).

Although these questions may appear very simple, it still takes time to gather that information. When you got a RAID incident, collect this info as soon as practical.

Monday, 7 June 2010

Redundancy in various filesystems

This is a quick summary of redundant elements purposedly maintained in filesystems.

FAT16 and FAT32 filesystems typically have two copies of file allocation table (FAT). It is possible because the table is relatively small and the resulting overhead is not significant. Despite this some devices (e.g. mobile phone Sony Ericsson W580i) do not update the second copy of the table.

As for NTFS filesystem, the full copy of Master File Table (MFT) doesn't exist because it would be too large and too expensive to update. However, NTFS stores a copy of the beginning of the MFT. This copy has variable size depending on a cluster size. Only the records describing the system files are copied. There is no copy of the user file records.

ExFAT filesystem, which one might come across during a pen drive recovery or an SD card recovery, only stores a single copy of the file allocation table, most likely for performance reasons.

HFS and HFS+ do not have a copy of the Catalog File, although it might be theoretically possible, because the copy size wouldn't be too large. However, the designers opted not to do it.

RAID is not a substitute for a proper backup.

RAID reliability is provided by redundancy. In theory, the probability of a simultaneous two-disk failure is the square of probability of a single disk failure, but that formula only works given that the failures are independent.

Actually, the drive failures are not independent because there are many factors in common for all drives in the array.

These factors include:

  • Temperature. If a drive is damaged as a result of the overheating, most likely that rest of the drives overheat as well.
  • Power. If a power supply burns out (or lightning strikes the power line) all the drives would fail immediately.
  • Logical connection. If you have RAID 1 and you have accidentally deleted some files, both copies would be deleted simultaneously.
  • Controller. If a RAID controller burns out, a disk array would go offline completely. In a lucky case, it is possible to attach the drives to a similar controller and it would recognize the array, but it is not always that smooth. Sometimes, a RAID recovery software might be needed.
  • Cables. If several drives are connected to the same cable (as it was earlier with IDE and SCSI) and the cable snaps, all the drives connected with this cable would be gone.

There are multiple reasons why the redundant array may fail instantly, and the proper backup is still required to provide a secure data storage.

Friday, 4 June 2010

If you start to format the hard drive and then find out that it is the wrong drive, press "Cancel" immediately, and then go for the reset button (if your computer has it). If you do not have a reset button, keep in mind that a power button has a five second delay before the shutdown occurs. Wall socket plug may be a better option.

There is a significant technical difference between Quick and Complete format, but we'd rather discuss the timings for now.

If you are doing a quick format, most likely pressing "Cancel" and reset would be of no use because you do not have time for it, but you should try anyway.

In case of a complete format (with Windows Vista or Windows 7, which actually overwrite the data during the format)

  • on the FAT filesystem the file allocation table is lost very quickly and then folders are progressively lost. Loss of the allocation table makes subsequent unformat attempts difficult and causes the loss of all the fragmented files. Further loss of folder records makes the recovery next to impossible even though the file content may still be there.

  • on the NTFS filesystem, the MFT (Master File Table) is typically located starting at 3GB offset and takes up about 100MB. The typical disk write speed is about 30-60 MB/sec, 3GB are thus filled in about one minute, after which the MFT is lost, making the recovery next to impossible. Modern SSDs with write speeds about 300MB/sec cut the available time to like 10 seconds.


All in all, the conclusion is that you better double check what drive you are going to format.

Monday, 31 May 2010

Types of data recovery

There are two distinct types of data recovery, namely “in-place” and “read-only” recovery.

The in-place recovery is the attempt to fix the errors and bring the filesystem to the consistent state. This is done by changing the damaged filesystem itself.
The read-only recovery, as the name implies, does not change the damaged filesystem. Instead, the data is extracted and copied to the separate dedicated storage.

The prevalence of each type of data recovery has been changing over time.

In the days of DOS, Windows 3.11, and then Windows 95, in-place repair prevailed. Actually, it was the only option available before Ontrack released their “Tiramisu Data Recovery” circa 1999. So, you had Norton Disk Doctor (which was quite good in fixing errors), Norton Unerase, and Norton Unformat and that was about it. Norton Utilities worked with FAT filesystem under DOS or Windows 95. The prevalence of in-place repair is understandable if you consider the simplicity of the filesystem and the cost of the storage in those days. The most widespread filesystem was FAT, which is rather simple and well-documented. On the other hand, a spare hard drive cost was prohibitively high.

The release of Windows 2000 changed the things significantly. NTFS filesystem was quickly established as the standard, but it lacked documentation badly (in fact, it is still not fully understood by developers outside Microsoft). As far as in-place repair is concerned, you were left with the Microsoft CHKDSK. Since NTFS is not documented, it is not possible to fix it in-place because you do not know what the consistent state of the filesystem should be. Even a minor deviation from the standard which is unknown to the developers causes the NTFS driver to reject the volume or produce otherwise bizarre behavior. However, the storage cost dropped, making it easier to find a high capacity spare storage. So the simplest route of just extracting what is really needed – the file content, and stop worrying about filesystem consistency became the most cost-effective approach. Nowadays, the read-only data recovery software dominates the do-it-yourself data recovery market and the in-place repair is left to servicemen in the recovery labs. In-place repair is only used in some specific cases where high capacity disk arrays are involved.

Saturday, 22 May 2010

Data recovery vs. TRIM

TRIM wins.

The TRIM command available on modern SSD reduces the chances to successfully undelete a file. TRIM violates the most significant principle of the data recovery that “the data is not overwritten until the disk space is actually required to store another piece of data”.

Writing on a SSD is slower because before writing something to the block, it's needed to erase this block, and the erase operation is relatively slow. This is responsible for the performance degradation effect of the SSD when the performance starts to degrade as the device is filled to capacity, because there are no more blank blocks on the SSD.

To compensate for this performance degradation, a hardware command TRIM was implemented to erase the specified blocks in advance. TRIM is supported by most modern high capacity SSDs. TRIM is commanded by the OS (supported starting with Windows 7). When Windows 7 is in the idle state, it commands TRIM to erase those blocks which are not in use any longer.

TRIM violates the most significant assumption made in data recovery that “data is not overwritten until the disk space is actually required to store another piece of data”. Thus, it is no longer enough "not to write anything to the disk". Even if Windows just sits idle long enough, it wipes out the content of the files in the background. When you try to undelete a file, the content is all zeros.

If you delete a file, it is likely that the TRIM command will be issued soon and destroys the data completely. In case of catastrophic damage, when the entire disk is unreadable, becomes raw file system, or if Windows fails to start, there are no side effects from TRIM because the operating system is either not there or does not command TRIM for a raw filesystem drive.

Thursday, 20 May 2010

What's the best cluster size

On Tom’s hardware forum mdk4ever asks what is the best cluster size? He's got a new 2 TB Western Digital hard drive and wants to set the optimal cluster size while formatting the drive.

The rule of thumb is “Always use the default cluster size”. The performance gain from changing the cluster size, if any, is not noticeable. One can speculate that the GUI option to change the cluster size if in itself a legacy from the days of floppy drives.
If you change the cluster size you may encounter unexpected side effects (did you know that NTFS compression off for cluster size greater than 4KB?).
Another consideration is that when you need to recover data it is much convenient to deal with the default cluster size values. Data recovery software can either calculate the default cluster size (for HFS) or look it up in the table of standard sizes (for FAT, NTFS). Non-default cluster size needs to be determined by the complex techniques.

So, we recommend that you always stick to the default cluster size to be able to recover data easily should the need arise.

Monday, 17 May 2010

RAID system that just works

On Tom's hardware, ratbat asks for a RAID that just works (no matter what).

His requirements are fairly simplistic:
  1. Minimum maintenance,
  2. Minimum downtime in case the drive fails,
  3. Simplest possible recovery,
  4. The setup has to boot from the RAID.

The only match for these criteria is RAID1 (mirror).

As you know, the RAID levels are (exotics excluded) RAID0, 1, 5, and 0+1. Each RAID level has its own strengths and weaknesses, but these are too complex to fit into this post. For more read on this, check RAID levels reference.

RAID0 is eliminated from the contest because it is not fault tolerant.

RAID5 fails the "Simplest possible recovery" requirement. To have a bootable RAID, the hardware controller is required. In case the controller dies, the recovery can get complicated, requiring RAID recovery software, which may be fairly tricky to operate properly.

So we are left with RAID1 (mirror) and RAID0+1.

RAID1 wins the contest because of its simplicity. If one of the drives dies, the mirror just continues to operate. if the controller dies, you can get any of the two identical drives, plug it into any compatible PC and have the data immediately available. You can even plug the drive into the mainboard (bypassing the dead RAID controller) and boot from it. The only thing you lose by doing so is redundancy.

Now we have to choose between hardware and software implementations of the RAID1.
On Windows, the RAID1 is only available in Server versions, so there's really not much choice. Either get a hardware controller, or use a controller integrated into the motherboard (if available).

Make sure to test your RAID once you have created it and placed the OS on it.
1. Unplug each drive in turn (make sure to power off before doing it) and check how the controller responds.
2. Unplug each drive in turn and attach it to the non-RAID mainboard port. Check if the system boots from that drive.
3. After each of these steps, the resynchronization of the array is needed, so it would take some time.

Once you are done with that, install the software as usual.