Wednesday, 31 August 2011

Customers' requests revisited

The customer walks in and says something along the lines of you should have more input options for your RAID Recovery app.

Unfortunately, it just does not work that way. As you add more options and combinations thereof, people starting to get lost among these fast. Interestingly, I recall once considering an automatic software to detect all RAIDs attached to the system with just one click of button. Something more along the lines of All your RAID are belong to us, which would eliminate the requirements both to specify array type and to select a disk set. Just do all the probing and produce all possible RAIDs. Unfortunately, this did not work out for technical reasons.

Actually, even providing a correct RAID type may prove difficult if the array was created five years ago, the person setting up the system retired four years ago, and noone even noticed the RAID until it failed. So now you have four disks, some of which may or may not work; so, RAID 0, RAID 10, RAID 5, or RAID 6? OK, RAID 6 can typically be ruled out based on the controller model alone, but the rest may not be that easy.

Sunday, 14 August 2011

Exotics

A customer walks in and says: We need to recover a FATX volume, can you do that? - Sorry, no. Various exotic filesystems, btrfs, logfs, and even ReiserFs have always been considered a job for a data recovery serivce, not an automated software. Software is cheap, but only resolves common cases by applying typical solutions. Data recovery service is expensive, and applies its high fees toward the difficult cases, e.g. writing custom software to deal with just one specific case.

Lately, there is an influx of requests for something nonstandard. The latest hit was
- We have ReiserFs on the RAID5.
- Okay, no problem.
Turns out there was a problem. The RAID5 was using 512 bytes per block. JMB 393 controller. Oops.

So far we have delayed parity (from HP SmartArray), Promise RAID 6 with its non-standard Reed-Solomon, Promise 1E interleaved layout, exFAT filesystem recovery. The capability to recover RAID with a block size of 512 bytes is in the pipeline, currently undergoing testing.

So what's next? The spec for JMB 393 lists RAID 3 as a possible option, anyone actually ever used that?

Thursday, 11 August 2011

"Best guess" parameters in RAID recovery

Every once in a while, we get a feature request for our RAID recovery software (http://www.freeraidrecovery.com/) to implement the ability to interrupt the analysis midway and get a list of possible solutions, sorted by confidence level.

There is some strong reservation against this would-be feature.

Although it looks like a good idea, a very nice thing to have, it has some undesired consequences we cannot allow. The confidence thresholds are there for reason, and we put an effort to ensure they are balanced between faster analysis (lower thresholds) and reliability (higher thresholds). Once we make incomplete solutions accessible, people will start using these solutions on real data. Sooner than later, someone is going to destroy their RAID 5 by reassembling it on the controller using wrong parameters set. In RAID0, this would be no harm (just re-assemble again in correct order), but with RAID 5, incorrect assembly (automatically followed by a rebuild) destroys the array beyond any practical repair. This is similar to consuming unbaked or half-baked food. Sooner or later someone will get poisoned. Determining if RAID configuration is correct is even more difficult than telling baked food from raw meat.

Monday, 1 August 2011

RAID increases failure rate

Surprising, isn't it? Actually, RAID does indeed increase failure rate. If you take MTBF, MTBF decreases with more disks. Even if RAID5, mean time between disk failures decreases.

In a fault-tolerant storage, time between failures (MTBF) does not matter. What matters is time between data loss events. This is called either mean time to data loss (MTTDL) or mean time between data losses (MTBDL).

You know you can setup a three-way RAID1 (three mirrored copies instead of two), i.e. the mirror can have more than two disks. So, let's imagine a RAID1 of infinite number of disks. This unit will have an MTBF of zero, because at any given moment one of the infinite number of disks is failing. It will also be continuously rebuilding while still delivering infinite linear read speed. Still, this imaginary device will have zero probability of losing data because of the disk failure, because the infinite number of disks cannot all fail at the same time.