Thursday, 29 December 2011

Stories


The computer with Asus motherboard based on ICH8R has worked flawlessly for about 1.5 years. Two 250 GB hard disks were in RAID1 divided into two partitions.
Two weeks ago I heard about a problem (the request to press a key when loading) which don't lead to subsequent problems in the operation.

When the system is starting up, the message "Primary HDD not found. Press F1 to resume" is displayed. Pressing F1 leads to usual system start with the message "New device was found ...". When I started to sort out what happened I found out that RAID controller switched over to IDE mode. Disk Management displayed copies of logical disks (originally there were C:, D:, -- but now E: and F: are added with the same sizes and labels). The freshest file dates on the copies coincided with the date when start up problems arose.

When I switched the controller to RAID, RAID1 appeared immediately with the name specified by me in RAID settings. The state of RAID 1 was shown as "Normal" that meant the controller didn't understand that the disks were not synchronized.


That's why it is important to monitor an array's state and to synchronize disks in a mirror or recalculate parity in a RAID 5 periodically. There are situations when a controller loses an array and an user doesn't note or just ignore it as in "RAID seems to work and the rest doesn't matter". However RAID loses its fault tolerance and instead of RAID 1 we get a single backup copy which will not update anymore.

Actually a good example for one of our RAID Tips here.

Monday, 26 December 2011

File system reliability doesn't depend on the load

The idea that filesystem fragmentation degrades reliability, put forward by some of the defragmenter vendors, is not true.

A filesystem handles all the fragments in the same way.

At first significant differences arise between contiguous files which have only one fragment and files consisting of two fragments. The next significant difference appears when the list of fragments becomes large and the list itself is divided into several parts. On NTFS the mark is usually at about 100 fragments. Once it is tested that a filesystem properly handles one fragment, the list of fragments, and the list of lists, you can be sure that a filesystem handles any number of fragments.

The same way a heavy load on the data storage system hardware doesn't by itself decrease reliability. High disk queue length doesn't lead to a disk failure. Definitely it indicates an overload, but if you wait long enough, all the requests will be processed. Strictly speaking, if the load is very heavy, some of the requests will be eventually cancelled due to timeout. This doesn't affect the ability of the storage to keep working properly when the load decreases.

However, there are some factors which arise along with the overload that can lead to failures and losing data. In the hardware these are usually temperature and vibration. A disk system under load warms up and vibrates. If the cooling system is not good enough it can lead to overheating and excessive wear.

In the software, due to an overload, race conditions and other bugs are ore likely come up. In addition, there can be errors in programs which work with the filesystem driver although they are not part of it.

For example when multi-core processors were not yet invented, and dual CPUs were expensive and rare, not many antivirus programs were being tested on a multiprocessor. Once you get the antivirus filter driver on a dualhead, blue screens were quite an ordinary thing.

However, filesystem errors are eventually found and fixed, and drivers of any mature filesystem are reliable and are already tested with most any load you can imagine.

Saturday, 24 December 2011

On average, machine wins

If in RAID recovery the software says the block size is X, while customer says it is Y, X and Y being diffferent, the software is correct at least nine times out of ten. Sometimes, that annoys the customer, but there is little we can do about it.

Monday, 12 December 2011

Recovering confidential data

When one deals with data recovery, sometimes he worries about the confidentially of the recovered data. Look at the example below:

...a Western Digital HD that makes clicking noises. .... The HD has many customer credit card numbers and legal documents on it, so confidentiality is very important to us.

If the automatic data recovery software works well enough, there is no problem. In this case one recovers data himself and data always stays on his own computer.

In all the other cases, for example, when a mechanical repair of disk is required, a technician has full access to a disk and that data which he is able to recover.
Any respectable data recovery company usually doesn't reject to sign either a non-disclosure agreement (NDA) or an agreement that data cannot be viewed at all during a disk repair.

The prohibition of reading data makes data recovering harder because of two reasons:

  1. Quality control becomes more difficult, and often impossible at all. Some data recovery programs provide automatic integrity control of some recovered files (e.g. in ZAR). In addition, many data recovery companies use custom-made programs for this purpose. However, not all file types can be checked without reading them.

  2. In case the data recovery fails at the first try, resetting the program for the second time becomes more complex.


On top of that data recovery companies might worry that the recovered data might be potentialy illegal.

Thursday, 8 December 2011

Ah sh!t!... erm... press on.

If you are reconfiguring a RAID, or whatever other storage system, and something unexpected happens, or something happens that you do not fully understand, stop.

Pressing on in this situation would likely make things worse. Pressing on for long enough will eventually make things irreversibly bad.

This thread on Tom's presents a good example. When reading it, keep in mind two things,

  • Even if one do not initialize the RAID, each time RAID 10 is reassembled and resynched with different order of disks, there is a 50:50 chance of total data loss

  • Repairing a filesystem or troubleshooting boot process without having fixed an underlying RAID first is certainly useless and often damages the data.

Tuesday, 6 December 2011

XFS coming soon

Probably as soon as tomorrow. The only thing still missing is a comparison test against typical failure modes: file deletion, format, and bad blocks.