Tuesday, 31 December 2013

Finding a disk

Q: There are 6 SATA disks in the server. Based on SMART data, one of them (in fourth bay) is going to fail. Someone knows how bays are numbered? Identification by “blinking the bay's LED” does not work.

A: Write down the serial number of the offending disk in the controller configuration tool, turn off the server, pull all the disks out in turn, and find the disk based on the serial number. Then assemble everything back and turn on the server. If everything is OK, turn off the server once again and replace the disk.

Tuesday, 24 December 2013

Transferring disks along with a controller between servers

Q: A motherboard in a server burned out. There was RAD5 of 5 disks with LSI MegaRAID controller. There is a reason to assume that the controller and the array were in working order. If I transfer the controller and drives to another server (different model/motherboard), the array would work?

A: A hardware controller with disks is a self-contained unit, so if you install drivers as required on another system, the array will work. Just in case, do not swap the drives on controller ports, because in some cases it is significant.

Friday, 13 December 2013

If wishes were horses...

RAID controllers should have a simplified configuration option which should ask just a single question - how much does it cost to re-create your data from scratch? - and then proceed accordingly. This should prevent people from storing useful data, or worse yet, something which is not possible to re-create, on RAID0s.

Monday, 2 December 2013

Hardware and Storage Spaces

Sadly the predictions about people using sub-par hardware to build enormous Storage Spaces configs are gradually coming true. Not that the hardware is bad or faulty per se. It is not just up to the task. Any large storage configuration, except maybe some network-based distributed systems (which are designed to be slow, by the way), requires stable hardware. Even a drive failure rate of one failure per year per drive would still be acceptable. On the disk set of ten or more drives, one failure per drive-year is an annoyance, but one can still expect the system to be able to cope. However, in what we are looking at now, with USB configurations of forty drives or larger, the failure rates are closer to one failure per drive per week. This results in systems where the lifetime expectancy is comparable to the time needed to populate the system with the initial data set. Once the data is copied over, original copy deleted, and original drives reused, whatever lifetime left in Storage Space runs out and it goes down.

If you want a storage larger than your typical desktop, say, 10 TB or more, there is still no way around the expensive hardware. A large pile of USB external drives from the corner store just does not last long enough.

Tuesday, 1 October 2013

Current updates

Nothing much to report at the moment, due to generally slow season, work-wise.

Work on the more professional-oriented edition of ReclaiMe software is still progressing as scheduled, more-or-less that is.

Got mentioned today by Jera of Work n Play Custom PC today (http://work-n-playcustompc.weebly.com).

Got a test setup of Storage Spaces R2 (or whatever it is), but have yet to investigate it (nothing much been done except for initial setup).


 

Wednesday, 4 September 2013

Half-baked ideas

Recently, surfing the net I came across the site where people advance various ideas.
First I was interested in this:
The point is to create a SSD using USB thumb drives. We decided to calculate the economic aspect of this project. It was found that much easier and cheaper is to buy a ready-made SSD rather than try to create it with heaps of USB thumb drives.
With thumbdrives, you get the cheapest price per gigabyte if you use 7x 64GB USB thumb drives and one 7-port USB 3.0 hub (the cheapest hub at newegg). Thus, the price of such 448 GB storage is 79 cents per gigabyte while for 480 GB SSD the price is 67 cents per gigabyte.
Further I encountered this idea
The idea intends to combine old computers into network and to create iSCSI RAM disks. Looks really interesting save for one problem.
The problem is that an electric bill very quickly becomes more expensive than a SSD. Such system with capacity say of 500 GB will consume electricity at a rate of about $400/month. Again, in just one month this becomes more expensive than a ready-made SSD.

Monday, 19 August 2013

Bad sectors vs. controllers

If you have a drive which has a mild case of bad-sector-itis, and you want to make an image of it, or clone it, make sure you connect the drive to the Intel ICH-series controller. No nVidia, no Marvell, no VIA or whatever else it might be. Where other SATA controllers lock up or cause a reboot, Intel ICH is still reading, reading, reading, slow but fierce.

Monday, 17 June 2013

Why there is no technical support by phone in data recovery

Once in a while people demand that we provide technical support not only by e-mails, but by phone as well. Despite the apparent reasonableness of the request, it is not that simple in practice.

A typical case of technical data recovery support via email looks like the following:
  1.  A client contacts our technical support and describes his or her problem for the first time.
  2.  We send a reply where we ask the questions to clarify obscure points. Usually we have three to five questions (although there may be more) which are organized as a numbered list.
  3. In the following email a client answers the questions. There are cases when client needs time to think or do some simple checks to answer the questions.
  4. In response to the message from a client, we send a list of recommended actions or step-by-step instructions as to what he or she should do next. There are cases when boilerplate instructions fit just right; however, almost half of all cases require pondering on our side to adapt typical data recovery instructions to a particular case or to create new ones. It takes about ten minutes or so unless we need to do some long tests.
As you can see, the scheme above doesn't fit into the time frame of a typical phone call. Additionally, email support gives a written plan of actions and a well-structured trace of activity one can consult any time. It's just something that phone support lacks. The conclusion is that it is just not practical to provide any serious support in data recovery (and perhaps in any other technical industry) by phone. Let's stick to emails.

Sunday, 26 May 2013

Intel RAID controller RS2MB044 + Intel RES2SV240 expander = doesn't work

The following setup doesn't work:

  • Intel RAID controller RS2MB044,
  • Intel RES2SV240 expander,
  • 16 WD30EFRX (WD Red) hard drives.
The demonstrated symptoms are similar to those described here. In our experience slowdown appears with both a RAID6 and RAID5. Perhaps, there is some incompatibility between controller, expander, and drives in any of possible combinations.

Monday, 15 April 2013

Quicker fixes are not a good idea (in most cases)

Every once in a while someone comes up with a forum question along the lines of

I have such and such problem with my storage. The data is still accessible, but the storage unit is for some reason in abnormal state. What should I do?

Typical (and proper) answer is
  1. back up data,
  2. test the backup,
  3. make sure the original problem, which caused whatever abnormality there was, is corrected,
  4. rebuild the storage from scratch,
  5. restore data
Often, this is not considered a good enough advice: Is there any quicker way to resolve the issue?

Actually, no. To make sure the abnormal state is properly repaired, one needs to identify all the undesired effects and changes to the data. This is plain impossible in all but most simple cases. So, there is always a risk of missing some important point during "quick" repair, masking the problem instead of repairing it. The issue might then reappear later in some undesired way.
 

Wednesday, 3 April 2013

Intel RS2MB044 RAID controller

If you have an Intel RS2MB044 controller and get the message

Controller ID: 0 Controller encountered a fatal error and was reset

Flash the latest firmware.

Next thing,

If flashing with the latest firmware fails stating that

Firmware Failed to FLASH flash. Stop!!!
FW error description:
The requested command cannot be completed as the image is corrupted.

Then, start flasing older firmwares first. The update from very old firmware version to the latest one may require several intermediate firmwares to be flashed before the controller will accept the latest one.

Mk. II testbed storage system

Finally, completed the build of Mark II testbed storage array today, to replace aging QNAP TS-639 Pro unit.

Intel RS2MB044 controller.
4x OCZ Solid 3 120 GB SSD, directly connected to the controller.
Intel RES2SV240 expander, connected by SFF-8088 to SFF-8087 cable.
16x 3TB WD Red hard drives connected to the expander.

The hard drives are configured as 14x 3TB RAID 6 and 2x hot spare, for a total of 36 decimal TB unformatted capacity.

SSD cache to be configured later.

Wednesday, 27 March 2013

Mirrors vs. automatic backups

There was an incident recently, which involved a near-loss of some important git repository. The incident involved the distributed system of multiple servers, one being the designated master and multiple slaves. Slaves pulled data automatically, and that was apparently done quite often. You guess what happens next, the master copy got corrupt, and before anyone knows all the slaves pull the corrupt copy.

In the follow up (here) they state that they have a backup system in place which is principially different from a RAID 1. That is obviously not so.

Any system with automatic replication is subject to the following failure mode - the master copy is damaged, and the damage is then automatically replicated to slaves (mirrors). The automatic replication systems are designed around the assumption that all the master failures are fail-stop - the master either fails mechanically and ceases to perform completely, or the master can detect any and all cases of corruption in it and sutdown itself. The "grey area" cases when the data is damaged but the master still works and pushes out that damaged data are not accounted for.

In most of these systems (except an exterme case of fully duplicate hardware) there is a time window when the synchronization can still be aborted if the corruption is detetcted in time. In a hardware RAID 1 with rotational hard drives this window is like 10..50 milliseconds long. In a weekly backup system, the average time window is half a week. Version-retention systems (which retain multiple previous versions of data) have longer window of opportunity for recovery. However, infinite window of opportunity requires an infinite storage space.

This is why manual backups should still be considered a valuable addition to autmatic strategies. You look at the system and it seems good enough, so you make a copy of it somewhere out of automatics' reach. Might come in handy one day.

Thursday, 28 February 2013

Why there will be no saved-state feature in ReclaiMe software


In average, ReclaiMe File Recovery brings data in less than 48 hours - with NTFS you typically see the files within a couple of hours, with HFS or ReFS you should wait till the end of the scan which takes about 12 hours for a 2 TB disk (SATA connection). ReclaiMe RAID Recovery scans drives in parallel so you are limited by the slowest and largest drive - with a 4 TB drive in the set you expect something like 24 hours. So, in 48 hours you definitely get data with ReclaiMe software. If recovery takes more than 48 hours, it is possible that there is a failed drive (with bad sectors) and you should consider imaging the drive in question first.

If the PC running data recovery cannot work for 48 hours non-stop, you should repair hardware first and only then proceed with data recovery. Anyway, with such a PC there is a great chance that data recovery will produce either incorrect data or, even worse, something that seems to be normal files with just "slightly" damaged content, due to subtle memory corruption or a similar problem.

Tuesday, 26 February 2013

PC freezes or suddenly reboots? Hardware, rather than software problem.

In most cases, when there are hardware issues on the client side, the most difficult support task is to persuade a client to look for hardware problems.

Modern Windows operating systems are quite stable at least when standard hardware is involved and there are certain events that just do not occur on a hardware-"healthy" PC for the past five years. These include freeze when a mouse stops moving on the screen or unexpected reboots when a PC suddenly restarts without error messages. If you encounter these sorts of issues, there is almost certainly a hardware issue and you should look for it.

Tuesday, 12 February 2013

SSDs, TRIM, and Storage Spaces

If you delete a Storage Spaces pool on a TRIM-capable SSDs, the entire pool gets trimmed, as in poof!, gone.

Tuesday, 29 January 2013

Time spent on data recovery

No one actually tells you how long does it take. 
  • Trivial small-size case takes two hours. Let's say it is flash-based media (never a hard drive), less than 200GB.
  • Trivial medium size takes two days. That is, any type of drive with no RAID.
  • Trivial big case takes a week or more to recover. This means RAIDs upwards of 4TB.
The above estimations account for DIY operation, time needed to hurriedly source whatever parts may be needed, aseemble the setup as required, do recovery and copying.
However, the estimations do not account for possible mistakes and resets in the recovery. In a trial-and-error process,
  •  every error-and-restart cycle doubles processing time.
 

Wednesday, 2 January 2013

@DEVOPS_BORAT

Been reading DevOps Borat today. Some infinite wisdom in there

Only 10% in devops are know how of work with Big Data. Only 1% are realize they are need 2 Big Data for fault tolerance.

Worst failure in devops are happen in high availability architectures (also know as Paradox of RAID Controller Failure)

In startup we are allow all developer use 20% of time for write code with not TDD so they can able of actual get shit done