QNAP revisited
For reference, we were using QNAP TS-639 Pro with six WD20EADS disks.
After about half a year of use, the web-interface was running slower, so that we needed to wait for several minutes to obtain the list of the disks. Gradually, we noted that the array performance decreased significantly.
It seemed obvious to assume that one of the member disks has been dropped from the array and a RAID5 was working in the degraded mode. However, all the member disks were marked as GOOD in the web-interface.
It was suspicious that LEDs on the disk bays indicating the state and activity of the disks were blinking unevenly. Logically, for RAID5 one should expect almost symmetric load on the disks, but in fact, one of the disk LED was blinking much more frequently.
Once we had run the bad blocks check on this disk through the web-interface, the disk dropped from the array in less than half an hour and its bay LED turned red. At this point, the web-interface began to work properly again. The disk taken from the array doesn't stay online for longer than a couple of minutes when connected to a Windows PC.
RAID 5 is designed to be redundant in order to improve reliability but a single disk failure destroys the redundancy. If the second disk fails before the first failed disk will be replaced, the data loss occurs. In our case, the disk failed but all the obvious diagnostic tools reported that the RAID was OK. So two weeks passed before we realized that the disk had to be replaced.
The hot-swapping did not work properly too. When we put in the new disk, we expected the array would detect it and start the rebuild. Instead the web-interface hung once again, showing the strange indications like "Disks 2, 3, 4, 5, 6 are not present", though read-write operations worked fine all the time. The reboot was required to detect the new disk and start the rebuild.
If you happen to have one of the QNAP units, note that the LEDs on the device indicate the state of the array correctly. If there is a discrepancy between LEDs and what the web interface tells you, go with LEDs.
After about half a year of use, the web-interface was running slower, so that we needed to wait for several minutes to obtain the list of the disks. Gradually, we noted that the array performance decreased significantly.
It seemed obvious to assume that one of the member disks has been dropped from the array and a RAID5 was working in the degraded mode. However, all the member disks were marked as GOOD in the web-interface.
It was suspicious that LEDs on the disk bays indicating the state and activity of the disks were blinking unevenly. Logically, for RAID5 one should expect almost symmetric load on the disks, but in fact, one of the disk LED was blinking much more frequently.
Once we had run the bad blocks check on this disk through the web-interface, the disk dropped from the array in less than half an hour and its bay LED turned red. At this point, the web-interface began to work properly again. The disk taken from the array doesn't stay online for longer than a couple of minutes when connected to a Windows PC.
RAID 5 is designed to be redundant in order to improve reliability but a single disk failure destroys the redundancy. If the second disk fails before the first failed disk will be replaced, the data loss occurs. In our case, the disk failed but all the obvious diagnostic tools reported that the RAID was OK. So two weeks passed before we realized that the disk had to be replaced.
The hot-swapping did not work properly too. When we put in the new disk, we expected the array would detect it and start the rebuild. Instead the web-interface hung once again, showing the strange indications like "Disks 2, 3, 4, 5, 6 are not present", though read-write operations worked fine all the time. The reboot was required to detect the new disk and start the rebuild.
If you happen to have one of the QNAP units, note that the LEDs on the device indicate the state of the array correctly. If there is a discrepancy between LEDs and what the web interface tells you, go with LEDs.
Yup, I found exactly the same behaviour, but obviously on a smaller scale, on my Qnap 209 but on a smaller scale.
ReplyDeleteI could be that it's low level disk corruption that's causing the problem. If they need did some low level disk repair, like Spinrite, their systems would probably run better.