Wednesday, 27 March 2013

Mirrors vs. automatic backups

There was an incident recently, which involved a near-loss of some important git repository. The incident involved the distributed system of multiple servers, one being the designated master and multiple slaves. Slaves pulled data automatically, and that was apparently done quite often. You guess what happens next, the master copy got corrupt, and before anyone knows all the slaves pull the corrupt copy.

In the follow up (here) they state that they have a backup system in place which is principially different from a RAID 1. That is obviously not so.

Any system with automatic replication is subject to the following failure mode - the master copy is damaged, and the damage is then automatically replicated to slaves (mirrors). The automatic replication systems are designed around the assumption that all the master failures are fail-stop - the master either fails mechanically and ceases to perform completely, or the master can detect any and all cases of corruption in it and sutdown itself. The "grey area" cases when the data is damaged but the master still works and pushes out that damaged data are not accounted for.

In most of these systems (except an exterme case of fully duplicate hardware) there is a time window when the synchronization can still be aborted if the corruption is detetcted in time. In a hardware RAID 1 with rotational hard drives this window is like 10..50 milliseconds long. In a weekly backup system, the average time window is half a week. Version-retention systems (which retain multiple previous versions of data) have longer window of opportunity for recovery. However, infinite window of opportunity requires an infinite storage space.

This is why manual backups should still be considered a valuable addition to autmatic strategies. You look at the system and it seems good enough, so you make a copy of it somewhere out of automatics' reach. Might come in handy one day.