Difference Between Fault Tolerance, High Availability, Disaster Recovery (2014)

notacoward · on Nov 16, 2019

When I was working in this area during relatively early days, the difference was sometimes expressed this way:

* Fault tolerance: near-infinite MTBF

* High availability: near-zero MTTR

With HA there is a blip. It might not be visible to an application because of retries, but it is visible outside of the HA system/component to some degree.

Special bonus thought: as a guide to designing or implementing an HA/FT system, I always found it helpful to think in terms of what happens to system reliability as size increases. In a traditional system, system reliability goes down because of dependencies between nodes/components. In some systems this degradation is even worse than you'd think because it's tied to the number of connections - O(n^2) rather than O(n). In an HA system, system reliability should go up because of nodes being able to cover for each other.

The key question was always: if X fails, what other part of the system can make up for (not just survive) it? If it's a whole node, what other node(s) can take its workload? If it's a disk, where is another copy of the data? If it's a network, how else can nodes communicate or at least synchronize? That last was interesting but because it led to things like serial lines or pinging through shared disks as a last-resort way to convey cluster state. Fun times.

jmts · on Nov 16, 2019

MTBF: Mean time between failures

MTTR: Mean time to repair

Arnavion · on Nov 16, 2019

I've only heard "mean time to recovery" in the context of sofware, though Wikipedia does imply "mean time to repair" is also valid for software.

BOOSTERHIDROGEN · on Nov 16, 2019

I’m thinking this is similar to cascade failure ?

ndespres · on Nov 16, 2019

I like these analogies a lot. I regularly have to explain to my clients the difference between the backup system, the disaster recovery system, and the file server replication system. None of them is the same as any other, and each component has different levels of redundancy going down the stack (RAID, HA for the virtual machine, shared storage between hosts, etc) and it's no easy task to explain that yes, while component X is redundant, it does not meet the definition of "highly available" or "backed up." So I appreciate any explanation like this article that attempts to simplify and illustrate any of these definitions.

vinay_ys · on Nov 16, 2019

In that plane analogy, assuming the plane's design load capacity requires all 4 engines to be fully functioning, a fully loaded plane suffering a failure of 1 out of 4 engines is dealing with a "degradation scenario".

It will have to jettison some amount of load (proportional to loss of one engine) to save the rest of the load or it risks losing the entire plane.

This is a fault-tolerant system with degradation possibilities.

If this degradation possibility isn't acceptable, then the plane's design load has to be reduced. Plane will carry only so much load that can be safely flown with minimum surviving engines (say, 3 out of 4, or 2 out of 4 or even 1 out of 4).

When the plane is operating in this configuration with redundant online engines, there are interesting challenges w.r.t efficiency of the engines. These 4 engines will now have to be at their best efficiency when loaded only 2/3 or 1/2 or 1/4 of their load capacity. Because that is expected to be their normal operating condition. And they should be able to operate under higher load condition (which may not be efficient, and stressful) for a sustained duration that is desired for the plane to land safely.

Obviously, the plane that has to operate with spare engine capacity is more expensive. It is worth deploying such a solution only if the cargo being carried is more expensive even when adjusted by the probability of occurrence of this scenario.

These exact same trade-off scenarios exist for design of distributed software systems that have to tolerate machine component failures.