> The ‘network partitions are not optional’ interpretation of CAP is forgetting ...

jlokier · on March 8, 2024

I have seen it happen even within a local dual-redundancy cluster, with machines connected by redundant ethernet links.

Some maintenance person disconnected power to the two servers, then reconnected the power assuming they would reliably recover. The admin instructions were to never do this - always boot up in sequence.

They booted up simultaneously, and the fully redundant network switches took too long to boot fully (Cisco), but started passing internet traffic. After a timeout the servers each assumed they were the master in a degraded cluster, so proceeded to make divergent modifications on the DRBD replicated storage and to serve requests.

I never found out why this happened in spite of the direct ethernet links between the servers which they were supposed to use for synchronisation decisions, but it did.

Recovery required manually comparing changes in files and databases to decide whether to merge or discard.

This problem was avoidable but an adequate fix evidently wasn't in place (mea culpa, limited time and budget).

It did not help that Pacemaker+Corosync was used, before Kubernetes was popular, and Ubuntu Server shipped a very buggy alpha version of Corosync that corrupted itself and crashed often, despite upstream warning it was an unreliable version. I had to manually build a different version of those tools from Red Hat source, because it was too late to change distro. This is one of two reasons I don't recommend Ubuntu Server in professional deployments any more, even though I still use it for my own projects.

Three servers, or two servers and a third special something for arbitration, is a standard solution to this problem.

But it's only useful for a stateful distributed system, like a database or filesystem with some level of multi-master or automatic failover.

There's no need for three nodes or any particular number, for stateless nodes like a web service whose shared state is all calls to a database or filesystem on other nodes.

Technically you don't need three servers. It's enough to have a cheap component or low-cost tiny computer to arbitrate. Even sending commands to the network switches to disable ports (if the switch doesn't behave too strangely, as the Cisco switches did in the above!), or IPMI commands to the other server's BMC. Just about anything can be used, even a high latency, offsite tiny VM, as it isn't needed when the main servers are synchronised.

jameshart · on March 8, 2024

Minimum of three is generally for one of two reasons:

1) to ensure you can deliberately take one offline for maintenance and still have redundancy in case a single node goes offline

2) in some systems an odd number of nodes is needed to ensure no ties in leader elections or decision votes. Three is the smallest odd number that has any redundancy.

YetAnotherNick · on March 8, 2024

1) I have taken a node in maintenance mode many times deliberately by launching second node and draining first. Its not an issue at all with just one node. For redundancy, is there ever a case where single node could go down in practice. Nodes are basically EC2 instance and they could run for years without going down.

2) 1 is odd and could win the election;)

jameshart · on March 9, 2024

EC2 instances can be rebooted at any time. The underlying hardware they are on fails from time to time and they get moved.

Running three nodes is a rule of thumb, not a hard rule for minimally guaranteeing availability.

YetAnotherNick · on March 9, 2024

> EC2 instances can be rebooted at any time

No, they can't be rebooted any time. Where are you getting this information from?

jameshart · on March 9, 2024

https://repost.aws/knowledge-center/ec2-linux-degraded-hardw...

Note in particular:

> For instances that launched from an Amazon EC2 Auto Scaling group, the instance termination and replacement occur immediately

YetAnotherNick · on March 10, 2024

> Amazon EC2 also sends a notification in your email

Interesting. I had seen EC2 instances running for multiple years and never saw this so I assumed that this isn't possible. I know for a fact that GCP has live migration where while the service would be degraded for few seconds, you don't need to do anything so I assumed AWS also had something similar.