So, it's impossible to transfer a value from one machine to another if there's n...

crdrost · on Oct 8, 2024

So, when it was originally phrased, the primary thing that you would have learned about databases, was that they enable ACID transactions. (And if you were a practitioner you would have learned about the various isolation levels and dirty reads and dirty writes.)

But if you wanted to go from this level to implementation, typically you could get a prototype working, but it would be slow. When things are slow, the WD-40 of the programming world is to add a cache. And this is where you get the quip that there are only two hard problems in computing, cache invalidation and naming things. (The “and off-by-one errors” is a later addition.) The problem is that cache invalidation shows up as consistency bugs in your database. Someone does a rare combination of commits and rollbacks and some cache doesn't get wiped quite as fast as it needs to, or is wiped overoptimistically causing a pull from uncommitted data, and your isolation level has dropped to READ UNCOMMITTED.

The CAP theorem was originally raised as a conjecture, something like, “once you shard the database, I don't think there's any way to solve these cache problems without one of the replicas just going silent for arbitrarily long pauses while it tries to at least partially synchronize its cache with the other shards.” Phrased that way, you can understand why it was a conjecture, it relies on nobody at MIT having a super clever way to deal with caches and cleverly routing the synchronization around the sharding.

BUT, you can make that statement for many different reasons, and this was not for Pedagogical Reasons, its point was rather Evangelism! The author was attempting to introduce the idea of Eventual Consistency, and gain adoption by ditching all of the wisdom about ACID transactions. This antagonism was deliberate, eventual consistency became the E in a rival acronym BASE. And so the argument was that we could explore a new corner of design-space.

It was later that someone decided they could prove it by coming up with a universal subgraph, “whatever connection you've got, it has to contain this: two nodes, fighting over one value, with a network connection possibly passing through other nodes but we can abstract all that away.” And then you have a proof, and then you have a bunch of people comparing the proof to the stated claims of various database vendors, and finding that over and over they claim to be both shardable with high availability among the shards, and to support ACID transactions that keep everything consistent. It turns out those statements are usually made assuming a happy path!

(You also get Paxos and Raft, “here is how to get consistency without arbitrary latency on two-phase commit via majority vote”, and the Jepsen blog “you said this had consistency level X, let’s fuzz it and see if we can generate a counterexample”, and some interesting exceptions like Datomic saying “this one part is not scalable and it's a single point of failure to sacrifice P for the CAP theorem’s sake, but in exchange we can simplify our C and A guarantees so that you can scale the reads of the system consistently.”)

lmm · on Oct 8, 2024

Because, sadly, a lot of system designers want to believe they have a way around it.