CAP theorem is great, though it does omit latency. Which leads you to the logical extension, PACELC [1]:
If there is a network Partition you must choose Availability or Consistency, Else the tradeoff is Latency vs Consistency.
This offers practical intuition for certain design choices.
For example Google Spanner [2] is a distributed DB that offers globally consistent reads/writes, but to achieve high performance (low latency) nodes must synchronize with multiple extremely accurate reference clocks (GPS & atomic) and follow a complex two phase commit protocol that ensures transactional linearizability using lower bounds and uncertainty of timestamps.
As another example, your multicore CPU leverages a cache coherency protocol that faces remarkably similar tradeoffs. Perhaps others have made this connection before…it does feel like some sort of universal law of physics.
CAP doesn't omit latency. Availability essentially is latency.
PACELC only says that even during normal operation you still need to make tradeoff between availability and latency whereas CAP only deals with a tradeoff during a partition event.
The A in CAP doesn't mean what people think it means, it has nothing to do with nodes being up, or crashes, or latency, or SLA, or any abnormal dysfunction.
Availability in CAP is purely a software decision, it's an algorithmic property. Roughly speaking, it is the decision to continue accepting requests during a partition, possibly sacrificing Consistency. If your refuse requests during a partition, you conserve Consistency, and you lose Availability. If you keep accepting requests during a partition, it's the opposite.
High latency is considered a partition in CAP. Any hardware error, any network trouble, any bug, crash, any unreachable machines is never an Availability issue.
> If your refuse requests during a partition, you conserve Consistency, and you lose Availability. If you keep accepting requests during a partition, it's the opposite.
I agree
> High latency is considered a partition in CAP.
Can you support this claim? I don't think this is true unless you're specifically talking about high latency between nodes rather than latency between the sender and the system.
>high latency between nodes rather than latency between the sender and the system
Yes, that's what I meant. To clarify, latency between the sender and the system would still not be an availability issue, but it also wouldn't be a partition. It's just out of scope of the CAP theorem. The user might be on a bad mobile connection, but it doesn't impact the distributed system itself.
I think the CAP 12 Years Later [1] paper covers this aspect of what latency means versus a partition. It describes that at some point in time a designer has to decide whether nodes have waited long enough and should return a response (A) or that they should wait until (potentially forever) before a response (C)
> every request received by a non-failing node in the system must result in a response
I don't believe that encodes latency
Instead, here's consistency:
> any read operation that begins after a write operation completes must return that value, or the result of a later write operation
"after a write operation completes" feels like where latency kicks in? Because within that space you can play around with completing a write operation to get what you want.
It's impossible to distinguish between high latency and unavailability. You can model unavailability as infinite latency.
In a real system you can wait for a partition event to resolve itself before replying which preserves consistency at the cost of latency (availability).
> In a real system you can wait for a partition event to resolve itself before replying which preserves consistency at the cost of latency (availability).
This is true, but PACELC states that even if there is no partition you still have a consistency vs latency tradeoff (because the processes that guarantee consistency eat up latency in form of network roundtrips)
100%, I don't think what we're saying conflicts. There is _always_ a tradeoff between consistency and availability whether you're thinking of PAC or PACELC. PACELC just makes it explicit that this tradeoff exists whether you're partitioned or not.
> Perhaps others have made this connection before…it does feel like some sort of universal law of physics.
This has been on my mind too, and I can’t help but think the fundamental concept underpinning it all is causality.
Reading about “consistency as logical monotonicity” — CALM [0], after diving into Spanner, there’s definitely something to databases beyond CAP.
I’m yet to find a simple and clear law or theorem that captures what you’re hinting to, but it feels like we must be getting close. This has been bouncing around my head for a few years since I first wrote a toy CRDT DB.
It seems to show up anywhere we have more than one system with independent memory (a place where state can be held), needs to maintain a shared representation or fact about something.
Within one memory (akin to a reference frame in quantum physics), we can do anything we want. We can mutate state without any interaction or interference from the outside world. This also sounds like a pure function. By itself, this is academic, theoretical—it does not, it does not exist. If a tree falls in the woods.
So if we want to interact with any other systems, we then need to coordinate, and the question is how.
The issue and pattern seems to rhyme everywhere. CPUs, HDD, SSD, file systems, networks, UIs, people, teams, etc. The best possible collaboration seems to be that which requires the least coordination. Money is a good example here, someone can string together a series of products from companies who know nothing about each other, by coordinating with money - as a means of exchange. Not to mention being able to buy complex technology with no idea the supply chain behind it. I don’t have to coordinate with a mine to use a computer, which contains something from said mine.
It sort of looks like distributed databases build a virtual memory on top of many smaller memories, which need to share some parts with each other to protect the system as a whole.
New information may need a few writes before it can be considered a new fact. I think this is an implementation detail, in that it’s irrelevant to the observer (who has no control over it).
This isn’t eventual consistency, which is perhaps the cruder form of this where the implementation detail above is wide open for all to see. Instead new information is available immediately, just your definition of immediately is different to the databases.
It follows then, that as an observer of the system you cannot violate causality in any way by learning information from the future, while they are still in the past.
My understanding from Spanner is that when you ask for a read, you provide or are issued a timestamp which provides the causal information to determine what you are allowed to know.
The system can then maintain both a true and correct representation of what it knows, and an incomplete working memory of what it is trying to know next (the writes which need to be committed into multiple memories).
Memory being anything from ram, ssd, carrier pigeon, lines in sand, etc.
I think where this breaks most of our brains is that it’s a universe simulation.
And both time and causality are fundamental invariants of the stability of the system, but are leaky abstractions that we need to deal with.
In CALM this is abstracted into what is effectively entropy. If your program never moves backward in time/loses entropy it’s CALM (I think). In earlier works I think Lamport and vector clocks were used, in Spanner it’s based on very precise measurements of our own time, where the maximum speed of information (ie speed of light) is the greater of the smallest unit of time we can measure (the precision of the GPS clock) and the time it takes for new data to become available in the system.
The other part where this differs from the read world is that the speed of information, the latency of a request, is different for reads and writes. Not true in the quantum world where an everything is a wrote (I think). Then, consider that in our own reference frame we can do a huge amount of work while waiting for a db read/write, something that would violate the speed of light if not in our virtualised world.
We cannot break causality in the world we live and breathe in, but we do it all the time in our simulated ones.
It “feels” to me like the uncertainty principle. Think of availability as an interval of time by which all nodes have to be connected. If you set A high enough, sure, you can have both CP to your heart’s delight. As A shrinks, you lose the ability to have both C and P and have to pick one. It’s something like CxP/A>n, where n is a constant within a system.
If there is a network Partition you must choose Availability or Consistency, Else the tradeoff is Latency vs Consistency.
This offers practical intuition for certain design choices.
For example Google Spanner [2] is a distributed DB that offers globally consistent reads/writes, but to achieve high performance (low latency) nodes must synchronize with multiple extremely accurate reference clocks (GPS & atomic) and follow a complex two phase commit protocol that ensures transactional linearizability using lower bounds and uncertainty of timestamps.
As another example, your multicore CPU leverages a cache coherency protocol that faces remarkably similar tradeoffs. Perhaps others have made this connection before…it does feel like some sort of universal law of physics.
[1] https://en.m.wikipedia.org/wiki/PACELC_theorem
[2] https://static.googleusercontent.com/media/research.google.c...