This sounds like a report of a bug, but I believe this is not the actual story. It is more a report of a design tradeoff: the authors of those CP systems completely understand what happens, but were not happy to pay this performance price for reads. One thing is to have a data store that has a very limited performance in write operations but is very fast when you need to read, another thing is a data store where both writes and reads are very slow. However once you read potentially stale data from nodes, many of the advantages of having a CP system are gone. IMHO to revert those systems to a default where reads are applied to the state machine like writes is the sanest thing to do, even if options to potentially read stale reads are also useful in some context.
> This sounds like a report of a bug, but I believe this is not the actual story. It is more a report of a design tradeoff: the authors of those CP systems completely understand what happens, but were not happy to pay this performance price for reads
If the authors were aware of these issues then the documentation was dangerously misleading[1] and they should be docked points for that.
[1] As reported by aphyr, haven't read through it all myself. I'm thinking primarily of the labeling of "read from leader without going through log" as "consistent" bit.
That's why I think this is a design decisions in both cases:
In one of this products (etcd if I remember correctly) there was a clear statement in the documentation about this semantics, and anyway, who implements Raft knows that for reads to be consistent they need to go the same path as writes. In the Raft paper you can find a whole section about this.
If you check the paper there are the following clearly stated informations:
Leaders can't reply to read queries without doing additional checks otherwise the reads are not linearizable.
For the reads to be linearizable, the following two things must be performed by leaders.
1) Commit a NOP at the start of its term, which is not a problem from a performance point of view. The problem is "2".
2) A leader needs to check if it is still the leader before every read, and this requires to contact a majority. That's the performance problem of linearizable reads, because you need to pay a latency equal to the latency of the slowest reply of the N/2+1 acks you need.
However note that even linearizable reads don't require fsync() to be called, so they are still better than writes.
What's your opinion on exposing the option of stale vs. consistent read in the API? I can see cases where I'd be ok with a stale read while for others I'd like the most up-to-date value.
That makes a lot of sense, there are definitely use cases where to read a past value is viable, especially considering the big difference in performances between the two kind of reads.