1. The consensus system you are talking about is Raft (https://raft.github.io/) ...

kilburn · on Nov 17, 2019

I think what you are missing is that validating servers/clients would not allow non-validating servers to rejoin the cluster (i.e.: servers/clients with validation enabled will validate both outgoing and incoming connections).

As I see it, by the time you restart the leader (and hence quorum switches to the non-validating portion of the cluster) all of your clients will suddenly fail (they are still validating, and there's no good server for them to connect to). Conversely, if you restarted the clients first they would all become unavailable before the quorum switch happened.

awinder · on Nov 17, 2019

I was speaking more to 2b, and it shouldn’t be expensive because the other servers should spin down after you’ve migrated. You’re maybe paying for a day of overlap if that, and you’re paying for massively reduced risk.

Fair point on the consensus stuff. I was keying in on the patroni system but reading up more looks like that’s less novel than on first read (the “framework” lines in the library readme had me worried)

kitotik · on Nov 17, 2019

I had the exact same thought on the consul rolling restarts. I’ve done exactly this before. I’m assuming there was some other issue I’m not getting as to why that wouldn’t have worked for them.