1. The consensus system you are talking about is Raft (https://raft.github.io/) and it's baked into Consul.
2. There are two ways to interpret parallel infrastructure. I will post my thoughts on both.
2a. Standing up a parallel Consul cluster. This is problematic because typically what people do with Consul is to put a Consul agent on every server (or pod) which registers its services for discovery to the Consul servers. When you make a parallel Consul cluster you need to also restart the Consul agents on every other service. They only mention Postgres in this blog post but there can potentially be a LOT of other servers registered to the Consul cluster.
2b. Standing up everything it takes to run Gitlab in parallel and then diverting traffic. Honestly sounds great. The reason a team wouldn't do this is due to either a) not having infrastructure code which allows for one-click deployment of whatever it takes to have Gitlab running. or b) it's actually pretty expensive to do if you're not Google or Amazon. The blog post mentions 255 clients (and 5 Consul servers). That's a lot of servers to rebuild!
Now, I would love to hear from anyone else who uses Consul, because I have my own thoughts on how they decided to handle the issue. I will focus my attention entirely on the Consul and not the Postgres portion.
The blog post mentions two limitations: 1) Reloading the configuration of the running service, which worked fine and did not drop connections, but the certificate settings are not included in the reloadable settings for our version of Consul. 2) Simultaneous restarts of various services, which worked, but our tools wouldn't allow us to do that with ALL of the nodes at once.
We don't need to reload. We can run a rolling systemctl restart whicn ansible is perfect for. The nice thing about this is that their stop-gap solution is to disable TLS verification. This means that servers with TLS verification ON in the meantime should be able to continue validating certs while other servers can have a rolling restart that disables TLS verification one server at a time. If we want to minimize downtime we would do every non-leader server in the cluster, then finally the leader, then every client in a serial manner. With 260 servers to deal with it would be slow but it shouldn't break Raft at any point. There is no reason for quorum to be broken. The gossip will still be communicated over TLS, just that some of the servers/clients wouldn't be validating the certs.
Then, we would follow exactly the same process for rolling out valid certificates with TLS validation turned back on. One non-leader server at a time, then the leader, then every client.
I could be missing some critical piece here, and it looks like the Gitlab team did run a lab test before making their change in prod. It's easy to miss a possibility when under pressure, and also easy for an online commentator like myself to think they are so much smarter. They still managed to get out of the crisis with no downtime and congrats to the operators who pulled it off!
I think what you are missing is that validating servers/clients would not allow non-validating servers to rejoin the cluster (i.e.: servers/clients with validation enabled will validate both outgoing and incoming connections).
As I see it, by the time you restart the leader (and hence quorum switches to the non-validating portion of the cluster) all of your clients will suddenly fail (they are still validating, and there's no good server for them to connect to). Conversely, if you restarted the clients first they would all become unavailable before the quorum switch happened.
I was speaking more to 2b, and it shouldn’t be expensive because the other servers should spin down after you’ve migrated. You’re maybe paying for a day of overlap if that, and you’re paying for massively reduced risk.
Fair point on the consensus stuff. I was keying in on the patroni system but reading up more looks like that’s less novel than on first read (the “framework” lines in the library readme had me worried)
I had the exact same thought on the consul rolling restarts. I’ve done exactly this before. I’m assuming there was some other issue I’m not getting as to why that wouldn’t have worked for them.
2. There are two ways to interpret parallel infrastructure. I will post my thoughts on both. 2a. Standing up a parallel Consul cluster. This is problematic because typically what people do with Consul is to put a Consul agent on every server (or pod) which registers its services for discovery to the Consul servers. When you make a parallel Consul cluster you need to also restart the Consul agents on every other service. They only mention Postgres in this blog post but there can potentially be a LOT of other servers registered to the Consul cluster. 2b. Standing up everything it takes to run Gitlab in parallel and then diverting traffic. Honestly sounds great. The reason a team wouldn't do this is due to either a) not having infrastructure code which allows for one-click deployment of whatever it takes to have Gitlab running. or b) it's actually pretty expensive to do if you're not Google or Amazon. The blog post mentions 255 clients (and 5 Consul servers). That's a lot of servers to rebuild!
Now, I would love to hear from anyone else who uses Consul, because I have my own thoughts on how they decided to handle the issue. I will focus my attention entirely on the Consul and not the Postgres portion.
The blog post mentions two limitations: 1) Reloading the configuration of the running service, which worked fine and did not drop connections, but the certificate settings are not included in the reloadable settings for our version of Consul. 2) Simultaneous restarts of various services, which worked, but our tools wouldn't allow us to do that with ALL of the nodes at once.
We don't need to reload. We can run a rolling systemctl restart whicn ansible is perfect for. The nice thing about this is that their stop-gap solution is to disable TLS verification. This means that servers with TLS verification ON in the meantime should be able to continue validating certs while other servers can have a rolling restart that disables TLS verification one server at a time. If we want to minimize downtime we would do every non-leader server in the cluster, then finally the leader, then every client in a serial manner. With 260 servers to deal with it would be slow but it shouldn't break Raft at any point. There is no reason for quorum to be broken. The gossip will still be communicated over TLS, just that some of the servers/clients wouldn't be validating the certs.
Then, we would follow exactly the same process for rolling out valid certificates with TLS validation turned back on. One non-leader server at a time, then the leader, then every client.
I could be missing some critical piece here, and it looks like the Gitlab team did run a lab test before making their change in prod. It's easy to miss a possibility when under pressure, and also easy for an online commentator like myself to think they are so much smarter. They still managed to get out of the crisis with no downtime and congrats to the operators who pulled it off!