> There is a way with AWS, but it carries risk. You can force detach an instance...

khrystoph · 2024-07-20T00:20:00 1721434800

>This is absurd. Every BMC I’ve ever used has an option to turn off the power immediately. Every low level hypervisor can do this, too. (Want a QEMU guest gone? Kill QEMU.). Why on Earth can’t public clouds do it?

The issue is far more nuanced than that. The systems are very complex and they're a hypervisor that has layers of applications and interfaces to allow scaling. In fact, the hosts all have BMCs (last I knew...but I know there were some who wanted to get rid of the BMC due to BMCs being unreliable, which is, yes, an issue when you deal with scale because BMCs are in fact unreliable. I've had to reset countless stuck BMCs and had some BMCs that were dead).

The hypervisor is certainly capable of killing an instance instantly, but the preferred method is an orderly shutdown. In the case of a reboot and a stop (and a terminate where the EBS volume is not also deleted on termination), it's preferred to avoid data corruption, so the hypervisor attempts an orderly shutdown, then after a timeout period, it will just kill it if the instance has not already shutdown in an orderly manner.

Furthermore, there's a lot more complexity to the problem than just "kill the guest". There are processes that manage the connection to the EBS backend that provides the interface for the EBS volume as well as apis and processes to manage network interfaces, firewall rules, monitoring, and a whole host of other things. If the monitoring process gets stuck, it may not properly detect an unhealthy host and external automated remediation may not take action. Additionally, that same monitoring is often responsible for individual instance health and recovery (ie. auto-recover) and if it's not functioning properly, it won't take remediation actions to kill the instance and start it up elsewhere. Furthermore, the hypervisor itself may not be properly responsive and a call from the API won't trigger a shutdown action. If the control plane and the data plane (in this case, that'd be the hypervisor/host) are not syncing/communicating (particularly on a stop or terminate), the API needs to ensure that the state machine is properly preserved and the instance is not running in two places at once. You can then "force" stop or "force" terminate and/or the control plane will update state in its database and the host will sync later. There is a possibility of data corruption or double send/receive data in a force case, which is why it's not preferred. Also, after the timeout (without the "force" flag), it will go ahead and mark it terminated/stopped and will sync later, the "force" just tells the control plane to do it immediately, likely because you're not concerned with data corruption on the EBS volume, which may be double-mounted if you start up again and the old one is not fully terminated.

>The state machine for a cloud VM instance should have a concept where all of the resources for an instance are still held and being billed, but the instance is not running. And one should be able to quickly transition between this state and actually running, in both directions.

It does have a concept where all resources are still held and billed, except CPU and Memory. That's what a reboot effectively does. Same with a stop (except you're not billed for compute usage and network usage will obviously be zero, but if you have an EIP, that would incur charges still). The transition between stop and running is also fast, the only delays incurred are via the control plane...either via capacity constraints causing issues placing an instance/VM or via the chosen host not communicating properly...but in most cases, it is generally a fast transition. I'm usually up and running in under 20 seconds when I start up an existing instance from a stopped state. There's also now a hibernate or sleep state that the instance can be put into if it's windows via the API where the instance acts just like a sleep/hibernate state of a regular Windows machine.

>Also, there should be a way to force stop an instance that is already stopping.

There is. I believe I referred to it in my initial response. It's a flag you can throw in the API/SDK/CLI/web console when you select "terminate" and "stop". If the stop/terminate command don't execute in a timely manner, you can call the same thing again with a "force" flag and tell the control plane to forcefully terminate, which marks the instance as terminated and will asynchronously try to rectify state when the hypervisor can execute commands. The control plane updates the state (though, sometimes it can get stuck and require remediation by someone with operator-level access) and is notified that you don't care about data integrity/orderly shutdown and will (once its updated the state in the control plane and regardless of the state of the data plane) mark it as "stopped" or "terminated". Then, you can either start again, which should kick you over to a different host (there are some exceptions) or you can launch a new instance if you terminated and attach an EBS volume (if you chose not to terminate the EBS volume on termination) and retrieve data (or use the data or whatever you were doing with that particular volume).

Almost all of that information is actually in the public docs. There was only a little bit of color about how the backend operates that I added for color. There's hundreds of programs that run to make sure the hypervisor and control plane are both in sync and able to manage resources and if just a few of them hang or are unable to communicate or the system runs out of resources (more of a problem on older, non-nitro hosts as that's a completely different architecture with completely different resource allocations), then the system can become partially functional...enough so that remediation automation won't step in or can't step in because other guests appear to be functioning normally. There's many different failure modes of varying degrees of "unhealthy" and many of them are undetectable or need manual remediation, but are statistically rare and by and large most hosts operate normally. On a normally operating host, forcing a shutdown/terminate works just fine and is fast. Even when some programs that are managing the host are not functioning properly, launch/terminate/stop/start/attach/detach all tend to continue to function (along with the "force" on detach, terminate, stop), even if one or two functions of the host are not functioning properly. It's also possible (and has happened several times) where a particular resource vector is not functioning properly, but the rest of the host is fine. In that case, the particular vector can be isolated and the rest of the host works just fine. It's literally these tiny little edge cases that happen maybe .5% of the time that cause things to move slower and at scale, a normal host with a normal BMC would have the same issues. Ie. I've had to clear stuck BMCs before on those hosts. Also, I've dealt with completely dead BMCs. When those states occur, if there's also a host problem, remediation can't go in and remedy host-level problems, which can lead to those control-plane delays as well as the need to call a "force".

Conclusion: it may SEEM like it should be super easy, but there's about a million different moving parts to cloud vendors and it's not just as simple as kill it with fire and vengeance (ie. quemu guest kill). BMCs and hypervisors do have an instant kill switch (and guest kill is used on the hypervisor as is a BMC power off in the right remediation circumstances), but you're assuming those things always work. BMCs fail. BMCs get stuck. You likely haven't had the issue because you're not dealing with enough scale. I've had to reset BMCs manually more times than I can count and I've also dealt with more than my fair share of dead ones. So, "power off immediately" does not always work, which means a disconnect occurs between the control plane and the data plane. There's also delays in remediation actions that automation takes to give enough time for things to respond to the given commands, which leads to additional wait time.

amluto · 2024-07-20T13:02:49 1721480569

I understand that this complexity exists. But in my experience with Google Compute, this isn’t a 1%-of-the-time problem with something getting stuck. It’s a “GCP lacks the capability” issue. Here’s the API:

https://cloud.google.com/compute/docs/reference/rest/v1/inst...

AWS does indeed seem more enlightened:

https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_S...

khrystoph · 2024-07-27T02:45:55 1722048355

yeah, AWS rarely has significant capacity issues. While the capacity utilization typically sits around 90% across the board, they're constantly landing new capacity, recovering broken capacity, and working to fix issues that cause things to get stuck (and lots of alarms and monitoring).

I worked there for just shy of 7 years and dealt with capacity tangentially (knew a good chunk of their team for a while and had to interact with them frequently) across both teams I worked on (support and then inside the EC2 org).

Capacity, while their methodologies for expanding were, in my opinion, antiquated and unenlightened for a long time, were still rather effective. I'm pretty sure that's why they never updated their algorithm for increasing capacity to be more JIT. Now, they have a LOT more flexibility in capacity now that they have resource vectoring, because you no longer have hosts with fixed instance sizes for the entire host (homogenous). You now have the ability to fit everything like legos as long as it is the same family (ie. c4 with c4, m4 with m4, etc.) and there was additional work being done to have cross-family resource vectoring as well that was in-use.

Resource vectors took a LONG time for them to get in place and when they did, capacity problems basically went away.

The old way of doing it was if you wanted to have more capacity for, say, c4.xlarge, you'd either have to drop new capacity and build it out to where the entire host had ONLY c4.xlarge OR you would have to rebuild excess capacity within the c4 family in that zone (or even down to the datacenter-level) to be specifically built-out as c4.xlarge.

Resource vectors changed all that. DRAMATICALLY. Also, to reconfigure a hosts recipe now takes minutes, rather than rebuilding a host and needing hours. So, capacity is infinitely more fungible than it was when I started there.

Also, I think resource vectoring came on the scene around 2019 or so? I don't think it was there in 2018 when I went to work for EC2...but it was there for a few years before I quit...and I think it was in-use before the pandemic...so, 2019 sounds about right.

Prior to that, though, capacity was a much more serious issue and much more constrained on certain instance types.