I did approximately this recently, but on a Linux machine on GCP. It sucked far ...

teeheelol · 2024-07-19T09:38:34 1721381914

AWS is not any better really on this. In fact 2 years ago (to the day!) we had a complete AZ outage in our local AWS region. This resulted in their control plane going nuts and being unable to shut down or start new instances. Then capacity problems.

khrystoph · 2024-07-20T00:47:17 1721436437

That's happened several times, actually. That's probably just the latest one. The really fun one was when S3 went down in 2017 in Virginia. Caused global outages of multiple services because most services were housed out of Virginia and when EC2 and other services went offline due to dependency on S3, everything cascade failed across multiple regions (in terms of start/stop/delete...ie. api actions. Stuff that was running was, for the most part, still working in some places).

...I remember that day pretty well. It was a busy day.

Twirrim · 2024-07-19T20:42:22 1721421742

> apparently GCP cannot reliably “stop” a VM in a timely manner.

In OCI we made a decision years ago that after 15 minutes from sending an ACPI shutdown signal, the instance should be hard powered off. We do the same for VM or BM. If you really want to, we take an optional parameter on the shutdown and reboot commands to bypass this and do an immediate hard power off.

So worst case scenario here, 15 minutes to get it shut down and be able to detach the boot volume to attach to another instance.

jwrallie · 2024-07-20T21:21:21 1721510481

I had this happen to one of my VMs, I was trying to compile something and went out of memory, then tried to stop the VM and it only came back after 15 min. I think it is a good compromise, long enough to give a chance for a clean reboot but short enough to prevent longer downtimes.

I’m just a free tier user but OCI is quite powerful. It feels a bit like KDE to me where sometimes it takes a while to find out where some option is, but I can always find it somewhere, and in the end it beats feeling limited by lack of options.

Twirrim · 2024-07-20T22:36:41 1721515001

We've tried at shorter time periods, back in the earlier days of our platform. Unfortunately what we've found is that the few times we've tried to lower it from 15 minutes, we've ended up with Windows users experiencing corrupt drives. Our best blind interpretation is that some things common enough on Windows can take up to 14 minutes to shut down under worst circumstances. So 15 minutes it is!

metadat · 2024-07-22T02:35:26 1721615726

This sounds appealing. Is OCI the only cloud to offer this level of control?

pizza234 · 2024-07-20T15:28:59 1721489339

Based on your description, AWS has another level of stop, the "force stop", which one can use in such cases. I don't have statistics on the time, so I don't know if that meets your criteria of "timely", but I believe it's quick enough (sub-minute, I think).

khrystoph · 2024-07-19T17:18:43 1721409523

There is a way with AWS, but it carries risk. You can force detach an instance's volume while it's in the shutting down state, but if you re-attach it to another machine, you risk the possibility of a double-write/data corruption while the instance is still shutting down.

As for "throwing away local SSD", that only happens on AWS with instance store volumes which used to be called ephemeral volumes as the storage was directly attached to the host you were running on and if you did a stop/start of an ebs-backed instance, you were likely to get sent to a different host (vs. a restart API call, which would make an ACPI soft command and after a duration...I think it was 5 minutes, iirc, the hypervisor would kill the instance and restart it on the same host).

When the instance would get sent to a different host, it would get different instance storage and the old instance storage would be wiped from the previous host and you'd be provisioned new instance storage on the new host.

However, with EBS-volumes, those travel from host to host across stop/start cycles and they're attached via very low latency across the network from EBS servers and presented as a local block device to the instance. It's not quite as fast as local instance store, but it's fast enough for almost every use case if you get enough IOPS provisioned either through direct provisioning + correct instance size OR through a large enough drive + large enough instance to maximjze the connection to EBS (there's a table and stuff detailing IOPs, throughput, and instance size in the docs).

Also, support can detach the volume as well if the instance is stuck shutting down and doesn't get manually shut down by the API after a timeout.

None of this is by any means "ideal", but the complexity of these systems is immense and what they're capable of at the scale they operate is actually pretty impressive.

The key is...lots of the things you talk about are do-able at small scale, but when you add more and more operations and complexity to the tool stack on interacting with systems, you add a lot of back-end network overhead, which leads to extreme congestion, even in very high speed networks (it's an exponential scaling problem).

The "ideal" way to deal with these systems is to do regular interval backups off-host (ie. object/blob storage or NFS/NAS/similar) and then just blow away anything that breaks and do a quick restore to the new, fixed instance.

It's obviously easier said than done and most shops still on some level think about VMs/instances as pets, rather than cattle or have hurdles that make treating them as cattle much more challenging, but manual recovery in the cloud, in general, should just be avoided in favor of spinning up something new and re-deploying to it.

amluto · 2024-07-19T22:18:42 1721427522

> There is a way with AWS, but it carries risk. You can force detach an instance's volume while it's in the shutting down state, but if you re-attach it to another machine, you risk the possibility of a double-write/data corruption while the instance is still shutting down.

This is absurd. Every BMC I’ve ever used has an option to turn off the power immediately. Every low level hypervisor can do this, too. (Want a QEMU guest gone? Kill QEMU.). Why on Earth can’t public clouds do it?

The state machine for a cloud VM instance should have a concept where all of the resources for an instance are still held and being billed, but the instance is not running. And one should be able to quickly transition between this state and actually running, in both directions.

Also, there should be a way to force stop an instance that is already stopping.

khrystoph · 2024-07-20T00:20:00 1721434800

>This is absurd. Every BMC I’ve ever used has an option to turn off the power immediately. Every low level hypervisor can do this, too. (Want a QEMU guest gone? Kill QEMU.). Why on Earth can’t public clouds do it?

The issue is far more nuanced than that. The systems are very complex and they're a hypervisor that has layers of applications and interfaces to allow scaling. In fact, the hosts all have BMCs (last I knew...but I know there were some who wanted to get rid of the BMC due to BMCs being unreliable, which is, yes, an issue when you deal with scale because BMCs are in fact unreliable. I've had to reset countless stuck BMCs and had some BMCs that were dead).

The hypervisor is certainly capable of killing an instance instantly, but the preferred method is an orderly shutdown. In the case of a reboot and a stop (and a terminate where the EBS volume is not also deleted on termination), it's preferred to avoid data corruption, so the hypervisor attempts an orderly shutdown, then after a timeout period, it will just kill it if the instance has not already shutdown in an orderly manner.

Furthermore, there's a lot more complexity to the problem than just "kill the guest". There are processes that manage the connection to the EBS backend that provides the interface for the EBS volume as well as apis and processes to manage network interfaces, firewall rules, monitoring, and a whole host of other things. If the monitoring process gets stuck, it may not properly detect an unhealthy host and external automated remediation may not take action. Additionally, that same monitoring is often responsible for individual instance health and recovery (ie. auto-recover) and if it's not functioning properly, it won't take remediation actions to kill the instance and start it up elsewhere. Furthermore, the hypervisor itself may not be properly responsive and a call from the API won't trigger a shutdown action. If the control plane and the data plane (in this case, that'd be the hypervisor/host) are not syncing/communicating (particularly on a stop or terminate), the API needs to ensure that the state machine is properly preserved and the instance is not running in two places at once. You can then "force" stop or "force" terminate and/or the control plane will update state in its database and the host will sync later. There is a possibility of data corruption or double send/receive data in a force case, which is why it's not preferred. Also, after the timeout (without the "force" flag), it will go ahead and mark it terminated/stopped and will sync later, the "force" just tells the control plane to do it immediately, likely because you're not concerned with data corruption on the EBS volume, which may be double-mounted if you start up again and the old one is not fully terminated.

>The state machine for a cloud VM instance should have a concept where all of the resources for an instance are still held and being billed, but the instance is not running. And one should be able to quickly transition between this state and actually running, in both directions.

It does have a concept where all resources are still held and billed, except CPU and Memory. That's what a reboot effectively does. Same with a stop (except you're not billed for compute usage and network usage will obviously be zero, but if you have an EIP, that would incur charges still). The transition between stop and running is also fast, the only delays incurred are via the control plane...either via capacity constraints causing issues placing an instance/VM or via the chosen host not communicating properly...but in most cases, it is generally a fast transition. I'm usually up and running in under 20 seconds when I start up an existing instance from a stopped state. There's also now a hibernate or sleep state that the instance can be put into if it's windows via the API where the instance acts just like a sleep/hibernate state of a regular Windows machine.

>Also, there should be a way to force stop an instance that is already stopping.

There is. I believe I referred to it in my initial response. It's a flag you can throw in the API/SDK/CLI/web console when you select "terminate" and "stop". If the stop/terminate command don't execute in a timely manner, you can call the same thing again with a "force" flag and tell the control plane to forcefully terminate, which marks the instance as terminated and will asynchronously try to rectify state when the hypervisor can execute commands. The control plane updates the state (though, sometimes it can get stuck and require remediation by someone with operator-level access) and is notified that you don't care about data integrity/orderly shutdown and will (once its updated the state in the control plane and regardless of the state of the data plane) mark it as "stopped" or "terminated". Then, you can either start again, which should kick you over to a different host (there are some exceptions) or you can launch a new instance if you terminated and attach an EBS volume (if you chose not to terminate the EBS volume on termination) and retrieve data (or use the data or whatever you were doing with that particular volume).

Almost all of that information is actually in the public docs. There was only a little bit of color about how the backend operates that I added for color. There's hundreds of programs that run to make sure the hypervisor and control plane are both in sync and able to manage resources and if just a few of them hang or are unable to communicate or the system runs out of resources (more of a problem on older, non-nitro hosts as that's a completely different architecture with completely different resource allocations), then the system can become partially functional...enough so that remediation automation won't step in or can't step in because other guests appear to be functioning normally. There's many different failure modes of varying degrees of "unhealthy" and many of them are undetectable or need manual remediation, but are statistically rare and by and large most hosts operate normally. On a normally operating host, forcing a shutdown/terminate works just fine and is fast. Even when some programs that are managing the host are not functioning properly, launch/terminate/stop/start/attach/detach all tend to continue to function (along with the "force" on detach, terminate, stop), even if one or two functions of the host are not functioning properly. It's also possible (and has happened several times) where a particular resource vector is not functioning properly, but the rest of the host is fine. In that case, the particular vector can be isolated and the rest of the host works just fine. It's literally these tiny little edge cases that happen maybe .5% of the time that cause things to move slower and at scale, a normal host with a normal BMC would have the same issues. Ie. I've had to clear stuck BMCs before on those hosts. Also, I've dealt with completely dead BMCs. When those states occur, if there's also a host problem, remediation can't go in and remedy host-level problems, which can lead to those control-plane delays as well as the need to call a "force".

Conclusion: it may SEEM like it should be super easy, but there's about a million different moving parts to cloud vendors and it's not just as simple as kill it with fire and vengeance (ie. quemu guest kill). BMCs and hypervisors do have an instant kill switch (and guest kill is used on the hypervisor as is a BMC power off in the right remediation circumstances), but you're assuming those things always work. BMCs fail. BMCs get stuck. You likely haven't had the issue because you're not dealing with enough scale. I've had to reset BMCs manually more times than I can count and I've also dealt with more than my fair share of dead ones. So, "power off immediately" does not always work, which means a disconnect occurs between the control plane and the data plane. There's also delays in remediation actions that automation takes to give enough time for things to respond to the given commands, which leads to additional wait time.

amluto · 2024-07-20T13:02:49 1721480569

I understand that this complexity exists. But in my experience with Google Compute, this isn’t a 1%-of-the-time problem with something getting stuck. It’s a “GCP lacks the capability” issue. Here’s the API:

https://cloud.google.com/compute/docs/reference/rest/v1/inst...

AWS does indeed seem more enlightened:

https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_S...

khrystoph · 2024-07-27T02:45:55 1722048355

yeah, AWS rarely has significant capacity issues. While the capacity utilization typically sits around 90% across the board, they're constantly landing new capacity, recovering broken capacity, and working to fix issues that cause things to get stuck (and lots of alarms and monitoring).

I worked there for just shy of 7 years and dealt with capacity tangentially (knew a good chunk of their team for a while and had to interact with them frequently) across both teams I worked on (support and then inside the EC2 org).

Capacity, while their methodologies for expanding were, in my opinion, antiquated and unenlightened for a long time, were still rather effective. I'm pretty sure that's why they never updated their algorithm for increasing capacity to be more JIT. Now, they have a LOT more flexibility in capacity now that they have resource vectoring, because you no longer have hosts with fixed instance sizes for the entire host (homogenous). You now have the ability to fit everything like legos as long as it is the same family (ie. c4 with c4, m4 with m4, etc.) and there was additional work being done to have cross-family resource vectoring as well that was in-use.

Resource vectors took a LONG time for them to get in place and when they did, capacity problems basically went away.

The old way of doing it was if you wanted to have more capacity for, say, c4.xlarge, you'd either have to drop new capacity and build it out to where the entire host had ONLY c4.xlarge OR you would have to rebuild excess capacity within the c4 family in that zone (or even down to the datacenter-level) to be specifically built-out as c4.xlarge.

Resource vectors changed all that. DRAMATICALLY. Also, to reconfigure a hosts recipe now takes minutes, rather than rebuilding a host and needing hours. So, capacity is infinitely more fungible than it was when I started there.

Also, I think resource vectoring came on the scene around 2019 or so? I don't think it was there in 2018 when I went to work for EC2...but it was there for a few years before I quit...and I think it was in-use before the pandemic...so, 2019 sounds about right.

Prior to that, though, capacity was a much more serious issue and much more constrained on certain instance types.