The biggest mistake here is running a global update on a Friday. Disrespect to e...

usr1106 · on July 19, 2024

Disrespect to every CIO to make their business depend on a single operating system, running automatic updates of system software without any canaries and phased deployments.

dist-epoch · on July 19, 2024

You're saying I should diversify my 100% Linux operation to also use Windows?

usr1106 · on July 19, 2024

While I believe Linux is a more reasonable operating system than Windows, shit can happen everywhere.

So if you have truly mission critical systems you should probably have more have at least 2 significantly different systems, each of them being able to maintain some emergency operations independently. Doing this with 2 Linux distros is easier than doing it with Linux and Windows. For workstations Macs could considered, for servers BSD.

Probably many companies will accept the risk that everything goes down. (Well, they probably don't say that. They say maintaining a healthy mix is too expensive.)

In that case you need a clearly phased approach to all updates. First update some canaries used by IT. If that goes well update 10% of the production. If that goes well (well, you have to wait until affected employees have actually worked a reasonable time) you can roll out increasingly more.

No testing in a lab (whether at the vendor or you own IT) will ever find all problems. If something slips through and affects 10% of your company it's significantly different from affecting (nearly) everyone.

ykonstant · on July 19, 2024

Maybe some OpenBSD would be a good hedge. It can also help spot over-reliance on some Linux quirks.

prmoustache · on July 19, 2024

What makes you think windows is the only alternative? Have you never heard about Gnu Hurd?

More seriously I am not saying you should run some critical services on menuetos or riscos but the BSDs are still alive and kicking as well as illumos and its derivatives. And yes I think a bit of diversity allows some additional resilience. It may necessitate more workforce but imho it is worth the downsides.

linker3000 · on July 19, 2024

The biggest mistake is not ringfencing this update in a test environment before sign-off for general deployment.

chippiewill · on July 19, 2024

Presumably they do test their updates, they're just maybe not good enough tests.

The ideal would be to do canary rollouts (1%, then 5%, 10% etc.) to minimise blast radius, but I guess that's incompatible with antiviruses protecting you from 0-day exploits.

spike021 · on July 19, 2024

While I'm usually a proponent of update waves like that, I know some teams can get loose with the idea if they determine the update isn't worth that kind of carefulness.

Not saying CS doesn't care enough but what may be a minor update to the team that did this and not necessary for a slow rollout is actually something that really should be supervised in that way.

tetha · on July 19, 2024

Our worst outage occurred when we were deploying some kernel security patches and we grew complacent and updated the main database and it's replica at the same time. We had a maintenance with downtime anyway at the same time, so whatever. The update worked on the other couple hundred systems.

Except, unknown to us, our virtualization provider had a massive infrastructural issue at exactly that moment preventing VMs from booting back up... That wasn't a fun night to failover services into the secondary DC.

rk06 · on July 19, 2024

Was this update meant to save from a 0 day?

grumple · on July 19, 2024

Update: change color of text in console

jaza · on July 19, 2024

Agreed. What happened to Patch Tuesdays?!

prmoustache · on July 19, 2024

I don't think the day matter anymore really.

The issue is update rollout process, lack of diversity of these kind of tools in the industry, and the absolute failure of the software industry to make decent software without bug and security holes.

OJFord · on July 19, 2024

Yeah, airlines prefer mid-week chaos & grounding.