I also fall on the side of "stagger the rollout" (or "give customers tools to stagger the rollout"), but at the same time I recognize that a lot of customers would not accept delays on the latest malware data.
Before the incident, if you asked a customer if they would like to get updates faster even if it means that there is a remote chance of a problem with them... I bet they'd still want to get updates faster.
I would say that canary release is an absolute must 100%. Except I can think of cases where it might still not be enough. So, I just don't feel comfortable judging them out of the box. Does all the evidence seem to point against them? For sure. But I just don't feel comfortable giving that final verdict without knowing for sure.
Specifically because this is about fighting against malicious actors, where time can be of essence to deploy some sort of protection against a novel threat.
If there's deadlines that you can go over, and nothing bad happens, for sure. Always have canary releases, and perfect QA, monitoring everything thoroughly, but I'm just saying, there can be cases where damage that could be done if you don't act fast enough, is just so much worse.
And I don't know that it wasn't the case for them. I just don't know.
> Specifically because this is about fighting against malicious actors, where time can be of essence to deploy some sort of protection against a novel threat.
This is severely overstating the problem: an extra few minutes is not going to be the difference between their customers being compromised. Most of the devices they run on are never compromised, because anyone remotely serious has defense in depth.
If it was true, or even close to true, that would make the criticism more rather than less strong. If time is of the essence, you invest in things like reviewing test coverage (their most glaring lapse), fuzz testing, and common reliability engineering techniques like having the system roll back to the last known good configuration after it’s failed to load. We think of progressive rollouts as common now but they got to get that mainstream in large part because the Google Chrome team realized rapid updates are important but then asked what they needed to do to make them safe. CrowdStrike’s report suggests that they wanted rapid but weren’t willing to invest in the implementation because that isn’t a customer-visible feature – until it very painfully became one.
I’d consider staggering a rollout to be the absolute basics of due diligence.
Especially when you’re building a critical part of millions of customer machines.