The core issue? I'd say it's QA. Deploy to a QA server fleet first. Stuff is bro...

mainde · 2024-07-19T17:08:48 1721408928

It's quite handy that all the things that pass QA never fail in production. :)

On a serious note, we have no way of knowing whether their update passed some QA or not, likely it hasn't, but we don't know. Regardless, the post you're replying to, IMHO, correctly makes the point that no matter how good your QA is: it will not catch everything. When something slips, you are going to need good observability and staggered, gradual, rollbackable, rollouts.

Ultimately, unless it's a nuclear power plant or something mission critical with no redundancy, I don't care if it passes QA, I care that it doesn't cause damage in production.

Had this been halted after bricking 10, 100, 1.000, 10.000, heck, even 100.000 machines or a whopping 1.000.000 machines, it would have barely made it outside of the tech circle news.

jjav · 2024-07-20T06:59:26 1721458766

> On a serious note, we have no way of knowing whether their update passed some QA or not

I think we can infer that it clearly did not go through any meaningful QA.

It is very possible for there to be edge-case configurations that get bricked regardless of how much QA was done. Yes, that happens.

That's not what happened here. They bricked a huge portion of internet connected windows machines. If not a single one of those machines was represented in their QA test bank, then either their QA is completely useless, or they ignored the results of QA which is even worse.

There is no possible interpretation here that doesn't make Crowdstrike look completely incompetent.

terribleperson · 2024-07-19T18:30:23 1721413823

If there had been a QA process, the kill rate could not have been as high as it is, because there'd have to be at least one system configuration that's not subject to the issue.

mainde · 2024-07-19T21:26:18 1721424378

I agree that testing can reduce the probability of having huge problems, but there are still many ways in which a QA process can fail silently, or even pass properly, without giving a good indication of what will happen in production due to data inconsistencies or environmental differences.

Ultimately we don't know if they QA'd the changes at all, if this was data corruption in production, or anything really. What we know for sure is that they didn't have a good story for rollbacks and enforced staggered rollouts.

terribleperson · 2024-07-19T21:59:01 1721426341

There's also the possibility that they did do QA, had issues in QA and were pressured to rush the release anyways.

dwattttt · 2024-07-20T06:53:40 1721458420

Unsubstantiated (not even going to bother link to the green-account-heard-it-from-a-friend comment), the fault was added by a post-QA process

apitman · 2024-07-19T17:04:41 1721408681

My understanding of their argument is that they can't afford the time to see if it breaks the QA fleet. Which I agree with GP is not a sufficient argument.