Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The core issue? I'd say it's QA.

Deploy to a QA server fleet first. Stuff is broken. 100% prevention.




It's quite handy that all the things that pass QA never fail in production. :)

On a serious note, we have no way of knowing whether their update passed some QA or not, likely it hasn't, but we don't know. Regardless, the post you're replying to, IMHO, correctly makes the point that no matter how good your QA is: it will not catch everything. When something slips, you are going to need good observability and staggered, gradual, rollbackable, rollouts.

Ultimately, unless it's a nuclear power plant or something mission critical with no redundancy, I don't care if it passes QA, I care that it doesn't cause damage in production.

Had this been halted after bricking 10, 100, 1.000, 10.000, heck, even 100.000 machines or a whopping 1.000.000 machines, it would have barely made it outside of the tech circle news.


> On a serious note, we have no way of knowing whether their update passed some QA or not

I think we can infer that it clearly did not go through any meaningful QA.

It is very possible for there to be edge-case configurations that get bricked regardless of how much QA was done. Yes, that happens.

That's not what happened here. They bricked a huge portion of internet connected windows machines. If not a single one of those machines was represented in their QA test bank, then either their QA is completely useless, or they ignored the results of QA which is even worse.

There is no possible interpretation here that doesn't make Crowdstrike look completely incompetent.


If there had been a QA process, the kill rate could not have been as high as it is, because there'd have to be at least one system configuration that's not subject to the issue.


I agree that testing can reduce the probability of having huge problems, but there are still many ways in which a QA process can fail silently, or even pass properly, without giving a good indication of what will happen in production due to data inconsistencies or environmental differences.

Ultimately we don't know if they QA'd the changes at all, if this was data corruption in production, or anything really. What we know for sure is that they didn't have a good story for rollbacks and enforced staggered rollouts.


There's also the possibility that they did do QA, had issues in QA and were pressured to rush the release anyways.


Unsubstantiated (not even going to bother link to the green-account-heard-it-from-a-friend comment), the fault was added by a post-QA process


My understanding of their argument is that they can't afford the time to see if it breaks the QA fleet. Which I agree with GP is not a sufficient argument.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: