Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Lack of gradual, health mediated rollout is absolutely the core issue here. False positive signatures, crash inducing blocks, etc will always slip through testing at some % no matter how good testing is. The necessary defense in depth here is to roll out ALL changes (binaries, policies, etc) in a staggered fashion with some kind of health checks in between (did > 10% of endpoints the change went to go down and stay down right after the change was pushed?).

Crowdstrike bit my company with a false positive that severely broke the entire production fleet because they pushed the change everywhere all at once instead of staggering it out. We pushed them hard in the RCA to implement staggered deployments of their changes. They sent back a 50 page document explaining why they couldn't which basically came down to "that would slow down blocks of true positives" - which is technically true but from followup conversations quite clear that is was not the real reason. The real reason is that they weren't ready to invest the engineering effort into doing this.

You can stagger changes out within a reasonable timeframe - the blocks already take hours/days/weeks to come up with, taking an extra hour or two to trickle the change out gradually with some basic sanity checks between staggers is a tradeoff everyone would embrace in order to avoid the disaster we're living through today.

Need a reset on their balance point of security:uptime.




Wow !! good to know real reason for non-staggered release of the software ...

> Crowdstrike bit my company with a false positive that severely broke the entire production fleet because they pushed the change everywhere all at once instead of staggering it out. We pushed them hard in the RCA to implement staggered deployments of their changes. They sent back a 50 page document explaining why they couldn't which basically came down to "that would slow down blocks of true positives" - which is technically true but from followup conversations quite clear that is was not the real reason. The real reason is that they weren't ready to invest the engineering effort into doing this.


There's some irony there in that the whole point of CrowdStrike itself is that it does behavioural based interventions. ie: it notices "unusual" activity over time and then can react to that autonomously. So them telling you they can't engineer it is kind of like them telling you they do don't know how to do a core feature they actually sell and market the product itself doing.


The core issue? I'd say it's QA.

Deploy to a QA server fleet first. Stuff is broken. 100% prevention.


It's quite handy that all the things that pass QA never fail in production. :)

On a serious note, we have no way of knowing whether their update passed some QA or not, likely it hasn't, but we don't know. Regardless, the post you're replying to, IMHO, correctly makes the point that no matter how good your QA is: it will not catch everything. When something slips, you are going to need good observability and staggered, gradual, rollbackable, rollouts.

Ultimately, unless it's a nuclear power plant or something mission critical with no redundancy, I don't care if it passes QA, I care that it doesn't cause damage in production.

Had this been halted after bricking 10, 100, 1.000, 10.000, heck, even 100.000 machines or a whopping 1.000.000 machines, it would have barely made it outside of the tech circle news.


> On a serious note, we have no way of knowing whether their update passed some QA or not

I think we can infer that it clearly did not go through any meaningful QA.

It is very possible for there to be edge-case configurations that get bricked regardless of how much QA was done. Yes, that happens.

That's not what happened here. They bricked a huge portion of internet connected windows machines. If not a single one of those machines was represented in their QA test bank, then either their QA is completely useless, or they ignored the results of QA which is even worse.

There is no possible interpretation here that doesn't make Crowdstrike look completely incompetent.


If there had been a QA process, the kill rate could not have been as high as it is, because there'd have to be at least one system configuration that's not subject to the issue.


I agree that testing can reduce the probability of having huge problems, but there are still many ways in which a QA process can fail silently, or even pass properly, without giving a good indication of what will happen in production due to data inconsistencies or environmental differences.

Ultimately we don't know if they QA'd the changes at all, if this was data corruption in production, or anything really. What we know for sure is that they didn't have a good story for rollbacks and enforced staggered rollouts.


There's also the possibility that they did do QA, had issues in QA and were pressured to rush the release anyways.


Unsubstantiated (not even going to bother link to the green-account-heard-it-from-a-friend comment), the fault was added by a post-QA process


My understanding of their argument is that they can't afford the time to see if it breaks the QA fleet. Which I agree with GP is not a sufficient argument.


Yes, one of the first steps of this gradual rollout should be rolling out to your own company in the classic, "eat your own dogfood" style.


If and when there is a US Cyber Safety Review Board investigation of this incident, documents like that are going to be considered with great interest by the parties involved.

Often it is the engineers working for a heavily invested customer at the sharp end of the coal face who get a glimpse underneath the layers of BS and stare into the abyss.

This doesn’t look good, they say. It looks fine from up top! Keep shoveling! Comes the reply.


Sure, gradual rollout seems obviously desirable, but think of it from a liability perspective.

You roll out a patch to 1% of systems, and then a few of the remaining 99% get attacked and they sue you for having a solution but not making it available to them. It won't matter that your sales contract explains that this is how it works and the rollout is gradual and random.

Just a thought.


These suing hypotheticals work both ways- they can sue for crashing 100% of your computers - so don't really explain any decision


Then push it down to customer, better yet provide integration points with other patch management software (no idea if you can integrate with WSUS without doing insane crap, but it's not the only system to handle that, etc.)


Another version of the "fail big" or "big lie" type phenomenon. Impact 1% of your customers and they sue you saying the gradual rollout demonstrates you had prior knowledge of the risk. Impact 100% of your customers and somehow you get off the hook by declaring it a black swan event that couldn't have been foretold.


Don't you think they will be sued now too?


CS recently went through cost cutting measures, which is likely why there's not a QA fleet to deploy to or improving their engineering processes.


Were they struggling with paying the employees?


In modern terms, you mean they simply weren't willing to babysit longer install frames.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: