Lack of gradual, health mediated rollout is absolutely the core issue here. Fals...

LrnByTeach · 2024-07-19T17:19:56 1721409596

Wow !! good to know real reason for non-staggered release of the software ...

> Crowdstrike bit my company with a false positive that severely broke the entire production fleet because they pushed the change everywhere all at once instead of staggering it out. We pushed them hard in the RCA to implement staggered deployments of their changes. They sent back a 50 page document explaining why they couldn't which basically came down to "that would slow down blocks of true positives" - which is technically true but from followup conversations quite clear that is was not the real reason. The real reason is that they weren't ready to invest the engineering effort into doing this.

zmmmmm · 2024-07-19T21:48:37 1721425717

There's some irony there in that the whole point of CrowdStrike itself is that it does behavioural based interventions. ie: it notices "unusual" activity over time and then can react to that autonomously. So them telling you they can't engineer it is kind of like them telling you they do don't know how to do a core feature they actually sell and market the product itself doing.

legohead · 2024-07-19T16:42:48 1721407368

The core issue? I'd say it's QA.

Deploy to a QA server fleet first. Stuff is broken. 100% prevention.

mainde · 2024-07-19T17:08:48 1721408928

It's quite handy that all the things that pass QA never fail in production. :)

On a serious note, we have no way of knowing whether their update passed some QA or not, likely it hasn't, but we don't know. Regardless, the post you're replying to, IMHO, correctly makes the point that no matter how good your QA is: it will not catch everything. When something slips, you are going to need good observability and staggered, gradual, rollbackable, rollouts.

Ultimately, unless it's a nuclear power plant or something mission critical with no redundancy, I don't care if it passes QA, I care that it doesn't cause damage in production.

Had this been halted after bricking 10, 100, 1.000, 10.000, heck, even 100.000 machines or a whopping 1.000.000 machines, it would have barely made it outside of the tech circle news.

jjav · 2024-07-20T06:59:26 1721458766

> On a serious note, we have no way of knowing whether their update passed some QA or not

I think we can infer that it clearly did not go through any meaningful QA.

It is very possible for there to be edge-case configurations that get bricked regardless of how much QA was done. Yes, that happens.

That's not what happened here. They bricked a huge portion of internet connected windows machines. If not a single one of those machines was represented in their QA test bank, then either their QA is completely useless, or they ignored the results of QA which is even worse.

There is no possible interpretation here that doesn't make Crowdstrike look completely incompetent.

terribleperson · 2024-07-19T18:30:23 1721413823

If there had been a QA process, the kill rate could not have been as high as it is, because there'd have to be at least one system configuration that's not subject to the issue.

mainde · 2024-07-19T21:26:18 1721424378

I agree that testing can reduce the probability of having huge problems, but there are still many ways in which a QA process can fail silently, or even pass properly, without giving a good indication of what will happen in production due to data inconsistencies or environmental differences.

Ultimately we don't know if they QA'd the changes at all, if this was data corruption in production, or anything really. What we know for sure is that they didn't have a good story for rollbacks and enforced staggered rollouts.

terribleperson · 2024-07-19T21:59:01 1721426341

There's also the possibility that they did do QA, had issues in QA and were pressured to rush the release anyways.

dwattttt · 2024-07-20T06:53:40 1721458420

Unsubstantiated (not even going to bother link to the green-account-heard-it-from-a-friend comment), the fault was added by a post-QA process

apitman · 2024-07-19T17:04:41 1721408681

My understanding of their argument is that they can't afford the time to see if it breaks the QA fleet. Which I agree with GP is not a sufficient argument.

justspamjustin · 2024-07-19T18:45:56 1721414756

Yes, one of the first steps of this gradual rollout should be rolling out to your own company in the classic, "eat your own dogfood" style.

this_steve_j · 2024-07-22T17:28:29 1721669309

If and when there is a US Cyber Safety Review Board investigation of this incident, documents like that are going to be considered with great interest by the parties involved.

Often it is the engineers working for a heavily invested customer at the sharp end of the coal face who get a glimpse underneath the layers of BS and stare into the abyss.

This doesn’t look good, they say. It looks fine from up top! Keep shoveling! Comes the reply.

2o4j2o3543o · 2024-07-19T21:44:42 1721425482

Sure, gradual rollout seems obviously desirable, but think of it from a liability perspective.

You roll out a patch to 1% of systems, and then a few of the remaining 99% get attacked and they sue you for having a solution but not making it available to them. It won't matter that your sales contract explains that this is how it works and the rollout is gradual and random.

Just a thought.

eviks · 2024-07-20T03:34:23 1721446463

These suing hypotheticals work both ways- they can sue for crashing 100% of your computers - so don't really explain any decision

p_l · 2024-07-19T21:55:26 1721426126

Then push it down to customer, better yet provide integration points with other patch management software (no idea if you can integrate with WSUS without doing insane crap, but it's not the only system to handle that, etc.)

zmmmmm · 2024-07-19T21:50:30 1721425830

Another version of the "fail big" or "big lie" type phenomenon. Impact 1% of your customers and they sue you saying the gradual rollout demonstrates you had prior knowledge of the risk. Impact 100% of your customers and somehow you get off the hook by declaring it a black swan event that couldn't have been foretold.

chomskyole · 2024-07-20T07:59:42 1721462382

Don't you think they will be sued now too?

jorblumesea · 2024-07-19T19:02:17 1721415737

CS recently went through cost cutting measures, which is likely why there's not a QA fleet to deploy to or improving their engineering processes.

ikekkdcjkfke · 2024-07-20T13:31:02 1721482262

Were they struggling with paying the employees?

cyanydeez · 2024-07-19T23:52:48 1721433168

In modern terms, you mean they simply weren't willing to babysit longer install frames.