Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They literally half-assed their deployment process - one part enterprisey, one part "move fast and break things".

Guess which part took down much of the corporate world?

from Preliminary Post Incident Review at https://www.crowdstrike.com/falcon-content-update-remediatio... :

"CrowdStrike delivers security content configuration updates to our sensors in two ways: Sensor Content that is shipped with our sensor directly, and Rapid Response Content that is designed to respond to the changing threat landscape at operational speed.

...

The sensor release process begins with automated testing, both prior to and after merging into our code base. This includes unit testing, integration testing, performance testing and stress testing. This culminates in a staged sensor rollout process that starts with dogfooding internally at CrowdStrike, followed by early adopters. It is then made generally available to customers. Customers then have the option of selecting which parts of their fleet should install the latest sensor release (‘N’), or one version older (‘N-1’) or two versions older (‘N-2’) through Sensor Update Policies.

The event of Friday, July 19, 2024 was not triggered by Sensor Content, which is only delivered with the release of an updated Falcon sensor. Customers have complete control over the deployment of the sensor — which includes Sensor Content and Template Types.

...

Rapid Response Content is used to perform a variety of behavioral pattern-matching operations on the sensor using a highly optimized engine.

Newly released Template Types are stress tested across many aspects, such as resource utilization, system performance impact and event volume. For each Template Type, a specific Template Instance is used to stress test the Template Type by matching against any possible value of the associated data fields to identify adverse system interactions.

Template Instances are created and configured through the use of the Content Configuration System, which includes the Content Validator that performs validation checks on the content before it is published.

On July 19, 2024, two additional IPC Template Instances were deployed. Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.

Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production."



> one part enterprisey, one part "move fast and break things".

When there's 0day, how enterprisey you would like to catch the 0day?


Not sure, but definitely more enterprisey than "release a patch to the entire world at once before running it on a single machine in-house".


So it would be preferable to have your data encrypted, taken hostage unless you pay, and be down for days, instead of 6 hours of just down?


Do you seriously believe that all CrowdStrike on Windows customers were at such imminent risk of ransomware that one-two hours to run this on one internal setup and catch the critical error they released would have been dangerous?

This is a ludicrous position, and has been proven obviously false by the proceedings: all systems that were crashed by this critical failure were not, in fact, attacked with ransomware once the CS agent was un-installed (at great pain).


I'd challenge you to be a CISO :)

You don't want to be in a situation where you're taken hostage and asked hundred mills ransomeware just because you're too slow to mitigate the situation.


That's a false dichotomy


Crowdstrike exploited their own 0-day. Their market cap went down by several billion dollars.

A patch should, at minimum:

1. Let the app run 2a. Block the offending behaviour 2b. Allow normal behaviour

Part 1. can be assumed if Parts 2a and 2b work correctly.

We know CrowdStrike didn't ensure 2a or 2b since the app caused the machine to reboot when the patch caused a fault in the app.

CrowdStrike's Root Cause Analysis, https://www.crowdstrike.com/wp-content/uploads/2024/08/Chann..., lists what they're going to do:

====

Mitigation: Validate the number of input fields in the Template Type at sensor compile time

Mitigation: Add runtime input array bounds checks to the Content Interpreter for Rapid Response Content in Channel File 291 - An additional check that the size of the input array matches the number of inputs expected by the Rapid Response Content was added at the same time. - We have completed fuzz testing of the Channel 291 Template Type and are expanding it to additional Rapid Response Content handlers in the sensor.

Mitigation: Correct the number of inputs provided by the IPC Template Type

Mitigation: Increase test coverage during Template Type development

Mitigation: Create additional checks in the Content Validator

Mitigation: Prevent the creation of problematic Channel 291 files

Mitigation: Update Content Configuration System test procedures

Mitigation: The Content Configuration System has been updated with additional deployment layers and acceptance checks

Mitigation: Provide customer control over the deployment of Rapid Response Content updates

====




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: