Let's blame the dev who pressed "Deploy"

Retr0id · 2024-07-21T13:03:33 1721567013

A slightly bewildering fact is that CrowdStrike's terms and conditions say not to use it for critical infrastructure:

> Neither the offerings nor crowdstrike tools are for use in the operation of aircraft navigation, nuclear facilities, communication systems, weapons systems, direct or indirect life-support systems, air traffic control, or any application or installation where failure could result in death, severe physical injury, or property damage.

(Originally in all-caps, presumably to make it sound more legally binding)

https://www.crowdstrike.com/terms-conditions/

lbourdages · 2024-07-21T13:06:23 1721567183

That's a pretty standard clause I've seen in every software ToS I ever gazed my eyes upon.

As far as I know, those applications weren't affected either - airports were affected because the ticketing systems were offline. Companies not using Crowdstrike were able to fly just fine.

Edit: 911 was affected in many places, and that is definitely concerning as it is very much life critical.

Retr0id · 2024-07-21T13:10:56 1721567456

Sure, the small-print is always going to say "anything bad that happens is someone else's fault", but it's a bit embarrassing when you put it next to their marketing materials.

01HNNWZ0MV43FF · 2024-07-21T15:01:31 1721574091

I recently was asked to sign a thing saying it's my fault if I die while getting my hair colored

gibbitz · 2024-07-21T19:57:30 1721591850

Oddly this is a thing: https://en.m.wikipedia.org/wiki/Vertebrobasilar_insufficienc...

01HNNWZ0MV43FF · 2024-07-29T19:36:55 1722281815

Interesting! Happy to report I survived

1vuio0pswjnm7 · 2024-07-22T03:33:58 1721619238

"Sure the small-print is always going to say "anything bad that happens is someone else's fault", ..."

But that's not what this "small-print" says, i.e., the text you quoted and the terms and conditions page linked to in your comment.

Generally, it is not possible to disclaim liability for death, physical injury or property damage. The Crowdstrike disclaimer does not attempt to do so.

Nor do the terms ask the software user to assume any risk of death, physical injury or property damage. (Except for a warning about using "Malware Samples".)

tetris11 · 2024-07-21T13:10:56 1721567456

Do they follow their own terms though, or do they still try to sell this software to hospitals without mentioning the fine print.

throwaway55671 · 2024-07-21T13:25:04 1721568304

Do you mean hospitals, 911 dispatch centers and other critical infrastructure buys and deploys software without having the legal department carefully analyze the terms and conditions, based on the marketing materials only?

soco · 2024-07-21T14:46:20 1721573180

Yes. I just googled and there are a-plenty. Even Crowdstrike has a dedicated offering page for hospitals. Not sure what those legal departments you mentioned are paid for, by the way. Maybe only to silence whistleblowers?

mannykannot · 2024-07-21T13:47:00 1721569620

Somewhat ironically, what CrowdStrike is doing here and what the author of this article seems to want are two manifestations of the same desire: to be compensated for one's expertise and judgement without bearing any responsibility when it goes wrong.

In any reasonably rational system, there should be some sort of balance between the two. That is hard to reach, as the various stakeholders are fundamentally at odds with one another, but it is easily believable that the dev who presses 'deploy' rarely has much of either.

port19 · 2024-07-22T07:34:53 1721633693

This is exactly right. Disclaimer inflation has made disclaimers worthless since noone wants to bear responsibility unless absolutely necessary

belter · 2024-07-21T15:19:59 1721575199

And the Windows license terms, offers no guarantees of any kind, while it seems from this outage, is used at 911 response centers and hospitals emergency and surgery rooms...

" Disclaimer. Neither Microsoft, nor the device manufacturer or installer, gives any other express warranties, guarantees, or conditions. Microsoft and the devicemanufacturer and installerexclude all implied warranties and conditions, including those of merchantability, fitness for a particular purpose, and non-infringement..."

https://www.microsoft.com/en-us/Useterms/Retail/Windows/10/U...

seiferteric · 2024-07-21T13:50:13 1721569813

If you think about it, any software update can make something bad happen. Maybe we just shouldn't be running bare-metal installs for critical infra in the first place. Just run a hypervisor taking snapshots daily or before updates and if it bricks a system it restarts into the last good image. Critical infra should be much more resilient and defensive than it is.

giantg2 · 2024-07-21T14:04:23 1721570663

That only works if it's self contained. As soon as you need to reach out and interact with other systems, version mismatches can still cause issues even if your system hasn't changed. Integration points and input are the biggest points of failure or vulnerabilities. A system that doesn't take input or integrate with others is super easy to maintain (of course most are practically useless too).

Retr0id · 2024-07-21T13:53:43 1721570023

Easier said than done, since a hypervisor is yet another moving part that can fail.

People were running Crowdstrike on AWS (which in most (all?) cases involves a hypervisor), and recovery was still painful for them.

hypeatei · 2024-07-21T13:56:44 1721570204

That, or just simplifying these systems with clear boundaries between them. Why are flight status displays running Windows with endpoint security in the first place? Data could be loaded periodically from one authorized machine/job on the network with all other outbound/inbound network access restricted.

alexlll862 · 2024-07-21T15:29:59 1721575799

your hypervisor needs security updates too.

Zambyte · 2024-07-21T13:54:27 1721570067

Maybe it'd time to concede that Tanenbaum was right.

CoastalCoder · 2024-07-21T14:11:25 1721571085

What does the law say about situations where a salesman's promises are at odds with the EULA?

lanthade · 2024-07-21T18:08:27 1721585307

My bet is all it says is the lawyers will be getting rich.

SebFender · 2024-07-22T11:55:22 1721649322

These days - similar clauses won't hold.

lupire · 2024-07-21T13:10:50 1721567450

Critical infrastructure should not use Internet-enabled machines.

mingus88 · 2024-07-21T13:54:44 1721570084

You will still see security teams pushing endpoint protection with kernel level observability onto air-gapped systems, so this issue still exists

From my own limited experience, those air gapped systems are often no more well managed than anything else. Perhaps having one more hop between the update channel and the secure network is enough to catch crowdstrike, but don’t be surprised if it isn’t.

hypeatei · 2024-07-21T15:35:28 1721576128

> You will still see security teams pushing endpoint protection with kernel level observability onto air-gapped systems

Why though? Is it just "because we do it on every other machine", scared to fail audit, or what? Obviously the regulatory environment is a problem but IT incompetence is also another.

netsharc · 2024-07-21T15:54:49 1721577289

I imagined in the past that super-critical infrastructure like the military would have a parallel network, even down to using different cables in the ground, and wouldn't share any of the IP layers with the public Internet. I guess this was feasible a few decades ago, but not nowadays.

arcfour · 2024-07-21T21:16:02 1721596562

Heard of SIPRNet, NIPRNet, and JIWCS?

worthless-trash · 2024-07-21T13:50:29 1721569829

I've had this discussion with many people involved in country infra. It falls on deaf ears. They talk about 'acceptable risk' and 'minimizing their long term costs' just to stay afloat while posting multi billion dollar profits.

I don't know where the delusion comes from, but some of them to go see the folly of their mistakes in the last few days. They will ride this off in the excuse that 'it affected so many other people, see, its not my fault !'.

AmericanChopper · 2024-07-21T13:53:23 1721570003

How can you run your internet backbone without connecting it to the internet?

worthless-trash · 2024-07-22T11:11:53 1721646713

Infrastructure , ie.. water, power, electricity, traffic, dam control, hydrostations, sewerage etc.

AmericanChopper · 2024-07-22T17:46:08 1721670368

My point is that a lot of critical infrastructure needs to be connected to the internet, including network (as I mentioned) and traffic control systems (as you mentioned).

worthless-trash · 2024-07-23T03:56:40 1721707000

I now see where you stand, and we're going to have to disagree. I've heard all the points that your side can make 20 times over and none of them are convincing. I have also learned that when people use the word "need" they have their minds made up, so no point continuing this discussion, have a good one.

AmericanChopper · 2024-07-23T05:29:22 1721712562

Well you can't run your internet backbone at all unless you want to connect it to the internet. For something like a network of traffic light controllers, traffic sensors, variable speed signs, traffic monitoring cameras, ect... You either need to connect it to the internet, or some sort of incredible large intranet that would be distributed over such a large area as to only marginally reduce the attack surface. Unlike say a power plant, where all of the controllers, and all of the systems they are controlling can be contained within a single facility.

Aurornis · 2024-07-21T13:27:38 1721568458

Responsibility lies on large numbers of processes, teams, and people.

I thought this blog might have some substance about proper postmortem investigations and how to evaluate and address the circumstances that led to a failure like this, but it has none of that. It’s just a very angry rant about CEOs and middle management. The premise is that engineers can’t bear any responsibility for their actions because they don’t get “respect”

This has to be the 10th time I’ve seen arguments that “blame” is the right action in this case, but with the key exception that we’re only allowed to blame people other than the engineers. The last article was a lengthy rant about how it’s actually QA’s fault and engineers shouldn’t be expected to ensure their own code is correct, therefore engineers are blameless.

This is empty calories for people who like ragebait, but nothing more.

mondrian · 2024-07-21T13:35:17 1721568917

> because they don’t get “respect”

Nah, that's not what the blog post says. It says engineers aren't given sufficient responsibility, so that they can assume the respective blame, when something goes wrong.

If an engineer says it would take 1 month to build a feature so it's sufficiently reliable, and the manager says "nah that's ridiculous, you have 1 week", then when the cobbled together feature breaks in production the manager should take the blame, since they effectively took the responsibility away from the engineer, upon themselves.

9dev · 2024-07-21T13:51:00 1721569860

I used to think that way, until I had my own junior developers, and felt like that guy trying to bake a cake with three beagles: They veer off, constantly, in all directions.

Developers deep in the trenches tend to have a bad feeling for business requirements or constraints; coupled with a knack for perfectionism and premature optimisation, that really often results in ridiculous time frames that are just plain unrealistic and would ruin the organisation long term.

I don’t have any profound insights, though: The only sane mantra can be keeping things in balance. Too much management, you drive your devs insane; too much engineer control, and the architecture astronauts reinvent the wheel every other day.

strken · 2024-07-21T14:18:28 1721571508

It's hard to tell whether perfectionism means doing a pointless refactor to cloud microservice event-driven k8s buzzword Rust WASM-on-the-server sharded graph databases, or whether it means spending 30 minutes putting a password on that MongoDB instance your data science team wants to load prod data into.

9dev · 2024-07-21T14:26:35 1721571995

Yeah, exactly. That’s why you want a culture of collaboration between engineers and managers to come together and decide what is important. It’s just hard to keep that up in practice, especially if a company grows.

happymellon · 2024-07-21T14:55:41 1721573741

> Developers deep in the trenches tend to have a bad feeling for business requirements

In my experience, this is because the developers are removed from interacting with the business. How are they supposed to make good decisions if they don't talk to their customers and understand what they are aiming for?

And I don't mean have the business folks show a roadmap once a year.

9dev · 2024-07-21T15:20:04 1721575204

I wasn’t assigning blame here, but rather preaching the virtues of balancing the interests of engineering and business folks.

mondrian · 2024-07-21T17:15:25 1721582125

> They veer off, constantly, in all directions.

I have a similar opinion of being micromanaged. The micromanager is like a chaos demon that keeps pointing me in random directions. I lose all internal vision / intuition and turn into an unhappy task robot.

AmericanChopper · 2024-07-21T13:56:55 1721570215

> If an engineer says it would take 1 month to build a feature so it's sufficiently reliable, and the manager says "nah that's ridiculous, you have 1 week"

This is just standard corporate accountability avoidance on the side of the engineer though. Most people don’t want to be accountable for any risk so they advise against it, or give impractical advice, so that somebody else has to make the decision and hold the accountability.

handwarmers · 2024-07-21T13:33:18 1721568798

The blog responds to a very particular aspect of the fallout from the crowdstrike outage. The "Responsibility lies on large numbers of processes, teams, and people" was actually addressed in the article. It makes the case that executives claim that responsibility is correlated to pay. All the author asks in this case is for them to walk the walk.

Go back to middle management.

skwee357 · 2024-07-21T13:48:55 1721569735

Hey, op here.

The premise of the post was a response to the ridiculous claim that when something goes bad, we need to blame the engineer(s) who pressed the button.

I tried, through rant, demonstrate that there are other people to blame, starting from politicians who are incompetent in what they do, to CEOs who get compensated for taking the risk, to managers who cut corners, etc.

The culmination of the post is that if you want o blame someone, you might as well blame any of the involved parties. But instead, if we want to prevent such issues in the future, we need to understand that the entire process or broken, rather than throwing individuals under the bus.

I hope this clarifies it a bit

moomin · 2024-07-21T17:49:49 1721584189

The only bone I’d pick with the article is blaming regulations. The regulations in question rarely say anything particularly boneheaded. Blanket compliance culture interprets those regulations in boneheaded ways. Because to do it any other way would be much more expensive.

ghaff · 2024-07-21T13:36:29 1721568989

Basically, when a major incident occurs the "correct" action is to throw people in jail. Those people, depending on your political persuasion, are either the person who pushed the button, middle managers (because they're the convenient scapegoats), or the C-suite.

ClumsyPilot · 2024-07-21T13:53:55 1721570035

> Those people, depending on your political persuasion, are either the person who pushed the button

That’s not how it works in any industry, ever. A single person can’t launch nukes, blow up a reactor, collapse a bridge, or otherwise cause billions of dollars in damages by accidents pressing one button.

The systems are designed to prevent that.

ghaff · 2024-07-21T14:46:00 1721573160

I think we're in agreement--a fat finger shouldn't be able to cause a disaster--but there often seems to be a sentiment of kill them all and let god sort them out.

belter · 2024-07-21T15:48:56 1721576936

So there will be congress public hearings on this, and the CEO will probably be called to testify. The company has been heavily outsourcing development to India, and it's not the kind of cultural environment where a Developer is going to push back and request more time for testing.

CEO will of course blame some low level employees who did not follow procedures...

"CrowdStrike Significantly Invests in India Operations to Continue Protecting Businesses from Modern Cyber Attacks" - https://www.crowdstrike.com/press-releases/crowdstrike-inves...

JKCalhoun · 2024-07-21T13:16:59 1721567819

Sure, blame the engineer(s).

And from now on every engineer now has the autonomy to refuse to use certain libraries and software stacks they are unfamiliar with, can refuse to submit a change, will have control over whether to push out software they worked on, etc.

And a huge pay bonus as well since they have all this CEO-like risk/responsibility now.

jfim · 2024-07-21T14:13:00 1721571180

That level of agency exists today. Any software developer can tell their manager that no, they're not doing X. The risk is that they get written up for insubordination or have to look for new employment.

What the author of the article doesn't know is that the structural engineer can also get fired if management deems the engineer too troublesome. The only difference is that the PE has a code of ethics and professional responsibility to uphold, and failing to do so would mean risking their license to practice engineering.

betaby · 2024-07-21T14:25:56 1721571956

In practice PE rarely held accountable in Quebec.

strken · 2024-07-21T14:30:15 1721572215

The most important difference is that in practice the SWE can't go and report the employer to anyone for breaching e.g. occupational health and safety laws, or any others, because those laws don't exist for the majority of software flaws. At best you have privacy laws like the GDPR or industry-specific ones covering e.g. medical devices...but CrowdStrike isn't meant to run in those environments anyway, see, look at this text buried in the license agreement.

mariusor · 2024-07-21T13:53:34 1721570014

Frankly, sign me up for that job, please.

pylua · 2024-07-21T13:08:41 1721567321

The practice of building software would have to completely be flipped on its head to support professional liability. Agile would be the first casualty.

ClumsyPilot · 2024-07-21T13:44:24 1721569464

I think Boeing clearly tried to apply Agile to their planes

synicalx · 2024-07-22T00:32:02 1721608322

And their doors as well apparently, those are VERY agile.

jmclnx · 2024-07-21T13:36:55 1721569015

>and that they need 3 months of development then you better shut the f** up and let them do their job.

This in a way covers the issue. Never in all my decades developing have had an estimate accepted. All the developers get is "It must be done by ...., no exceptions".

So you end up pulling all nighters for weeks, and because you are tired, errors or bad decisions creep in. So yes, the issue is fully with upper management.

I left a company because a big project was about to start and I could see it would be a big cluster**. People who stayed told me that is what happened.

chrisjj · 2024-07-21T11:19:02 1721560742

> it makes sense to run EDR on a mission-critical machine, but on a dumb display of information

Major logic error. See how much chaos ensued when that display went down? That's why it needs protection to keep the display up. Dumb or not is irrelevent. It is mission-critical regardless.

skwee357 · 2024-07-21T11:38:26 1721561906

My point was, why a display at a check in counter needs to run EDR software to begin with? Why can't it run a locked down, slimmed version of Windows in an isolated network that has very low potential to get malware on it?

I know why, because there is, probably, a regulation that says that if you run an airline company, you need to have malware protection on all machines. I bet, some IT guy even tried to question the need to run EDR on a non-mission-critical machine, but he was stopped by a wall of "it is what it is".

passwordoops · 2024-07-21T12:59:54 1721566794

"I know why, because there is, probably, a regulation"

Instead of assuming a regulation and writing a blog about it, do the research and find out. To quote the irreplaceable Benny Hill, "You mustn't assume, because it will make an ass out of you and me."

Also, and more important, why default to regulation and not airline directors pushing ill-advised modernization strategies pushed by M$?

ses1984 · 2024-07-21T13:16:02 1721567762

I've done the research. It’s called first hand experience. I was the guy making the arguments, that controls we already have in place obviate the need for edr everywhere, but I was told it doesn’t matter, gotta check the box.

passwordoops · 2024-07-21T13:52:54 1721569974

Awesome. Is the box there because of government regulation or someone in corporate deciding it's necessary?

betaby · 2024-07-21T14:27:57 1721572077

Corporation and consultants mostly, judging from my experience. If asked about precise law or regulation they just wave hands.

ses1984 · 2024-07-21T15:00:52 1721574052

In my case it was a pci-dss (payment card industry data security standard) audit.

moomin · 2024-07-21T17:58:45 1721584725

The thing is, you read regulations, and they pretty much always tell you to do something, but it’s always heavily principle based. Companies are left with extraordinary leeway as to how these regulations are actually implemented.

ses1984 · 2024-07-21T20:56:47 1721595407

You’re right, which I also used in my argument, but I was shot down by our own people, because their success metrics were based on passing the audit with the least amount of fuss.

We kept our other controls, we just added edr as well, because just having it appeased auditors. If you try to explain to an auditor your other controls, it could change a part of the audit from five minutes to multiple days.

We don’t use crowdstrike, but this was years ago.

edent · 2024-07-21T13:45:15 1721569515

FWIW, I don't think that's by Benny Hill - https://quoteinvestigator.com/2021/02/08/assume/

Diggsey · 2024-07-21T13:38:57 1721569137

As someone who works in this space, I can tell you: it's because big companies buy Cyber Security Insurance, and the insurance forms have a checkbox along the lines of "do you run Endpoint Security Software on all devices connected to your network", and if you check the box you save millions of dollars on the insurance (no exageration here). Similarly, if you sell software services to enterprises, the buyers send out similar due diligence forms which require you as a vendor to attest that you run Endpoint Security Software on all devices, or else you won't make the sale. This propagates down the whole supply chain, with the instigator being the Cyber Security insurance costs, regulation or simply perceived competence depending on the situation.

So it's not necessarily government regulation per se, but a combination of things:

1. It's much safer (in terms of personal liability) for the decision makers at large companies to follow "standard industry practices" (however ridiculous they are). For example, no-one will get fired outside of Crowd Strike for this incident precisely because everyone was affected. "How could we have foreseen this when noone else did?"

2. The Cyber Security Insurance provider may not cover this kind of incident given there was no breach and so as far as they are concerned installing something like Crowd Strike is always profitable.

3. The insurance provider has no way to effectively evaluate the security posture of the enterprise they are insuring, so rely on basic indicators such as this checkbox, which completely eliminates any nuance and leads to worse outcomes (but not to the insurance provider!)

4. "Bad checkboxes" propagate down the supply chain the same way that "good checkboxes" do (eg. there are generally sections on these due diligence questionnaires about modern slavery regulation, and that's something you really want to propagate down the supply chain!)

Overall I would say the main cause of this issue is simply "big organisation problems". At a certain scale it seems to become impossible for everyone within the organization to commicate effectively and to make correct, nuanced decisions. This leads to the people at the top seeing these huge (and potentially real) risks to the business because of their lack of information. The person ultimately in charge of security can't scale to understand every piece of software, and so ends up having to make organisation-wide decisions with next to no information. The entire thing is a house of cards that noone can let fall down because it's simply too big to fail.

Making these large organisations work effectively is a very hard problem, but I think any solution must involve mechanisms to allow parts of the business to fail withing taking everything down. Allowing more decisions to be taken locally, but also the responsibilities and repercussions of those decisions to be felt locally.

hypeatei · 2024-07-21T14:11:14 1721571074

Yes, "cyber insurance" is a common driver behind these awful security and system decisions. For example, my company requires password changes every 90-days even though NIST recommends against that. But hey, we're meeting insurance requirements!

0xBDB · 2024-07-22T21:09:51 1721682591

>My point was, why a display at a check in counter needs to run EDR software to begin with?

Because the thermostat on a fish tank has been used as a critical entry point into a casino network[1], and the point of EDR is not just to prevent that sort of thing if possible but also provide the telemetry into a SIEM for incident responders to know that it has happened after the fact and get the adversary out. So there is value in running it anywhere it can run.

I've seen a lot of contempt on HN threads today for compliance regulations and insurance demands that require things like EDR be installed where possible. As a Red Teamer I used to share that contempt for the non-technical types, but I don't now. It's true compliance is not security, but also true that Chesterton's Fence should apply here: just because you shouldn't be checking the box blindly doesn't mean you shouldn't be either checking it or documenting why not. The people who created the box were (probably) not actually idiots. It's there because somebody else had a very bad day.

1. https://www.washingtonpost.com/news/innovations/wp/2017/07/2...

hug · 2024-07-21T13:41:39 1721569299

Low potential is not no potential, and most everyone is looking for swiss-cheese defense when it comes to these devices.

In the case of a display at a check-in counter:

- The display needs to be on a network, because it needs to collect information from elsewhere to display it.

- It's on a network, so it needs to be kept updated, because a compromised host elsewhere on the same network will be able to compromise it, and anyway the display vendor won't support you if your product is nine versions behind current.

- Since it needs updates for various components, it almost certainly needs some amount of outbound internet access, and it's also vulnerable to supply-chain attacks from those updates.

- Since it is on a network, and has internet access, it needs to be running some kind of EDR or feed for a SIEM, because it is compromisable and the last thing you want is an unmonitored compromised host on your internal network talking back to C2.

Anything that can be used for lateral movement will be used for lateral movement, and if we can get logs from it we want logs from it. A cross-platform EDR solution is perfect for these scenarios.

chrisjj · 2024-07-21T15:51:02 1721577062

Agreed. Re:

"- It's on a network, so it needs to be kept updated, because a compromised host elsewhere on the same network will be able to compromise it"

the suggested solution was "an isolated network".0

The problem there is the operator would have to use SD cards to update the adverts... :)

chrisjj · 2024-07-21T13:34:06 1721568846

Because isolating the display and every machine to which it is necessarily connected obstructs monitoring and greatly increases the cost and delay of fixes if something should go wrong.

Also I doubt any slimmed version of Windows is sufficiently malware proof without added EPS.

sussmannbaka · 2024-07-21T13:43:19 1721569399

the more software you add to the display, including EDR, the higher the chance it will go down. you don’t reduce attack surface by introducing more surface and you don’t improve uptime by introducing additional single point of failures.

chrisjj · 2024-07-21T15:58:36 1721577516

> the more software you add to the display, including EDR, the higher the chance it will go down

I am pretty sure that removing EDR from an internetted Windows airport display would ensure it goes down. And probably then come up with a ransomware demand.

sussmannbaka · 2024-07-22T06:18:54 1721629134

Why would it do that? How would a ransomware demand come up on it? Honestly, please tell me because this sounds utterly outlandish to me.

chrisjj · 2024-07-22T13:16:45 1721654205

It would do that because the internet is awash with malware.

sussmannbaka · 2024-07-24T08:50:36 1721811036

Is the airport display downloading charlixcx.mp3.exe? Is it clicking malware attachments? I seriously want to know.

softwaredoug · 2024-07-21T13:27:25 1721568445

OK but structural engineering and anesthesiology are sciences with a lot of hard data behind them…

… arguing for 100% test coverage or SOLID principles are more like philosophies and anecdotes without a lot of hard data supporting them.

Software engineering looks less like engineering and more like a lot of bike shedding conversations.

ClumsyPilot · 2024-07-21T13:34:57 1721568897

We do have hard data as well, we have safe languages like ADA and Erlang, we have embedded engineers and special operating systems that are embedded with you applications.

It’s just nobody wants to pay for these specialist skills and would rather slap something on top of windows and call it a day

skwee357 · 2024-07-21T13:53:31 1721570011

Structural engineer can refuse to sign off a design that is not 100% fail proof. If he will be pressured into approving such design, he will ask the pressurer to put his own signature and bear the risk.

Now try to pushback on your managers request to “cut this long deploy process just once because this big client wants it fast”.

pylua · 2024-07-21T13:57:58 1721570278

In software development, the edge cases are hardly every fully documented or understood. That is because software is so malleable that there are so many different input / output combinations. Other engineering professions are much more constrained on their input.

dkarter · 2024-07-21T18:12:15 1721585535

Weak post. Agree with others that this is just rage bait lacking substance.

Blaming the developers (or any specific individual/group for that matter) is a cop out, it’s easy and lazy and doesn’t get to the root of the problem, which is more often than not a lack of processes, tools, information and lack of time/desire from leadership to address “technical debt” (for lack of a better term), no matter how many times the devs bring that up.

When you blame an individual or a group you can close the case shut on the post mortem and not get to any substantive improvements, meaning this can and will happen again, just to somebody else.

That’s why blameless postmortem and a blameless culture is so important. This is a good article about that philosophy:

> My summary of blameless culture is: when there is an outage, incident, or escaped bug in your service, assume the individuals involved had the best of intentions, and either they did not have the correct information to make a better decision, or the tools allowed them to make a mistake.

https://www.gybe.ca/a-few-words-about-blameless-culture/

batch12 · 2024-07-21T13:45:17 1721569517

Sure, the dev is wrong, but so is the process that allowed their error to impact the product. If a single person can make this choice on accident, then a single person can make it on purpose either by being malicious themselves or being otherwise compromised. If the company has hung their entire security posture and operational success on the choices of one person, they have a problem. Especially a security company.

hypeatei · 2024-07-21T14:05:26 1721570726

It's very surprising that critical infrastructure has been "infected" by the compliance checkbox obsession rather than just being designed in a way that completely eliminates the need for any shitty off-the-shelf security product.

There is incompetence and complacency showing at every level after the CrowdStrike outage. Both the people selling CrowdStrike and the ones implementing it.

betaby · 2024-07-21T14:23:00 1721571780

There is no law requiring EDR. It's purely constructed nonsense copied by security clowns from one company to another. Now it's kind of mandatory. It has self regulated everything to the ground

potatoman22 · 2024-07-21T14:01:50 1721570510

As a developer, it's nice to shift the blame to other people, and other people do share some of the blame. But we also need to be responsible for our code functioning. It reads like the author is looking for anyone else to blame besides the devs.

gizmo · 2024-07-21T14:06:47 1721570807

The author argues that the CEO of CrowdStrike has failed upwards. I don't agree. CrowdStrike makes compliance software. CrowdStrike's main purpose isn't to provide protection against cyber attacks. Businesses don't care enough about that. Businesses do care about simplifying their compliance burden and limiting their liability when they get hacked. That's where CrowdStrike excels, and that's why CrowdStrike can charge so much for their services. It's not easy to build a 70 billion dollar business and CrowdStrike serves a real business need.

> I remember times when leaders had dignity and self-respect. They would go on stage and apologize

Etiquette rules change over time, but a constant throughout history is that people in power don't take responsibility voluntarily. The "good old days" where leaders had dignity and self-respect never existed.

> [...] delusional claim how software engineers should bear the responsibility for bugs and outages

The /opinion/ that people /should/ be held responsible can't be delusional. Apparently the author believes that software engineers should bear zero responsibility -- even when their software kills people -- because we don't get enough respect. I don't agree with that opinion but it's a bit rich to call other people delusional when making one unfounded claim after another.

The post as a whole is way too angry and too cynical for me.

giantg2 · 2024-07-21T13:55:49 1721570149

'But then the author engages in an absurd rant about how the entire software engineering industry is a “bit of a clusterfuck”,'

I mean, it is a bit of a cluster fuck.

Most of the issues I've seen are due to speed or cost. We don't have time for tests.we don't have time to record the business requirements in a collective place (just look through multiple JIRA stories and piece it together). We don't have money for dedicated QA roles. So of course problems happen. Luckily my team only works on a lower criticality site.

None of this is really a dev's fault. It's leadership and the culture they incentivise.

acdha · 2024-07-21T13:58:50 1721570330

I blame the courts largely buying the argument that companies can disclaim all responsibility with terms of service. Most of these problems would become far less common if the situation was more balanced and managers had to consider the possibility of serious financial consequences for rushing something out without adequate testing.

giantg2 · 2024-07-21T14:44:21 1721573061

I mostly agree, but that still won't fix this issue. Nothing has 100% uptime. Let's say Crowdstrike has an SLA of 99.99%. Even with this outage, they probably meet that depeding on the time frame.

So which mangers bear the blame here? The ones who likely met their SLAs, or the ones who knew their systems relied on software that wasn't 100% reliable and didn't have a backup plan?

acdha · 2024-07-21T21:40:59 1721598059

Pure SLAs matter less than the negligence: this showed critical lack of testing, following robust engineering practices for progressive deployments or fail safe design, etc. If a bank loses $1M they don’t get to argue that they shouldn’t compensate clients because they processed 99.9999% of their transactions correctly.

giantg2 · 2024-07-22T00:44:27 1721609067

In that example, the banks wouldn't compensate people if their payment system was down and couldn't process orders. That really what happened here - loss of sales on the clients' end. If it were securities, then they would have as-of processing to fix it.

But the same argument about lack of testing can be made for the companies using the system. Do they have tests for what happens when pieces of their systems or infrastructure go down? Did they ever do a disaster recovery drill? Honestly, there's blame all around because almost nobody is doing it right. Even the ones who are have failures.

acdha · 2024-07-22T11:26:57 1721647617

> Do they have tests for what happens when pieces of their systems or infrastructure go down? Did they ever do a disaster recovery drill?

Yes, but this forced them to run the worst DR routine short of Microsoft going rogue. The scale of the testing problem is orders of magnitude larger on one side: people trusted them to be minimally competent and they just weren’t.

giantg2 · 2024-07-23T01:52:27 1721699547

Not going to lie, you sound like you've never made any mistakes ever. The problematic culture is on both sides. Anyone who works in infrastructure, SCCM packaging, or data centers knows you roll out updates to a isolated set of machines to test it first. Trust no one. If they really relied on them that much, they should have penalties in their contracts. If not, that would be an example of not being minimally competent - any enterprise sourcing team would look into this.

acdha · 2024-07-24T12:11:39 1721823099

Nobody does much in ops without making mistakes. I’ve done all of the roles you mentioned and learned from mistakes and oversights, which is why I know that you should start by looking at what you assume is available in each scenario. For example, I’ve twice seen prolonged outages in a data center due to failures in the power distribution equipment when doing a failover test – the physical plant guys had checked the UPS/battery systems but didn’t think about what would happen if that component failed and then learned that spare parts were out of stock in Southern California and the manufacturer had to put someone on the plane from Colorado, which meant we had to roll the failover to another data center. All of us had technically known that the redundant hardware didn’t mean n=2 parts could fail simultaneously but had incorrectly assumed the odds of a correlated failure were much lower or that the vendor a couple miles away would be able to fix a failed unit.

I mention that last because that’s what happened to a lot of people here. They had DR plans assuming that they had their management infrastructure or could quickly bring it back online, but then they had things like CrowdStrike taking out the servers holding BitLocker recovery keys and other critical infrastructure. One of the under-appreciated outcomes from the general push to secure things has been that a lot of systems are now less robust because they depend on a few security critical components with no easy path to recovery if those fail. Full disk encryption is great from the perspective of data loss but it also means key management is mission critical in a way senior management probably wasn’t fully appreciative of when setting funding plans.

josephg · 2024-07-21T14:08:20 1721570900

You know, I sort of agree with this article while completely disagreeing with it.

The article says:

> You want software engineers to be accountable for their code, then give them the respect they deserve.

The problem is, respect is something that's taken as much as its something given. We can't even decide for ourselves if software is worthy of respect. Can anyone learn to code from a coding bootcamp, where getting a job is all that matters? Or is it a discipline that takes years to master, where mistakes can and will bring down the global economy? Are we glorified plumbers, or are we mathematicians and civil engineers combined?

If you see yourself as a code monkey, of course you can't be "held responsible" for the results of your work. Coding bootcamps don't teach infosec. Its up to your company to set good practices and your job is just to follow them.

Its only if you personally want to take your role in society seriously that it makes sense to consider not just your job, but the effect your job has on the wider world. I'm personally of the opinion that this mindset is almost always long term positive for your career. Its less "blame the dev for hitting deploy" and more "I'm the dev. No, I won't hit deploy on that code in the state its in."

I didn't go to a coding bootcamp. I went to university. There they forced all of us engineering & CS students to do an ethics course - which was actually fantastic. There they taught us about the Therac-25: a computer controlled radiation machine which killed a bunch of people. The engineers on the ground knew that it needed more testing, but the company insisted it was fine and pushed it out the door.

Here's the question: If you were one of those engineers, what would you do? If you knew, or suspected, that a bug in your code could bring down 911 services and hospitals, ground planes or give people a lethal dose of radiation, do you really trust your manager to make the engineering call? Do you think the CEO understands the risk that is being taken by hitting deploy?

Of course, if we're playing the blame game, the blame ultimately the blame falls on the CEO of the company or something.

But forget the blame game. You won't be fired. You won't lose your cushy job. The question is: Who do you want to be in situations like this? Think about this question now. You won't have time in the moment when your boss tells you to hit deploy, and you have second thoughts.

Me? I want to be someone who would say no.

skwee357 · 2024-07-21T14:16:35 1721571395

I agree.

But in order to be the "person who would say no", the industry needs to understand that your opinion and expertise--matter. It could be a cultural shift, or a gate-keeping style shift where we protect the title "Engineer", like they do in some professions in some countries.

But given the current state, you can't on one hand blame the developers, and on the other hand treat them like spoiled kids who make too much money and in any way AI can replace most of them. It doesn't work this way. A Structural Engineer bears the responsibility because he has the authority, and respect to his knowledge, to refuse to sign off a broken design. This is not the case in software engineering.

mwcampbell · 2024-07-21T17:11:28 1721581888

> treat them like spoiled kids

To what extent is this because we act like spoiled kids? I really do mean "we" here; I probably have acted like that sometimes. I wonder if we, the post-microcomputer generations, are messed up to some extent because we started programming as a fun distraction from the work we were supposed to be doing, rather than learning programming as a serious job from the beginning like our predecessors who learned on mainframes or minicomputers in college.

josephg · 2024-07-21T21:03:15 1721595795

Right. This is my point exactly. It’s hard to have it both ways. We can’t both carry the burden of responsibility for society, and expect truck drivers to learn programming in a 12 week bootcamp. It’s hard to expect programmers to have rigor in our work if we hire programmers who are self taught. And that’s a bitter pill to swallow, because lots of self taught software engineers are really good.

But we don’t need to boil the ocean for things to improve. You personally can still decide you don’t want to make software that harms society. You personally can push back against your company if they want you to sell your users data. Nobody really knows how much they should respect your opinions and skills, so they’ll try things on. If you don’t respect yourself, nobody else will either.

intelVISA · 2024-07-21T15:28:18 1721575698

Imo the root issue is that SWE is such a young profession: it hasn't had time to find a real footing, it's probably somewhere in the barber surgeon era about now.

Therefore the current identity crisis: the avg SWE is expected to be skilled and autonomous, but also report to daily standups and use mythical "story points" to estimate Very Serious Projects.

I would say devs are neither plumbers nor engineers but managers (of automation), but to recognize this is not Agile so instead we'll treat them as cheap cogs and wonder why the best escape to run their own shop where they can recapture the responsibilities and compensation eroded by bloat and grifters.

This is why the corp narrative is so wishful toward AI replacing engineers: an abstract magic box that prints money with no concept of accountability, copyright, liability is a dream for CEOs.

skwirl · 2024-07-21T13:18:04 1721567884

> If a software engineer tells you that this code needs to be 100% test covered, that AI won’t replace them, and that they need 3 months of development—then you better shut the fuck up and let them do their job.

This is so naive that it is impossible to take the author seriously.

mondrian · 2024-07-21T13:43:58 1721569438

The concept that you can't take the blame if you're not given the responsibility is not naive.

dakiol · 2024-07-21T13:42:19 1721569339

Who else is in a better position to give an estimate than the people who design and build the thing?

9dev · 2024-07-21T13:59:23 1721570363

It leads to a conflict of interest: that’s why we have checks and balances in government, and draw a clear line between executive and legislative powers.

If we simply accepted the fact that the wizards of the arcane machines knew what they do and the rest of us is at their mercy, then we would also loose any organisational control. I’m sure that sounds tempting to some engineers, but business people actually do serve a purpose. Acting like the all-knowable developers will surely do right just isn’t a good business strategy.

ses1984 · 2024-07-21T13:18:58 1721567938

It’s called hyperbole.

skwirl · 2024-07-21T13:28:49 1721568529

No, this paragraph was not hyperbole. This was in conclusion and the author’s serious opinion.

skwee357 · 2024-07-21T14:01:39 1721570499

It was a hyperbolic statement related to the fact that if you want developers to take full responsibility, like surgeons, then you need trust their authority.

I haven’t heard of a surgeon who said “this operation will take as much time as needed”, and the hospital manager pressured him to “finish it in 8 hours, and not use too many syringes”.

Kiro · 2024-07-21T13:36:36 1721568996

You're replying to the author but I agree. Too much unfounded speculation. I'm glad the author is being called out for lack of research in a sibling comment.

ses1984 · 2024-07-21T15:26:47 1721575607

The whole post is a rant.

jkercher · 2024-07-21T13:39:55 1721569195

What's naive?