Knightmare: A DevOps Cautionary Tale (2014)

lopkeny12ko · on Sept 10, 2023

I'm not sure how automated deployments would have solved this problem. In fact, if anything, it would have magnified the impact and fallout of the problem.

Substitute "a developer forgot to upload the code to one of the servers" for "the deployment agent errored while downloading the new binary/code onto the server and a bug in the agent prevented the error from being surfaced." Now you have the same failure mode, and the impact happens even faster.

The blame here lies squarely with the developers--the code was written in a non-backwards-compatible way.

dmurray · on Sept 10, 2023

> The blame here lies squarely with the developers--the code was written in a non-backwards-compatible way.

The blame completely lies with the risk management team.

The market knew there was a terrible problem, Knight knew there was a problem, yet it took 45 minutes of trying various hotfixes before they ceased trading. Either because they didn't have a kill switch, or because no one was empowered to pull the kill switch because of the opportunity cost (perhaps pulling the switch at the wrong time costs $500k in opportunity).

I worked for a competitor to Knight at the time, and we deployed terrible bugs to production all the time, and during post mortems we couldn't fathom the same thing happening to us. A dozen automated systems would have kicked in to stop individual trades, and any senior trader or operations person could have got a kill switch pulled with 60 seconds of dialogue, and not feared the repercussions. Actually, we made way less of Knight's $400m than we could have because our risk systems kept shutting strategies down because what was happening was "too good to be true".

mpeg · on Sept 10, 2023

It’s nice to see your perspective as someone familiar with better systems.

I have always found this story fascinating; in my junior days I worked at a relatively big adtech platform (ie billions of impressions per day) and as cowboy as we were about lots of things, all our systems always had kill switches that could stop spending money and I could have pulled them with minimal red tape if I suspected something was wrong.

And this was for a platform where our max loss for an hour would have hurt but not killed the business (maybe a six figure loss), I can’t imagine not having layers of risk management systems in HFT software.

pxmpxm · on Sept 11, 2023

They were asleep at the wheel, not unlike all the random brokerages that blew up when swiss central bank pulled the CHF peg in 2015.

This is a culture problem - as soon as you load up your trading firm with a bunch of software industry hires, you end up with jiras and change management workflows instead of people on deck that have context for what they're doing. That's the only way to explain reverse scalping for 45 mins straight.

ycombobreaker · on Sept 11, 2023

The CHF de-peg wasn't really technology risk. Brokers lost money because they undervalued CHF/EUR risk, undervalued liquidity risk (stop orders were executing FAR worse than expected, or simply failing to execute at all), and didn't pay attention to the legal protections afforded to their customers (customer balances went negative but there was no way to recover that money from the customers). These brokers would have had the same problems even if using pen & paper, they failed to plan (or alternatively, made a conscious bet and lost).

throwaway2037 · on Sept 11, 2023

I think it is worth saying that no one saw that de-peg coming. Absolutely no one. Sure there are some crazies who saw it coming, but that same camp is still taking for the HKD-USD de-peg. It was a shock to everyone on Wall Street. I am a bit surprised that the Swiss National Bank didn't tip off their own banks before doing it. Both UBS and Credit Suisse were seriously caught off guard when it happened.

ycombobreaker · on Sept 11, 2023

That's fair. I wasn't close enough to see how "surprising" it was. The point stands that it looks nothing like HFT active trading risk. Knight Capital created such amazing training material for the industry. What went wrong with their software, how many decisions or practices could have prevented or narrowed the risk. What went wrong at trade time, how they had a clear window to pull the plug but died due to indecision and inaction.

jwestbury · on Sept 11, 2023

> as soon as you load up your trading firm with a bunch of software industry hires

As a software industry hire at a hedge fund right now... I'd love to see more cross-pollination, because there are so many good things happening on both sides, and so many terrible things happening through just a sheer lack of knowledge.

Change management workflows are great and should be used more in finance. But software companies should implement andon cord systems more often (Amazon does; nowhere else I've worked gives that power to anybody at the company).

Game_Ender · on Sept 11, 2023

> andon cord systems more often

What did you mean by this? Do you mean a way to disable systems at the flick of a switch, like dynamic feature flags?

mplewis · on Sept 12, 2023

Anyone is allowed to pull the red cord and stop the proverbial production line at any time when something is wrong and needs to be fixed.

blitzar · on Sept 11, 2023

I messed around with the idea of a physical big red button kill switch to shut down market making; the IT people thought I was joking - the trading desk just assumed that it was in the design from day 1.

brazzy · on Sept 11, 2023

> or because no one was empowered to pull the kill switch because of the opportunity cost (perhaps pulling the switch at the wrong time costs $500k in opportunity).

Isn't the problem that pulling the plug on a trading bot doesn't just have opportunity costs, but may also leave you with open positions that, depending on the kind of trades you're doing and the way the market is moving, could be arbitrarily expensive to unwind?

kasey_junk · on Sept 11, 2023

You usually have a procedure to close out open positions via an alternate system (including via phone).

Kinrany · on Sept 11, 2023

Can be two buttons, one for buying and one for selling

eru · on Sept 11, 2023

> Actually, we made way less of Knight's $400m than we could have because our risk systems kept shutting strategies down because what was happening was "too good to be true".

Aren't a lot of trades undone anyway by the authorities after such severe market hiccups?

throwaway2037 · on Sept 11, 2023

This is a good question. In my experience, I have only see exchange trades reversed when there was a major bug in exchange software. If the bug is on the client side, tough luck. And reversing trades done on an exchange is usually a decision for the exchange regulator. It is a major event that only happens every few years -- at most -- for highly developed exchanges.

dmurray · on Sept 11, 2023

Reversing or amending a single "fat finger" trade happens all the time and the exchange generally has procedures for this that don't involve a regulator.

Even in the most controversial recent example - LME cancelling a day's worth of nickel trades [0]- I understand it was their call and not any external regulator. That said, while I'd count LME as a "highly developed exchange", it's the Wild West compared to the US NMS.

[0] https://www.bloomberg.com/news/articles/2022-03-14/inside-ni...

throwaway2037 · on Sept 12, 2023

As I understand, the LME nickel trade reversal was to prevent total meltdown due to multiple counterparties going bankrupt at the same time. To me, it was a classic case of exchange limits and poor risk control. "If you owe the bank $100 that's your problem. If you owe the bank $100 million, that's the bank's problem. -J. Paul Getty (of course, add some zeroes for today's world)

Also, can you explain more about what this phrase means? "it's the Wild West compared to the US NMS" Are you saying the risk limits and controls on LME are much worse that US markets?

eru · on Sept 11, 2023

> [...] the exchange generally has procedures for this that don't involve a regulator.

That's part of why I vaguely referred to 'the authorities' in my original comment. I wasn't quite sure who's doing the amending and reversing, and it wasn't too important.

dmurray · on Sept 11, 2023

Normally trades are undone or amended in price if they are executed far away (say 10%+) from what is determined to be reasonable market prices. And when amended, they get amended to a price that's still in the same direction, so the market taker still loses a little bit compared to the fair price.

KCG traded in such liquid instruments and in such a way that it didn't move the market that much. They lost a hundred dollars a trade on 4 million trades.

The article says some stocks were moved by more than 10%, but as I recall that was a small fraction of them.

mlrtime · on Sept 11, 2023

Pedantic:

Technically KCG didn't exist yet. NITE, aka Knight Capital Group lost ~$460 million. Getco bought NITE forming KCG (Beating out Virtu's bid which later purchased KCG) and now is called Virtu Financial (VIRT).

movedx · on Sept 10, 2023

CI/CD would have have solved this 100%:

> ... one of Knight’s technicians did not copy the new code to one of the eight SMARS computer servers. Knight did not have a second technician review this deployment and no one at Knight realized that the Power Peg code had not been removed from the eighth server, nor the new RLP code added.

Read this part again:

> ... one of Knight’s technicians did not *copy the new code to one of the eight SMARS computer servers*.

Yes, of course a CI/CD pipeline can fail midway through and only partially deploy the code to a partial number of servers, but I doubt it. And even if that were the case, just off the top of my head I can guarantee an Ansible Playbook would have not only stopped the moment that particular transfer failed, the whole Playbook would have therefore failed, and none of the services would have been restarted (because that would be a final step that wouldn't be reached.)

This was due to human error and is the very reason CI/CD/automation is a thing.

> Knight did not have a second technician review this deployment and no one at Knight realized that the Power Peg code had not been removed from the eighth server

CI/CD would have have solved this 100%. A "Pull Request" made against a repository of Ansible code (or whatever you flavour is) would have *PREVENTED* the first technician from ever being able to merge the code into master/main (because you have master/main protected right... right?), completely preventing the entire process from ever rolling out without a review, which would have hopefully caught the misaligned configuration.

DevOps, which is mostly underpinned by CI/CD, would have solved this 100%. I'm very certain of this.

SoftTalker · on Sept 11, 2023

Ansible in my experience will stop trying to run subsequent tasks on a server once one of them fails, but it will go ahead with other servers that match the inventory pattern. So it very well could have successfully updated 7 out of 8 hosts.

Maybe there is a switch that will stop everything if any task on any host fails but it's not the default behavior.

At least it would have logged an error that hopefully would have been looked at.

BoorishBears · on Sept 11, 2023

I think this is an example of hindsight not always being 20/20

If you replace each step of the post mortem with a CI/CD based alternative, you miss out on the fact CI/CD trivializes designs where this wouldn't have happened.

The "easy default" wouldn't be to run a play against 8 hosts manually in your terminal, it'd be run a playbook with them all baked in, and that would fail correctly by default: https://docs.ansible.com/ansible/latest/playbook_guide/playb...

The key here is CI/CD makes it so its actually less work to run that one play than it is to shoot yourself in the foot with 8 separate invocations.

Even in the fact of incompetence/laziness/oversight, the general framework makes the right choice

dfinninger · on Sept 11, 2023

The easy default here would be to have a runbook that executed on a particular inventory group.

The “linear” execution strategy is the default (which you linked to). By default, if there is an error on one host it will continue executing on all other hosts. You need to set a flag to stop executing on all hosts[1].

The parent process would not be notified of any failures until the end of the run, unless you supplied a custom callback plugin[2].

[1] https://docs.ansible.com/ansible/2.8/user_guide/playbooks_er...

[2] https://docs.ansible.com/ansible/latest/plugins/callback.htm...

BoorishBears · on Sept 11, 2023

The problem was a human forgot to run a step and no one noticed: The playbook would have failed and the server wouldn't have been online to make orders.

If you read the article, the other servers were fine and did not contribute to the issue.

dfinninger · on Sept 11, 2023

Yeah, it’s a runbook config: https://docs.ansible.com/ansible/2.8/user_guide/playbooks_er...

dtech · on Sept 11, 2023

> So it very well could have successfully updated 7 out of 8 hosts.

The problem was that the feature flag was manually enabled on the host with old code. Presumably with automated deployment the feature flag would never have been toggled if the deployment failed, either because the deployment didn't get that far or because the human spotted the failed deployment.

Lutger · on Sept 11, 2023

It is quite likely this would have been solved by a good automated deployment process. However, it is also quite likely that at some point a human error would creep into either the automated deployment process itself, or be 100% correctly deployed into production.

At that point, if the error is as serious, Knight would still gave gone bankrupt since they had no way to mitigate these failure conditions.

Being 100% free of bugs is just not a viable way to end up with safe systems.

movedx · on Sept 11, 2023

All processes are fallible, but some are less so than others because we've automated a large part of them.

DevOps, and CI/CD, would have prevented Knight's issues 100%.

piyh · on Sept 11, 2023

Tree, meet forest.

ledauphin · on Sept 10, 2023

The blame here may indeed lie with whoever decided that reusing an old flag was a good idea. As anyone who has been in software development for any time can attest, this decision was not necessarily - and perhaps not even likely - made by a "developer."

manicennui · on Sept 10, 2023

9 times out of 10, I see developers making the mistakes that everyone seems to want to blame on non-technical people. There is a massive amount of software being written by people with a wide range of capabilities, and a large number of developers never master the basics. It doesn't help that some of the worst tools "win" and offer little protection against many basic mistakes.

xupybd · on Sept 10, 2023

You have to assume people will make mistakes.

A great book on that is this https://www.thenile.co.nz/books/sidney-dekker/the-field-guid...

Lutger · on Sept 11, 2023

A large number of developers never master the basics, that is true. But more interestingly, absolutely zero programmers can write a good amount of code that is free of bugs.

If your road to safety is bugfree code, it will end up in an accident sooner or later, 100% guaranteed.

hinkley · on Sept 10, 2023

For a group who so thoroughly despises bosses that operate on 'blame allocation', we spend a lot of time shopping around for permission to engage in reckless behavior. Most people would call that being a hypocrite.

Whereas I would call it... no, hypocrite works just fine.

PH95VuimJjqBqy · on Sept 11, 2023

At the company I work, we have a team that took 3 weeks and multiple tries to get an API response (JSON) capitalized properly (camelCase to PascalCase).

When I tried to talk to the tech lead about it, his response is that SAFe would have prevented the issue (it was discovered by another team who consumes their API).

Throughout the entire thing this tech lead maintained his team didn't do anything and that the problem was the process.

yeah, no. I have 25+ years of experience as a developer, it doesn't take 3+ weeks to fix the casing of a JSON property name. I eventually had to be the bad guy and tell them their work was unacceptable because they themselves couldn't recognize it. Only when I did it, I ran it up the chain because if the tech lead doesn't see the problem then I need someone who can help them see the problem.

For some people there's a "responsibility shield" that's so strong you can never get through to them.

SoftTalker · on Sept 10, 2023

Or at least not by a developer who has made that sort of mistake in the past.

I don't know what software engineering programs teach these days, but in the 1980s there was very little inclusion of case studies of things that went wrong. This was unlike the courses in the business school (my undergrad was CS major + business minor) nor I would presume what real engineering disciplines teach.

My first exposure to a fuckup in production was a fuckup in production on my first job.

andersa · on Sept 10, 2023

I wonder if this code was written in c++ or similar, the flags were actually a bitfield, and they repurposed it because they ran out of bits.

Need a space here? Oh, let's throw out this junk nobody used in 8 years and there we go...

CraigRo · on Sept 10, 2023

It is very hard to change the overall size of the messages, and there's a lot of pressure to keep them short. So it could have been a bitfield or several similar things... e.g a value in a char field

bluelightning2k · on Sept 10, 2023

This sounds particularly plausible with it being high frequency trading. Those presumably have optimisations few other applications have

paradox242 · on Sept 11, 2023

This is the first thing I thought of because otherwise this story doesn't make a lot of sense.

mijoharas · on Sept 10, 2023

At the very least have a two deploys - actually removing the old code that relies on it and then repurposing it. Giant foot gun to do it all in one especially without any automated deploys.

bluelightning2k · on Sept 10, 2023

Good point. Actually I think I'll treat this as a best practice in general when there's a transition

wink · on Sept 11, 2023

That assumes that you have a stable, reliable, quick process to roll out updates. Sounds like they didn't, so maybe they worked on the "oh better add this feature, it's our only chance this month" pattern.

Nekhrimah · on Sept 10, 2023

>whoever decided that reusing an old flag was a good idea.

My understanding is that in high frequency trading, minimizing the size of the transmission is paramount. Hence re-purposing an existing flag, rather than adding size to the packet makes some sense.

hinkley · on Sept 10, 2023

Flag recycling is a task that should be measured in months to quarters, and from what I recall of the postmortem they tried to achieve it in weeks, which is just criminally stupid.

It's this detail of the story which flips me from sympathy to schadenfreude. You dumb motherfuckers fucked around and found out.

lopkeny12ko · on Sept 10, 2023

I doubt any manager or VP cares or knows enough about the technical details of the code to dictate the name that should be used for a feature flag, of all things.

schneems · on Sept 10, 2023

I see this as a problem of not investing enough in the deploy process. (Disclosure: I maintain an open source deploy tool for a living).

Charity Majors gave a talk in Euruko that talked a lot about this. Deploy tooling shouldn’t be a bunch of bash scripts in a trench coat, it should be fully staffed, fully tested, and automated within an inch of its life.

If you have a deploy process that has some king of immutable architecture, tooling to monitor (failed/stuck/incomplete) rollouts, and the ability to quickly rollback to a prior known good stage then you have layers of protection and an easy course of action for when things do go sideways. It might not have made this problem impossible, but it would have made it harder to happen.

hinkley · on Sept 10, 2023

I wrote a tool to automate our hotfix process, and people were somewhat surprised that you could kill the process at any step and start over and it would almost always do the right thing. Like how did you expect it to work? Why replace an error prone process with an error prone and opaque one that you can't restart?

btown · on Sept 11, 2023

> the ability to quickly rollback to a prior known good stage

This is vital, but it's often not sufficient just to roll back, say, to a known good Docker image. Database migrations may have occurred that dropped columns that the old code expects to exist; feature flags may need to be changed; multiple services may need to be rolled back individually; data may have accumulated under new assumptions that breaks old assumptions when old code is applied to that new data.

One of the really subtle wins of devops as a discipline is that by allowing/forcing application teams to take responsibility for deployment, they're more exposed to thinking how to solve these things in a maintainable way: for instance, breaking out complex "the meaning of our data is changing"-type changesets/data migrations into individually reversible stages, with the stages merged onto the production branch over the course of multiple days where analysis is done on error rates and live data.

bluelightning2k · on Sept 10, 2023

Great reply.

Counterpoint though: that automation in and of itself is more failure area.

I can imagine a similar story where the deployment pipeline incorrectly rolled back due to some change in metric format and caused the infinite loss, for example.

The thing with these being a 1 in a million chance is that there's thousands of different hypothetical causes. The more parts the harder to predict an interaction and we've all been blindsided by something.

I would personally hate the stress of working on such high stakes releases.

schneems · on Sept 11, 2023

Test test test. If that’s not enough, pick better tools. I’m rewriting bash scripts in rust at work because it gives me the ability to make many invalid states impossible to represent in code. Is it overkill? Maybe, but it is such a huge quality of life improvement.

Automated things can fail. Sure. But consider that playbooks are just crappy automation run by unreliable meat computers.

Also you can take an iterative approach to automation:

- manual playbook only

- automate one step of the playbook

- if it goes well, move to another. If not, run a retro to figure out out how you can improve it and try again.

Stress of failure at a job responsible for deployment architecture is manageable if you have a team and culture built around respecting that stress. There are some areas of code people are more careful around, but largely we make safety a product of our tools and processes and not some heroic “try harder not to screw up” attitude.

I find the impact of helping so many developers and their companies rewarding.

BoorishBears · on Sept 11, 2023

Part of the solution is at the level of attitude, just one more productive than "don't fuck up"

To create a contrived example, say someone reads your note on replacing bash scripts and decides they agree with the principle.

They go into work tomorrow, their fellow engineer agrees on the technical merit, and they reimplement a bunch of bash scripts in Rust with a suite of tests bigger than anyone imagined, and life is great.

... fast forward a few months from now and suddenly a state the bash scripts were hiding flares up and everyone is lost, and type safety didn't help.

A shared culture of "conservation of value" can help in a lot of ways there. That's the attitude that creation of value is always uncertain, so you prioritize potential future value lower than currently provided value:

- instead of looking at technical merit of the new, we prioritize asking: What specific shortcomings the old way have? What can we improve downstream so that the value those systems provide is protected from invalid states we're worried about this tool generating?

- does switching the language reduce the number of people who can work on it? Do we reduce the effect surface area of the team providing value to it? When hair is on fire do we know the sysops guy won't balk?

- when it goes down, with a culture of "conservation of value", your plan A is always rolling back, there's no back and forth on if we can just roll this one fix. If you cause the company to lose a million dollar trade, it's already codified that you made the right decision

Obviously these are all extensions of a contrived example, but to me culture is heavily utilized as a way to guide better engineering.

I think these days people tend to think in terms of culture that affirms, as a reaction to cultures that block anyone from accomplishing anything: to me a good engineering culture is one that clashes with what people want to do just enough to be mildly annoying.

hooverd · on Sept 10, 2023

> bash scripts in a trench coat That's an amazing turn of phrase.

hinkley · on Sept 10, 2023

The goal with automation is that the number of unidentified corner cases reduces over time.

A manual runbook is a game of, "I did step 12, I think I did step 13, so the next step is 14." that plays out every single time you do server work. The thing with the human brain is that when you interrupt a task you've done a million times in the middle, most people can't reliably discern between this iteration and false memories from the last time they did it.

So unless there are interlocks that prevent skipping a step, it's a gamble every single time. And the effort involved in creating interlocks is a large fraction of the cost of automating.

WJW · on Sept 11, 2023

1. Print the checklist/runbook out on paper with actual empty boxes next to the steps.

2. Laminate the printed checklist and put it in a big folder.

3. Every time you run the checklist, use a sharpie to mark the checkbox after you've done the step.

4. When you are done with the entire process, use whiteboard cleaner to wipe out the checks again and put the checklist back in the big folder with all the other checklists.

This is how every safety critical profession (aviation, shipping, medical, power generation, etc) has worked for decades and unless people are willingly being obtuse it is extremely hard to do it wrong. You just need people to turn off their ego and follow the process instead of trying to show off by doing it from memory. This last part might be more difficult in software settings.

MiyamotoAkira · on Sept 11, 2023

Regarding checklist people like Hollnagel, Wears, Braithwaite, Dekker, ... have done a bit on investigation (Hollnagel mostly on Healthcare, Dekker started on air industry but spread from there). Read the "Safety-I vs Safety-II" paper or the "When a checklist is not enough: How to improve them and what else is needed" paper

WJW · on Sept 11, 2023

Great point and sometimes checklists are indeed not enough. My previous post was triggered more by the "I think I did step 13," part of the post I was responding to. That is not a flaw with checklists but a flaw in the safety culture of the operator team. It should never happen that you lose track of where you are in the process because human memory is unreliable, rather you should fix that through better processes and outsource the memorizing to paper.

emmelaich · on Sept 11, 2023

Technically, a flag re-use was the most impactful error, code wise.

A flag of such importance should not be just on/off; the ON should require a positive response / receipt containing the name and version of the code being turned on.

[edit - don't mean for each trade, I mean validation on startup]

ashu1461 · on Sept 11, 2023

I think the blame is not on either the devops or the developers, it is on the process. If a bug occurs than there should be atleast 5-6 different metrics / alerts that should be able to catch the bug.

stusmall · on Sept 10, 2023

I think the big improvement would be consistency. Either all servers would be correct or all servers would be incorrect. The step where "Since they were unable to determine what was causing the erroneous orders they reacted by uninstalling the new code from the servers it was deployed to correctly" wouldn't have had a negative impact. They could have even instantly rolled back. Also if they were using the same automated deployment processes for their test environment they might have even caught this in QA.

thrashh · on Sept 10, 2023

I agree. It doesn’t matter if you give an inexperienced person a hammer or a saw — they’ll still screw it up.

My biggest pet peeve is they NO ONE ever does failure modeling.

I swear everyone builds things assuming it will work perfectly. Then when you mention if one part fails, it will completely bring down everything, they’ll say that it’s a 1 in a million chance. Yeah, the problem isn’t that it’s unlikely, it’s that when it does happen, you’ve perfectly designed your system to destroy itself.

TheAlchemist · on Sept 11, 2023

No one, is quite a bold assumption !

It's actually quite routine stuff now in finance at least - to perform some kind of 'fire test' on a regular basis - you shut down some components during the day, and switch to backups solutions, to test everything works smoothly.

devjab · on Sept 11, 2023

> the deployment agent errored while downloading the new binary/code onto the server

In that case the build would never be pushed to production. The worst it would accomplish, and this is if your systems fail, is that it will break your staging area.

Sure this is in the the ideal world where people actually know how to set up their deployment pipelines correctly, so you’re likely still right in many cases, but you shouldn’t be.

moeris · on Sept 10, 2023

Automated deployments would have allowed you to review the deployment before it happened. A failed deployment could be configured to allow automatic rollbacks. Automated deployments should also handle experiment flags, which could have been toggled to reduce impact. There are a bunch of places where it could have intervened and mitigated/prevented this whole situation.

Waterluvian · on Sept 10, 2023

Which really means is a failure of leadership for being so incompetent as to allow such an intensely risky situation to exist.

michaelt · on Sept 11, 2023

Well, yes and no.

Imagine a company where the engineering culture hires macho programmers who love bitmasked flags and manual memory management, and who think memory safe languages and json are for sissies.

Imagine they hire lots of graduates who've never worked elsewhere and teach them that their way is best, and everyone who says otherwise doesn't understand real performance. Those 'industry best practices' are written by javascript folks who think a 2 second pageload is fast, and a 200ms pageload is instant, don't you know?

And imagine when they make experienced hires, they look for people who have experience with bitmasked flags and manual memory management - which is reasonable enough, they gotta be able to code review that stuff and coach junior employees on working with it. But has the side effect experienced hires won't rock the boat.

Now, the nontechnical bosses can ask anyone from the highest engineering leadership to the must junior of peons, and they'll all agree that bitmasked flags are the right way of doing things.

Is it so unreasonable for nontechnical bosses to trust the consensus of their engineers on matters of engineering?

lwhi · on Sept 10, 2023

Automated deployments require planning before the time they're executed.

If code is involved, someone likely reviews and approves it.

There are naturally far more safeguards in place than there would be for a manual deployment.

justinclift · on Sept 10, 2023

In an ideal world, sure.

In the current one, we have Facebook's "Move fast and break things" being applied to many things where it has no business being.

Banking and communications infrastructure comes to my mind, but there are definitely others. :)

lwhi · on Sept 10, 2023

I think the benefits from automated deployments are things that are just par for the course to be honest.

Sure, you can mess these things up .. but doing so would involve willful negligence rather than someone's absent mindedness.

Basically, I think the takeaway from the article is probably worth taking.

itsthecourier · on Sept 10, 2023

Also, api versioning. They weren't running api versioning on it, they called an old method with a new set of parameters, that shouldn't have been possible in first place

rcpt · on Sept 10, 2023

It seems like the kind of thing that would be canaried which is the kind of thing that you'd typically build alongside automated deployment

lolinder · on Sept 11, 2023

> why code that had been dead for 8-years was still present in the code base is a mystery, but that’s not the point

This seems to be exactly the point! For 8 years they left unused code in place, seemingly only bothering to remove it because they wanted to repurpose a flag. If they'd done the right thing 8 years prior and removed code they weren't using, this story plays out very differently. No ancient routines get resurrected, no rogue server.

Maybe Knight Capital wasn't using version control and held onto this code "just in case", but I've seen this same resistance to deleting code in programmers working in repos that are completely under VCS, and it's flabbergasting. If you need it again, you can always bring it back from version control. If you need it again but forget it's there, you'd do the same with the dead code path. Leaving it in the source tree is pure liability.

EDIT: Kevlin Henney gave an excellent talk at GOTO about software reliability and he touches on this, using Knight Capital as the example—he actually cites this very blog post [0]. The whole talk is excellent, but I've linked the three minutes where he talks about Knight Capital.

> The problem is there is no code that is truly dead. It turns out all you need to do is make a small assumption, a change of an assumption and then suddenly it's no longer dead, it's zombie code. It has come back to life and the zombie apocalypse costs money.

[0] https://youtu.be/IiGXq3yY70o?si=hZ9HB2dlfj0vHvNK&t=463

TheDong · on Sept 11, 2023

> I've seen this same resistance to deleting code in programmers working in repos that are completely under VCS, and it's flabbergasting

I think a lot of developers only know the basics of git. They can check in changes, they can look at history with git log, and maybe they know how to use git blame.

They often don't know how to filter git history. They often don't know about the git pickaxe, or about exclude patterns, and don't even think to question if you can do something like "git log -G'int.*foo\(' -- ':(exclude)directory'" to search for 'foo' in the git log, excluding some directory.

They know how to "grep" within the existing code tree though, so they know if it's not deleted they can find it again with the right grep. If it's deleted, they might not know how to find it in git history.

I sympathize with this to a degree actually. Code in the git log is invisible to a lot of tooling, so for example it won't show up in autocomplete if your text editor might have otherwise suggested it, it won't show up in your library documentation, etc.

If you truly think the code will be used again, I think it's at least defensible to leave it in the tree so that it comes along for refactors, and ends up being found when it's needed.

For cases like Knight capital, where it's obviously never going to be useful again, it's not defensible of course.

lolinder · on Sept 11, 2023

I can understand that argument for small, standalone functions. Where it really gets at me is when people insist on leaving whole use cases or subsystems in place, which seems to be what happened here.

You don't want these to be visible to autocomplete, because they're outdated and would need major modification to be correct again. If you do need to resurrect them, they're trivial to find in the git history—just search for the commit named "remove foo"—and they should pass through code review as if they were brand new code, because a lot of stuff will have changed around them in the intervening time.

jve · on Sept 11, 2023

> and don't even think to question if you can do something like "git log -G'int.*foo\(' -- ':(exclude)directory'"

I do question it, but I know the answer is hard (as you demonstrated), so I don't bother. And I'm now looking at git log docs - I fail to parse how the exclude works even looking at the docs - I don't find anything about `:(` contstruct which houses exclude keyword. But thanks for -G - that will be useful.

> Code in the git log is invisible to a lot of tooling

This is the issue. I expect when searching code to have a way to search older commits. But azure devops won't do it. There is no checkbox "Include all commits"

TheDong · on Sept 11, 2023

> I fail to parse how the exclude works even looking at the docs

Most git subcommands can take a pathspec, including git log.

The documentation you want to read is the pathspec docs, see here: https://git-scm.com/docs/gitglossary#Documentation/gitglossa...

You can see that it includes the exclude keyword, among others.

Since it applies to almost every command (from 'git add -- <pathspec>' to 'git checkout -- <pathspec>'), it's not mentioned as clearly in individual commands.

planede · on Sept 11, 2023

git log and all other commands' man page should really refer back to gitglossary here then. And they should either name their argument <pathspec>, or specify that the given argument (<path> or <file>) is a pathspec.

bhasi · on Sept 11, 2023

> git pickaxe, or about exclude patterns, and don't even think to question if you can do something like "git log -G'int.*foo\(' -- ':(exclude)directory'" to search for 'foo' in the git log, excluding some directory

Learned something new today, thank you! Will find ways to use these in my daily workflow.

SoftTalker · on Sept 11, 2023

"If it works, don't touch it" is something I've heard a lot, especially said by managers who don't understand what they are talking about.

An update to simply "remove old code" might be difficult if someone sees any change as creating a risk of something going wrong. And to be fair, any change is a risk, but so is leaving old code around.

At least now we have this case to point to as a clear example of the risk.

asalahli · on Sept 11, 2023

I'm less surprised by them leaving dead code around, honestly. I've seen that happen at virtually every company I've worked for.

What completely baffles me everytime I come across their story is that they repurposed an existing flag instead of making a new one! Why?

SamuelAdams · on Sept 11, 2023

Version control isn’t bulletproof either. All your code history is one git rebase away from being abolished forever.

I hope most orgs have processes around their main branches so this does not occur, but I’ve also been in smaller orgs and accidentally screwed up prod database tables, so the accidental git rebase isn’t impossible to consider…

lolinder · on Sept 11, 2023

Yes, you should definitely have branch protection turned on on main, I kind of assumed that went without saying. But to actually lose your entire git history would require both having no branch protection and having every single developer on your staff be in the regular habit of force rebasing their own branches on main. If a single developer does a double take when they're told that their branch has a different history than origin/main, then you probably didn't lose more than a week of work.

Also, it's worth noting that this dead code will almost certainly not get reused wherever you decide to store it, so it's best to keep it as far away from zombification as possible, even if it's not the safest place.

IshKebab · on Sept 11, 2023

You're very unlikely to lose your version history like that.

Everywhere I've worked has branch protection turned on, and backups. Even if those both fail somehow it's very likely the complete history is on lots of engineers' laptops.

chii · on Sept 11, 2023

> All your code history is one git rebase away from being abolished forever

unless you also GC right away, it's not gone.

Terr_ · on Sept 11, 2023

Yep, and also:

1. You shouldn't be allowing anybody to force-push rebased stuff onto major branches in your main repo (the one builds come from) anyway. This is especially important to support auditing and trace-ability.

2. Just because the folder is a git-repo isn't an excuse not to have it part of your regular offsite backup set.

gorgoiler · on Sept 11, 2023

There are reasons to want branch protection but audit isn’t one of them. An auditable system would be keeping an immutable log of the git repo actions in a separate, append-only location. It wouldn’t rely on the thing being written to all the time, by users, to never get broken. In your model, your audit history only needs one bug in the branch-protection tool for it to be destroyed.

Think of it like syslog. It’s good to keep a log of events but it’s bad to rely on the /var/log/syslog on your web server. You should be logging to a remote, append-only system.

(However I would concede that if we are talking about lawyer-proof-audit — SOC2, ISO etc — rather than actual security auditing, then branch protection is probably just fine.)

alexpotato · on Sept 11, 2023

See my sister thread where I worked at Knight just after the outage.

> For 8 years they left unused code in place, seemingly only bothering to remove it because they wanted to repurpose a flag

There was another issue where they were using a database with only 256 columns. Sometimes, they needed a new column so they would just reuse an old columns that "wasn't being used at the time".

IIRC, this was generally acknowledged internally to be "a bad idea" but no one had prioritized cleaning up the old code and/or coming up with a better best practice.

datadrivenangel · on Sept 11, 2023

Maybe they had engineering leadership mandates where creating new feature flags was more paperwork?

I've seen that happen, where 'new' features took a lot of justification, but 'bugfixes' were free.

nijave · on Sept 12, 2023

Imo it's easier to quantify the value of adding new code than removing old code (in terms of dollars). Things that are more easily to quantify the value of tend to get higher priority. Same applies to other areas like performance optimization, testing, security, patching (some of these have high level data like the cost of incidents/downtime or the cost of a breach)

Some companies have processes in place to try to balance this like 20% of time spent on tech debt reduction

philjohn · on Sept 11, 2023

I always remember the Codeless Code on comments when I read about this: http://thecodelesscode.com/case/41?topic=comments

pnt12 · on Sept 11, 2023

nitpick: git is enough to keep the history of old code, but it's not great at letting you find it.

mirekrusin · on Sept 11, 2023

The morale should be "don't reuse feature flags kids".

hedora · on Sept 10, 2023

No continuous deployment system I have worked with would have blocked this particular bug.

They were in a situation where they were incrementally rolling out, but the code had a logic bug where the failure of one install within an incremental rollout step bankrupted the company.

I’d guard against this with runtime checks that the software version (e.g. git sha) matches, and also add fault injection into tests that invoke the software rollout infrastructure.

kccqzy · on Sept 11, 2023

No continuous deployment system worth its salt would allow configuration and code to be out of sync. They had a configuration change to turn on a flag that used to enable Power Peg but now enabled something else, plus a code change to reinterpret that flag differently.

chii · on Sept 11, 2023

the situation is caused by a confluence of multiple issues.

The biggest red-flag is that they chose to repurpose a flag! Why? Is it really difficult to add a new flag for a new feature?

Even if the technician was careful not to let prod be out of sync, it is possible that the deployment isn't instantaneous, and that the old code could've ran when the repurposed flag was turned on.

atomicnumber3 · on Sept 11, 2023

Some of the trickiness that comes from HFT and adjacent things like this is you can be working on tremendously powerful hardware but still be miserly about bits because every extra byte you have to cram into a packet is extra latency. The HFT firm I worked for would "re-use flags" in the sense that each packet literally had an 8-bit section called "flags" (and further down in the message, another 8 bit section called flags2 because of course) and each bit in there was a Boolean that could be on or off - a flag. So we weren't reusing flags as much as we were re-allocating what a high bit at that index in flags meant.

We were very conscious of this kind of error though and we managed them like Scrooge counting his farthings.

nijave · on Sept 12, 2023

A couple solutions I've seen before

- Config gets generated to a deploy and saved as <version>.json. Code downloads config file matching its own version or fails to start (this is a nice one since rollbacks becomes vary deterministic and code rollback is the same process as config rollback)

- Deploy code first and verify it's updated and working correctly before changing feature flag config (this one is still prone to errors without automation)

The first one can be done on Kubernetes using kustomize ConfigMap generator. The system I worked on used object storage for the config file and deploy tooling to generate it from a key value store at deploy time

TheAlchemist · on Sept 11, 2023

Wild west times ! It's worth noting, that things changed a lot in trading systems since then.

When I started working in this domain (2009), it was pretty crazy how unreliable those systems were, on all sides - banks, brokers, exchanges. Frequently you needed to make sure over the phone, what quantities got executed etc.

I remember when the Italian exchange was rolling out their systems, at some point we did "tests" on a mix of production and UAT - if my memory is correct, we were just changing IPs to which to connect for order passing to test for the upcoming release, after the market closed. We couldn't just test in their UAT environment, since it was so bugged and half down most of the time.

And let's not even talk about Excel spreadsheets with some VBA code that would make chatGPT swear, that were pricing instruments with volumes traded with a lot of zeros.

It's very different nowadays, in part thanks to stories like this one. Most things are automated, and there is much less cowboy's attitude.

There are mandatory kill switches, a lot of layers of risk / trading activity monitorings (on your side, on exchange side), and really a lot of hard learned lessons incorporated into the systems. That's also part of the reason why people sometime tend to be naive about how hard it is to build a good trading system - the strategies are sometimes now really smart - it's mostly about how to avoid getting killed by something that's outside of usual conditions.

valdiorn · on Sept 10, 2023

Literally everyone in quant finance knows about knight capital. It even has its own phrase; "pulling a knight capital" (meaning; cutting corners on mission critical systems, even ones that can bankrupt the company in an instant, and experiencing the consequences)

shric · on Sept 10, 2023

Indeed, it's used in onboarding material at my employer.

mxz3000 · on Sept 10, 2023

Yeap, it's used as a case study for us as to the worst case scenario in trading incidents. Definitely humbling.

atomicnumber3 · on Sept 11, 2023

Same, we knew of it as the Knightmare too.

pavas · on Sept 10, 2023

My team's systems play a critical role for several $100M of sales per day, such that if our systems go down for long enough, these sales will be lost. Long enough means at least several hours and in this time frame we can get things back to a good state, often without much external impact.

We too have manual processes in place, but for any manual process we document the rollback steps (before starting) and monitor the deployment. We also separate deployment of code with deployment of features (which is done gradually behind feature flags). We insist that any new features (or modification of code) requires a new feature flag; while this is painful and slow, it has helped us avoid risky situations and panic and alleviated our ops and on-call burden considerably.

For something to go horribly wrong, it would have to fail many "filters" of defects: 1. code review--accidentally introducing a behavioral change without a feature flag (this can happen, e.g. updating dependencies), 2. manual and devo testing (which is hit or miss), 3. something in our deployment fails (luckily this is mostly automated, though as with all distributed systems there are edge cases), 4. Rollback fails or is done incorrectly 5. Missing monitoring to alert us that issue still hasn't been fixed. 5. Fail to escalate the issue in time to higher-levels. 6. Enough time passes that we miss out on ability to meet our SLA, etc.

For any riskier manual changes we can also require two people to make the change (one points out what's being changed over a video call, the other verifies).

If you're dealing with a system where your SLA is in minutes, and changes are irreversible, you need to know how to practically monitor and rollback within minutes, and if you're doing something new and manually, you need to quadruple check everything and have someone else watching you make the change, or its only a matter of time before enough things go wrong in a row and you can't fix it. It doesn't matter how good or smart you are, mistakes will always happen when people have to manually make or initiate a change, and that chance of making mistakes needs to be built into your change management process.

coldtea · on Sept 10, 2023

>My team's systems play a critical role for several $100M of sales per day, such that if our systems go down for long enough, these sales will be lost.

Would they? Or would they just happen later? In a lot of cases in regular commerce, or even B2B, the same sales can often be attempted again by the client for a little later, it's not "now or never". As a user I have retried things I wanted to buy when a vendor was down (usually because of a new announcement and big demand breaking their servers) or when my bank had some maintainance issue, and so on.

pavas · on Sept 10, 2023

It's both (though I would lean towards lost for a majority of them). It's also true that the longer the outage, the greater the impact, and you have to take into account knock-on effects such as loss of customer trust. Since these are elastic customer-goods, and ours isn't the only marketplace, customers have choice. Customers will typically compare price, then speed.

It's also probably true that a one-day outage would have a negative net present value (taking into account all future sales) far exceeding the daily loss in sales, due to loss of customer goodwill.

yardstick · on Sept 10, 2023

It would be a serious issue for in person transactions like shops, supermarkets, gas stations, etc

Imagine Walmart or Costco or Chevron centralised payment services went down for 30+ mins. You would get a lot of lost sales from those who don’t carry enough cash to cover it otherwise. Maybe a retailer might have a zapzap machine but lots of cards aren’t imprinted these days so that’s a non starter too.

squeaky-clean · on Sept 10, 2023

Not just lost sales. I've seen a Walmart lose all ability to do credit card sales and after about 5 minutes maybe 10% of people waiting just started leaving with their groceries in their cart and a middle finger raised to the security telling them to stop.

coldtea · on Sept 11, 2023

That's some low class rogue behavior though, not the standard in sales ("they can't process my card, let me take the stuff for free anyway").

kaashif · on Sept 10, 2023

> Maybe a retailer might have a zapzap machine but lots of cards aren’t imprinted these days so that’s a non starter too.

When I Google "zapzap machine" this comment is the only result, but after looking around on Wikipedia, I see this is a typo for "zipzap".

Is this really the only time in history someone has typoed zipzap as zapzap? I guess so.

HellsMaddy · on Sept 10, 2023

For anyone who is still confused: https://en.wikipedia.org/wiki/Credit_card_imprinter

yardstick · on Sept 10, 2023

Haha yeah I guess so! Last time I used one was in the previous millennium.

joncrocks · on Sept 11, 2023

It depends on the business. It's not uncommon for clients to execute against different institutions' systems, and they can/would re-route flow to someone else if you're down.

Think less "buying a car" and more "buying a pint of milk". If you're buying a car and the store is closed, you might come back the next day. If you're buying milk you will just go to the store down the street.

nijave · on Sept 12, 2023

I imagine same with time based or opportunistic businesses. If the shopping channel (assuming it runs around the clock) couldn't process orders, they'd have to decide if they want to forgo selling other products to rerun the missed ones.

For certain types of entertainment like movies or sports, the sale may no longer be relevant.

dang · on Sept 10, 2023

Knightmare: A DevOps Cautionary Tale (2014) - https://news.ycombinator.com/item?id=8994701 - Feb 2015 (85 comments)

Knightmare: A DevOps Cautionary Tale - https://news.ycombinator.com/item?id=7652036 - April 2014 (60 comments)

Also:

The $440M software error at Knight Capital (2019) - https://news.ycombinator.com/item?id=31239033 - May 2022 (172 comments)

Bugs in trading software cost Knight Capital $440M - https://news.ycombinator.com/item?id=4329495 - Aug 2012 (1 comment)

Knight Capital Says Trading Glitch Cost It $440 Million - https://news.ycombinator.com/item?id=4329101 - Aug 2012 (90 comments)

Others?

SoftTalker · on Sept 11, 2023

Early theory about the cause (incorrect, as it turns out):

Nanex ~ 03-Aug-2012 ~ The Knightmare Explaned - https://news.ycombinator.com/item?id=4337359 (no comments)

foota · on Sept 10, 2023

The real issue here (sorry for true Scotsman-ing) is that they were using an untested combination of configuration and binary release. Configuration and binaries can be rolled out in lockstep, preventing this class of issues.

Of course there were other mistakes here etc., but the issue wouldn't have been possible if this weren't the case.

dkarl · on Sept 10, 2023

> why code that had been dead for 8-years was still present in the code base is a mystery, but that’s not the point

It's not the worst mistake in the story, but it's not "not the point." A proactive approach to pruning dead functionality would have resulted in a less complex, better-understood piece of software with less potential to go haywire. Driving relentlessly forward without doing this kind of maintenance work is a risk, calculated or otherwise.

daft_pink · on Sept 10, 2023

I'm so glad I don't write code that automatically routes millions of dollars with no human intervention.

It's like writing code that flies a jumbo jet.

Who wants that kind of responsibility.

callalex · on Sept 10, 2023

It’s fine to have that kind of responsibility, but it has to actually be your responsibility. Which means you have to be empowered to say “no, we aren’t shipping this until XYZ is fixed” even if XYZ will take another two years to build and the boss wants to ship tomorrow.

salawat · on Sept 10, 2023

Yep. Until the capacity to say unoverridably "No" materializes, there's a lot of code I refuse to have responsibility for delegated to me.

wruza · on Sept 10, 2023

As a profit non-taker, which responsibility a worker can even have? Realistically it lies in range of their monthly paycheck and pending bonuses and in a moral obligation to operate a failing system until it lands somewhere. Everything above it is a systemic risk for a profit taker which if left unaddressed is absolutely on them. There’s no way you can take responsibility for $400M unless you have that money.

Waterluvian · on Sept 10, 2023

It’s not scary when it’s done properly. And done properly can look like an incredibly tedious job. I think it’s for a certain kind of person who loves the process and the tests and the simulators and the redundancy. Where only 1% of the engineering is the code that flies the plane.

wruza · on Sept 10, 2023

It feels anxiety inducing at first, but if you have good controls and monitoring in place, it becomes daily routine. You basically address the points you naturally have and the more reasonably anxious you are, the better for the business. From my experience with finance, I’d wager that problem at Knight was 10% tech issues, 90% CTO-ish person feeling ballsy. In general, not exactly that day or week.

shric · on Sept 11, 2023

I don't know if it's like this at every company, but typically there are plenty of humans keeping a close eye on what's going on whenever the software is placing orders on an exchange.

I suspect we can thank this incident in part.

hgomersall · on Sept 10, 2023

I'm so glad I'm not wasting my life working in finance.

shric · on Sept 10, 2023

I've worked in various small to medium IT companies, a FAANG and another fortune 500 tech company. 6 months ago I moved to a proprietary trading company/market maker and it's the most interesting and satisfying place I've worked so far.

I hope to continue to "waste my life" for many years to come.

hammeiam · on Sept 10, 2023

May I ask which one, and what your process was that led you to them?

pxmpxm · on Sept 11, 2023

Actually it's one of the few truly intellectually-pure endeavors. Everything else is the same pursuit with extra steps:

Make a trading strategy to make money

vs

Make a cutting edge machine learning classifier to back out latent meaning in search queries to produce better search results to drive more traffic to google to sell ads to make money

hgomersall · on Sept 11, 2023

You're not wrong, but the problem is those steps are also the steps that produce food, or improve health, or solve climate change or solve any of the innumerable problems we face as a society. As you identify, there are plenty of pursuits other than finance that are not particularly socially useful - it's not a very exclusive club.

TheAlchemist · on Sept 11, 2023

Out of curiosity - in what domain do you work ?

I find the work in finance / tech very interesting. Societally useful ? Almost certainly not. But probably still more than most good-paying tech jobs.

daft_pink · on Sept 12, 2023

I work in finance/tech, but my code doesn't execute million-dollar trades automatically. Cash transactions are reviewed by a human, and its mostly data analysis type work.

It's one thing to make recommendations or calculations and give the report to a human. It's another to start trading high volume in real time automatically.

envsubst · on Sept 10, 2023

[flagged]

dang · on Sept 11, 2023

Hey, can you please review the site guidelines and stick to them when posting here? We have to ban accounts that won't.

https://news.ycombinator.com/newsguidelines.html

envsubst · on Sept 14, 2023

Yes.

I expect that the comment above describing jobs in finance as a "waste of life", with no other content is also unacceptable.

dang · on Sept 15, 2023

It sounds like it! but I didn't see that one.

goldinfra · on Sept 10, 2023

Most jobs do in fact contribute to the well being of humanity, however little. It's few jobs, like most in financial trading, that actively reduce the well being of humanity.

Never will you meet a more self-deluded and pathetic set of humans. Desperate money addicts that often become other kinds of addicts. Whole thing should be abolished.

Source: I worked in finance when I was young and dumb.

meiraleal · on Sept 10, 2023

> Most jobs do in fact contribute to the well being of humanity, however little.

No, they don't. A lot of jobs hold os back, actually. Salespeople selling things people don't wanna buy, finance and tech bros vampirizing third world countries without the safeguards that western countries have on their capital markets, etc.

yieldcrv · on Sept 10, 2023

I, for one, am so glad to “own” the “compose” button at a democracy destabilizing adtech-conglomerate

Galanwe · on Sept 11, 2023

Things are somewhat different now than 5, 10, 20 years ago.

There has been a wave of "individual accountability regimes" released by pretty much every regulator.

I have worked with the SFC the most, so that's what I will describe here, but all these regulations are pretty much copy/paste of each other anyway.

I was MIC under the SFC (HK) for various operational and financial resps for approx 8 years responsible for close to $3B exposure across equities, IRS & FX, and am now licensed with the FCA (UK) since approx 1 year.

Basically, on top of the usual regulatory framework defining a top level Operating Officer (MOO) and subordinate Responsible Officers (RO), the new individual accountability regime creates the notion of Managers In Charge (MIC).

The MICs fill the gap that, increasingly, a considerable amount of operational responsibility lie in the hands of non licensed individuals (i. e. tech people).

The SFC defines a number of responsabilites (e.g. DRP/BCP, kill switches, backups, fail overs, rollbacks, load testing, etc) and these responsibilites need to be allocated to one or more of the allocated MIC.

The SFC has a right to reject an appointment of MIC if the individual is not seen as fit and proper (that is assessed generally on an annual basis by a compliance officer, but can be re-assessed on the spot if you end up displaying unfit traits). The SFC also mandates a track record of experience and expertise on the assigned responsibilities, as well as a direct capability by the MIC to have control on his responsibilities. In clear terms, that means you need to have the actual power of saying "no", you need to have the power to hire someone if that is necessary for the safety of the operations, etc.

Once you get appointed as MIC, most of your responsibilities are based on _means_, not _end results_:

If Karen breaks production, that's not much of your problem (regulatorily speaking) as long as you can demonstrate that you had Karen attend 6h of training this year on how not to break production.

In terms of actual developer experience, the _means_ often take the form of trainings, code review, pre prod impact assessment, incident reporting procedures, etc.

So on one hand you have a very heavy personal and professional responsibility. But on the other end you are at fault only if you did not setup a proper framework for things to work.

In terms of the professional responsibility, there is not much to do if you are deemed guilty. You will most likely be temporarily or permanently barred from having a licensed position. Nobody will hire you anyway.

For the personal responsibility, it is usually limited to single digit millions, and most big asset managers have an insurance to protect you (otherwise noone would accept the role).

If you are interested in the actual additional responsibilities that were added after KC, then I suggest you have a look at MiFID II (the European régulation, well written and understandable), especially segment RTS 6 "Technical standards specifying the organisational requirements of investment firms engaged in algorithmic trading":

https://ec.europa.eu/finance/securities/docs/isd/mifid/rts/1...

m463 · on Sept 10, 2023

> It's like writing code that flies a jumbo jet.

and upgrading it from a coach seat

skizm · on Sept 10, 2023

I feel like the first thing I would build into any automated trading system is a kill switch? then every single diff or pull request I add would have some sort of automated testing to ensure the kill switch still works. Also I'd manually flip it on/off once a day to make sure it works for real. That seems like the single most important thing to build and make sure works. Or is the system too complex for something like this and I don't understand the domain well enough?

atomicnumber3 · on Sept 11, 2023

Most systems do. At pastjob we had a few different levels:

- halt - just stop trading

- exit-only - only exit positions (but do so according to our alphas, no hurry)

- flatten - exit in a hurry but obey certain limits (often if liquidity was thin we would just "journal" the shares - move them to a long-term-hold (meaning more than the current day) account to exit in the opening auction the next day

- market exit - get the fuck out, now, no matter what the cost.

I never saw us use that last one.

slavboj · on Sept 11, 2023

Depending on what you're doing, a straight "pull the plug and stop trading" could leave you with eg unhedged positions that blow through your risk limits. But when your ability to actually execute those trades sensibly is broken regardless, yeah, you're still going to want to hit that button.

quickthrower2 · on Sept 11, 2023

Hold on. Are we blaming the plane crash on the pilot here? It seems there is so much other stuff wrong with this company first that such a deployment would tank it.

No kill switch. Literally needs to be a power switch and a trader who runs to the room and flips it. Ridiculously small amount of cash for the trading volume, and no way to borrow more to stay in business (but that borrowing requiring manual intervention no accessible to the trading system). Obviously the decision to leave that code in there, and for there to be config setting to bring it back.

Then the devops stuff - rollback plans, approvals, pairing on deployments, etc.

hyperhopper · on Sept 10, 2023

Yes, the deployment practices were bad, but they still would have had an issue even with proper practices.

The real issue was re-using an old flag. That should have never been thought of or approved.

amluto · on Sept 10, 2023

I would argue the real issue was the lack of an automated system (or multiple automated systems) that would hit the kill switch if the trading activity didn’t look right.

joncrocks · on Sept 11, 2023

I think attributing blame to a single place can be difficult in complex situations.

It's a set of failures in a system that all had to happen for the failure to occur.

https://en.wikipedia.org/wiki/Swiss_cheese_model

leaflets2 · on Sept 11, 2023

Yes definitely, one has to assume that from time to time, bugs will reach the prod servers, no amount of tests and code review can completely prevent that.

Hopefully the kill switch system is reasonably easy to code review and test :-)

distortionfield · on Sept 10, 2023

But how would you even start to define something as stochastic as trading activity as “not looking right”?

Jorge1o1 · on Sept 10, 2023

I’ve had to fill out forms for new algorithms / quant strategies with questions like:

- how many orders per minute do you expect to create?

- how many orders per minute do you expect to cancel/amend?

- what’s your max per-ticker position?

- what’s your max strategy-level GMV/NMV?

Etc.

Any one of those questions can be used to set up killswitches.

[edited for formatting]

distortionfield · on Sept 10, 2023

Sure, but there is always the possibility that then you shut down trading when things _arent_ broken.

There are always two error rates.

Defining behavior is great for retrospective analysis but would you really feel comfortable putting hard cuts into production based on the answers to those questions? I’m genuinely asking, because IME I wouldn’t be.

amluto · on Sept 11, 2023

That last nine in a trading system uptime has exponentially low value unless you have customers who care quite a lot.

Seriously, suppose you have a truly awesome system making $100B per year of revenue. If you unnecessarily shut down 0.1% of the time, that’s only $100M per year lost, and an 0.1% unnecessary shutdown rate seems pretty high.

distortionfield · on Sept 11, 2023

> That last nine in a trading system uptime has exponentially low value

IME that last 9 is where all the action happens

> unless you have customers who care quite a lot

All customers care about their trades. I’ve worked with these systems. You can’t treat smaller traders as less-than.

> only $100M

How far removed from the problem do you have to be to think one hundred million dollars is not going to effect anyone?

Narkov · on Sept 11, 2023

> How far removed from the problem do you have to be to think one hundred million dollars is not going to effect anyone?

If $10b is at risk, $100m is not a lot for an insurance policy.

amluto · on Sept 11, 2023

Not all automated trading systems have customers.

leaflets2 · on Sept 11, 2023

A way to add limits when being clueless:

Estimate what a real human can do in a day, and use that as the limits. Verify that the system behaves ok for some time, then scale up the desired trading volume and limits, observe, scale, repeat.

But you don't do it by making a (bad) guess up front and then just leaving it at that.

piyh · on Sept 11, 2023

"If we lose 100 million dollars in 20 minutes" seems like a good one.

mxz3000 · on Sept 10, 2023

spamming the market with orders for one

distortionfield · on Sept 10, 2023

Define “spamming”, then? High frequency traders would probably look a lot like spammers.

There are always two error rates.

rwmj · on Sept 10, 2023

There's definitely more to this story. Why was there a fixed number of "flags" so that they needed to be reused? I wish there was a true technical explanation.

Neil44 · on Sept 10, 2023

I can only think that it was some kind of fixed binary blob of 1/0 flags where all the positions had been used umpteen times over the years and nobody wanted to mess with the system to replace it with something better.

notnmeyer · on Sept 10, 2023

this is what stood out to me reading the story. i wonder if there was a reason why they opted for this, however half-baked.

it reads less to me like a case for devops as it does a case for better practices at every stage of development. how arrogant or willfully ignorant do you have to be to operate like this considering what’s at stake?

SoftTalker · on Sept 10, 2023

They probably already had a bitfield of feature flags, maybe it was a 16-bit integer and full, and someone notices "hey this one is old, we can reuse it and not have to change the datatype"

notnmeyer · on Sept 10, 2023

ah, yeah—hadn’t considered that!

matkoniecz · on Sept 11, 2023

There are multiple real root issues here. Missing manual kill switch is also one of them.

jammycakes · on Sept 11, 2023

This incident highlights a problem that is often overlooked in the debate about feature branches versus feature toggles.

I've worked with both feature branches and feature toggles, and while long lived feature branches can be painful to work with what with all the conflicts, they do have the advantage that problems tend to be uncovered and resolved in development before they hit production.

When feature toggles go wrong, on the other hand, they go wrong in production -- sometimes, as was the case here, with catastrophic results. I've always been nervous about the fact that feature toggles and trunk based development means merging code into main that you know for a fact to be buggy, immature, insufficiently tested and in some cases knowingly broken. If the feature toggles themselves are buggy and don't cleanly separate out your production code from your development code, you're asking for trouble.

This particular case had an additional problem: they were repurposing an existing feature toggle for something else. That's just asking for trouble.

MiyamotoAkira · on Sept 11, 2023

That's interesting. Whenever I have an issue with a flag it gets picked up on dev/test/uat environments (all gets tested, especially around the code behaving the same as before with the flag off). The code change never reaches production. And if for some reason the code under the flag is wrong, and it has reached production (something unexpected, unseen), undoing the change is whatever long it takes to switch the flag back (and the cache to update if you have a cache).

jammycakes · on Sept 11, 2023

That's a good approach if you can cleanly separate out the old code from the new code, and if you can make sure that you've got all the old functionality behind the switch. Unfortunately this can be difficult at times. Feature toggles involving UI elements, third party services or legacy code can be difficult to test automatically, for example. Another risk is accidental exposure: if a feature toggle gets switched on prematurely for whatever reason, you'll end up with broken code in production.

The cases where I've experienced problems with feature toggles have been where we thought we were swapping out all the functionality but it later turned out that due to some subtleties or nuances with the system that we weren't familiar with, we had overlooked something or other.

Feature toggles sound like a less painful way of managing changes, but you really need to have a disciplined team, a well architected codebase, comprehensive test coverage and a solid switching infrastructure to avoid getting into trouble with them. My personal recommendation is to ask the question, "What would be the damage that would happen if this feature were switched on prematurely?" and if it's not a risk you're prepared to take, that's when to move to a separate branch.

xyst · on Sept 10, 2023

Having worked in some Fortune 500 financial firms and low rent “fintech” upstarts, I am not surprised this happened. Decades of bandaid fixes, years of rotating out different consultants/contractors, and software rot. Plus years of emphasizing mid level management over software quality.

As other have mentioned, I don’t think “automation of deployment” would have prevented this company’s inevitable downfall. If it wasn’t this one incident in 2014, then it would have been another incident later on.

hinkley · on Sept 10, 2023

It's an entire industry built on adrenaline, bravado, and let's be honest: testosterone. How could their IT discipline be described as anything other than "YOLO"?

Trading is mostly based on a book that, like the waterfall model, was meant to be a cautionary tale on how not to do things. Liars' Poker had the exact opposite effect of Silent Spring. Imagine if Rachel Carson's book came out and people decided that a career in pesticides was more glamorous than being a doctor or a laywer, we made movies glorifying spraying pesticides everywhere and on everything, and telling anyone who thought you were crazy that they're a jealous loser and to fuck off.

jonplackett · on Sept 10, 2023

The thing I was surprised about is that they survived!

After all that they still got a $400 million cash bailout!

xcv123 · on Sept 10, 2023

They were probably worth much more than $400M before the failure so it was a good investment opportunity. They would have been a money printing machine aside from this one major fuckup.

nly · on Sept 10, 2023

Their IP (proprietary trading algos) etc were probably worth a lot at the time.

These days probably not so. I wouldn't imagine there are any market makers left in NYSE and NASDAQ who aren't deploying FPGAs to gain a speed edge.

m_0x · on Sept 10, 2023

It wasn't a bailout, it was investment money.

Bailout could imply government throwing a lifeguard

kaashif · on Sept 10, 2023

> It wasn't a bailout, it was investment money.

Almost all bailouts are investments, whether it's a government or private bailout.

The investments are questionable sometimes, but they're still investments.

leoqa · on Sept 10, 2023

The nuance is a) what happens to existing equity stakeholders and b) does the bailout have to be repaid.

If the answer is nothing and no, then it’s a bailout philosophically. If the existing investors get diluted then they’re in part paying for the new capital injection.

ben-gy · on Sept 10, 2023

It wasn’t a bail out, it was an opportunity for investors to get great terms on equity at a crucial juncture for the company.

swores · on Sept 10, 2023

A government bail out isn't the exclusive use of the phrase "bail out", it was both a bail out and an opportunity for investors to get great terms on equity.

gumballindie · on Sept 10, 2023

> Had Knight implemented an automated deployment system – complete with configuration, deployment and test automation – the error that cause the Knightmare would have been avoided.

Would it have been avoided though? Configuration, deployment and test automation mean nothing if they don't do what they are supposed to do. Regardless of how many tests you have, if you don't test for the right stuff it's all useless.

bluelightning2k · on Sept 10, 2023

Yes. It would have.

The specific part is configuration as code. So the config change (flag activation) and code change (flag calling) would have been synchronized.

And there wouldn't have been one server of 8 with a different build for a meaningful time and also if it did fail to deploy on that one server it would have been obvious.

gumballindie · on Sept 11, 2023

That's based on the assumption that someone would have thought about testing that particular flag for that particular scenario.

In my view this would only have been caught by a deployment to an identical copy of production, with running, simulated transactions, and high level funtional testing. Testing for each individual config value and scenario of where it may be used is playing whack a mole. Basically, I'd make a clone of prod, simulate everything that happends externaly (APIs, etc) and observe transaction KPIs and other high level business indicators. Testing for tech is insuring that the tech works, and sometimes that means testing that it's broken.

alexpotato · on Sept 11, 2023

Couple fun facts/stories:

1. I signed my offer letter to work at Knight 5 days before this happened (and I still went to work there)

You can read more about that here: https://twitter.com/alexpotato/status/1501174282969305093

2. As I mentioned above, I went to work at Knight as a DevOps on a team that deal directly with the team mentioned in the blog post.

There are lots of stories around this but I will share this one:

Late 2012 is when Apple rolled out the "emergency weather notification" function. I was in the office and the notification went off on multiple people's phones. Knight was also experimenting with call notifications.

So when the alert goes off, someone yells "God damn it! Not again!!" (thinking there was another big outage)

3. People outside of finance have no idea of the different types of outage that can happen due to all sorts of factors.

I have a LOT of stories here: https://twitter.com/alexpotato/status/1215876962809339904

4. In finance in general, the amount of legacy code that behaves in weird ways or was written by someone 10 years ago who is no longer with the firm is ASTOUNDING.

Coupled with the billions of combinations of regulations, internal controls, multiple countries and jurisdictions etc makes accounting for every single edge case impossible. To use an infosec term the "attack surface" of possible user actions that could lead to bugs is enormous.

Typical case:

- User says they want to see reports for a couple days worth of trading for all securities

- User also says they want to see FULL history for one security

- User never says they might want to see FULL history for ALL securities at the same time

- This being HN, someone will say "you should have thought of that"

- Sure, but then they pull only some of the history for a Ukranian bond that has a 182 (not 180 like most) day bond. This is the only example of this type of bond. Ever. Did you think of that? What should the system have done?

- An oh, btw, this system was pushed out quickly due to regulatory pressure etc

lukeasrodgers · on Sept 11, 2023

I would be interested to read these stories, but the twitter links only show a single tweet ending in the phrase "A thread." Perhaps this is a new feature of X whereby only logged-in users can see a tweet and its replies.

brightlancer · on Sept 11, 2023

The Wayback Machine is pretty successful with older tweets before the login block:

https://web.archive.org/web/20200118133242/https://twitter.c...

Also, a shorthand for "give me the oldest copy of something" is

https://web.archive.org/web/0/https://example.com/

alexpotato · on Sept 11, 2023

You are correct.

You used to be able to read any public thread without logging in.

Then it became "only logged in users" to prevent scraping for AI training data (Apparently)

Now it's "you can see the linked tweet but not the rest of the thread unless you are logged in"

brundolf · on Sept 11, 2023

Much as I enjoy articles that reinforce my existing beliefs, high-frequency trading is a pretty extreme example when it comes to how how badly things can go in a short time

dilyevsky · on Sept 10, 2023

Their issue was neglecting an automated SCRAM system that would halt all the trading or any alerting with manual intervention. The article touches on that. There was no excuse why the system wasn’t halted by 9:32 which would’ve avoided most of the kerfuffle

stevage · on Sept 11, 2023

>They had 48-hours to raise the capital necessary to cover their losses (which they managed to do with a $400 million investment from around a half-dozen investors).

I'm very curious about this bit. How exactly do you raise $400m of "investment" to cover such a massive footgun, in 48 hours, when you haven't even had time to understand what happened or whether it would happen again?

Why are people stumping up hundreds of millions of cash here?

supportengineer · on Sept 10, 2023

They were missing any kind of risk mitigation steps, in their deployment practice.

Gibbon1 · on Sept 10, 2023

I think Goldman Sachs or someone big like that had a similar oopsie. And what happened was the exchange reversed all their bad trades.

hyperhello · on Sept 10, 2023

There’s no money for that.

earnesti · on Sept 10, 2023

It is funny, but in one company I was working for, the more people they added the more they neglected all basics, such as backups. There were heavy processes for many things and they were followed very well, but for whatever reasons some really basic things went unnoticed for many years.

codegeek · on Sept 10, 2023

I refuse to believe that failed deployment can bring a company down. That is just a symptom. The root cause has to be a whole big collection of decisions and processes/systems built over years.

civilized · on Sept 10, 2023

I see a lot of criticism of the deployment, but why did the developers "repurpose an old flag" that activates 8 years dead code that you haven't deleted and that has completely unknown current functionality? That seems like the strangest decision made in this debacle.

leaflets2 · on Sept 11, 2023

To save time, I guess. They deleted the inactive code, so, why not, they thought. But then they forgot to deploy that change (to one server).

Bugs and configuration errors will happen from time to time, and might look silly in retrospect. But the real problem was, I think, that there was no kill switch (managers and tech leads should have decided to add long ago)

jundl77 · on Sept 12, 2023

I mean.. beyond that, that flag was used by a system from 8 years ago? Perhaps they simply didn't know they were re-using it.

nickdothutton · on Sept 10, 2023

“The code that that was updated repurposed an old flag…” Was as far as I needed to read. Never do this.

zsoltkacsandi · on Sept 11, 2023

This has nothing to do with “DevOps”, and I am getting tired of this word. This mistake could have been prevented on multiple levels, and in my experience, deployments that involves major architectural changes rarely repeatable or can be fully automated.