I'm not sure how automated deployments would have solved this problem. In fact, if anything, it would have magnified the impact and fallout of the problem.
Substitute "a developer forgot to upload the code to one of the servers" for "the deployment agent errored while downloading the new binary/code onto the server and a bug in the agent prevented the error from being surfaced." Now you have the same failure mode, and the impact happens even faster.
The blame here lies squarely with the developers--the code was written in a non-backwards-compatible way.
> The blame here lies squarely with the developers--the code was written in a non-backwards-compatible way.
The blame completely lies with the risk management team.
The market knew there was a terrible problem, Knight knew there was a problem, yet it took 45 minutes of trying various hotfixes before they ceased trading. Either because they didn't have a kill switch, or because no one was empowered to pull the kill switch because of the opportunity cost (perhaps pulling the switch at the wrong time costs $500k in opportunity).
I worked for a competitor to Knight at the time, and we deployed terrible bugs to production all the time, and during post mortems we couldn't fathom the same thing happening to us. A dozen automated systems would have kicked in to stop individual trades, and any senior trader or operations person could have got a kill switch pulled with 60 seconds of dialogue, and not feared the repercussions. Actually, we made way less of Knight's $400m than we could have because our risk systems kept shutting strategies down because what was happening was "too good to be true".
It’s nice to see your perspective as someone familiar with better systems.
I have always found this story fascinating; in my junior days I worked at a relatively big adtech platform (ie billions of impressions per day) and as cowboy as we were about lots of things, all our systems always had kill switches that could stop spending money and I could have pulled them with minimal red tape if I suspected something was wrong.
And this was for a platform where our max loss for an hour would have hurt but not killed the business (maybe a six figure loss), I can’t imagine not having layers of risk management systems in HFT software.
They were asleep at the wheel, not unlike all the random brokerages that blew up when swiss central bank pulled the CHF peg in 2015.
This is a culture problem - as soon as you load up your trading firm with a bunch of software industry hires, you end up with jiras and change management workflows instead of people on deck that have context for what they're doing. That's the only way to explain reverse scalping for 45 mins straight.
The CHF de-peg wasn't really technology risk. Brokers lost money because they undervalued CHF/EUR risk, undervalued liquidity risk (stop orders were executing FAR worse than expected, or simply failing to execute at all), and didn't pay attention to the legal protections afforded to their customers (customer balances went negative but there was no way to recover that money from the customers). These brokers would have had the same problems even if using pen & paper, they failed to plan (or alternatively, made a conscious bet and lost).
I think it is worth saying that no one saw that de-peg coming. Absolutely no one. Sure there are some crazies who saw it coming, but that same camp is still taking for the HKD-USD de-peg. It was a shock to everyone on Wall Street. I am a bit surprised that the Swiss National Bank didn't tip off their own banks before doing it. Both UBS and Credit Suisse were seriously caught off guard when it happened.
That's fair. I wasn't close enough to see how "surprising" it was. The point stands that it looks nothing like HFT active trading risk. Knight Capital created such amazing training material for the industry. What went wrong with their software, how many decisions or practices could have prevented or narrowed the risk. What went wrong at trade time, how they had a clear window to pull the plug but died due to indecision and inaction.
> as soon as you load up your trading firm with a bunch of software industry hires
As a software industry hire at a hedge fund right now... I'd love to see more cross-pollination, because there are so many good things happening on both sides, and so many terrible things happening through just a sheer lack of knowledge.
Change management workflows are great and should be used more in finance. But software companies should implement andon cord systems more often (Amazon does; nowhere else I've worked gives that power to anybody at the company).
I messed around with the idea of a physical big red button kill switch to shut down market making; the IT people thought I was joking - the trading desk just assumed that it was in the design from day 1.
> or because no one was empowered to pull the kill switch because of the opportunity cost (perhaps pulling the switch at the wrong time costs $500k in opportunity).
Isn't the problem that pulling the plug on a trading bot doesn't just have opportunity costs, but may also leave you with open positions that, depending on the kind of trades you're doing and the way the market is moving, could be arbitrarily expensive to unwind?
> Actually, we made way less of Knight's $400m than we could have because our risk systems kept shutting strategies down because what was happening was "too good to be true".
Aren't a lot of trades undone anyway by the authorities after such severe market hiccups?
This is a good question. In my experience, I have only see exchange trades reversed when there was a major bug in exchange software. If the bug is on the client side, tough luck. And reversing trades done on an exchange is usually a decision for the exchange regulator. It is a major event that only happens every few years -- at most -- for highly developed exchanges.
Reversing or amending a single "fat finger" trade happens all the time and the exchange generally has procedures for this that don't involve a regulator.
Even in the most controversial recent example - LME cancelling a day's worth of nickel trades [0]- I understand it was their call and not any external regulator. That said, while I'd count LME as a "highly developed exchange", it's the Wild West compared to the US NMS.
As I understand, the LME nickel trade reversal was to prevent total meltdown due to multiple counterparties going bankrupt at the same time. To me, it was a classic case of exchange limits and poor risk control. "If you owe the bank $100 that's your problem. If you owe the bank $100 million, that's the bank's problem. -J. Paul Getty (of course, add some zeroes for today's world)
Also, can you explain more about what this phrase means? "it's the Wild West compared to the US NMS" Are you saying the risk limits and controls on LME are much worse that US markets?
> [...] the exchange generally has procedures for this that don't involve a regulator.
That's part of why I vaguely referred to 'the authorities' in my original comment. I wasn't quite sure who's doing the amending and reversing, and it wasn't too important.
Normally trades are undone or amended in price if they are executed far away (say 10%+) from what is determined to be reasonable market prices. And when amended, they get amended to a price that's still in the same direction, so the market taker still loses a little bit compared to the fair price.
KCG traded in such liquid instruments and in such a way that it didn't move the market that much. They lost a hundred dollars a trade on 4 million trades.
The article says some stocks were moved by more than 10%, but as I recall that was a small fraction of them.
Technically KCG didn't exist yet. NITE, aka Knight Capital Group lost ~$460 million. Getco bought NITE forming KCG (Beating out Virtu's bid which later purchased KCG) and now is called Virtu Financial (VIRT).
> ... one of Knight’s technicians did not copy the new code to one of the eight SMARS computer servers. Knight did not have a second technician review this deployment and no one at Knight realized that the Power Peg code had not been removed from the eighth server, nor the new RLP code added.
Read this part again:
> ... one of Knight’s technicians did not *copy the new code to one of the eight SMARS computer servers*.
Yes, of course a CI/CD pipeline can fail midway through and only partially deploy the code to a partial number of servers, but I doubt it. And even if that were the case, just off the top of my head I can guarantee an Ansible Playbook would have not only stopped the moment that particular transfer failed, the whole Playbook would have therefore failed, and none of the services would have been restarted (because that would be a final step that wouldn't be reached.)
This was due to human error and is the very reason CI/CD/automation is a thing.
> Knight did not have a second technician review this deployment and no one at Knight realized that the Power Peg code had not been removed from the eighth server
CI/CD would have have solved this 100%. A "Pull Request" made against a repository of Ansible code (or whatever you flavour is) would have *PREVENTED* the first technician from ever being able to merge the code into master/main (because you have master/main protected right... right?), completely preventing the entire process from ever rolling out without a review, which would have hopefully caught the misaligned configuration.
DevOps, which is mostly underpinned by CI/CD, would have solved this 100%. I'm very certain of this.
Ansible in my experience will stop trying to run subsequent tasks on a server once one of them fails, but it will go ahead with other servers that match the inventory pattern. So it very well could have successfully updated 7 out of 8 hosts.
Maybe there is a switch that will stop everything if any task on any host fails but it's not the default behavior.
At least it would have logged an error that hopefully would have been looked at.
I think this is an example of hindsight not always being 20/20
If you replace each step of the post mortem with a CI/CD based alternative, you miss out on the fact CI/CD trivializes designs where this wouldn't have happened.
The easy default here would be to have a runbook that executed on a particular inventory group.
The “linear” execution strategy is the default (which you linked to). By default, if there is an error on one host it will continue executing on all other hosts. You need to set a flag to stop executing on all hosts[1].
The parent process would not be notified of any failures until the end of the run, unless you supplied a custom callback plugin[2].
The problem was a human forgot to run a step and no one noticed: The playbook would have failed and the server wouldn't have been online to make orders.
If you read the article, the other servers were fine and did not contribute to the issue.
> So it very well could have successfully updated 7 out of 8 hosts.
The problem was that the feature flag was manually enabled on the host with old code. Presumably with automated deployment the feature flag would never have been toggled if the deployment failed, either because the deployment didn't get that far or because the human spotted the failed deployment.
It is quite likely this would have been solved by a good automated deployment process. However, it is also quite likely that at some point a human error would creep into either the automated deployment process itself, or be 100% correctly deployed into production.
At that point, if the error is as serious, Knight would still gave gone bankrupt since they had no way to mitigate these failure conditions.
Being 100% free of bugs is just not a viable way to end up with safe systems.
The blame here may indeed lie with whoever decided that reusing an old flag was a good idea. As anyone who has been in software development for any time can attest, this decision was not necessarily - and perhaps not even likely - made by a "developer."
9 times out of 10, I see developers making the mistakes that everyone seems to want to blame on non-technical people. There is a massive amount of software being written by people with a wide range of capabilities, and a large number of developers never master the basics. It doesn't help that some of the worst tools "win" and offer little protection against many basic mistakes.
A large number of developers never master the basics, that is true. But more interestingly, absolutely zero programmers can write a good amount of code that is free of bugs.
If your road to safety is bugfree code, it will end up in an accident sooner or later, 100% guaranteed.
For a group who so thoroughly despises bosses that operate on 'blame allocation', we spend a lot of time shopping around for permission to engage in reckless behavior. Most people would call that being a hypocrite.
Whereas I would call it... no, hypocrite works just fine.
At the company I work, we have a team that took 3 weeks and multiple tries to get an API response (JSON) capitalized properly (camelCase to PascalCase).
When I tried to talk to the tech lead about it, his response is that SAFe would have prevented the issue (it was discovered by another team who consumes their API).
Throughout the entire thing this tech lead maintained his team didn't do anything and that the problem was the process.
yeah, no. I have 25+ years of experience as a developer, it doesn't take 3+ weeks to fix the casing of a JSON property name. I eventually had to be the bad guy and tell them their work was unacceptable because they themselves couldn't recognize it. Only when I did it, I ran it up the chain because if the tech lead doesn't see the problem then I need someone who can help them see the problem.
For some people there's a "responsibility shield" that's so strong you can never get through to them.
Or at least not by a developer who has made that sort of mistake in the past.
I don't know what software engineering programs teach these days, but in the 1980s there was very little inclusion of case studies of things that went wrong. This was unlike the courses in the business school (my undergrad was CS major + business minor) nor I would presume what real engineering disciplines teach.
My first exposure to a fuckup in production was a fuckup in production on my first job.
It is very hard to change the overall size of the messages, and there's a lot of pressure to keep them short. So it could have been a bitfield or several similar things... e.g a value in a char field
At the very least have a two deploys - actually removing the old code that relies on it and then repurposing it. Giant foot gun to do it all in one especially without any automated deploys.
That assumes that you have a stable, reliable, quick process to roll out updates. Sounds like they didn't, so maybe they worked on the "oh better add this feature, it's our only chance this month" pattern.
>whoever decided that reusing an old flag was a good idea.
My understanding is that in high frequency trading, minimizing the size of the transmission is paramount. Hence re-purposing an existing flag, rather than adding size to the packet makes some sense.
Flag recycling is a task that should be measured in months to quarters, and from what I recall of the postmortem they tried to achieve it in weeks, which is just criminally stupid.
It's this detail of the story which flips me from sympathy to schadenfreude. You dumb motherfuckers fucked around and found out.
I doubt any manager or VP cares or knows enough about the technical details of the code to dictate the name that should be used for a feature flag, of all things.
I see this as a problem of not investing enough in the deploy process. (Disclosure: I maintain an open source deploy tool for a living).
Charity Majors gave a talk in Euruko that talked a lot about this. Deploy tooling shouldn’t be a bunch of bash scripts in a trench coat, it should be fully staffed, fully tested, and automated within an inch of its life.
If you have a deploy process that has some king of immutable architecture, tooling to monitor (failed/stuck/incomplete) rollouts, and the ability to quickly rollback to a prior known good stage then you have layers of protection and an easy course of action for when things do go sideways. It might not have made this problem impossible, but it would have made it harder to happen.
I wrote a tool to automate our hotfix process, and people were somewhat surprised that you could kill the process at any step and start over and it would almost always do the right thing. Like how did you expect it to work? Why replace an error prone process with an error prone and opaque one that you can't restart?
> the ability to quickly rollback to a prior known good stage
This is vital, but it's often not sufficient just to roll back, say, to a known good Docker image. Database migrations may have occurred that dropped columns that the old code expects to exist; feature flags may need to be changed; multiple services may need to be rolled back individually; data may have accumulated under new assumptions that breaks old assumptions when old code is applied to that new data.
One of the really subtle wins of devops as a discipline is that by allowing/forcing application teams to take responsibility for deployment, they're more exposed to thinking how to solve these things in a maintainable way: for instance, breaking out complex "the meaning of our data is changing"-type changesets/data migrations into individually reversible stages, with the stages merged onto the production branch over the course of multiple days where analysis is done on error rates and live data.
Counterpoint though: that automation in and of itself is more failure area.
I can imagine a similar story where the deployment pipeline incorrectly rolled back due to some change in metric format and caused the infinite loss, for example.
The thing with these being a 1 in a million chance is that there's thousands of different hypothetical causes. The more parts the harder to predict an interaction and we've all been blindsided by something.
I would personally hate the stress of working on such high stakes releases.
Test test test. If that’s not enough, pick better tools. I’m rewriting bash scripts in rust at work because it gives me the ability to make many invalid states impossible to represent in code. Is it overkill? Maybe, but it is such a huge quality of life improvement.
Automated things can fail. Sure. But consider that playbooks are just crappy automation run by unreliable meat computers.
Also you can take an iterative approach to automation:
- manual playbook only
- automate one step of the playbook
- if it goes well, move to another. If not, run a retro to figure out out how you can improve it and try again.
Stress of failure at a job responsible for deployment architecture is manageable if you have a team and culture built around respecting that stress. There are some areas of code people are more careful around, but largely we make safety a product of our tools and processes and not some heroic “try harder not to screw up” attitude.
I find the impact of helping so many developers and their companies rewarding.
Part of the solution is at the level of attitude, just one more productive than "don't fuck up"
To create a contrived example, say someone reads your note on replacing bash scripts and decides they agree with the principle.
They go into work tomorrow, their fellow engineer agrees on the technical merit, and they reimplement a bunch of bash scripts in Rust with a suite of tests bigger than anyone imagined, and life is great.
... fast forward a few months from now and suddenly a state the bash scripts were hiding flares up and everyone is lost, and type safety didn't help.
A shared culture of "conservation of value" can help in a lot of ways there. That's the attitude that creation of value is always uncertain, so you prioritize potential future value lower than currently provided value:
- instead of looking at technical merit of the new, we prioritize asking: What specific shortcomings the old way have? What can we improve downstream so that the value those systems provide is protected from invalid states we're worried about this tool generating?
- does switching the language reduce the number of people who can work on it? Do we reduce the effect surface area of the team providing value to it? When hair is on fire do we know the sysops guy won't balk?
- when it goes down, with a culture of "conservation of value", your plan A is always rolling back, there's no back and forth on if we can just roll this one fix. If you cause the company to lose a million dollar trade, it's already codified that you made the right decision
Obviously these are all extensions of a contrived example, but to me culture is heavily utilized as a way to guide better engineering.
I think these days people tend to think in terms of culture that affirms, as a reaction to cultures that block anyone from accomplishing anything: to me a good engineering culture is one that clashes with what people want to do just enough to be mildly annoying.
The goal with automation is that the number of unidentified corner cases reduces over time.
A manual runbook is a game of, "I did step 12, I think I did step 13, so the next step is 14." that plays out every single time you do server work. The thing with the human brain is that when you interrupt a task you've done a million times in the middle, most people can't reliably discern between this iteration and false memories from the last time they did it.
So unless there are interlocks that prevent skipping a step, it's a gamble every single time. And the effort involved in creating interlocks is a large fraction of the cost of automating.
1. Print the checklist/runbook out on paper with actual empty boxes next to the steps.
2. Laminate the printed checklist and put it in a big folder.
3. Every time you run the checklist, use a sharpie to mark the checkbox after you've done the step.
4. When you are done with the entire process, use whiteboard cleaner to wipe out the checks again and put the checklist back in the big folder with all the other checklists.
This is how every safety critical profession (aviation, shipping, medical, power generation, etc) has worked for decades and unless people are willingly being obtuse it is extremely hard to do it wrong. You just need people to turn off their ego and follow the process instead of trying to show off by doing it from memory. This last part might be more difficult in software settings.
Regarding checklist people like Hollnagel, Wears, Braithwaite, Dekker, ... have done a bit on investigation (Hollnagel mostly on Healthcare, Dekker started on air industry but spread from there). Read the "Safety-I vs Safety-II" paper or the "When a checklist is not enough: How to improve them and what else is needed" paper
Great point and sometimes checklists are indeed not enough. My previous post was triggered more by the "I think I did step 13," part of the post I was responding to. That is not a flaw with checklists but a flaw in the safety culture of the operator team. It should never happen that you lose track of where you are in the process because human memory is unreliable, rather you should fix that through better processes and outsource the memorizing to paper.
Technically, a flag re-use was the most impactful error, code wise.
A flag of such importance should not be just on/off; the ON should require a positive response / receipt containing the name and version of the code being turned on.
[edit - don't mean for each trade, I mean validation on startup]
I think the blame is not on either the devops or the developers, it is on the process. If a bug occurs than there should be atleast 5-6 different metrics / alerts that should be able to catch the bug.
I think the big improvement would be consistency. Either all servers would be correct or all servers would be incorrect. The step where "Since they were unable to determine what was causing the erroneous orders they reacted by uninstalling the new code from the servers it was deployed to correctly" wouldn't have had a negative impact. They could have even instantly rolled back. Also if they were using the same automated deployment processes for their test environment they might have even caught this in QA.
I agree. It doesn’t matter if you give an inexperienced person a hammer or a saw — they’ll still screw it up.
My biggest pet peeve is they NO ONE ever does failure modeling.
I swear everyone builds things assuming it will work perfectly. Then when you mention if one part fails, it will completely bring down everything, they’ll say that it’s a 1 in a million chance. Yeah, the problem isn’t that it’s unlikely, it’s that when it does happen, you’ve perfectly designed your system to destroy itself.
It's actually quite routine stuff now in finance at least - to perform some kind of 'fire test' on a regular basis - you shut down some components during the day, and switch to backups solutions, to test everything works smoothly.
> the deployment agent errored while downloading the new binary/code onto the server
In that case the build would never be pushed to production. The worst it would accomplish, and this is if your systems fail, is that it will break your staging area.
Sure this is in the the ideal world where people actually know how to set up their deployment pipelines correctly, so you’re likely still right in many cases, but you shouldn’t be.
Automated deployments would have allowed you to review the deployment before it happened. A failed deployment could be configured to allow automatic rollbacks. Automated deployments should also handle experiment flags, which could have been toggled to reduce impact. There are a bunch of places where it could have intervened and mitigated/prevented this whole situation.
Imagine a company where the engineering culture hires macho programmers who love bitmasked flags and manual memory management, and who think memory safe languages and json are for sissies.
Imagine they hire lots of graduates who've never worked elsewhere and teach them that their way is best, and everyone who says otherwise doesn't understand real performance. Those 'industry best practices' are written by javascript folks who think a 2 second pageload is fast, and a 200ms pageload is instant, don't you know?
And imagine when they make experienced hires, they look for people who have experience with bitmasked flags and manual memory management - which is reasonable enough, they gotta be able to code review that stuff and coach junior employees on working with it. But has the side effect experienced hires won't rock the boat.
Now, the nontechnical bosses can ask anyone from the highest engineering leadership to the must junior of peons, and they'll all agree that bitmasked flags are the right way of doing things.
Is it so unreasonable for nontechnical bosses to trust the consensus of their engineers on matters of engineering?
Also, api versioning. They weren't running api versioning on it, they called an old method with a new set of parameters, that shouldn't have been possible in first place
Substitute "a developer forgot to upload the code to one of the servers" for "the deployment agent errored while downloading the new binary/code onto the server and a bug in the agent prevented the error from being surfaced." Now you have the same failure mode, and the impact happens even faster.
The blame here lies squarely with the developers--the code was written in a non-backwards-compatible way.