I'm not sure how automated deployments would have solved this problem. In fact, if anything, it would have magnified the impact and fallout of the problem.
Substitute "a developer forgot to upload the code to one of the servers" for "the deployment agent errored while downloading the new binary/code onto the server and a bug in the agent prevented the error from being surfaced." Now you have the same failure mode, and the impact happens even faster.
The blame here lies squarely with the developers--the code was written in a non-backwards-compatible way.
> The blame here lies squarely with the developers--the code was written in a non-backwards-compatible way.
The blame completely lies with the risk management team.
The market knew there was a terrible problem, Knight knew there was a problem, yet it took 45 minutes of trying various hotfixes before they ceased trading. Either because they didn't have a kill switch, or because no one was empowered to pull the kill switch because of the opportunity cost (perhaps pulling the switch at the wrong time costs $500k in opportunity).
I worked for a competitor to Knight at the time, and we deployed terrible bugs to production all the time, and during post mortems we couldn't fathom the same thing happening to us. A dozen automated systems would have kicked in to stop individual trades, and any senior trader or operations person could have got a kill switch pulled with 60 seconds of dialogue, and not feared the repercussions. Actually, we made way less of Knight's $400m than we could have because our risk systems kept shutting strategies down because what was happening was "too good to be true".
It’s nice to see your perspective as someone familiar with better systems.
I have always found this story fascinating; in my junior days I worked at a relatively big adtech platform (ie billions of impressions per day) and as cowboy as we were about lots of things, all our systems always had kill switches that could stop spending money and I could have pulled them with minimal red tape if I suspected something was wrong.
And this was for a platform where our max loss for an hour would have hurt but not killed the business (maybe a six figure loss), I can’t imagine not having layers of risk management systems in HFT software.
They were asleep at the wheel, not unlike all the random brokerages that blew up when swiss central bank pulled the CHF peg in 2015.
This is a culture problem - as soon as you load up your trading firm with a bunch of software industry hires, you end up with jiras and change management workflows instead of people on deck that have context for what they're doing. That's the only way to explain reverse scalping for 45 mins straight.
The CHF de-peg wasn't really technology risk. Brokers lost money because they undervalued CHF/EUR risk, undervalued liquidity risk (stop orders were executing FAR worse than expected, or simply failing to execute at all), and didn't pay attention to the legal protections afforded to their customers (customer balances went negative but there was no way to recover that money from the customers). These brokers would have had the same problems even if using pen & paper, they failed to plan (or alternatively, made a conscious bet and lost).
I think it is worth saying that no one saw that de-peg coming. Absolutely no one. Sure there are some crazies who saw it coming, but that same camp is still taking for the HKD-USD de-peg. It was a shock to everyone on Wall Street. I am a bit surprised that the Swiss National Bank didn't tip off their own banks before doing it. Both UBS and Credit Suisse were seriously caught off guard when it happened.
That's fair. I wasn't close enough to see how "surprising" it was. The point stands that it looks nothing like HFT active trading risk. Knight Capital created such amazing training material for the industry. What went wrong with their software, how many decisions or practices could have prevented or narrowed the risk. What went wrong at trade time, how they had a clear window to pull the plug but died due to indecision and inaction.
> as soon as you load up your trading firm with a bunch of software industry hires
As a software industry hire at a hedge fund right now... I'd love to see more cross-pollination, because there are so many good things happening on both sides, and so many terrible things happening through just a sheer lack of knowledge.
Change management workflows are great and should be used more in finance. But software companies should implement andon cord systems more often (Amazon does; nowhere else I've worked gives that power to anybody at the company).
I messed around with the idea of a physical big red button kill switch to shut down market making; the IT people thought I was joking - the trading desk just assumed that it was in the design from day 1.
> or because no one was empowered to pull the kill switch because of the opportunity cost (perhaps pulling the switch at the wrong time costs $500k in opportunity).
Isn't the problem that pulling the plug on a trading bot doesn't just have opportunity costs, but may also leave you with open positions that, depending on the kind of trades you're doing and the way the market is moving, could be arbitrarily expensive to unwind?
> Actually, we made way less of Knight's $400m than we could have because our risk systems kept shutting strategies down because what was happening was "too good to be true".
Aren't a lot of trades undone anyway by the authorities after such severe market hiccups?
This is a good question. In my experience, I have only see exchange trades reversed when there was a major bug in exchange software. If the bug is on the client side, tough luck. And reversing trades done on an exchange is usually a decision for the exchange regulator. It is a major event that only happens every few years -- at most -- for highly developed exchanges.
Reversing or amending a single "fat finger" trade happens all the time and the exchange generally has procedures for this that don't involve a regulator.
Even in the most controversial recent example - LME cancelling a day's worth of nickel trades [0]- I understand it was their call and not any external regulator. That said, while I'd count LME as a "highly developed exchange", it's the Wild West compared to the US NMS.
As I understand, the LME nickel trade reversal was to prevent total meltdown due to multiple counterparties going bankrupt at the same time. To me, it was a classic case of exchange limits and poor risk control. "If you owe the bank $100 that's your problem. If you owe the bank $100 million, that's the bank's problem. -J. Paul Getty (of course, add some zeroes for today's world)
Also, can you explain more about what this phrase means? "it's the Wild West compared to the US NMS" Are you saying the risk limits and controls on LME are much worse that US markets?
> [...] the exchange generally has procedures for this that don't involve a regulator.
That's part of why I vaguely referred to 'the authorities' in my original comment. I wasn't quite sure who's doing the amending and reversing, and it wasn't too important.
Normally trades are undone or amended in price if they are executed far away (say 10%+) from what is determined to be reasonable market prices. And when amended, they get amended to a price that's still in the same direction, so the market taker still loses a little bit compared to the fair price.
KCG traded in such liquid instruments and in such a way that it didn't move the market that much. They lost a hundred dollars a trade on 4 million trades.
The article says some stocks were moved by more than 10%, but as I recall that was a small fraction of them.
Technically KCG didn't exist yet. NITE, aka Knight Capital Group lost ~$460 million. Getco bought NITE forming KCG (Beating out Virtu's bid which later purchased KCG) and now is called Virtu Financial (VIRT).
> ... one of Knight’s technicians did not copy the new code to one of the eight SMARS computer servers. Knight did not have a second technician review this deployment and no one at Knight realized that the Power Peg code had not been removed from the eighth server, nor the new RLP code added.
Read this part again:
> ... one of Knight’s technicians did not *copy the new code to one of the eight SMARS computer servers*.
Yes, of course a CI/CD pipeline can fail midway through and only partially deploy the code to a partial number of servers, but I doubt it. And even if that were the case, just off the top of my head I can guarantee an Ansible Playbook would have not only stopped the moment that particular transfer failed, the whole Playbook would have therefore failed, and none of the services would have been restarted (because that would be a final step that wouldn't be reached.)
This was due to human error and is the very reason CI/CD/automation is a thing.
> Knight did not have a second technician review this deployment and no one at Knight realized that the Power Peg code had not been removed from the eighth server
CI/CD would have have solved this 100%. A "Pull Request" made against a repository of Ansible code (or whatever you flavour is) would have *PREVENTED* the first technician from ever being able to merge the code into master/main (because you have master/main protected right... right?), completely preventing the entire process from ever rolling out without a review, which would have hopefully caught the misaligned configuration.
DevOps, which is mostly underpinned by CI/CD, would have solved this 100%. I'm very certain of this.
Ansible in my experience will stop trying to run subsequent tasks on a server once one of them fails, but it will go ahead with other servers that match the inventory pattern. So it very well could have successfully updated 7 out of 8 hosts.
Maybe there is a switch that will stop everything if any task on any host fails but it's not the default behavior.
At least it would have logged an error that hopefully would have been looked at.
I think this is an example of hindsight not always being 20/20
If you replace each step of the post mortem with a CI/CD based alternative, you miss out on the fact CI/CD trivializes designs where this wouldn't have happened.
The easy default here would be to have a runbook that executed on a particular inventory group.
The “linear” execution strategy is the default (which you linked to). By default, if there is an error on one host it will continue executing on all other hosts. You need to set a flag to stop executing on all hosts[1].
The parent process would not be notified of any failures until the end of the run, unless you supplied a custom callback plugin[2].
The problem was a human forgot to run a step and no one noticed: The playbook would have failed and the server wouldn't have been online to make orders.
If you read the article, the other servers were fine and did not contribute to the issue.
> So it very well could have successfully updated 7 out of 8 hosts.
The problem was that the feature flag was manually enabled on the host with old code. Presumably with automated deployment the feature flag would never have been toggled if the deployment failed, either because the deployment didn't get that far or because the human spotted the failed deployment.
It is quite likely this would have been solved by a good automated deployment process. However, it is also quite likely that at some point a human error would creep into either the automated deployment process itself, or be 100% correctly deployed into production.
At that point, if the error is as serious, Knight would still gave gone bankrupt since they had no way to mitigate these failure conditions.
Being 100% free of bugs is just not a viable way to end up with safe systems.
The blame here may indeed lie with whoever decided that reusing an old flag was a good idea. As anyone who has been in software development for any time can attest, this decision was not necessarily - and perhaps not even likely - made by a "developer."
9 times out of 10, I see developers making the mistakes that everyone seems to want to blame on non-technical people. There is a massive amount of software being written by people with a wide range of capabilities, and a large number of developers never master the basics. It doesn't help that some of the worst tools "win" and offer little protection against many basic mistakes.
A large number of developers never master the basics, that is true. But more interestingly, absolutely zero programmers can write a good amount of code that is free of bugs.
If your road to safety is bugfree code, it will end up in an accident sooner or later, 100% guaranteed.
For a group who so thoroughly despises bosses that operate on 'blame allocation', we spend a lot of time shopping around for permission to engage in reckless behavior. Most people would call that being a hypocrite.
Whereas I would call it... no, hypocrite works just fine.
At the company I work, we have a team that took 3 weeks and multiple tries to get an API response (JSON) capitalized properly (camelCase to PascalCase).
When I tried to talk to the tech lead about it, his response is that SAFe would have prevented the issue (it was discovered by another team who consumes their API).
Throughout the entire thing this tech lead maintained his team didn't do anything and that the problem was the process.
yeah, no. I have 25+ years of experience as a developer, it doesn't take 3+ weeks to fix the casing of a JSON property name. I eventually had to be the bad guy and tell them their work was unacceptable because they themselves couldn't recognize it. Only when I did it, I ran it up the chain because if the tech lead doesn't see the problem then I need someone who can help them see the problem.
For some people there's a "responsibility shield" that's so strong you can never get through to them.
Or at least not by a developer who has made that sort of mistake in the past.
I don't know what software engineering programs teach these days, but in the 1980s there was very little inclusion of case studies of things that went wrong. This was unlike the courses in the business school (my undergrad was CS major + business minor) nor I would presume what real engineering disciplines teach.
My first exposure to a fuckup in production was a fuckup in production on my first job.
It is very hard to change the overall size of the messages, and there's a lot of pressure to keep them short. So it could have been a bitfield or several similar things... e.g a value in a char field
At the very least have a two deploys - actually removing the old code that relies on it and then repurposing it. Giant foot gun to do it all in one especially without any automated deploys.
That assumes that you have a stable, reliable, quick process to roll out updates. Sounds like they didn't, so maybe they worked on the "oh better add this feature, it's our only chance this month" pattern.
>whoever decided that reusing an old flag was a good idea.
My understanding is that in high frequency trading, minimizing the size of the transmission is paramount. Hence re-purposing an existing flag, rather than adding size to the packet makes some sense.
Flag recycling is a task that should be measured in months to quarters, and from what I recall of the postmortem they tried to achieve it in weeks, which is just criminally stupid.
It's this detail of the story which flips me from sympathy to schadenfreude. You dumb motherfuckers fucked around and found out.
I doubt any manager or VP cares or knows enough about the technical details of the code to dictate the name that should be used for a feature flag, of all things.
I see this as a problem of not investing enough in the deploy process. (Disclosure: I maintain an open source deploy tool for a living).
Charity Majors gave a talk in Euruko that talked a lot about this. Deploy tooling shouldn’t be a bunch of bash scripts in a trench coat, it should be fully staffed, fully tested, and automated within an inch of its life.
If you have a deploy process that has some king of immutable architecture, tooling to monitor (failed/stuck/incomplete) rollouts, and the ability to quickly rollback to a prior known good stage then you have layers of protection and an easy course of action for when things do go sideways. It might not have made this problem impossible, but it would have made it harder to happen.
I wrote a tool to automate our hotfix process, and people were somewhat surprised that you could kill the process at any step and start over and it would almost always do the right thing. Like how did you expect it to work? Why replace an error prone process with an error prone and opaque one that you can't restart?
> the ability to quickly rollback to a prior known good stage
This is vital, but it's often not sufficient just to roll back, say, to a known good Docker image. Database migrations may have occurred that dropped columns that the old code expects to exist; feature flags may need to be changed; multiple services may need to be rolled back individually; data may have accumulated under new assumptions that breaks old assumptions when old code is applied to that new data.
One of the really subtle wins of devops as a discipline is that by allowing/forcing application teams to take responsibility for deployment, they're more exposed to thinking how to solve these things in a maintainable way: for instance, breaking out complex "the meaning of our data is changing"-type changesets/data migrations into individually reversible stages, with the stages merged onto the production branch over the course of multiple days where analysis is done on error rates and live data.
Counterpoint though: that automation in and of itself is more failure area.
I can imagine a similar story where the deployment pipeline incorrectly rolled back due to some change in metric format and caused the infinite loss, for example.
The thing with these being a 1 in a million chance is that there's thousands of different hypothetical causes. The more parts the harder to predict an interaction and we've all been blindsided by something.
I would personally hate the stress of working on such high stakes releases.
Test test test. If that’s not enough, pick better tools. I’m rewriting bash scripts in rust at work because it gives me the ability to make many invalid states impossible to represent in code. Is it overkill? Maybe, but it is such a huge quality of life improvement.
Automated things can fail. Sure. But consider that playbooks are just crappy automation run by unreliable meat computers.
Also you can take an iterative approach to automation:
- manual playbook only
- automate one step of the playbook
- if it goes well, move to another. If not, run a retro to figure out out how you can improve it and try again.
Stress of failure at a job responsible for deployment architecture is manageable if you have a team and culture built around respecting that stress. There are some areas of code people are more careful around, but largely we make safety a product of our tools and processes and not some heroic “try harder not to screw up” attitude.
I find the impact of helping so many developers and their companies rewarding.
Part of the solution is at the level of attitude, just one more productive than "don't fuck up"
To create a contrived example, say someone reads your note on replacing bash scripts and decides they agree with the principle.
They go into work tomorrow, their fellow engineer agrees on the technical merit, and they reimplement a bunch of bash scripts in Rust with a suite of tests bigger than anyone imagined, and life is great.
... fast forward a few months from now and suddenly a state the bash scripts were hiding flares up and everyone is lost, and type safety didn't help.
A shared culture of "conservation of value" can help in a lot of ways there. That's the attitude that creation of value is always uncertain, so you prioritize potential future value lower than currently provided value:
- instead of looking at technical merit of the new, we prioritize asking: What specific shortcomings the old way have? What can we improve downstream so that the value those systems provide is protected from invalid states we're worried about this tool generating?
- does switching the language reduce the number of people who can work on it? Do we reduce the effect surface area of the team providing value to it? When hair is on fire do we know the sysops guy won't balk?
- when it goes down, with a culture of "conservation of value", your plan A is always rolling back, there's no back and forth on if we can just roll this one fix. If you cause the company to lose a million dollar trade, it's already codified that you made the right decision
Obviously these are all extensions of a contrived example, but to me culture is heavily utilized as a way to guide better engineering.
I think these days people tend to think in terms of culture that affirms, as a reaction to cultures that block anyone from accomplishing anything: to me a good engineering culture is one that clashes with what people want to do just enough to be mildly annoying.
The goal with automation is that the number of unidentified corner cases reduces over time.
A manual runbook is a game of, "I did step 12, I think I did step 13, so the next step is 14." that plays out every single time you do server work. The thing with the human brain is that when you interrupt a task you've done a million times in the middle, most people can't reliably discern between this iteration and false memories from the last time they did it.
So unless there are interlocks that prevent skipping a step, it's a gamble every single time. And the effort involved in creating interlocks is a large fraction of the cost of automating.
1. Print the checklist/runbook out on paper with actual empty boxes next to the steps.
2. Laminate the printed checklist and put it in a big folder.
3. Every time you run the checklist, use a sharpie to mark the checkbox after you've done the step.
4. When you are done with the entire process, use whiteboard cleaner to wipe out the checks again and put the checklist back in the big folder with all the other checklists.
This is how every safety critical profession (aviation, shipping, medical, power generation, etc) has worked for decades and unless people are willingly being obtuse it is extremely hard to do it wrong. You just need people to turn off their ego and follow the process instead of trying to show off by doing it from memory. This last part might be more difficult in software settings.
Regarding checklist people like Hollnagel, Wears, Braithwaite, Dekker, ... have done a bit on investigation (Hollnagel mostly on Healthcare, Dekker started on air industry but spread from there). Read the "Safety-I vs Safety-II" paper or the "When a checklist is not enough: How to improve them and what else is needed" paper
Great point and sometimes checklists are indeed not enough. My previous post was triggered more by the "I think I did step 13," part of the post I was responding to. That is not a flaw with checklists but a flaw in the safety culture of the operator team. It should never happen that you lose track of where you are in the process because human memory is unreliable, rather you should fix that through better processes and outsource the memorizing to paper.
Technically, a flag re-use was the most impactful error, code wise.
A flag of such importance should not be just on/off; the ON should require a positive response / receipt containing the name and version of the code being turned on.
[edit - don't mean for each trade, I mean validation on startup]
I think the blame is not on either the devops or the developers, it is on the process. If a bug occurs than there should be atleast 5-6 different metrics / alerts that should be able to catch the bug.
I think the big improvement would be consistency. Either all servers would be correct or all servers would be incorrect. The step where "Since they were unable to determine what was causing the erroneous orders they reacted by uninstalling the new code from the servers it was deployed to correctly" wouldn't have had a negative impact. They could have even instantly rolled back. Also if they were using the same automated deployment processes for their test environment they might have even caught this in QA.
I agree. It doesn’t matter if you give an inexperienced person a hammer or a saw — they’ll still screw it up.
My biggest pet peeve is they NO ONE ever does failure modeling.
I swear everyone builds things assuming it will work perfectly. Then when you mention if one part fails, it will completely bring down everything, they’ll say that it’s a 1 in a million chance. Yeah, the problem isn’t that it’s unlikely, it’s that when it does happen, you’ve perfectly designed your system to destroy itself.
It's actually quite routine stuff now in finance at least - to perform some kind of 'fire test' on a regular basis - you shut down some components during the day, and switch to backups solutions, to test everything works smoothly.
> the deployment agent errored while downloading the new binary/code onto the server
In that case the build would never be pushed to production. The worst it would accomplish, and this is if your systems fail, is that it will break your staging area.
Sure this is in the the ideal world where people actually know how to set up their deployment pipelines correctly, so you’re likely still right in many cases, but you shouldn’t be.
Automated deployments would have allowed you to review the deployment before it happened. A failed deployment could be configured to allow automatic rollbacks. Automated deployments should also handle experiment flags, which could have been toggled to reduce impact. There are a bunch of places where it could have intervened and mitigated/prevented this whole situation.
Imagine a company where the engineering culture hires macho programmers who love bitmasked flags and manual memory management, and who think memory safe languages and json are for sissies.
Imagine they hire lots of graduates who've never worked elsewhere and teach them that their way is best, and everyone who says otherwise doesn't understand real performance. Those 'industry best practices' are written by javascript folks who think a 2 second pageload is fast, and a 200ms pageload is instant, don't you know?
And imagine when they make experienced hires, they look for people who have experience with bitmasked flags and manual memory management - which is reasonable enough, they gotta be able to code review that stuff and coach junior employees on working with it. But has the side effect experienced hires won't rock the boat.
Now, the nontechnical bosses can ask anyone from the highest engineering leadership to the must junior of peons, and they'll all agree that bitmasked flags are the right way of doing things.
Is it so unreasonable for nontechnical bosses to trust the consensus of their engineers on matters of engineering?
Also, api versioning. They weren't running api versioning on it, they called an old method with a new set of parameters, that shouldn't have been possible in first place
> why code that had been dead for 8-years was still present in the code base is a mystery, but that’s not the point
This seems to be exactly the point! For 8 years they left unused code in place, seemingly only bothering to remove it because they wanted to repurpose a flag. If they'd done the right thing 8 years prior and removed code they weren't using, this story plays out very differently. No ancient routines get resurrected, no rogue server.
Maybe Knight Capital wasn't using version control and held onto this code "just in case", but I've seen this same resistance to deleting code in programmers working in repos that are completely under VCS, and it's flabbergasting. If you need it again, you can always bring it back from version control. If you need it again but forget it's there, you'd do the same with the dead code path. Leaving it in the source tree is pure liability.
EDIT: Kevlin Henney gave an excellent talk at GOTO about software reliability and he touches on this, using Knight Capital as the example—he actually cites this very blog post [0]. The whole talk is excellent, but I've linked the three minutes where he talks about Knight Capital.
> The problem is there is no code that is truly dead. It turns out all you need to do is make a small assumption, a change of an assumption and then suddenly it's no longer dead, it's zombie code. It has come back to life and the zombie apocalypse costs money.
> I've seen this same resistance to deleting code in programmers working in repos that are completely under VCS, and it's flabbergasting
I think a lot of developers only know the basics of git. They can check in changes, they can look at history with git log, and maybe they know how to use git blame.
They often don't know how to filter git history. They often don't know about the git pickaxe, or about exclude patterns, and don't even think to question if you can do something like "git log -G'int.*foo\(' -- ':(exclude)directory'" to search for 'foo' in the git log, excluding some directory.
They know how to "grep" within the existing code tree though, so they know if it's not deleted they can find it again with the right grep. If it's deleted, they might not know how to find it in git history.
I sympathize with this to a degree actually. Code in the git log is invisible to a lot of tooling, so for example it won't show up in autocomplete if your text editor might have otherwise suggested it, it won't show up in your library documentation, etc.
If you truly think the code will be used again, I think it's at least defensible to leave it in the tree so that it comes along for refactors, and ends up being found when it's needed.
For cases like Knight capital, where it's obviously never going to be useful again, it's not defensible of course.
I can understand that argument for small, standalone functions. Where it really gets at me is when people insist on leaving whole use cases or subsystems in place, which seems to be what happened here.
You don't want these to be visible to autocomplete, because they're outdated and would need major modification to be correct again. If you do need to resurrect them, they're trivial to find in the git history—just search for the commit named "remove foo"—and they should pass through code review as if they were brand new code, because a lot of stuff will have changed around them in the intervening time.
> and don't even think to question if you can do something like "git log -G'int.*foo\(' -- ':(exclude)directory'"
I do question it, but I know the answer is hard (as you demonstrated), so I don't bother. And I'm now looking at git log docs - I fail to parse how the exclude works even looking at the docs - I don't find anything about `:(` contstruct which houses exclude keyword. But thanks for -G - that will be useful.
> Code in the git log is invisible to a lot of tooling
This is the issue. I expect when searching code to have a way to search older commits. But azure devops won't do it. There is no checkbox "Include all commits"
You can see that it includes the exclude keyword, among others.
Since it applies to almost every command (from 'git add -- <pathspec>' to 'git checkout -- <pathspec>'), it's not mentioned as clearly in individual commands.
git log and all other commands' man page should really refer back to gitglossary here then. And they should either name their argument <pathspec>, or specify that the given argument (<path> or <file>) is a pathspec.
> git pickaxe, or about exclude patterns, and don't even think to question if you can do something like "git log -G'int.*foo\(' -- ':(exclude)directory'" to search for 'foo' in the git log, excluding some directory
Learned something new today, thank you! Will find ways to use these in my daily workflow.
"If it works, don't touch it" is something I've heard a lot, especially said by managers who don't understand what they are talking about.
An update to simply "remove old code" might be difficult if someone sees any change as creating a risk of something going wrong. And to be fair, any change is a risk, but so is leaving old code around.
At least now we have this case to point to as a clear example of the risk.
Version control isn’t bulletproof either. All your code history is one git rebase away from being abolished forever.
I hope most orgs have processes around their main branches so this does not occur, but I’ve also been in smaller orgs and accidentally screwed up prod database tables, so the accidental git rebase isn’t impossible to consider…
Yes, you should definitely have branch protection turned on on main, I kind of assumed that went without saying. But to actually lose your entire git history would require both having no branch protection and having every single developer on your staff be in the regular habit of force rebasing their own branches on main. If a single developer does a double take when they're told that their branch has a different history than origin/main, then you probably didn't lose more than a week of work.
Also, it's worth noting that this dead code will almost certainly not get reused wherever you decide to store it, so it's best to keep it as far away from zombification as possible, even if it's not the safest place.
You're very unlikely to lose your version history like that.
Everywhere I've worked has branch protection turned on, and backups. Even if those both fail somehow it's very likely the complete history is on lots of engineers' laptops.
1. You shouldn't be allowing anybody to force-push rebased stuff onto major branches in your main repo (the one builds come from) anyway. This is especially important to support auditing and trace-ability.
2. Just because the folder is a git-repo isn't an excuse not to have it part of your regular offsite backup set.
There are reasons to want branch protection but audit isn’t one of them. An auditable system would be keeping an immutable log of the git repo actions in a separate, append-only location. It wouldn’t rely on the thing being written to all the time, by users, to never get broken. In your model, your audit history only needs one bug in the branch-protection tool for it to be destroyed.
Think of it like syslog. It’s good to keep a log of events but it’s bad to rely on the /var/log/syslog on your web server. You should be logging to a remote, append-only system.
(However I would concede that if we are talking about lawyer-proof-audit — SOC2, ISO etc — rather than actual security auditing, then branch protection is probably just fine.)
See my sister thread where I worked at Knight just after the outage.
> For 8 years they left unused code in place, seemingly only bothering to remove it because they wanted to repurpose a flag
There was another issue where they were using a database with only 256 columns. Sometimes, they needed a new column so they would just reuse an old columns that "wasn't being used at the time".
IIRC, this was generally acknowledged internally to be "a bad idea" but no one had prioritized cleaning up the old code and/or coming up with a better best practice.
Imo it's easier to quantify the value of adding new code than removing old code (in terms of dollars). Things that are more easily to quantify the value of tend to get higher priority. Same applies to other areas like performance optimization, testing, security, patching (some of these have high level data like the cost of incidents/downtime or the cost of a breach)
Some companies have processes in place to try to balance this like 20% of time spent on tech debt reduction
No continuous deployment system I have worked with would have blocked this particular bug.
They were in a situation where they were incrementally rolling out, but the code had a logic bug where the failure of one install within an incremental rollout step bankrupted the company.
I’d guard against this with runtime checks that the software version (e.g. git sha) matches, and also add fault injection into tests that invoke the software rollout infrastructure.
No continuous deployment system worth its salt would allow configuration and code to be out of sync. They had a configuration change to turn on a flag that used to enable Power Peg but now enabled something else, plus a code change to reinterpret that flag differently.
the situation is caused by a confluence of multiple issues.
The biggest red-flag is that they chose to repurpose a flag! Why? Is it really difficult to add a new flag for a new feature?
Even if the technician was careful not to let prod be out of sync, it is possible that the deployment isn't instantaneous, and that the old code could've ran when the repurposed flag was turned on.
Some of the trickiness that comes from HFT and adjacent things like this is you can be working on tremendously powerful hardware but still be miserly about bits because every extra byte you have to cram into a packet is extra latency. The HFT firm I worked for would "re-use flags" in the sense that each packet literally had an 8-bit section called "flags" (and further down in the message, another 8 bit section called flags2 because of course) and each bit in there was a Boolean that could be on or off - a flag. So we weren't reusing flags as much as we were re-allocating what a high bit at that index in flags meant.
We were very conscious of this kind of error though and we managed them like Scrooge counting his farthings.
- Config gets generated to a deploy and saved as <version>.json. Code downloads config file matching its own version or fails to start (this is a nice one since rollbacks becomes vary deterministic and code rollback is the same process as config rollback)
- Deploy code first and verify it's updated and working correctly before changing feature flag config (this one is still prone to errors without automation)
The first one can be done on Kubernetes using kustomize ConfigMap generator. The system I worked on used object storage for the config file and deploy tooling to generate it from a key value store at deploy time
Wild west times ! It's worth noting, that things changed a lot in trading systems since then.
When I started working in this domain (2009), it was pretty crazy how unreliable those systems were, on all sides - banks, brokers, exchanges. Frequently you needed to make sure over the phone, what quantities got executed etc.
I remember when the Italian exchange was rolling out their systems, at some point we did "tests" on a mix of production and UAT - if my memory is correct, we were just changing IPs to which to connect for order passing to test for the upcoming release, after the market closed. We couldn't just test in their UAT environment, since it was so bugged and half down most of the time.
And let's not even talk about Excel spreadsheets with some VBA code that would make chatGPT swear, that were pricing instruments with volumes traded with a lot of zeros.
It's very different nowadays, in part thanks to stories like this one. Most things are automated, and there is much less cowboy's attitude.
There are mandatory kill switches, a lot of layers of risk / trading activity monitorings (on your side, on exchange side), and really a lot of hard learned lessons incorporated into the systems. That's also part of the reason why people sometime tend to be naive about how hard it is to build a good trading system - the strategies are sometimes now really smart - it's mostly about how to avoid getting killed by something that's outside of usual conditions.
Literally everyone in quant finance knows about knight capital. It even has its own phrase; "pulling a knight capital" (meaning; cutting corners on mission critical systems, even ones that can bankrupt the company in an instant, and experiencing the consequences)
My team's systems play a critical role for several $100M of sales per day, such that if our systems go down for long enough, these sales will be lost. Long enough means at least several hours and in this time frame we can get things back to a good state, often without much external impact.
We too have manual processes in place, but for any manual process we document the rollback steps (before starting) and monitor the deployment. We also separate deployment of code with deployment of features (which is done gradually behind feature flags). We insist that any new features (or modification of code) requires a new feature flag; while this is painful and slow, it has helped us avoid risky situations and panic and alleviated our ops and on-call burden considerably.
For something to go horribly wrong, it would have to fail many "filters" of defects: 1. code review--accidentally introducing a behavioral change without a feature flag (this can happen, e.g. updating dependencies), 2. manual and devo testing (which is hit or miss), 3. something in our deployment fails (luckily this is mostly automated, though as with all distributed systems there are edge cases), 4. Rollback fails or is done incorrectly 5. Missing monitoring to alert us that issue still hasn't been fixed. 5. Fail to escalate the issue in time to higher-levels. 6. Enough time passes that we miss out on ability to meet our SLA, etc.
For any riskier manual changes we can also require two people to make the change (one points out what's being changed over a video call, the other verifies).
If you're dealing with a system where your SLA is in minutes, and changes are irreversible, you need to know how to practically monitor and rollback within minutes, and if you're doing something new and manually, you need to quadruple check everything and have someone else watching you make the change, or its only a matter of time before enough things go wrong in a row and you can't fix it. It doesn't matter how good or smart you are, mistakes will always happen when people have to manually make or initiate a change, and that chance of making mistakes needs to be built into your change management process.
>My team's systems play a critical role for several $100M of sales per day, such that if our systems go down for long enough, these sales will be lost.
Would they? Or would they just happen later? In a lot of cases in regular commerce, or even B2B, the same sales can often be attempted again by the client for a little later, it's not "now or never". As a user I have retried things I wanted to buy when a vendor was down (usually because of a new announcement and big demand breaking their servers) or when my bank had some maintainance issue, and so on.
It's both (though I would lean towards lost for a majority of them). It's also true that the longer the outage, the greater the impact, and you have to take into account knock-on effects such as loss of customer trust. Since these are elastic customer-goods, and ours isn't the only marketplace, customers have choice. Customers will typically compare price, then speed.
It's also probably true that a one-day outage would have a negative net present value (taking into account all future sales) far exceeding the daily loss in sales, due to loss of customer goodwill.
It would be a serious issue for in person transactions like shops, supermarkets, gas stations, etc
Imagine Walmart or Costco or Chevron centralised payment services went down for 30+ mins. You would get a lot of lost sales from those who don’t carry enough cash to cover it otherwise. Maybe a retailer might have a zapzap machine but lots of cards aren’t imprinted these days so that’s a non starter too.
Not just lost sales. I've seen a Walmart lose all ability to do credit card sales and after about 5 minutes maybe 10% of people waiting just started leaving with their groceries in their cart and a middle finger raised to the security telling them to stop.
It depends on the business. It's not uncommon for clients to execute against different institutions' systems, and they can/would re-route flow to someone else if you're down.
Think less "buying a car" and more "buying a pint of milk". If you're buying a car and the store is closed, you might come back the next day. If you're buying milk you will just go to the store down the street.
I imagine same with time based or opportunistic businesses. If the shopping channel (assuming it runs around the clock) couldn't process orders, they'd have to decide if they want to forgo selling other products to rerun the missed ones.
For certain types of entertainment like movies or sports, the sale may no longer be relevant.
The real issue here (sorry for true Scotsman-ing) is that they were using an untested combination of configuration and binary release. Configuration and binaries can be rolled out in lockstep, preventing this class of issues.
Of course there were other mistakes here etc., but the issue wouldn't have been possible if this weren't the case.
> why code that had been dead for 8-years was still present in the code base is a mystery, but that’s not the point
It's not the worst mistake in the story, but it's not "not the point." A proactive approach to pruning dead functionality would have resulted in a less complex, better-understood piece of software with less potential to go haywire. Driving relentlessly forward without doing this kind of maintenance work is a risk, calculated or otherwise.
It’s fine to have that kind of responsibility, but it has to actually be your responsibility. Which means you have to be empowered to say “no, we aren’t shipping this until XYZ is fixed” even if XYZ will take another two years to build and the boss wants to ship tomorrow.
As a profit non-taker, which responsibility a worker can even have? Realistically it lies in range of their monthly paycheck and pending bonuses and in a moral obligation to operate a failing system until it lands somewhere. Everything above it is a systemic risk for a profit taker which if left unaddressed is absolutely on them. There’s no way you can take responsibility for $400M unless you have that money.
It’s not scary when it’s done properly. And done properly can look like an incredibly tedious job. I think it’s for a certain kind of person who loves the process and the tests and the simulators and the redundancy. Where only 1% of the engineering is the code that flies the plane.
It feels anxiety inducing at first, but if you have good controls and monitoring in place, it becomes daily routine. You basically address the points you naturally have and the more reasonably anxious you are, the better for the business. From my experience with finance, I’d wager that problem at Knight was 10% tech issues, 90% CTO-ish person feeling ballsy. In general, not exactly that day or week.
I don't know if it's like this at every company, but typically there are plenty of humans keeping a close eye on what's going on whenever the software is placing orders on an exchange.
I've worked in various small to medium IT companies, a FAANG and another fortune 500 tech company. 6 months ago I moved to a proprietary trading company/market maker and it's the most interesting and satisfying place I've worked so far.
I hope to continue to "waste my life" for many years to come.
Actually it's one of the few truly intellectually-pure endeavors. Everything else is the same pursuit with extra steps:
Make a trading strategy to make money
vs
Make a cutting edge machine learning classifier to back out latent meaning in search queries to produce better search results to drive more traffic to google to sell ads to make money
You're not wrong, but the problem is those steps are also the steps that produce food, or improve health, or solve climate change or solve any of the innumerable problems we face as a society. As you identify, there are plenty of pursuits other than finance that are not particularly socially useful - it's not a very exclusive club.
I work in finance/tech, but my code doesn't execute million-dollar trades automatically. Cash transactions are reviewed by a human, and its mostly data analysis type work.
It's one thing to make recommendations or calculations and give the report to a human. It's another to start trading high volume in real time automatically.
Most jobs do in fact contribute to the well being of humanity, however little. It's few jobs, like most in financial trading, that actively reduce the well being of humanity.
Never will you meet a more self-deluded and pathetic set of humans. Desperate money addicts that often become other kinds of addicts. Whole thing should be abolished.
Source: I worked in finance when I was young and dumb.
> Most jobs do in fact contribute to the well being of humanity, however little.
No, they don't. A lot of jobs hold os back, actually. Salespeople selling things people don't wanna buy, finance and tech bros vampirizing third world countries without the safeguards that western countries have on their capital markets, etc.
Things are somewhat different now than 5, 10, 20 years ago.
There has been a wave of "individual accountability regimes" released by pretty much every regulator.
I have worked with the SFC the most, so that's what I will describe here, but all these regulations are pretty much copy/paste of each other anyway.
I was MIC under the SFC (HK) for various operational and financial resps for approx 8 years responsible for close to $3B exposure across equities, IRS & FX, and am now licensed with the FCA (UK) since approx 1 year.
Basically, on top of the usual regulatory framework defining a top level Operating Officer (MOO) and subordinate Responsible Officers (RO), the new individual accountability regime creates the notion of Managers In Charge (MIC).
The MICs fill the gap that, increasingly, a considerable amount of operational responsibility lie in the hands of non licensed individuals (i. e. tech people).
The SFC defines a number of responsabilites (e.g. DRP/BCP, kill switches, backups, fail overs, rollbacks, load testing, etc) and these responsibilites need to be allocated to one or more of the allocated MIC.
The SFC has a right to reject an appointment of MIC if the individual is not seen as fit and proper (that is assessed generally on an annual basis by a compliance officer, but can be re-assessed on the spot if you end up displaying unfit traits). The SFC also mandates a track record of experience and expertise on the assigned responsibilities, as well as a direct capability by the MIC to have control on his responsibilities. In clear terms, that means you need to have the actual power of saying "no", you need to have the power to hire someone if that is necessary for the safety of the operations, etc.
Once you get appointed as MIC, most of your responsibilities are based on _means_, not _end results_:
If Karen breaks production, that's not much of your problem (regulatorily speaking) as long as you can demonstrate that you had Karen attend 6h of training this year on how not to break production.
In terms of actual developer experience, the _means_ often take the form of trainings, code review, pre prod impact assessment, incident reporting procedures, etc.
So on one hand you have a very heavy personal and professional responsibility. But on the other end you are at fault only if you did not setup a proper framework for things to work.
In terms of the professional responsibility, there is not much to do if you are deemed guilty. You will most likely be temporarily or permanently barred from having a licensed position. Nobody will hire you anyway.
For the personal responsibility, it is usually limited to single digit millions, and most big asset managers have an insurance to protect you (otherwise noone would accept the role).
If you are interested in the actual additional responsibilities that were added after KC, then I suggest you have a look at MiFID II (the European régulation, well written and understandable), especially segment RTS 6 "Technical standards specifying the organisational requirements of investment firms engaged in algorithmic trading":
I feel like the first thing I would build into any automated trading system is a kill switch? then every single diff or pull request I add would have some sort of automated testing to ensure the kill switch still works. Also I'd manually flip it on/off once a day to make sure it works for real. That seems like the single most important thing to build and make sure works. Or is the system too complex for something like this and I don't understand the domain well enough?
Most systems do. At pastjob we had a few different levels:
- halt - just stop trading
- exit-only - only exit positions (but do so according to our alphas, no hurry)
- flatten - exit in a hurry but obey certain limits (often if liquidity was thin we would just "journal" the shares - move them to a long-term-hold (meaning more than the current day) account to exit in the opening auction the next day
- market exit - get the fuck out, now, no matter what the cost.
Depending on what you're doing, a straight "pull the plug and stop trading" could leave you with eg unhedged positions that blow through your risk limits. But when your ability to actually execute those trades sensibly is broken regardless, yeah, you're still going to want to hit that button.
Hold on. Are we blaming the plane crash on the pilot here? It seems there is so much other stuff wrong with this company first that such a deployment would tank it.
No kill switch. Literally needs to be a power switch and a trader who runs to the room and flips it. Ridiculously small amount of cash for the trading volume, and no way to borrow more to stay in business (but that borrowing requiring manual intervention no accessible to the trading system). Obviously the decision to leave that code in there, and for there to be config setting to bring it back.
Then the devops stuff - rollback plans, approvals, pairing on deployments, etc.
I would argue the real issue was the lack of an automated system (or multiple automated systems) that would hit the kill switch if the trading activity didn’t look right.
Yes definitely, one has to assume that from time to time, bugs will reach the prod servers, no amount of tests and code review can completely prevent that.
Hopefully the kill switch system is reasonably easy to code review and test :-)
Sure, but there is always the possibility that then you shut down trading when things _arent_ broken.
There are always two error rates.
Defining behavior is great for retrospective analysis but would you really feel comfortable putting hard cuts into production based on the answers to those questions? I’m genuinely asking, because IME I wouldn’t be.
That last nine in a trading system uptime has exponentially low value unless you have customers who care quite a lot.
Seriously, suppose you have a truly awesome system making $100B per year of revenue. If you unnecessarily shut down 0.1% of the time, that’s only $100M per year lost, and an 0.1% unnecessary shutdown rate seems pretty high.
Estimate what a real human can do in a day, and use that as the limits. Verify that the system behaves ok for some time, then scale up the desired trading volume and limits, observe, scale, repeat.
But you don't do it by making a (bad) guess up front and then just leaving it at that.
There's definitely more to this story. Why was there a fixed number of "flags" so that they needed to be reused? I wish there was a true technical explanation.
I can only think that it was some kind of fixed binary blob of 1/0 flags where all the positions had been used umpteen times over the years and nobody wanted to mess with the system to replace it with something better.
this is what stood out to me reading the story. i wonder if there was a reason why they opted for this, however half-baked.
it reads less to me like a case for devops as it does a case for better practices at every stage of development. how arrogant or willfully ignorant do you have to be to operate like this considering what’s at stake?
They probably already had a bitfield of feature flags, maybe it was a 16-bit integer and full, and someone notices "hey this one is old, we can reuse it and not have to change the datatype"
This incident highlights a problem that is often overlooked in the debate about feature branches versus feature toggles.
I've worked with both feature branches and feature toggles, and while long lived feature branches can be painful to work with what with all the conflicts, they do have the advantage that problems tend to be uncovered and resolved in development before they hit production.
When feature toggles go wrong, on the other hand, they go wrong in production -- sometimes, as was the case here, with catastrophic results. I've always been nervous about the fact that feature toggles and trunk based development means merging code into main that you know for a fact to be buggy, immature, insufficiently tested and in some cases knowingly broken. If the feature toggles themselves are buggy and don't cleanly separate out your production code from your development code, you're asking for trouble.
This particular case had an additional problem: they were repurposing an existing feature toggle for something else. That's just asking for trouble.
That's interesting. Whenever I have an issue with a flag it gets picked up on dev/test/uat environments (all gets tested, especially around the code behaving the same as before with the flag off). The code change never reaches production. And if for some reason the code under the flag is wrong, and it has reached production (something unexpected, unseen), undoing the change is whatever long it takes to switch the flag back (and the cache to update if you have a cache).
That's a good approach if you can cleanly separate out the old code from the new code, and if you can make sure that you've got all the old functionality behind the switch. Unfortunately this can be difficult at times. Feature toggles involving UI elements, third party services or legacy code can be difficult to test automatically, for example. Another risk is accidental exposure: if a feature toggle gets switched on prematurely for whatever reason, you'll end up with broken code in production.
The cases where I've experienced problems with feature toggles have been where we thought we were swapping out all the functionality but it later turned out that due to some subtleties or nuances with the system that we weren't familiar with, we had overlooked something or other.
Feature toggles sound like a less painful way of managing changes, but you really need to have a disciplined team, a well architected codebase, comprehensive test coverage and a solid switching infrastructure to avoid getting into trouble with them. My personal recommendation is to ask the question, "What would be the damage that would happen if this feature were switched on prematurely?" and if it's not a risk you're prepared to take, that's when to move to a separate branch.
Having worked in some Fortune 500 financial firms and low rent “fintech” upstarts, I am not surprised this happened. Decades of bandaid fixes, years of rotating out different consultants/contractors, and software rot. Plus years of emphasizing mid level management over software quality.
As other have mentioned, I don’t think “automation of deployment” would have prevented this company’s inevitable downfall. If it wasn’t this one incident in 2014, then it would have been another incident later on.
It's an entire industry built on adrenaline, bravado, and let's be honest: testosterone. How could their IT discipline be described as anything other than "YOLO"?
Trading is mostly based on a book that, like the waterfall model, was meant to be a cautionary tale on how not to do things. Liars' Poker had the exact opposite effect of Silent Spring. Imagine if Rachel Carson's book came out and people decided that a career in pesticides was more glamorous than being a doctor or a laywer, we made movies glorifying spraying pesticides everywhere and on everything, and telling anyone who thought you were crazy that they're a jealous loser and to fuck off.
They were probably worth much more than $400M before the failure so it was a good investment opportunity. They would have been a money printing machine aside from this one major fuckup.
The nuance is a) what happens to existing equity stakeholders and b) does the bailout have to be repaid.
If the answer is nothing and no, then it’s a bailout philosophically. If the existing investors get diluted then they’re in part paying for the new capital injection.
A government bail out isn't the exclusive use of the phrase "bail out", it was both a bail out and an opportunity for investors to get great terms on equity.
> Had Knight implemented an automated deployment system – complete with configuration, deployment and test automation – the error that cause the Knightmare would have been avoided.
Would it have been avoided though? Configuration, deployment and test automation mean nothing if they don't do what they are supposed to do. Regardless of how many tests you have, if you don't test for the right stuff it's all useless.
The specific part is configuration as code. So the config change (flag activation) and code change (flag calling) would have been synchronized.
And there wouldn't have been one server of 8 with a different build for a meaningful time and also if it did fail to deploy on that one server it would have been obvious.
That's based on the assumption that someone would have thought about testing that particular flag for that particular scenario.
In my view this would only have been caught by a deployment to an identical copy of production, with running, simulated transactions, and high level funtional testing. Testing for each individual config value and scenario of where it may be used is playing whack a mole. Basically, I'd make a clone of prod, simulate everything that happends externaly (APIs, etc) and observe transaction KPIs and other high level business indicators. Testing for tech is insuring that the tech works, and sometimes that means testing that it's broken.
2. As I mentioned above, I went to work at Knight as a DevOps on a team that deal directly with the team mentioned in the blog post.
There are lots of stories around this but I will share this one:
Late 2012 is when Apple rolled out the "emergency weather notification" function. I was in the office and the notification went off on multiple people's phones. Knight was also experimenting with call notifications.
So when the alert goes off, someone yells "God damn it! Not again!!" (thinking there was another big outage)
3. People outside of finance have no idea of the different types of outage that can happen due to all sorts of factors.
4. In finance in general, the amount of legacy code that behaves in weird ways or was written by someone 10 years ago who is no longer with the firm is ASTOUNDING.
Coupled with the billions of combinations of regulations, internal controls, multiple countries and jurisdictions etc makes accounting for every single edge case impossible. To use an infosec term the "attack surface" of possible user actions that could lead to bugs is enormous.
Typical case:
- User says they want to see reports for a couple days worth of trading for all securities
- User also says they want to see FULL history for one security
- User never says they might want to see FULL history for ALL securities at the same time
- This being HN, someone will say "you should have thought of that"
- Sure, but then they pull only some of the history for a Ukranian bond that has a 182 (not 180 like most) day bond. This is the only example of this type of bond. Ever. Did you think of that? What should the system have done?
- An oh, btw, this system was pushed out quickly due to regulatory pressure etc
I would be interested to read these stories, but the twitter links only show a single tweet ending in the phrase "A thread." Perhaps this is a new feature of X whereby only logged-in users can see a tweet and its replies.
Much as I enjoy articles that reinforce my existing beliefs, high-frequency trading is a pretty extreme example when it comes to how how badly things can go in a short time
Their issue was neglecting an automated SCRAM system that would halt all the trading or any alerting with manual intervention. The article touches on that. There was no excuse why the system wasn’t halted by 9:32 which would’ve avoided most of the kerfuffle
>They had 48-hours to raise the capital necessary to cover their losses (which they managed to do with a $400 million investment from around a half-dozen investors).
I'm very curious about this bit. How exactly do you raise $400m of "investment" to cover such a massive footgun, in 48 hours, when you haven't even had time to understand what happened or whether it would happen again?
Why are people stumping up hundreds of millions of cash here?
It is funny, but in one company I was working for, the more people they added the more they neglected all basics, such as backups. There were heavy processes for many things and they were followed very well, but for whatever reasons some really basic things went unnoticed for many years.
I refuse to believe that failed deployment can bring a company down. That is just a symptom. The root cause has to be a whole big collection of decisions and processes/systems built over years.
I see a lot of criticism of the deployment, but why did the developers "repurpose an old flag" that activates 8 years dead code that you haven't deleted and that has completely unknown current functionality? That seems like the strangest decision made in this debacle.
To save time, I guess. They deleted the inactive code, so, why not, they thought. But then they forgot to deploy that change (to one server).
Bugs and configuration errors will happen from time to time, and might look silly in retrospect. But the real problem was, I think, that there was no kill switch (managers and tech leads should have decided to add long ago)
This has nothing to do with “DevOps”, and I am getting tired of this word.
This mistake could have been prevented on multiple levels, and in my experience, deployments that involves major architectural changes rarely repeatable or can be fully automated.
Changes we make to software and hardware infrastructure are essentially hypotheses. They're backed by evidence suggesting that these modifications will achieve our intended objectives.
What's crucial is to assess how accurately your hypothesis reflects the reality once it's been implemented. Above all, it's important to establish an instance that would definitively disprove your hypothesis - an event that wouldn't occur if your hypothesis holds true.
Harnessing this viewpoint can help you sidestep a multitude of issues.
> (why code that had been dead for 8-years was still present in the code base is a mystery, but that’s not the point).
Actually it's a big part of the point: they have a system that works with dead code in it. If you remove that dead code perhaps it unwittingly breaks something else.
That kinds of chesterson's fence is a good practice.
Chesterton's Fence states that you shouldn't make a change until you understand something's current state. Removing code because it's dead is folly, if you don't understand 1) why it's there, and 2) why nobody else removed it yet.
As this is a postmortem, it was proven dead code. There is nothing in the text that mentions that they didn't know what the code did (which then wouldn't be dead code).
It may not be obvious that it's dead code - in a lot of popular interpreted languages, it's impossible to tell if a given function can be called or not
Your original comment is somewhat unclear. Are you advocating for leaving old code in because the system works and it's more stable that way, or taking it out to force the necessary refactoring steps and understanding that will bring?
I'm sorry I wasn't clear: I re-read my comment and couldn't think of a decent edit.
It was the author whom I was quoting as saying "why would someone have old code lying around." It seems obvious why that's a good idea and it seems commenters in this thread (including you) agree with me and not the author.
I don't exactly understand what this has to do with continuous delivery, but maybe I just don't know enough about this topic.
Wouldn't it have been best to set up a 'shadow infrastructure' and route every trade into it for several weeks/months to verify the correctness of the system?
I worked in fintech for a few years. I'll never again work on software that's responsible for trading, you could offer $1M/year and I wouldn't take it. By far the most stress I've ever experienced at a job.
While nice automated deployment is the wrong lesson here, it's really not anticipating backwards incompatibility and poor altering and incident training.
Flags should never be reused and should be retired after they're no longer useful.
> Flags should never be reused and should be retired after they're no longer useful.
That's such a "no-brainer," that I don't think it's even written down, anywhere.
When I read that, I was like, "Whut?"
In the Days of Yore, when we hammered programs directly into the iron as Machine Code, we would do stuff like that, but I can't even imagine doing that with any halfway modern language. They don't say, but it's probably C++. I know that's popular for HFT.
Not so simple.
The company was then used as a building block to to create another entity, which was then acquired for over a billion dollars.
"The company agreed to be acquired by Getco LLC in December 2012 after an August 2012 trading error lost $460 million. The merger was completed in July 2013, forming KCG Holdings.
...On April 20, 2017, KCG announced that it had agreed to be acquired by Virtu Financial for $20 per share in cash in a deal valued at approximately $1.4 billion."
Focusing on deployments is too narrow. Deployment can be automatic but still have a botched config.
In this context it's more useful to think in terms of production principles. The principle that was poorly followed was defence in depth. There was no line of defence after the deployment.
This is the Ur “devops fuckup” tale - I’ve told this to junior engineers who’ve bodged a deploy to make them feel better. I’ve been in this field for 20 years, and I can’t imagine I’ll ever have a day as bad as the engineers who got bit by this fuckup.
Not removing old code is akin to never throwing away food, even after it reaches its expiration date. Sure, you'll have it around next time you need it, but putting year-old yeast into your baguettes is, well, a recipe for disaster.
With git, yes. When I end up working with programmers who come from using other vcs, I find they are often the ones who don't delete code, or at best comment it out. I encourage them to trust git. It takes effort (barring a system failure out of git's control) to lose code. It can take some digging to find code in the git history, but it's there. Even if you run `git gc`, the objects are still there in the repo, by default. Even in extremis, if the central repo is gone, whoever has the most recent checkout still has history.
Imagine there was some way for a trading company to execute billions of dollars of trades and they say "ooops, sorry, that was all a mistake" can you not see how that would be abused?
Now, the story also says that within a minute of the market opening, the experienced traders knew something was wrong. Do they bear any culpability for jumping on those trades, making their money off of something they knew couldn't be intentional?
This isn’t really correct. Typically exchanges have safety parameters which market makers can set according to how they wish to trade, and if you exceed those your orders will no longer be accepted and existing orders may also be pulled.
Obviously there are false positives occasionally and there is typically communication between the exchange and the market maker to ensure those don’t reoccur.
Automation is not a silver bullet. Automation is still designed by humans. Peer reviews, acceptance test procedures, promotion procedures, etc all would have helped. And yes some of those things are manual. Sandbox environments, etc
Sometimes I think whether these events are more sinister than they appear to be. But then I heard that another MM is using Access applications to make markets for options and I think it's just incompetent.
Ah, Knight Capital. The warning story for every quant trader / engineer.
This is what people don't realize when they say HFT (high frequency trading) is risk-free, leeching off people, etc.
You make a million every day with very little volatility (the traditional way of quantifying "risk" in finance) but one little mistake, and you're gone. The technical term is "picking up pennies in front of a steamroller (train)". Selling options is also like that.
In this case KCG was doing the opposite of making markets --- they were taking --- they were eating the spread over and over and over again until they ran out of money.
This entire story is about a trading firm that lost 400m trying to provide market liquidity. Which part of the loads of risk isn't clear in this context?
If a seller and a buyer are in market within seconds of each other, they would have traded successfully without a third party taking some of their money. As I understand it, HFTs are trying to avoid taking meaningful long-term positions (which is why latency matters to only them).
What risk are they taking exactly? Bugs ruining the business isn't meaningful risk for the customer. It isn't like day traders are at risk of going bankrupt due to that after all.
They claim liquidity is their value but given how they act they don't seem to be providing measurable liquidity, either in terms of price or volume. (Yes they increase volume by getting in the middle of trades but that isn't useful volume...)
Market risk isn't the only type of risk. Many businesses in other industries don't have market risk, that isn't abnormal. Even businesses that you would expect to be exposed to market risk aren't, since they hedge most or all of it.
There's operational risk, like what brought down Knight Capital, that's a type of risk. Or the risk that you will be put out of business by competition because you were too slow to innovate while burning through all your cash runway. HFT firms face the same risks that other types of businesses face. Smaller HFT firms fail often, and larger firms tend to stay around (although sometimes they also fail and often they shrink), which is similar to many mature competitive industries.
> given how they act they don't seem to be providing measurable liquidity
I'm not sure "How they act" should inform one's perspective on the empirical question of whether or not they are adding to liquidity. There is a lot of serious debate and research that has gone into that question.
How they act is the hyper focus on first to market. HFT wants to have the first buy or sell order at price X.
Being first to market does not impact liquidity availability. After all someone else has an order at that price already.
My points about risk are beyond going long or short for a meaningful amount of time (certainly not seconds, probably not minutes) trading quickly isn't hugely impactful on end users. Thus all of the downsides of trading quickly aren't reducing risk for them.
Depends on whether they truly take on the risk. Interestingly I can’t clearly tell from a quick google who exactly ended up holding the bag here, and what became of upper management.
Most people confuse market making/risk holding with high frequency statistical arbitrage strategies. I'm not totally sure exactly what Knight Capital was running, but generally the only "little" mistakes that would cause HFT market takers such as Jump(for the most part) would blow up is some type of egregious technical error like this, or some type of assumption violations outside of market conditions(legal, structural, etc.). Compare this to market makers like Jane Street who hold market risk in exchange for EV, and thus could lose money just based off of market swings (not to blowup levels if they know what they're doing), and you can see the difference between the styles.
I'm a proponent of both. But generally I hold more respect for actual market makers who hold positions and can warehouse risk.
There are plenty of bad options traders, particularly retail...but this is an oversimplification. You can buy an index fund, and it can go to zero (however unlikely). You're not guaranteed any return, whereas at least selling an option has some guaranteed fixed premium.
Professional options traders are incredibly sophisticated, and most of the tail risk is offloaded to people who are always long biased. Options as a whole massively improve price discovery in markets.
lol. No. Deployments were not the issue. At any given time an automated deployment system could have had a mistake introduced that resulted in bad code being sent to the system. It does not matter if it was old or new code. Any code could have had this bug.
What the issue was, and it’s one that I see often. Firstly no vision into the system. Not even a dash board showing the softwares running version. How often i see people ship software without a banner posting its version and or an endpoint that simply reports the version.
Secondly no god damn kill switch. You are working with money!! Shutting down has to be an option.
Oh god. I just realized this is a PM. A plight on software engineering. People who play technical, and “take the requirements from the customer to the engineer”. What’s worse is when they play engineer too.
I mean it makes no sense, without even reading the article, just by working in IT I can tell you that if you're one deployment away from being bankrupt then you're either doing it wrong, or in the wrong business.
They were market makers, which is different. They help so when you push sell on E*trade you actually get a price somewhat close to your order in relatively short time. No need to call up a broker who will route the order so a guy shouting on the floor.
But ChatGPT would have fixed the issue faster in 45 mins than a human would. /s
A high risk situation like this would make the idea of using LLMs for this as not an option; before someone puts out a 'use-case' for a LLM to fix this issue.
I'm sorry to preempt the thought of this in advance, but it would not.
No they're preempting someone coming along and claiming this. Haven't seen it in the replies yet but there's typically one (or a lot in some cases) person(s) claiming ChatGPT will bring Jesus back from the dead sort of thing.
That's an odd nonsequitor strawman you constructed to knock down. Did someone suggest LLMs as the solution or are you just asserting superiority over an imaginary guy?
As expected it seems many here, even you couldn't figure out what the '/s' means even as I preempted it in advance before anyone comes and tries to claim it anyway.
So even putting the '/s' to denote sarcastic intent doesn't work on HN. Can't even take a joke here.
Substitute "a developer forgot to upload the code to one of the servers" for "the deployment agent errored while downloading the new binary/code onto the server and a bug in the agent prevented the error from being surfaced." Now you have the same failure mode, and the impact happens even faster.
The blame here lies squarely with the developers--the code was written in a non-backwards-compatible way.