See title - I think a collection of failure stories would be a useful learning resource.
Edit: I’m thinking primarily about technical decisions, though of course these are made in a wider context (eg choosing tech that’s impossible to hire for).
I know of a company that was pretty successful with a VB6 app they wrote in the late 90s, but twenty years later they needed to modernize. They spent two years on a from scratch rewrite that was never completed (second system syndrome).
The company abandoned the rewrite, acquired a competitor, and rebranded it as the company's new product. It turns out the competitor's code base had serious deficiencies (couldn't scale to the company's size, had important bugs), and was very unpleasant to develop against. In the 6 months following the acquisition, well over half of the engineering team left the company (heavily skewed towards the highest-caliber engineers) because they couldn't stand what their jobs had become. Many long-time customers threatened to leave because of user facing technical issues in the acquired product.
A telling incident: the company needed to deploy an emergency fix to the mobile app. The only machine authorized to publish to the app store was a laptop that the owner of the acquired company had left at home overseas. To complete the deploy, he needed to call his wife and walk her through the publication process over the phone.
The company managed to fix enough of the problems that they're still around today, but for a while there was a lot of uncertainty around the company's future.
> They spent two years on a from scratch rewrite that was never completed
Wow, that’s doomed to fail hard.
Where I work at, we’re slowly rewriting a legacy Perl code base to Go HTTP services, and the best decision we made was to allow interoperability between the two systems. This way we can migrate the legacy parts by parts with minimal impact. This is a slow process (in 5 years we migrated maybe 3/4 of the code) but very reliable. And we actually have something to show for it.
I think you're probably asking about mistakes like deleting a database? But more often companies, orgs, or teams dying has to do with engineering management in my experience. I'll name a few I have experienced and you all can decide whether or not these count as "engineering mistakes?"
- Saying nothing about it, but "secretly" moving the team to another country by changing the manager and director, and not hiring anymore in this country, even when team members quit. It took awhile but I figured out that the plan was to let natural attrition take over and for every person who quits from here, hire their replacement there.
- Different company: after being acquired, the acquiring company communicated for eight months that no changes in staffing would take place, and that once the acquisition closes, everyone will be put on exciting new projects and a new bright future will emerge. Then, after the acquisition closed, laying off a third of the company immediately. Another third chose the exit soon after. Eventually, the company died.
- Third company: after being acquired by a large competitor, the director of technology for this third company promised them a new product for four years. I'm led to understand by those closer to the situation than myself that this director then proceeded to coast on vaporware demos for those four years, claiming the need to "pivot" or "reboot" the product as necessary, and promising more and more pie-in-the-sky fantasies until finally the gig was up and he was fired. That subsidiary also officially closed its doors eventually.
Ironically, I'm going through the opposite of #1 - not a bad company, but decided it needed to move countries. Made a big announcement, said basically in 3 months you're all out the door, but keep up the hard work, and help with the transition! I think they _just_ hired their first couple of FTE's in the new location.
You can imagine the general reaction - many left, those with vacation time are taking it, progress ground to a halt, motivation and morale through the floor - so basically, burning three months of salary instead of doing exactly what you were bothered by - hiring new teams in the new locale, scaling down work in the old one, and then after things are stable in the new home, shutting down the old one.
It's truly baffling to me given that they're moving to a more expensive location, they're confused by higher costs of hiring, slowed progress and doubling of expenses.
End of the day it seemed like the best way to set several million dollars on fire - I imagine an executive saved their bacon somehow, but I really don't imagine there will be much of a bonus after the math is all done and tallied. I'm really curious how the new teams will react, realizing that there was this massive talent dump and nobody seems to know where anything came from - the transition is being handled as professionally as possible by those that remain, but I'm pretty sure I wouldn't take an FTE offer from them at this point given the track record.
I have heard some companies hide the truth of relocating to another city and announce it right before the move. As employee, you either move with the company or find a new job.
It's probably better to build a team in the new location then announce the move in advance. It will be a shock either way.
>But more often companies, orgs, or teams dying has to do with engineering management in my experience.
Nothing can 10X sink a product faster than bad engineering management.
I once joined an established Enterprise software company where leadership had decided they wanted to make a new product. A lot of the justification was they saw the need with existing customers, and they could make $200M+ ARR by simply getting their existing customers to use it. They invested a lot of money in marketing, created a lot of hype, and built up a huge worldwide org.
One of the premises of the product was that if you had a fleet of machines using this product, it would be a force multiplier for extracting value out of a particular broadly defined use case.
Except, engineering and product management was so busy getting the engineers to build something that could scale to infinity, they never bothered to improve the basic stuff that let somebody get value during a trial period. Customers would struggle to deploy and integrate this thing onto one machine. And when they did, they didn't see the value.
Leadership always parroted the same thing. Slides about once you have thousands of machines using this, the value is there. But over 3 years nobody really implemented a nice way of integrating 5 machines, let alone 1000.
3 years later the org went from 3000 employees worldwide to a few sales people in every region and development mostly being done in India. Comically enough, the lack of financial resources meant that the new management had to focus on things that actually mattered to customers. So the product actually got a lot better. But by that time all the momentum and trust was gone.
I didn’t mean simple mistakes, but failures at design/architecture level (I was thinking more about technical leadership than management/staffing). This is a good list though!
> I figured out that the plan was to let natural attrition take over and for every person who quits from here, hire their replacement there.
Honestly, if the company has decided to move, that seems like a good way to do it, no? Nobody gets laid off, they avoid the bad will of axing a team outright or asking you to train your replacements; it becomes a simple gradual turn-over.
Yeah, so in hindsight I feel that there's a better way to resolve this than the two options of "don't tell anyone" or "tell everyone."
The way I would do it, is tell people that we need to move their team to {{other country}} and therefore we would like to offer them either a severance package to find a new job _or_ work out a plan for them to transition to some other team that is in their location.
Saw someone buy a lot of bumper stickers to promote a political issue, without paying sufficient attention to the printing details. The ink was not UV tolerant. Used outside, the stickers became blank white within a week.
as i recall part of the slogan was "long term thinking"
When I've seen companies sink for technology reasons, it's usually been because of the cumulative effect of tech debt accumulating over a couple years. Engineers need to hit a launch deadline, so they throw in some global variables or make a few private APIs public in the interest of expediency, figuring that they'll clean it up later. Later comes, and the engineers move onto new projects instead of cleaning up the previous mess. New code gets written that depends on the previous hacks. Eventually you end up with a big ball of mud that breaks whenever anybody touches it, and nobody can launch any new features. Now the company is forced into the position of rewriting everything from scratch, but can't do so strategically, they just need to do it now because they don't have any other options.
Companies failing for "big" single decisions - like rewriting their code from scratch, or a poor tech stack choice, or a poor initial architecture - are much more rare, for the simple reason that most experienced technical leadership knows that these are risky decisions and tends to put a lot of thought into them, and then there's a lot more organizational commitment to following through on the decision once it's been made. Also, if done from a position of market leadership, you can usually recover from them - many startups start with a poor tech stack and poor architecture, and just assume that they'll rewrite everything once they get lots of funding and a suitable lead over competitors. The insidiousness of the "creeping tech debt" scenario is that you often don't realize you're screwed until you've fallen behind the competitors, which means you enter the "rewrite and tech switch" scenario from the position of being market laggards, which can kill you.
Well, it didn't end the company but only because the company was bought and bought again - then killed.
The new head of engineering had no experience in software. He developed a dislike for the language the product had originally been built in for his own reasons. He ordered a complete rewrite of the software in a different language. He hired contractors since we had no in house skills in the new language.
This lack of in house skills also meant that oversight of the code the contractors produced was poor. Several of us raised alarms over this but we were ignored. Eventually, the day came when the first paying customer was signed for the new platform. It was a complete disaster. The code was very unstable and full of bugs. The deadline kept getting pushed farther and farther out.
The new owners concluded the entire division was a failure. They fired the head of engineering who had been in charge of the disaster and his boss too. Then they started work on a new version of the platform using entirely different people. The ones left behind on the old platform (in the old language) had nothing to do but provide support for the dwindling number of open contracts.
Millions of dollars went down the drain, years of time were wasted, customers were badly served and a lot of people were incredibly frustrated, all because one executive decided to ignore the in-house talent and experience and follow his own inflated ego instead.
> He ordered a complete rewrite of the software in a different language. He hired contractors since we had no in house skills in the new language.
Ignoring for a moment the rest of it: What was the end state supposed to be? Like, pretend the rewrite went perfectly - now what? You still have nobody in-house that knows the language, let alone being familiar with the codebase!
I've actually seen a (Sales) executive, without a seconds worth of technical experience, bad mouth Ruby / Rails in meetings. He would say that it was a competitive disadvantage to use it. He would say that if our customers found out we used Ruby on Rails that they'd leave us. He had a few other things he would repeatedly say about it. It was all so bizarre.
FWIW, he was fired after about 18 months on the job (for reasons unknown to me).
As a ruby dev this completely makes sense to me based on the Java developers I have worked with. I'm willing to bet he also never even considered jRuby as a happy compromise, also basing this on my experiences with the Java developers I have worked with
I've also noticed that most Java devs dislike Ruby just as much as I (a ruby dev) dislike Java. But the guy who made this decision had never done development in Ruby, Java, or anything else. He was a former db administrator who went into management.
In the early 2000's the consulting firm I worked for got a customer who was building a "revolutionary" (their words) new medical software suite. They were having difficulties shipping the product so we were hired to help. Turns out they were insanely paranoid about IP theft, so they had 20-30 programmers working on the code, each had zero access to any other programmer's slice (i.e. source code) of the app, only being allowed to use libraries and APIs to interact with. No one except the execs had access to the entire codebase, thus no code reviews, no unified architecture or design agreements; basically it was an app made up of 20-30 independent apps, all doing things differently without any coordination. After we got hired they fired all of their programmers and gave us the entire source code, it took months just to be able to build the app at all from source, but it was such a mess it was impossible to make something remotely shippable.
One day they vanished and owed us nearly $800,000. It was enough that our parent company just shut us down a few months later. Oddly enough we had a really good group of developers and likely could have rebuilt the whole app ourselves, but our parent insisted on trying to recover the money instead of just taking the IP.
My guess is that the software is so boring that it attracts weak engineers. But the complexity is really high due to the regulations. So in the end it’s a bad mix.
It’s both an operational and technical decision because it results in inadequate tooling and specials.
A service I know well had an issue, and as a result they started doing what modern services do - nuking things going it will clear as clean versions come up.
But unknown to the ops team, they nuked a bunch of custom stuff that no one knew how to build or really what they contained. Developers with direct access to production had rolled them out.
7 days of partial outage later they covered this up by hiding parts of the app until they could straighten things out.
I think this community and the tech world in general overvalues engineering and I say this as an engineer myself. In my experience engineering or technology rarely seem to be the reason for a company's success or failure. There are certainly outliers in which a tech is so incredible that it alone can build a company. There are also some engineering mistakes that can be too costly to fix later and people can always be negligent and not do something obvious like proper backups. However those situations are usually extremely rare. The most common engineering mistakes can be fixed with more money. Typically some other factors are what make or break a company. That is what leads to not having the money to fix those engineering mistakes.
This seems like such a short sided take. For a company to be labelled a "tech" company the tech itself should be the product. For example Apple, people buy iPhones and apple stuff because they perform well (e.g. fantastic camera, displays, M1...) and they work well with each other. Certainly at this point Apples name gets people to pay a premium but that's still grounded in the quality of their stuff. You can pick similar examples across other tech companies too. For example, Netflix got their head start because of their streaming tech.
> The most common engineering mistakes can be fixed with more money.
That applies to pretty much all mistakes.
> Typically some other factors is what makes or breaks a company.
Tech is still pretty much at the top of that list, especially for "tech" companies.
I don't think that many people truly are choosing between an iPhone or Android phone based off the quality of the camera. I also don't think that Android phones are universally worse from an engineering perspective. Remember it wasn't that long ago that Apple designed a phone that lost reception if you held it the way that many people instinctually hold their phones. I think the differences between Apple and its competitors mostly comes down to different priorities and motivations. A product from a company that is committed to vertical integration is going to be vastly different than a product from a company that only exists to get you to use software which only exists to sell you ads.
Also Netflix didn't have a head start because of the quality of their streaming tech. They had a head start because of the business decision to invest in streaming before their competitors did. That early decision led to the quality of their library which is what attracted customers which is what allowed them to reinvest in their tech.
There are plenty of other examples. Twitter and Reddit come to mind as companies that had seemingly awful technology that would regularly fail. They succeeded through that. Companies like Facebook have certainly produced some good tech, but almost all of that has come after the company became a success. That is similar to Microsoft who got their big start selling someone else's tech.
I think this is generally true and agree 90% of the time although I've once seen the opposite in which what was otherwise a successful small SaaS severely underinvested in product. They had an MVP that they basically left unchanged for years and years while ramping up sales and marketing which worked for a while before competitors started popping up with better design/ux/perf/features etc.
- Telling the engineers it's an MVP and then when they're done releasing and selling against it not even as beta, but as v1 production code. I see this all the time and I take a lot of money from these companies to rework and retrofit these systems to the quality they should have been before going live.
- Choosing load-bearing tech that you're unsure will meet 100% of your needs based on hype. For example there are tons of companies with marketing websites that for one reason or another can't have useful user analytics software attached or run A/B tests.
- Letting the engineers define the entire product. This leaves you with a "perfect" solution which then can't be explained let alone sold because the customer/user perspective was not properly considered. I've seen more than one innovative (and desperately needed) startup with patented tech fail this way despite having a groundbreaking solution at hand. Product design matters.
- Dividing your org into "good" teams and "bad" teams by funneling productive engineers towards important problems and not redistributing them once those problems are solved. This "good" team eventually spends all their time fixing the broken parts of the system that had been relegated to the "bad" teams (because they are so degraded that they are now the most important problems). This then causes those "good" engineers to quit because they don't like pure maintenance work, and the resulting rapid loss of knowledge cripples the business.
- Wasting large quantities of dev hours on things that won't ever make your cost of labor back. Obvious examples include companies that spend more to support IE than they bring in from IE users (I've observed this regularly doubling implementation times on a per-task basis).
- Native code avoidance. Everybody I know that has spent > 2y on a React Native project eventually switches to native code and wishes they had started that way. This is a sample size of 10+ real $MM projects. I've seen the same for many Electron-style apps. The resulting "stop the presses" rewrite is almost always started too late to save the day thanks to simple sunk cost fallacy.
... the list goes on an on. Statistics don't lie, the road to failure is wide and welcoming ;-)
I worked for a company that did an SAP modernization project. The IBM consultants did a large part of converting all the custom ABAP stuff. The idea was to get back to as vanilla SAP as possible and included a ton of Business Objects and data warehouse work as well to convert old reporting etc.
They were constantly behind and decided to just push the load testing off the road map to hit the cio's arbitrary go live date. Within three days the data volume got large enough to grind the entire system to a halt and the company couldn't take, bill or fulfill orders. Of course the consultants were well out the door by that point. I spent months unwinding the stupid crap they did on the Business objects reporting side.
Worked at a startup that built a 3D modelling tool in the browser. They did a major refactoring that was still ongoing. At this point in the core team of 5 they churned through 10 engineers in 1 year. So when I started my job I learned that you could neither create a working build nor would any of the tests run. (Not to speak about the CI pipeline) Eventually the build could be fixed as well as the few tests. But as it turned out every basic user functionality had to be rewritten and the structural refactoring was still in progress.
IMHO the refactoring sounded very reasonable since the old code base was unmanageable. But there was a culture of extreme pressure from the CTO, making everyone haste through tasks preventing everyone to do "proper engineering". At the time the CTO was in some sort of permanent absence already. Afterwards the team lead left because of burnout, I also left and later on I heard they closed down the company.
Actually I've seen similar destructive refactorings at other places. At one they also had a lot of subtle problems leading to many user complaints. It could be fixed but by then it was already too late.
IMHO refactorings are great but it's always necessary to keep regressions in check all the time and really understand the design decisions of existing code.
Falling into the CMS trap[1] at a sensitive time in a startup can kill the business. In brief: when you try to build out too much complexity up front, everything easy becomes hard, everything hard becomes impossible, and generally any change takes too long. And since sometimes you don't have the luxury/runway to retry, you are stuck with it.
To protect their identity I won't go into specifics, but not implementing anti-tampering on local and remote backups i.e. protection from root. Backups only residing on live systems. Not protecting systems from bad automation. Not deprecating old automation frameworks and continually adding new automation frameworks. I am intentionally excluding specific incidents.
I think the implication of the sum of the things listed is that an old backup system overwrote/mangled newer backups because nobody turned it off and the two systems were targeting the same storage.
Backups sent to a remote system should be append only so that if a machine is compromised, a malicious actor cannot delete or corrupt previous backups.
An easy way is to stream the backups to tape, and have someone swap the tapes. Typically people who do this take the recorded tapes offsite, but to prevent against an electronic attack it’s adequate to have the removed tapes even in the same room, just not in the tape drive.
Separate storage of file hashes would make tampering detectable. Actually preventing root modifying a file probably has to be done at the hardware level.
Not engineering, but CEO banking the entire future of the company on a partnership, where we were dependent on the partner company for future financing, but the success metrics of the partnership were entirely in the hands of their sales department.
Never bet the future of your company on metrics that are entirely out of your control.
Not sure if it will sink the company, but moving from very good native apps to a single pwa/electron style app that was poorly written.
The business justification seemly makes sense to have a single development train. But the execution was horrible. Bugs, horrible performance. Inconsistent features copied over from the various native apps, etc.
It was released as a major upgrade, but it was almost alpha grade in reality.
That org probably saved a lot in engineering, but the customer backlash was immense and likely cost them more.
(It was a very popular and beloved note taking app. that tried to become an all-in-one day planner/calendar/todo/note platform).
Attempting to rewrite a multi-million dollar application at the core of the business. Never designed to be an embrace-extend-extinguish pattern, so either it would all work perfectly, or the whole business would be sunk.
I have seen a couple of startups that wanted build and launch their product as fast as they can. The product was supposed to be a SAAS but in that rush they have chosen technologies that made adding features very difficult. While they were considering and procrastinating adding features, competitions started to pop up and started to take over the niche.
The lesson is that, if you are just planning to make an MVP make sure you have your scaling figured out. The blue ocean gets red faster than you can imagine.
Company gave the Product team full reign and always "deferred" paying off tech debt to chase the next quarterly goal or business pivot. Eventually the tech stack was a giant, messy tarball with no test coverage and layers of hacks, but everyone could still deliver features because the original team was still there (this also provided justification to the Product team that tech debt payoff was unnecessary or could be "deferred" again).
After a while, business stagnation from constant pivots and new initiatives resulted in attrition and a downward spiral of budget cuts and reduced morale. Attrition and knowledge loss caused velocity to drop, which caused more of the team to leave, which made the business stagnation worse, which caused morale to go down more and budgets further cut, etc. Eventually the company couldn't recruit the same level of talent to replace people who left, hiring standards dropped dramatically, and it was now impossible to pay off tech debt or even really run the tarball reliably anymore (a Ruby monolith running an ancient, unsupported version of Rails with a million security holes and bugs).
Engineering leadership made a mass exodus, the few people that were left ended up on a death watch as the move to an outsourced engineering org from India was implemented on the way towards a full migration to a 3rd party vendor platform. Software engineering was completely eliminated from the company and the lesson the company leadership took from this was that "we never should have built the platform in the first place" along with a dose of "external business factors outside of our control caused the decline in revenue, forcing us to make hard decisions".
Palantir is operating under the idea that "The AI will figure it out once we can access all the data"
This is a mistake because:
1. A lot of the data at big orgs is garbage or only understandable within a certain context by specific people.
2. Internal politics within lots of organizations prevents access to this data.
3. The AI cannot just figure it out. You need tons of humans in the mix which brings you back to #1 and #2.
Word on the street is a couple years back Palantir knocked a good chunk of SOCOM's intelligence capabilities offline for a few weeks.
Intelligence analysts like to use Communities of Interest "COIs" to keep track of things. These are kind of like wikis. A COI on the Middle East may have groups like Al-Qaida and ISIS, and people like MBS, and countries like Saudi Arabia and links between them. They can also have hierarchies, like "planes" split into "prop" and "jet" which in turn splits into "military" and "civilian," think decision trees.
One problem with these is they can take a while stand up. COIs need at least a few hundred objects in them before they can be useful. Palantir came along and said, "well, we can help stand them up with our AI."
They did their work, apparently didn't properly test it and just did things in production (maybe SOCOM doesn't have a dev environment, who knows?), and when they pushed run it created hundreds of thousands of objects throughout the COIs, with levels in the hierarchy that didn't make any sense. Like, imagine "747" falls under "Helicopter" which falls under "Cessna" which is under the "Nation" category.
It took weeks for SOCOM to rebuild things. I'm not sure if they had problems with backups or maybe they didn't properly have things backed up.
We saw recently the almost downfalls of Intel, Apple, Boeing because of grave engineering mistakes, but ultimately caused by grave management mistakes.
Apple seems to have jumped the ship successfully by switching to ARM, their keyboard and OS are still unusable though. Wonder how their China adventure will turn out.
Intel could be bought by Nvidia or AMD, but since the government invested so much into their backdoors there, they will be kept alive. But no chance that their architectural problems can be solved at all.
Boing is unfixable since its unfriendly McDonald-Douglas management takeover and technical decline since. Now even their stable flagship product line fell down in a straight line.
In my experience, most successful people (and by extension, people who run successful companies) don't really know why they were successful and often attribute it to their own actions (eg, I write great code), rather than luck or networking or having solid code reviewers or some external factor. Because of this, when they are successful, they often double down on whatever it is that they think they did the first time, and that frequently doesn't work a second time.
Maybe the first idea hit a niche that wasn't being satisfied by the market, and the second attempt tried to break into a heavily saturated market. Maybe they just lucked out by being in the right place at the right time, or having the right connection that could bring in a multi-million dollar contract, or...
From a broader perspective, I would say that the biggest and most common mistake that people make (engineers too!) is not to spend time examining the hows and whys of the success that you've had so far. Were you successful because you had brilliant ideas, or was it because you had a mediocre idea that filled an underserved niche market? Were you successful because you used Postgres or Ruby or Kafka or Elasticsearch? Were you successful because you created a culture of innovation and learning and team players? Or were you successful because you happened upon a fantastic solution for the specific problem at hand but can't generalize it to larger problems?
If you don't know why you were successful in the first place, it's hard to continue to be successful.
TL;DR: lack of introspection and evaluation of success criteria over time
Been there, done that. They were bursting out of an ancient VMware cluster on private hardware in traditional hosting, with a residential router facing the internet. Hard yikes.
A colleague accidentally swapped every product photo on one premiere league football team shop website with a kit photo of another premiere league team. He pushed up the change (with his testing image...) and went home without realising until angry phone calls started coming in.
1) Elective rewrite of a working and performant system without an understanding or any testing mechanism to know if the rewrite was producing similar output.
2) Solving theoretical scaling problems with every possible technology du-jour.
1) Employing only the cheapest contracting shops for several years and being baffled when all progress ground to a halt due to a mountain of spaghetti.
2) then deciding to rewrite
The company abandoned the rewrite, acquired a competitor, and rebranded it as the company's new product. It turns out the competitor's code base had serious deficiencies (couldn't scale to the company's size, had important bugs), and was very unpleasant to develop against. In the 6 months following the acquisition, well over half of the engineering team left the company (heavily skewed towards the highest-caliber engineers) because they couldn't stand what their jobs had become. Many long-time customers threatened to leave because of user facing technical issues in the acquired product.
A telling incident: the company needed to deploy an emergency fix to the mobile app. The only machine authorized to publish to the app store was a laptop that the owner of the acquired company had left at home overseas. To complete the deploy, he needed to call his wife and walk her through the publication process over the phone.
The company managed to fix enough of the problems that they're still around today, but for a while there was a lot of uncertainty around the company's future.