You are of course right, but in practice I've seen many more incidents created b...

You are of course right, but in practice I've seen many more incidents created by doing changes to make things more robust than from simpler incremental product changes that usually are feature flagged and so on. At deeper levels usually there's less ability to do such containment (or is too expensive or takes too long or people are lazy) and so many times I wonder if it's better to do the trade-off or just keep things simple and eat only the "simple" sources of downtime to fix.

For example the classic thing is to always have minimum of 3 or 5 nodes for every stateful system. But in some companies, 1 hour of planned downtime on Monday mornings at 7AM to 8AM for operations and upgrades (which you only use when you need) + eating the times when the machine actually dies, would be less downtime than all the times you'd go down because of problems related to the very thing that should make you more robust. An incident here because replication lag was too high, an incident there because the quorum keeping system ran out of space etc and you're probably already behind. And then we have kubernetes. At some point it does make sense, when you have enough people and complexity to deal with this properly, but usually we do it too early.