This week we had a few minutes of downtime on an internal service because of a node rotation that triggered an alert. The responding engineer started to put together a plan to make the service HA (which would have tripled the cost to serve). I asked how frequently the service went down and how many people would be inconvenienced if it did. They didn't know, but when we checked the metrics it had single-digit minutes of downtime this year and fewer than a dozen daily users. We bumped the threshold on the alert to longer than it takes for a pod to be re-scheduled and resolved the ticket.
This is most sensible thing I’ve read on here in a while. Engineers’ obsession with tinkering and perfection is the slow death of many startups. If you’re doing something important like banking or air traffic control fair enough but a CRUD app for booking hair appointments will survive a bit of downtime