This week we had a few minutes of downtime on an internal service because of a n... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		loire280 on Aug 9, 2024 \| parent \| context \| favorite \| on: How we migrated onto K8s in less than 12 months This week we had a few minutes of downtime on an internal service because of a node rotation that triggered an alert. The responding engineer started to put together a plan to make the service HA (which would have tripled the cost to serve). I asked how frequently the service went down and how many people would be inconvenienced if it did. They didn't know, but when we checked the metrics it had single-digit minutes of downtime this year and fewer than a dozen daily users. We bumped the threshold on the alert to longer than it takes for a pod to be re-scheduled and resolved the ticket.

jack_riminton on Aug 10, 2024 [–]

This is most sensible thing I’ve read on here in a while. Engineers’ obsession with tinkering and perfection is the slow death of many startups. If you’re doing something important like banking or air traffic control fair enough but a CRUD app for booking hair appointments will survive a bit of downtime

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact