Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

First step is redundancy: having backups, failover, overprovisioning. Essentially prepared "plan Bs".

Next step is introspection: aggregate monitoring and enough detail to figure out if there are issues.

Next step is being notified when things break. I.e. anomaly detection and alerting.

Then, debuggability. Enough detail to solve issues. Disaster recovery testing is part of ensuring you actually have this, and not just believe you do.

Aside from that, there's CI/CD, automated scaling, automated isolation of bad actors. There are so many things one could do, but this also depends on how large the team is. I'll argue that this type of automation isn't that important if it's just one person.

The SRE book(s) [1] contain many of these high-level ideas. Don't try to do them all at once. :) (Bias: Niall, one of the editors, was my manager when I joined Google SRE.)

[1] https://sre.google/books/



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: