Hacker News new | past | comments | ask | show | jobs | submit login

True, it is a simple concept as it seems to me.

1. define some reliability target (better expressed by some SLOs) in advance and what steps to do if that is not reached 2. if the service fails to reach it, do the steps to increase reliability arranged in step 1. 3. repeat at some regular intervals

The point I think is that the things are arranged in advance. Not after some shit happens because people get very subjective about "their own" service. The target is there, so lets try to reach it. We have error budget as well, lets use that one. If you don't have anything (as I've seen in a lot of places, or wishful 100% reliability), you'll have major reliability problems I'm absolutely sure.

So the SRE book tries to give you a solution to a lot of headaches some medium to large companies might be facing.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: