Site Reliability Engineer. Broadly speaking, Google divides software engineering into two categories:
- software engineers, who create, initially maintain, and develop long term maintenance plans for software
- site reliability engineers, who keep existing products healthy, deal with unexpected complications in realtime, and continuously engineer improvements on how to do their jobs so they can carry more and more active software per SRE.
We are. Because of Google's sheer size, it does have user-visible outages, but the ratio of outages to (uptime X services supported) is pretty much best-of-industry.
... it has to be, because at their size, outages are catastrophic for people.