I work on something at a similar scale to 1.1.1.1, if we had this kind of setup our oncall would never be asleep (well, that is almost already the case, but alas). It's easy to say "just implement X monitor and you'd have caught this" but there's a real human cost and you have to work extremely vigilently at deleting monitors or you'll be absolutely swamped with endless false positive pages. I don't think a 5 minute delay is unreasonable for a service this scale.
This just seems kinda fundamental: the entire service was basically down, and it took 6+ minutes to notice? I’m just increasingly perplexed at how that could be. This isn’t an advanced monitor, this is perhaps the first and most important monitor I’d expect to implement (based on no closely relevant experience).
I don’t want to devolve this to an argument from authority, but - there’s a lot of trade offs to monitoring systems, especially at that scale. Among other things, aggregation takes time at scale, and with enough metrics and numbers coming in, your variance is all over the place. A core fact about distributed systems at this scale is that something is always broken somewhere in the stack - the law of averages demands it, and so if you’re going to do an all-fire-alarm alert any time part of the system isn’t working, you’ve got alarms going off 24/7. Actually detecting that an actual incident is actually happening on a machine of the size and complexity we’re talking about within 5 minutes is absolutely fantastic.