Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

CloudFlare thinks it had to do with bad BGP routes:

https://blog.cloudflare.com/cloudflares-view-of-the-rogers-c...

Not much technical details from the Rogers CEO:

> "We now believe we've narrowed the cause to a network system failure following a maintenance update in our core network, which caused some of our routers to malfunction early Friday morning,"

https://www.cbc.ca/news/business/rogers-outage-interac-debit...




Bad BGP routes are an externally visible symptom of the outage -- they're how Rogers was telling other ISPs that its own networks were unreachable, but just knowing that they were doesn't explain why.


Read the Cloudflare post:

=======

Cloudflare Radar shows a near complete loss of traffic from Rogers ASN, AS812, that started around 08:45 UTC (all times in this blog are UTC).

What happened? Cloudflare data shows that there was a clear spike in BGP (Border Gateway Protocol) updates after 08:15, reaching its peak at 08:45.

=======

BGP storm first, outage second.


If an internal route goes down, the information is propagated to adjacent networks via BGP. So the BGP storm could have been caused by routes becoming unreachable (possibly intermittently).


>BGP storm first, outage second.

Not really. If you overlay the two graphs you see that the outage happened almost in lockstep with the outage. While the two events are correlated, that doesn't imply causation. There could be a common cause factor that was behind both of them, for instance (which is what the parent poster was saying).

https://i.imgur.com/OnP4Kcz.png


Also … bad routes can cause outages for sure but this long lasting?!?

That raises other questions about their ability to recover from what isn’t an entirely unique situation (bad routes).


It’s [time to live](https://en.m.wikipedia.org/wiki/Time_to_live). Once a bad route is published and propagates, TTL determines how long it remains cached.

The real issue is, it took Rogers engineers so long to diagnose and address the problem, allowing incorrect route maps to propagate to deep and far corners of the internet.

Anecdotally, a friend in Greece roaming with a Canadian Rogers SIM card was unable to use her phone. Perhaps other Canadian Rogers subscribers, traveling on far continents can confirm?

My guess (pure speculation) is Roger’s TTL values have been dramatically decreased by now, which will increase the bandwidth requirements for network state management. This is also not great news for Rogers customers and peer networks.


Generally speaking, BGP route propagation does not include a TTL cache invalidation mechanism (i.e. routes are never invalidated and rechecked -- the BGP session's ongoing existence validates already learned routes and conducts the addition or deletion in-band -- if heard, "TTL" in the context of BGP means that routes should not be propagated onward, like TTL for IP packets).


> a friend in Greece roaming with a Canadian Rogers SIM card was unable to use her phone

I've heard this from several individuals. It makes sense: Rogers, being totally dark, couldn't authenticate any roaming sessions.

I wonder how fast a roaming provider boots you off their network though. Roaming usually tunnels your traffic to your home provider, so I could see data halting immediately. Maybe local calls work for a while.

This effect sunk in when they said they couldn't receive any SMS 2FA. Even people with a local SIM SIM setup (highly recommended!) still need to slide in their home SIM to log into some things.


I don’t know if they were from Canada but I photographed a family on vacation here in Minnesota and they said their “phones don’t work in this area.” They seemed surprised.

It was an extended family of 14 people and there are plenty of cell phone towers of various sorts that reach where we were.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: