Someone following the news closer please fact-check me, but AFAIK, the chain of events:
- The Optus network (the second largest telco network in Australia) went down
- Optus executives initially couldn't coordinate anything because they were... all on Optus
- The network outage lasted ~14 hours
- The network outage effected triple-zero, the country's emergency number, because Optus was rebooting towers and causing phones to still connect to them so they can't fallback to another carrier to dial 000
- After the network is finally back up, Optus blamed "a 3rd party" that caused the outage
- The "3rd party" was then turned out to be Singtel - Optus' parent company
- Singtel issued a statement to basically say Optus was wrong
- Optus then issued another statement saying the outage was caused by them using default configuration files on some of their Cisco routers
- The Australian senate summoned the Optus CEO for a hearing
Networking people will be shocked, SHOCKED, to hear that the root cause appears to have been a broken BGP push. Taking 14 hours to recover from that is all on Optus though.
And the damage went well beyond spotty emergency calls: for example, if you run a small business that relies on credit card payments, you were fucked if your terminals were on the Optus network. The situation was so bad that prepaid SIM cards for Telstra (the main competitor) were selling out in much of the country.
Causes: A Border Gateway Protocol (BGP) routing problem played a role in the outage. Public data from CloudFlare showed a spike in BGP route announcements from the Optus network around the time the outage occurred — over 940,000 announcements in an hour from a node that normally makes less than 3,000 announcements per hour — indicative of a BGP routing problem.
[snip] committee describes the outage as a gradual event triggered by loss of connectivity between neighbouring computer networks. The report suggests that approximately 90 edge provider routers disconnected as an automated protective measure against routing update overload. The failures occurred following a software upgrade at a North American Singtel exchange that caused one of the routers to disconnect. This, in turn, triggered Optus's routers to rapidly update its own routing tables which triggered the shutdown due to the pre-configured default threshold limits set by Cisco Systems being exceeded. The tabled report and Singtel stressed that the software upgrade was not the cause of the fault
- The CEO told the senate hearing that she now carried Telstra and Vodaphone SIM cards (maybe practical, terrible optics)
- Transport for NSW is heavily reliant on Optus so public transport was heavily impacted (not entirely Optus' fault, but public outrage means someone had to be the scapegoat)
The sad thing about this happening to Optus is that it _could_ have happened to any of the big telcos, this sort of technical cascading fault, but because it happened to Optus on the back of the data breach all the focus is on Optus, not on the wider lessons that ought to be learned.
Would’ve been great to come out of an event like this looking at whether catastrophic fault conditions like this exist elsewhere in our national infra, but it feels like all that’ll happen is Optus gets the shit kicked out of them while other providers count their blessings
What annoys me isn't so much that they had this outage.
It's that they took so long to restore. Carrier systems are supposed to be multiply fault tolerant.
Even if you have to put warm bodies physically in front of them, you should have enough of those people around to get the core networks up and going again within an hour or so by rolling back to a known-good configuration.
Even if they've somehow managed to brick the systems, they should have enough hot/cold spares, and the ability to call $VENDOR to hand-deliver new ones if need be.
The model for network delivery has changed. Networks like Rogers in Canada, Optus in Australia, and dish in the USA outsource the core knowledge needed to recover their network to vendors like Nokia and Cisco.
The employees locally lack the knowledge and the access to restore the network without guidance from external vendors outside the country. From an operating cost perspective, this is the right choice, but for reliability and sovereignty it's terrible.
Sadly, with RCS, 5G Standalone and other new technologies that require operating servers with leading edge software, operators repeatedly choose to outsource the entire stack to an external vendor like Nokia rather than replicate that knowledge locally.
I have a FirstNet SIM for my phone (the first responder network). I've never experienced it, and it is only during designated events (i.e. a switch has to be flipped at the carrier, so might not have worked here depending on the outage), but while my phone is nominally on AT&T during said events a few things are meant[1] to happen:
Voice calls should be prioritized over other network traffic. Data should be the same. And my phone should roam from AT&T to VZW (or even TMO).
[1] having said that, I've not looked too closely, and it sounds like (unsurprisingly) that might not exactly be the reality...
Optus' official position on the events that they tabled to the Australian parliament are is at [1]. In summary:
> "This unexpected overload of IP routing information occurred after a software upgrade at one of the Singtel internet exchanges (known as STiX) in North America, one of Optus’ international networks. During the upgrade, the Optus network received changes in routing information from an alternate Singtel peering router. These routing changes were propagated through multiple layers of our IP Core network. As a result, at around 4:05am (AEDT), the pre-set safety limits on a significant number of Optus network routers were exceeded. Although the software upgrade resulted in the change in routing information, it was not the cause of the incident."
> "It is now understood that the outage occurred due to approximately 90 PE routers
automatically self-isolating in order to protect themselves from an overload of IP routing
information. These self-protection limits are default settings provided by the relevant
global equipment vendor (Cisco)."
> "Several hypotheses and paths to restoration were explored over the period up to 10.30am."
And then in later statements:
> "Nokia is our managed services partner for our network, and they were involved from the very beginning in managing the incident and recovering the network; their staff are based in India in two locations"[2]
One of the key problems appears to be heavy reliance on outsourced Nokia staff in India, who seemingly would have been disconnected from Optus' systems in Australia. Then within Australia for local Optus staff, perhaps staff had Optus-provided mobile phones and couldn't be reached if the mobile phone network was down. At the minimum, you'd like to think that on-call operational staff exist near all PE routers and have multiple communication means such as mobile phones with other carriers, satellite phone, fixed Internet connectivity not provided by Optus.
The total outage duration was 6.5hrs to diagnose the problem and a further 3.5 hours to get 98% of connectivity re-established. Resolving the problem once diagnosed required physical presence at 14 sites across Australia to reset 90 PE routers (as part of "100+ devices").
This seems like it was driven by Singtel in an effort to just make everyone forget about the recent issues. Namely, a massive leak of customer PII, and a 14 hour outage of their core networks.
Optus's issues are long running. My understanding is that they're largely driven by Singtel's desire to keep cutting everything to the bone.
I've got no idea about Rosmarin's effectiveness as a CEO.
Sure, it didn't help that she was a no-show for most of the outage, and when she did do interviews the responses were weak or not exactly confidence building that they knew what the problem was or how to fix it.
Even still, I don't see throwing a new CEO at it is going to result in any meaningful change while Singtel are still the owners, and there's no legislative changes to protect consumers and critical infrastructure.
Large parts of the business are driven entirely on short-term metrics which can be fiddled with to hide problems. The use of contracting firms (both onshore and off) to manage/build/maintain what should, for a Telco, be their core strengths only seems to make this worse.
Having worked with Singaporean telcos in a past life, I can attest that the standard action item for guiltful postmortems is identifying a scapegoat and firing them.
Opinion based on few anecdotes, but I'm sure the blame lies with Singtel, especially given that there was initial finger-pointing that was then retracted. Classic "protect the castle, sacrifice the farmhouse (administrator)".
I've heard from multiple sources that Optus in Australia is somewhat of a cash cow for Singtel in slightly undercutting Telstra's exorbitant pricing whilst absolutely minimising support and administrative costs. The 2022 security breach[0] being a potential example of a symptom.
Interestingly the article specificies this as the cause:
On Friday, Optus confirmed the outage was due to a configuration issue with more than 90 Cisco routers, which could not cope with changes to routing information supplied from Singtel Internet Exchange (STiX) after a routine software upgrade.
On the HN thread of the day that this happened I had a comment downvoted for suggesting the fault originated with Singtel (my fault for not expanding - it was in response to a rhetorical question that had 'Australia' as the 'obvious answer').
Regardless of the fault origins, this is a justified resignation.
Optus (buck stops with CEO) totally and utterly borked the PR handling of this from the get go.
They had multiple oportunities to get out in front, to shape the story, to make a statement. to answer a few obvious questions, etc.
I've rarely seen a large company like this just ditch the playbook for handling PR in the face of a screwup and sit things out for hours with no comment.
Was it eight hours before any kind of official response of substance?
WTF was going on internally and where the hell was the PR dept.
It did seem strange that any response took such a long time, and even after that amount of time the response was effectively a shrug emoji. I mean, yes, it's lying, but standard playbook is to say something along the lines of "we've identified the cause and the resolution is proving trickier than initially estimated, but we're expecting full network service back up within X"
I have to assume the CEO knew this was a final straw after the 2022 data breach, and this shit-show PR was essentially a free-kick since the writing was already on the wall.
Was probably also told: you know you're fired, but not until you've fronted the senate committee, we're not going to subject our new CEO to that as their first order of business.
Optus has been a basket case for many years, and was only made worse by appointing a number of ex banking executives whom don't understand the role Optus' products play in their customers lives.
- The Optus network (the second largest telco network in Australia) went down
- Optus executives initially couldn't coordinate anything because they were... all on Optus
- The network outage lasted ~14 hours
- The network outage effected triple-zero, the country's emergency number, because Optus was rebooting towers and causing phones to still connect to them so they can't fallback to another carrier to dial 000
- After the network is finally back up, Optus blamed "a 3rd party" that caused the outage
- The "3rd party" was then turned out to be Singtel - Optus' parent company
- Singtel issued a statement to basically say Optus was wrong
- Optus then issued another statement saying the outage was caused by them using default configuration files on some of their Cisco routers
- The Australian senate summoned the Optus CEO for a hearing
- Here we are, the CEO resigned
EDIT:
To add more context, just over a year ago (in Sept 2022), Optus, under the same CEO's leadership, had a massive data breach: https://en.wikipedia.org/wiki/2022_Optus_data_breach