We have done a similar operation routinely on databases under pretty write intensive workloads (like 10s of thousands of inserts per second). It is so routine we have automation to adjust to planned changes in volume and do so a dozen times a month or so. It has been very robust for us. Our apps are designed for it and use AWS’s JDBC wrapper.
My company has been intentionally causing attrition in the US by moving to effectively a 996 style schedule. As people quit, their positions are moved to the India office. It is not an officially communicated policy. I have just surmised this based on private conversations with the executives and what is actually happening.
Interesting day. I've been on an incident bridge since 3AM. Our systems have mostly recovered now with a few back office stragglers fighting for compute.
The biggest miss on our side is that, although we designed a multi-region capable application, we could not run the failover process because our security org migrated us to Identity Center and only put it in us-east-1, hard locking the entire company out of the AWS control plane. By the time we'd gotten the root credentials out of the vault, things were coming back up.
Good reminder that you are only as strong as your weakest link.
This reminds me of the time that Google’s Paris data center flooded and caught on fire a few years ago. We weren’t actually hosting compute there, but we were hosting compute in AWS EU datacenter nearby and it just so happened that the dns resolver for our Google services elsewhere happened to be hosted in Paris (or more accurately it routed to Paris first because it was the closest). The temp fix was pretty fun, that was the day I found out that /etc/hosts of deployments can be globally modified in Kubernetes easily AND it was compelling enough to want to do that. Normally you would never want to have an /etc/hosts entry controlling routing in kube like this but this temporary kludge shim was the perfect level of abstraction for the problem at hand.
Probably. This was years ago so the details have faded but I do recall that we did weigh about 6 different valid approaches of varying complexity in the war room before deciding this /etc/hosts hack was the right approach for our situation
I remember Facebook had a similar story when they botched their BGP update and couldn't even access the vault. If you have circular auth, you don't have anything when somebody breaks DNS.
Wasn't there an issue where they required physical access to the data center to fix the network, which meant having to tap in with a keycard to get in, which didn't work because the keycard server was down, due to the network being down?
Way back when I worked at eBay, we once had a major outage and needed datacenter access. The datacenter process normally took about 5 minutes per person to verify identity and employment, and then scan past the biometric scanners.
On that day, the VP showed up and told the security staff, "just open all the doors!". So they did. If you knew where the datacenter was, you could just walk-in in mess with eBay servers. But since we were still a small ops team, we pretty much knew everyone who was supposed to be there. So security was basically "does someone else recognize you?".
Well, you put a lot of trust in the individuals in this case.
A disgruntled employee can just let the bad guys in on purpose, saying "Yes they belong here".
That works until they run into a second person. In a big corp where people don't recognize each other you can also let the bad guys in, and once they're in nobody thinks twice about it.
way back when DC's were secure but not _that secure_ i social engineered my way close enough to our rack without ID to hit a reset button before getting thrown out.
late reply but, no, i really needed to hit the button but didn't have valid ID at the time. My driver's license was expired and i couldn't get it renewed because of a outstanding tickets iirc. I was able to talk my way in and had been there many times before so knew my way around and what words to say. I was able to do what i needed before another admin came up and told me that without valid ID they have no choice but to ask me to leave (probably like an insurance thing). I was being a bit dramatic when i said "getting thrown out" the datacenter guys were very nice and almost apologetic about asking me to leave.
There's some computer lore out there about someone tripping a fire alarm by accident or some other event that triggered a gas system used to put out fires without water but isn't exactly compatible with life. The story goes some poor sys admin had to stand there with their finger on like a pause button until the fire department showed up to disarm the system. If they released the button the gas would flood the whole DC.
My point is that while the failure rate may be low the failure method is dude burns to death in a locked server room. Even classified room protocols place safety of personnel over safety of data in an emergency.
It wasn't Equinix, but I think the vendor was acquired by them. I don't actually blame them, I appreciated their security procedures. The five minutes usually didn't matter.
I remember hearing Google early in it's history had some sort of emergency back up codes that they encased in concrete to prevent them becoming a casual part of the process and they needed a jack hammer and a couple hours when the supposedly impossible happened after only a couple years.
> To their great dismay, the engineer in Australia could not open the safe because the combination was stored in the now-offline password manager.
Classic.
In my first job I worked on ATM software, and we had a big basement room full of ATMs for test purposes. The part the money is stored in is a modified safe, usually with a traditional dial lock. On the inside of one of them I saw the instructions on how to change the combination. The final instruction was: "Write down the combination and store it safely", then printed in bold: "Not inside the safe!"
> It took an additional hour for the team to realize that the green light on the smart card reader did not, in fact, indicate that the card had been inserted correctly. When the engineers flipped the card over, the service restarted and the outage ended.
There is a video from the lock pick lawyer where he receives a padlock in the mail with so much tape that it takes him whole minutes to unpack.
Concrete is nice, other options are piles of soil or brick in front of the door. There probably is a sweet spot where enough concrete slows down an excavator and enough bricks mixed in the soil slows down the shovel. Extra points if there is no place nearby to dump the rubble.
Probably one of those lost in translation or gradual exaggeration stories.
If you just wanted recovery keys that were secure from being used in an ordinary way you can use Shamir to split the key over a couple hard copies stored in safety deposit boxes a couple different locations.
The Data center I’m familiar with uses cards and biometrics but every door also has a standard key override. Not sure who opens the safe with the keys but that’s the fallback in case the electronic locks fail.
The memory is hazy since it was 15+ years ago, but I'm fairly sure I knew someone who worked at a company whose servers were stolen this way.
The thieves had access to the office building but not the server room. They realized the server room shared a wall with a room that they did have access to, so they just used a sawzall to make an additional entrance.
my across the street neighbor had some expensive bikes stolen this way. The thieves just cut a hole in the side of their garage from the alley, security cameras were facing the driveway and with nothing on the alley side. We (the neighborhood) think they were targeted specifically for the bikes as nothing else was stolen and your average crack head isn't going to make that level of effort.
I assume they needed their own air supply because the automatic poison gas system was activating. Then they had to dodge lazers to get to the one button that would stop the nuclear missle launch.
add a bunch of other poinless scifi and evil villan lair tropes in as well...
Most datacenters are fairly boring to be honest. The most exciting thing likely to happen is some sheet metal ripping your hand open because you didn't wear gloves.
Still have my "my other datacenter is made of razorblades and hate" sticker. \o/
I had a summer job at a hospital one year in the data center when an electrician managed to trigger the halon system and we all had to evacuate and wait for the process to finish and the gas to vent. The four firetrucks and station master who shoved up was both annoyed and relieved it was not real.
Not sure if you’re joking but a relatively small datacenter I’m familiar with has reduced oxygen in it to prevent fires. If you were to break in unannounced you would faint or maybe worse (?).
Not quite - while you can reduce oxygen levels, they have to be kept within 4pp so at worst, will make you light headed. Many athletes train at the same levels though so it’s easy to overcome.
That'd make for a decent heist comedy - a bunch of former professional athletes get hired to break in to a low-oxygen data center, but the plan goes wrong and they have to use their sports skills in improbable ways to pull it off.
Halon was used back in the day for fire suppression but I thought it was only dangerous at high enough concentrations to suffocate you by displacing oxygen.
Not an active datacenter, but I did get to use a fire extinguisher to knock out a metal-mesh-reinforced window in a secure building once because no one knew where the keys were for an important room.
Management was not happy, but I didn’t get in trouble for it. And yes, it was awesome. Surprisingly easy, especially since the fire extinguisher was literally right next to it.
Nothing says ‘go ahead, destroy that shit’ like money going up in smoke if you don’t.
P.S. don’t park in front of fire hydrants, because they will have a shit eating grin on their face when they destroy your car- ahem - clear the obstacle - when they need to use it to stop a fire.
Not to speak for the other poster, but yes, they had people experiencing difficulties getting into the data centers to fix the problems.
I remember seeing a meme for a cover of "Meta Data Center Simulator 2021" where hands were holding an angle grinder with rows of server racks in the background.
"Meta Data Center Simulator 2021: As Real As It Gets (TM)"
Yes for some insane reason facebook had EVERYTHING on a single network. The door access not working when you lose BGP routes is especially bad because normal door access systems cache access rules on the local door controllers and thus still work when they lose connectivity to the central server.
Depends. Some have a paranoid mode without caching, because then a physical attacker cannot snip a cable and then use a stolen keycard as easily or something. We had an audit force us to disable caching, which promptly went south at a power outage 2 months later where the electricians couldn't get into the switch room anymore. The door was easy to overcome, however, just a little fiddling with a credit card, no heroic hydraulic press story ;)
If you aren't going to cache locally than you need redundant access to the server like LTE access and plan for needing to unlock the doors if you lose access to the server.
This sounds similar to AWS services depending on DynamoDB, which sounds like what happened here. Even if under the hood parts of AWS depend on Dynamo, it should be a walled-off instance separate from Dynamo available via us-east-1.
I was there at the time, for anyone outside of the core networking teams it was functionally a snow day. I had my manager's phone number, and basically established that everyone was in the same boat and went to the park.
Core services teams had backup communication systems in place prior to that though. IIRC it was a private IRC on separate infra specifically for that type of scenario.
I remember working for a company who insisted all teams had to usr whatever corp instant messaging/chat app but our sysadmin+network team maintained a jabber server + a bunch of core documentation synchronized on a vps in a totally different infrastructure just in case and sure enough there was that a day it came handy.
Ah, but have they verified how far down the turtles go, and has that changed since they verified it?
In the mid-2000s most of the conference call traffic started leaving copper T1s and going onto fiber and/or SIP switches managed by Level3, Global Crossing, Qwest, etc. Those companies combined over time into Century Link which was then rebranded Lumen.
That's similar to the total outage of all Rogers services in Canada back on July 7th 2022. It was compounded by the fact that the outage took out all Rogers cell phone service, making it impossible for Rogers employees to communicate with each other during the outage. A unified network means a unified failure mode.
Thankfully none of my 10 Gbps wavelengths were impacted. Oh did I appreciate my aversion to >= layer 2 services in my transport network!
That's kind of a weird ops story, since SRE 101 for oncall is to not rely on the system you're oncall for to resolve outages in it. This means if you're oncall for communications of some kind, you must have some other independent means of reaching eachother (even if it's a competitor phone network)
That is heavily contingent on the assumption that the dependencies between services are well documented and understood by the people building the systems.
Rogers is perhaps best described as a confederacy of independent acquisitions. In working with their sales team, I have had to tell them where there facilities are as the sales engineers don't always know about all of the assets that Rogers owns.
There's also the insistence that Rogers employees should use Rogers services. Paying for every Rogers employee to have Bell cell phone would not sit well with their executives.
That the risk assessments of the changes being made to the router configuration were incorrect also contributed to the outage.
There is always that point you reach where someone has to get on a plane with their hardware token and fly to another data centre to reset the thing that maintains the thing that gives keys to the thing that makes the whole world go round.
Not sure if this counts fully as 'distributed' here, but we (Authentik Security) help many companies self-host authentik multi-region or in (private cloud + on-prem) to allow for quick IAM failover and more reliability than IAMaaS.
There's also "identity orchestration" tools like Strata that let you use multiple IdPs in multiple clouds, but then your new weakest link is the orchestration platform.
Disclosure: I work for FusionAuth, a competitor of Authentik.
Curious. Is your solution active-active or active-passive? We've implemented multi-region active-passive CIAM/IAM in our hosted solution[0]. We've found that meets needs of many of our clients.
I'm only aware of one CIAM solution that seems to have active-active: Ory. And even then I think they shard the user data[1].
Ory’s setup is indeed true multi-region active-active; not just sharded or active-passive failover.
Each region runs a full stack capable of handling both read and write operations, with global data consistency and locality guarantees.
We’ll soon publish a case study with a customer that uses this setup that goes deeper into how Ory handles multi-region deployments in production (latency, data residency, and HA patterns). It’ll include some of the technical details missing from that earlier blog post you linked.
Keep an eye out!
Wow, you really *have* to exercise the region failover to know if it works, eh? And that confidence gets weaker the longer it’s been since the last failover I imagine too. Thanks for sharing what you learned.
You should assume it will not work unless you test it regularly. That's a big part of why having active/active multi-region is attractive, even though it's much more complex.
Sure it was, you just needed to login to the console via a different regional endpoint. No problems accessing systems from ap-southeast-2 for us during this entire event, just couldn’t access the management planes that are hosted exclusively in us-east-1.
It's a good reminder actually that if you don't test the failover process, you have no failover process. The CTO or VP of Engineering should be held accountable for not making sure that the failover process is tested multiple times a month and should be seamless.
Too much armor makes you immobile. Will your security org be held to task for this? This should permanently slow down all of their future initiatives because it’s clear they have been running “faster than possible” for some time.
Totally ridiculous that AWS wouldn't by default make it multi-region and warn you heavily that your multi-region service is tied to a single region for identity.
I always find it interesting how many large enterprises have all these DR guidelines but fail to ever test. Glad to hear that everything came back alright
People will continue to purchase Mutli-AZ and multi-region even though you have proved what a scam it is. If east region goes down, ALL amazon goes down, feel free to change my mind. STOP paying double rates for multi region.
DHH wrote a blog post complaining about how there are fewer "native Brits" in London, and then linked to Wikipedia's article about the number of white people in London. He also brought up a march by Tommy Robinson, but framed it as just a couple of exceeding normal guys out for a walk, and not a bunch of nationalists.
It came off as xenophobic and racist, so sponsors pulled funding while others (some quite high profile) refuse to work with DHH. There's a non-zero amount of reading between the lines, so here's the blog post so everyone can decide for themselves:
I have been a Linux desktop user for 20+ years. It is incredible how far it has come. There is nothing Microsoft can do that will drive the normies away though. Microsoft knows this and that is why we are where we are.
I have been running Linux since 2011, and so much more stuff is in the “Just Works” category, especially if you have AMD graphics. When I installed NixOS on my Thinkpad about a year ago, it was almost comical how easy it was for me; I had gotten used to having to waste an entire day messing with drivers and fixing issues in 2012-2015, so it felt kind of weird for stuff to work as expected immediately.
I am trying very hard to get my parents to use something like Linux Mint because the Windows 11 auto-update on my mom’s computer actually prevented it from booting (making me waste an entire day remotely having them flash a live USB so I could rsync over her files to me…thanks MS!), so this might be enough of a final straw for them.
I have tried switching family members over after malware incidents. The most success was setting my 80 year old grandmother up with Lubuntu. She had no issue picking it up. I don’t think she even really noticed vs Windows. Lasted a few years until she went to an iPad for accessibility reasons.
Very interesting. I found myself nodding YES the whole way through the post. Something like this could lead to a large shift in how we manage infrastructure. We split terraform configs for more reasons than just splitting state of course, but something like this could make other approaches to organizing things more viable. Really cool and will be keeping an eye on this.
The market seems bad right now. Companies are offshoring everything they can and squeezing both sides.
At my company, we only hire in India now and the executives are intentionally causing "attrition" in the US by running people into the ground with demands that amount to 996 style work.
That is unfortunate. Not because I think they should have to, but because they eventually will have to if it gets big enough. Never underestimate the ability of your users to hold it wrong.
The default install only binds to loopback, so I am sure it is pretty common to just slap OLLAMA_HOST=0.0.0.0 and move on to other things. I know I did at first, but my host isn't publicly routable and I went back the same night and added IPAddressDeny/Allow rules (among other standard/easy hardening).
I love this website. It is a throwback to the old internet I grew up with. It has it all. Packed with esoteric information gathered and curated by a passionate group. Designed for desktop only with its own unique aesthetic. Not covered with ads and cookie banners and newsletter popups. I remember spending many evenings exploring such things at 33.6kbps.
On topic: I have watched every episode on TNG more than once and never noticed this. How embarrassing!
I only noticed it because I hang out with A/V people. It was nice to find this page and have it not just confirmed but fully detailed. Babylon 5 does this a lot too.
We have done a similar operation routinely on databases under pretty write intensive workloads (like 10s of thousands of inserts per second). It is so routine we have automation to adjust to planned changes in volume and do so a dozen times a month or so. It has been very robust for us. Our apps are designed for it and use AWS’s JDBC wrapper.
Just one more thing to worry about I guess…
reply