Hacker News new | past | comments | ask | show | jobs | submit login
Major data center power failure (again): Cloudflare Code Orange tested (cloudflare.com)
214 points by gmemstr on April 8, 2024 | hide | past | favorite | 46 comments



As always, it's really impressive to see how much technical detail they release publicly in their RCAs. It sets a good example for the industry.

Also — quite impressive to make major infrastructure and architecture changes in a few months. Not every organization can pull that off.


> quite impressive to make major infrastructure and architecture changes in a few months

And have it work first time round.


I feel there’s a sweet spot where if you do it too quickly, it’s a bad sign. And if it takes years, risk just keeps going up and up until it becomes basically impossible to do smoothly.


The blog post contain no details at all about how they achieved high availability. I'm disappointed.


There is not much impressive here. Their architecture seems to be relying on a couple of datacenters, and if somebody have turned on or not the right switches in the right places. This means you will continue to hear from regular outages and maybe nice postmortem blogs. I don't see a fundamental approach to reliability based on concepts around blast radius, expecting anything to fail anytime, and fundamental principles of bulkhead patterns.

Here is a free hint: By talking so much about where their data centers are located, on my view, they already failed item one on my check list. Principle number one of Physical Security is, you don't say where your Data Centers are, except of course to very restricted number of "need to know group".

As predictions have no value, unless they are made prior to events, I predict the next outage will be some common core component, with some on/off type of config, with some common core configuration, that "could not be foreseen". Then to the next blog...On to the next outage....


Two major outages less than half a year a part, but with wildly different outcomes. It's definitely showing their engineering capabilities were targeted at the correct outcomes.

Would definitely be interested to see the detailed RCA on the power side of things. Not many people really think about Layer 0 on the stack.


Cloudflare is in EdgeConneX Portland, you can try poking around but I haven't seen a RCA of what happened. Details are usually only shared with direct customers because it is bad for the brand.

https://www.edgeconnex.com/wp-content/uploads/2018/10/ECX-22...


They called out Flexential in the post?


Yes


Someone set up the breakers incorrectly way back when, and they were never adjusted. I'll bet it's not possible to adjust those without powering off the downstream equipment.

It reminds me of the amazon guy discovering that there was no way to fail back power without an outage, then them going off and building their own equipment.


> there was no way to fail back power without an outage, then them going off and building their own equipment.

Anywhere I can read about that?



Maybe [1] which is about per-rack UPSes, to reduce the blast radius of UPS failure?

Pretty sensible IMHO - I live in a country with a reliable electricity grid, and outages due to UPS malfunction are about as common as power outages.

[1] https://www.datacenterdynamics.com/en/news/aws-develops-its-...


In large data centers, rack level UPSs are impractical for many reasons like cost and efficiency, but the big problem is that modern power densities are so high that you want rows to fail if cooling isn't available.

It doesn't take long without cooling to cook equipment to the point of failure or reduced reliability.

7 to 16kW per rack is common even in these older colo facilities.

And there never would have been enough UPS to make up for not enough replacement breakers on site.


> It doesn't take long without cooling to cook equipment to the point of failure or reduced reliability.

IIRC, it's about 30 minutes—irrelevant of room size.

This is because while smaller rooms have less air volume as a 'buffer', they also you can't fit much gear. While huge data centres have huge volumes of air, they also have lots of gear.

So there's a coincidental proportional relationship what you (as a rule of thumb) have 30 minutes before things cook, regardless of room size. (At least for 'general purpose' computing: GPUs are probably another matter.)


Google has been using rack-level UPS for many years, so I'm curious to hear why you think that is impractical specifically in large datacenters.

https://cloud.google.com/blog/products/compute/google-joins-....


Probably because Google are designing the entire hardware (its not a Dell rack server) and data center together, they are also responsible for the entire workload if something goes wrong and likely also have data center redundancy for an entire power loss for the entire workload there.

In third party data centers you're co-located with other customers that have mostly comodity server hardware which won't have those options, they generally (at least in my experience) all provide UPS power as part of the facilitiy so you have it already and something like a fire if it did happen, would impact lots of other customers who likely don't have redunancy between data centers or fire compartments.

Additionally depending on your exact setup (it's possible you may have dark fibre directly out of the facilitiy, but often not) you're likely to be uplinked through powered equipment at the same facility, that wouldn't have the extra UPS power either.

Plus as others mentioned, the loss of cooling is the biggest problem. In fairness maybe that would buy you a few minutes for the kindof automatic switchover they talked about. But that's quite a bit of co-ordination to have your own power, to not rely on anyone else being powered and then to be able to co-ordinate your shutdown before the room gets too hot - and not be at risk of overheating the room for others.

Plus there are safety issues like emergency power out for fire fighters etc which again, I'm sure Google could deal with at a facility-wide level, but if you're a small fish in a much bigger data center its harder to co-ordinate that kindof thing.

So while I'm sure it's all possible, there are obvious headwinds in many directions.


Your typical company, even if it has hundreds of racks, usually at the scale and workloads needed at Google, where you can plug off the racks and not really notice it.


But isn't the UPS only expected to last for 15 minutes or so, to give the backup generators time to start up? Or to perform a fast-but-graceful migration when the generator doesn't start up?

I thought most DCs just pause the cooling until the generator comes up, rather than running the cooling on battery power?


> But isn't the UPS only expected to last for 15 minutes or so, to give the backup generators time to start up?

They are expected to last that long, but if the batteries are on year 4 of their 5-year life, that may not happen. What also may not happen is the generator starting up.

Or the automatic transfer switch (ATS) not working properly: it should be on either input Feed A or Feed B, but when it tries to throw itself over (making a loud kah-chunk sound), it gets stuck in between—this happened to us once.

"The perversity of the universe always tends toward a maximum." — Finagle's Law of Dynamic Negatives


I know our DC does load test on the batteries and generators. There are also 5 generators on site. I get all the emails about these tests since they are marked low risk.

Fun story, small IT company had a natural gas generator installed at the new office they built out. Power went out for the first time, lights go out, generator kicks on, lights turn on, lights turn off. Long story short the electrician didn't have the right breaker size in the truck and used a small one. Building was empty when it was tested which is why bigger DCs have a load generator to test at load.


Per-Rack UPS is not allowed in our local Flexential DC. Reason being is that in case of a fire, the whole room needs to go dark on power for fire control. We do have two redundant AC circuits on two different breakers. But our DC was from a company that got bought out by Flexential so maybe, hopefully, its setup different.


It’s a good idea not to have per rack UPS in a colo. Customers are really bad at maintaining equipment Making sure customers replace batteries every five years might be difficult.

Also, in my opinion, all UPS manufactures make horrible UPS equipment. There can be so many different types of failures or problems or glitches. You don’t want a UPS leaking or exploding. Also, there are laws in some states that only allow so many pounds of batteries in a building.


I can confirm that the vast majority of UPS's are garbage. I happily pay more to source Eaton whenever I can.


> Pretty sensible IMHO - I live in a country with a reliable electricity grid, and outages due to UPS malfunction are about as common as power outages.

Hence why modern servers and their associated networking equipment have dual power supplies which could be connected to two seperate UPS systems. It would be very unlikely to have them both fail at once. In a less important home/small business scenario typically one supply is connected to the UPS and the other is connected to the wall via a surge protector.


You normally reduce the blast radius of UPS failures by having two supplies for each server. So I don't think that's it.

The link someone else put about overriding faults might be it, if OP misremembered the problem.


Well in principle, sure.

And yet here we are, reading an article about "a total loss of power [...] following a reportedly simultaneous failure of four [...] switchboards serving all of Cloudflare’s cages. This meant both primary and redundant power paths were deactivated across the entire environment."


I'm not sure what you mean. Wouldn't we have the same level of outage with per-rack UPSes?


I would guess the devices in question are some sort of electric trip unit, along these lines:

https://www.eaton.com/content/dam/eaton/products/electrical-...

I don’t see anything in the manual about needing to power it down for reconfiguration. All the relevant buttons, set screws, etc should be accessible without removing any dead fronts, so they can be safely accessed while the system is live.


> "When one or more of these breakers tripped, a cascading failure of the remaining active CSB boards resulted, thus causing a total loss of power serving Cloudflare’s cage and others on the shared infrastructure."

Background note to HN readers:

Almost zero SaaS providers (or even CDNs) using the term "our datacenter" or showing their datacenters on maps etc. have their own datacenters. It's universal and normal. In general they have a server, a rack, a cage, in shared space, subject to others' policies and practices, and their neighbors.

This can adjust your mental model for accountability and your designs for resilience. You can even exploit this by colo-ing at the same addresses to get LAN latencies to your SaaS provider, CDN, or (sometimes) even cloud provider.


A single k8s cluster spanning multiple datacenters feels mind boggling to me. I know it's not exactly uncommon for HA even if you just have a little one in your cloud provider of choice but I'm sure it's a totally different beast than the toy ones I've created.


At its core Kubernetes is mostly a HTTP API being used to sync state between nodes. I see no reason that part shouldn't work even across the world, albeit at the cost of slower syncing of state (eg. pods taking longer to be created after a scale-up).

That HTTP API is backed by etcd which uses Raft and that is where running over a large area is more likely to cause problems. One approach would be to keep the etcd instances in one region (and probably also co-locate the scheduler, controller manager there), while having far-flung worker nodes. This creates a risk of losing the control plane but in most cases services would keep running (but would be unable to react to any further issues until the control plane recovers).

You would also want to carefully design your workloads with topology constraints and region-specific services to avoid high application layer latencies though.

Overall it's a fun thought experiment. In practical terms I think cross-datacenter in a small geographic area would work fine but I probably wouldn't want to run a single worldwide cluster, both for the reasons above and for other scaling reasons.


You could implement geographic taints and tolerations and constrain certain workloads to certain regions. Lots of places span clusters across several AZs and have the same problem in theory. However I don’t personally like it, because you’re engineering a cross-datacenter failure domains which is usually a bad design decision unless done for specific reasons. And as for the the control plane, for it to reliably span multiple locations while also tolerating failures, you now have to deal with all kinds of weird split brain scenarios, so most just run multiple clusters instead of rewriting k8s for this kind of design.


It’s no different than running EKS in Multi-AZ configuration, except for the cross-az egress fees…


I guess if you have really fast fibre connections between them then the separation isn’t all that separate even with distance


Single Point of Failure

Is PDX still a single-point-of-failure for Cloudflare services?

It was 5-months ago [0], and if I understand the post - it sounds like it still is.

If anyone knows, I'd be curious to hear.

[0] https://news.ycombinator.com/item?id=38113503


It’s complicated and it’s not supposed to be. PDX is where the config plane for the edge network and services is stored. Some of this information is transported to the edge via QuickSilver [1] (e.g. auth tokens). This information is replicated to Europe and fail over is possible. The challenge with the previous outage was a combination of things (as it always is) whereby certain services had a dependency that relied on PDX as a single point of failure (if I recall correctly). That underpinned enough services that a good chunk of Cloudflare’s config plane went down. Additionally, the cutover to Europe didn’t go smoothly because it was instantaneous instead of gradual which resulted in traffic amplification as retries for previous requests & current requests were shuttled into the online data center resulting in a thundering herd.

What this blog post talks about it is how this time nothing went down (or at least cut over within minutes due to presumably automated systems noticing & doing it) with the exception of analytics data which does have a single dependency and that’s determined to be “ok” (i.e. it’s an acceptable failure mode for that product & that product only).

There are additional failure complications required because PDX (& Europe) is composed of several independent data centers which aren’t supposed to all fail simultaneously. What’s pretty clear from the implied “we’re not happy either” is that Cloudflare isn’t pleased with their vendor’s separately located data centers still having correlated failures.

[1] https://blog.cloudflare.com/introducing-quicksilver-configur...


BTW, Portland is a bad place to have single point-of-failure data centers. Portland is vulnerable to Cascadia subduction zone quake. The probability is 37% over next 50 years. Which is low each year but risk needs to be considered. It will knock out all local centers and power may not be restored for months.


> Cloudflare isn’t pleased with their vendor’s separately located data centers still having correlated failures.

I hear you, but …

… what’s now core fabric of the internet (cloudflare) shouldn’t be suseptable to a single vendors data center going down to - knock out their service.

It create a situation where an attacker now knows all they need to do is target that single data center (PDX), cloudflare will go down, and now they can attack any website previously protected by cloudflare.

I’m not trying to be a hater.

I’m just trying to call out, it’s a bit unfair to solely point the figure at this data center provider when architecturally this issue shouldn’t exist in the first place.


Just to be clear - only the config plane was updated (e.g. making changes to how products worked at the edge, creating new tokens for some services etc). The edge data plane itself is immune to this kind of failure (i.e. no websites protected by cloudflare went down as a result of this).


>cloudflare will go down, and now they can attack any website previously protected by cloudflare.

Neither this outage nor the previous broke Cloudflare's protection.


From what I understand of what I read, it was a single point of failure for one product instead of 15


...and they were already on fixing the 1.


I can very definitely empathise with the experience of having worked hard at fixing the issues underpinning high priority incidents, then noticing that what previously would have taken hours to fix is now only visible as a blip on a graph.


Interesting, I didn't even hear about that second outage


Same. My employer was highly impacted by the November 2nd outage but this latest one didn't appear to have affected us at all.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: