> Cloudflare’s critical Workers KV service went offline due to an outage of a 3rd party service that is a key dependency. As a result, certain Cloudflare products that rely on KV service to store and disseminate information are unavailable [...]
Surprising, but not entirely unplausible for a GCP outage to spread to CF.
Probably unintentional. "We just read this config from this URL at startup" can easily snowball into "if that URL is unavailable, this service will go down globally, and all running instances will fail to restart when the devops team try to do a pre-emptive rollback"
After reading about cloudflare infra in post mortems it has always been surprising how immature their stack is. Like they used to run their entire global control plane in a single failure domain.
Im not sure who is running the show there, but the whole thing seems kinda shoddy given cloudflares position as the backbone of a large portion of the internet.
I personally work at a place with less market cap than cloudflare and we were hit by the exact same instances (datacenter power went out) and had almost no downtime, whereas the entire cloudflare api was down for nearly a day.
Nice job keeping your app up during the outage but I'm not sure you can say "the whole thing seems kinda shoddy" when they're handling the amount of traffic they are.
What's the alternative here? Do you want them to replicate their infrastructure across different cloud providers with automatic fail-over? That sounds -- heck -- I don't know if modern devops is really up to that. It would probably cause more problems than it would solve...
I was really surprised. The dependence on another enterprise’s cloud services in-general I think is risky, but pretty much everyone does it these days, but I didn’t expect them to be.
AWS has Outpost racks that let you run AWS instances and services in your own datacenter managed like the ones running in AWS datacenters. Neat but incredibly expensive.
> What's the alternative here? Do you want them to replicate their infrastructure
Cloudflare adverises themselves as _the_ redundancy / CDN provider. Don't ask me for an "alternative" but tell them to get their backend infra shit in order.
There are roughly 20-25 major IaaS providers in the world that should have close to dependency on each other. I'm almost certain that cloud flare believe that was their posture, and that the action items coming out of this post mortem will be to make sure that this is the case.
> Cloudflare’s critical Workers KV service went offline due to an outage of a 3rd party service that is a key dependency. As a result, certain Cloudflare products that rely on KV service to store and disseminate information are unavailable [...]
Surprising, but not entirely unplausible for a GCP outage to spread to CF.