Hacker Newsnew | past | comments | ask | show | jobs | submit | eastdakota's commentslogin

That’s not accurate. As with any incident response there were a number of theories of the cause we were working in parallel. The feature file failure was one identified as potential in the first 30 minutes. However, the theory that seemed the most plausible based on what we were seeing (intermittent, initially concentrated in the UK, spike in errors for certain API endpoints) as well as what else we’d been dealing with (a bot net that had escalated DDoS attacks from 3Tbps to 30Tbps against us and others like Microsoft over the last 3 months). We worked multiple theories in parallel. After an hour we ruled out the DDoS theory. We had other theories also running in parallel, but at that point the dominant theory was that the feature file was somehow corrupt. One thing that made us initially question the theory was nothing in our changelogs seemed like it would have caused the feature file to grow in size. It was only after the incident that we realized the database permissions change had caused it, but that was far from obvious. Even after we identified the problem with the feature file, we did not have an automated process to role the feature file back to a known-safe previous version. So we had to shut down the reissuance and manually insert a file into the queue. Figuring out how to do that took time and waking people up as there are lots of security safeguards in place to prevent an individual from easily doing that. We also needed to double check we wouldn’t make things worse. The propagation then takes some time especially because there are tiers of caching of the file that we had to clear. Finally we chose to restart the FL2 processes on all the machines that make up our fleet to ensure they all loaded the corrected file as quickly as possible. That’s a lot of processes on a lot of machines. So I think best description was it took us an hour for the team to coalesce on the feature file being the cause and then another two to get the fix rolled out.

Thank you for the clarification and insight, with that context it does make more sense to me. Is there anything you think can be done to improve the ability to identify issues like this more quickly in the future?

Any "limits" on system should be alerted... like at 70% or 80% threshold.. it might be worth it for a SRE to revisit the system limits and ensuring threshold based alerting around it..

There’s lots of things we did while we were trying to track down and debug the root cause that didn’t make it into the post. Sorry the WARP takedown impacted you. As I said in a comment above, it was the result of us (wrongly) believing that this was an attack targeting WARP endpoints in our UK data centers. That turned out to be wrong but based on where errors initially spiked it was a reasonable hypothesis we wanted to rule out.

Thanks!

* published less than 12 hours from when the incident began. Proud of the team for pulling together everything so quickly and clearly.

That's all well & good, but I'm curious...

> Spent some time after we got things under control talking to customers. Then went home.

What did sama / Fidji say? ;) Turnstile couldn't have been worth that.


Next time open your dev console in your window and look at how much is going on in the background.

Well… we have a culture of transparency we take seriously. I spent 3 years in law school that many times over my career have seemed like wastes but days like today prove useful. I was in the triage video bridge call nearly the whole time. Spent some time after we got things under control talking to customers. Then went home. I’m currently in Lisbon at our EUHQ. I texted John Graham-Cumming, our former CTO and current Board member whose clarity of writing I’ve always admired. He came over. Brought his son (“to show that work isn’t always fun”). Our Chief Legal Officer (Doug) happened to be in town. He came over too. The team had put together a technical doc with all the details. A tick-tock of what had happened and when. I locked myself on a balcony and started writing the intro and conclusion in my trusty BBEdit text editor. John started working on the technical middle. Doug provided edits here and there on places we weren’t clear. At some point John ordered sushi but from a place with limited delivery selection options, and I’m allergic to shellfish, so I ordered a burrito. The team continued to flesh out what happened. As we’d write we’d discover questions: how could a database permission change impact query results? Why were we making a permission change in the first place? We asked in the Google Doc. Answers came back. A few hours ago we declared it done. I read it top-to-bottom out loud for Doug, John, and John’s son. None of us were happy — we were embarrassed by what had happened — but we declared it true and accurate. I sent a draft to Michelle, who’s in SF. The technical teams gave it a once over. Our social media team staged it to our blog. I texted John to see if he wanted to post it to HN. He didn’t reply after a few minutes so I did. That was the process.


> I texted John to see if he wanted to post it to HN. He didn’t reply after a few minutes so I did

Damn corporate karma farming is ruthless, only a couple minute SLA before taking ownership of the karma. I guess I'm not built for this big business SLA.


We're in a Live Fast Die Young karma world. If you can't get a TikTok ready with 2 minutes of the post modem drop, you might as well quit and become a barista instead.

> I read it top-to-bottom out loud for Doug, John, and John’s son. None of us were happy — we were embarrassed by what had happened — but we declared it true and accurate.

I'm so jealous. I've written postmortems for major incidents at a previous job: a few hours to write, a week of bikeshedding by marketing and communication and tech writers and ... over any single detail in my writing. Sanitizing (hide a part), simplifying (our customers are too dumb to understand), etc, so that the final writing was "true" in the sense that it "was not false", but definitely not what I would call "true and accurate" as an engineer.


You call this transparency, but fail to answer the most important questions: what was in the burrito? Was it good? Would you recommend?


Chicken burrito from Coyo Taco in Lisbon. I am not proud of this. It’s worse than ordering from Chipotle. But there are no Chipotle’s in Lisbon… yet.


There's a lot of good food places in Lisbon that you might not be familiar with yet. Enjoy your stay

I DON'T see this as transparency. There is ZERO mention of the burrito in the post-mortem document itself.

0/10, get it right the first time, folks. (/s)


A very human and authentic response. Love to see it.

Fantastic for recruiting, too.


> He didn’t reply after a few minutes so I did

I'd consider applying based on this alone


Appreciate the extra transparency on the process.


A postmortem postmortem, I love it. Transparency to the power of 2.

I really appreciate this level of transparency. Thank you for being a good person in such a powerful position in the world.

I'm not sure I've ever read something from someone so high up in a company that gave me such a strong feeling for "I'd like to work for these people". If job posts could be so informal and open ended, this post could serve as one in the form of a personality fit litmus test.

How do you guys handle redaction? I'm sure even when trusted individuals are in charge of authoring, there's still a potential of accidental leakage which would probably be best mitigated by a team specifically looking for any slip ups.

Thanks for the insight.


Team has a good sense, typically. In this case, the names of the columns in the Bot Management feature table seemed sensitive. The person who included that in the master document we were working from added a comment: “Should redact column names.” John and I usually catch anything the rest of the team may have missed. For me, pays to have gone to law school, but also pays to have studied Computer Science in college and be technical enough to still understand both the SQL and Rust code here.

Could you elaborate a bit on how going to law school helped? Was it because it made it easier for you to communicate and align with your CLO?

Probably because he could check legalities of a release himself without council. It is probably equivalent to educating yourself on your rights and laws so if you get pulled over by a cop who may try to do things that you can legally refuse, you can say no.

that's very cool, thanks

Can attest: not a single LLM used. Couldn’t if I tried. Old school. And not entirely proud of that.


Based CEO


That’s correct.


Is it actually consul-template? (I have post-consul-template stress disorder).


I'd love to hear any commentary on Consul if anyone else has it.


I think Consul is great, for what it's worth; we were just abusing it.

https://fly.io/blog/a-foolish-consistency/

https://fly.io/blog/corrosion/


Did you know: PCTSD affects more than 2 in 5 engineers.


Because we initially thought it was an attack. And then when we figured it out we didn’t have a way to insert a good file into the queue. And then we needed to reboot processes on (a lot) of machines worldwide to get them to flush their bad files.


Thanks for the explanation! This definitely reminds me of CrowdStrike outages last year:

- A product depends on frequent configuration updates to defend against attackers.

- A bad data file is pushed into production.

- The system is unable to easily/automatically recover from bad data files.

(The CrowdStrike outages were quite a bit worse though, since it took down the entire computer and remediation required manual intervention on thousands of desktops, whereas parts of Cloudflare were still usable throughout the outage and the issue was 100% resolved in a few hours)


It might remind you of Crowdstrike because of the scale.

Outages are in a large majority of cases caused by change, either deployments of new versions or configuration changes.


zone your deployments first -blue/green. Have a small blue zone, and test it out. If it works, then expand to green deployments.

A configuration file should not grow! design failure here, I want to understand


Richard Cook #18 (and #10) strikes again!

https://how.complexsystems.fail/#18

It'd be fun to read more about how you all procedurally respond to this (but maybe this is just a fixation of mine lately). Like are you tabletopping this scenario, are teams building out runbooks for how to quickly resolve this, what's the balancing test for "this needs a functional change to how our distributed systems work" vs. "instead of layering additional complexity on, we should just have a process for quickly and maybe even speculatively restoring this part of the system to a known good state in an outage".


This document by Dr. Cook remains _the standard_ for systems failure. Thank you for bringing it into the discussion.

Why was Warp in London disabled temporarily. No mention of that change was discussed in the RCA despite it being called out in an update.

For London customers this made the impact more severe temporarily.


We incorrectly thought at the time it was attack traffic coming in via WARP into LHR. In reality it was just that the failures started showing up there first because of how the bad file propagated and where it was working hours in the world.

Probably because it was the London team that was actively investigating the incident and initially came to the conclusion that it may be a DDoS while being unable to authenticate to their own systems.

Given the time of the outage that makes sense, they’d mostly be within their work day time (if such a thing apples to us anymore).

Question from a casual bystander, why not have a virtual/staging mini node that receives these feature file changes first and catches errors to veto full production push?

Or you do have something like this but the specific db permission change in this context only failed in production


I think the reasoning behind this is because of the nature of the file being pushed - from the post mortem:

"This feature file is refreshed every few minutes and published to our entire network and allows us to react to variations in traffic flows across the Internet. It allows us to react to new types of bots and new bot attacks. So it’s critical that it is rolled out frequently and rapidly as bad actors change their tactics quickly."


In this case, the file fails quickly. A pretest that consists of just attempting to load the file would have caught it. Minutes is more than enough time to perform such a check.


Just asking out of curiosity, but roughly how many staff would've been involved in some way in sorting out the issue? Either outside regular hours or redirected from their planned work?


Is there some way to check the sanity of the configuration change, monitor it and then revert back to an earlier working configuration if things don't work out?

Yeah, I can imagine that this insertion was some high-pressure job.


The computer science equivalent of choosing between the red, green and blue wires when disarming a nuke with 15 seconds left on the clock

Is it though? Or is it, oh, this is such a simple change that we really don't need to test it attitude? I'm not saying this applies to TFA, but some people are so confident that no pressure is felt.

However, you forgot that the lighting conditions are where only red lights from the klaxons are showing so you really can't differentiate the colors of the wires


Thx for the explanation!

Side thought as we're working on 100% onchain systems (for digital assets security, different goals):

Public chains (e.g. EVMs) can be a tamper‑evident gate that only promotes a new config artifact if (a) a delay or multi‑sig review has elapsed, and (b) a succinct proof shows the artifact satisfies safety invariants like ≤200 features, deduped, schema X, etc.

That could have blocked propagation of the oversized file long before it reached the edge :)


We don’t know. Suspect it may just have been a big uptick in load and a failure of its underlying infrastructure to scale up.


The status page is hosted on AWS Cloudfront, right? It sure looks like Cloudfront was overwhelmed by the traffic spike, which is a bit concerning. Hope we'll see a post from their side.


CloudFront has quotas[0] and they likely just hit those quota limits. To request higher quotas requires a service ticket. If they have access logs enabled in CloudFront they could see what the exact error was.

And since it seems this is hosted by Atlassian, this would be up to Atlassian.

[0] https://docs.aws.amazon.com/AmazonCloudFront/latest/Develope...


Yes, probably a bunch of automated bots decided to check the status page when they saw failures in production.


It looks a lot like a CloudFront error we randomly saw today from one of our engineers in South America. I suspect there was a small outage in AWS but can't prove it.


Probably non zero number of companies use cloudfront and other cdns as fallback for cloudflare or running a blended cdn so not surprising to see other cdns hit with a thundering herd when cloudflare went down


This situation reminds me of risk assessment, where you sometimes assume two rare events are independent, but later learn they are actually highly correlated.

That definitely is one not-wrong conclusion.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: