More

dps · on July 12, 2019

(Stripe CTO here)

Thanks for the questions. We have testing procedures and deploy mechanisms that enable us to ship hundreds of deploys a week safely, including many which touch our infrastructure. For example, we do a fleetwide version rollout in stages with a blue/green deploy for typical changes.

In this case, we identified a specific code path that we believed had a high potential to cause a follow-up incident soon. The course of action was reviewed by several engineers; however we lacked an efficient way to fully validate this change on the order of minutes. We're investing in building tooling to increase robustness in rapid response mechanisms and to help responding engineers understand the potential impact of configuration changes or other remediation efforts they're pushing through an accelerated process.

I think our engineers’ approach was strong here, but our processes could have been better. Our continuing remediation efforts are focused there.

ssalazars · on July 12, 2019

Thank you for taking the time to respond to my questions. I believe the high potential of causing a follow-up incident was left out of the post (or maybe I missed it?).

I hope that lessons are learned from this operational event, and invest towards building metrics and tooling that allows you to, first of all, prevent issues, and second, shorten the outage/mitigation times in the future.

I'm happy you guys are being open about the issue, and taking feedback from people outside your company. I definitely applaud this.

tus88 · on July 12, 2019

> ship hundreds of deploys a week safely

That seems like a lot of change in a week, or does deploys mean something else like customer websites being deployed?

tschwimmer · on July 12, 2019

They very likely have continuous deployment. So each change could potentially be released as a separate deploy. If the changes have changed to the data model, they gotta run a migration. So hundreds seems reasonable to me.

nialldalton · on July 12, 2019

From the outside it sounds like, whatever the database is, it has far too many critical services tightly bound within it. E.g. leader election implemented internally instead of as a service with separate lifecycle management - pushing the database query processor minor version forward forcing me to move the leader election code or replica config handling forwards... ick.

From the description/comment it also sounds like the database operates directly on files rather than file leases as there's no notion of a separate local - cluster-scoped - byte-level replication layer below it. Harder to shoot a stateful node.. And sounds like it's tricky to externally cross-check various rates, i.e. monitor replication RPCs and notice that certain nodes are stepping away from the expected numbers without depending on the health of the nodes themselves.

Hopefully the database doesn't also mix geo-replication for local access requirements / sovereignty in among the same mechanisms too.. rather than separating out into some aggregation layers above purely cluster-scoped zones!

Of course, this is all far far easier said than done given the available open source building blocks. Fun problems while scaling like crazy :)

dps · on July 12, 2019

I'm Stripe's CTO and wrote a good deal of the RCA (with the help of others, including a lot of the engineers who responded to the incident). If you've any specific feedback on how to make this more useful, I'd love to hear it.

davidw · on July 12, 2019

I don't think either one is particularly "useful" to me as a consumer of the business, other than knowing that "we have top people working on it right now" and there's a plan in place to try and avoid future problems.

What's fun for a software person is that there's a lot of interesting digressions and stuff to learn in the Cloudflare one. The whole explanation of the regexp at the end is something that no one cares about from the business side, but is an interesting read in and of itself.

It's worth noting that yours came out a bit more than a week faster than theirs, which jgrahamc clearly spent a lot of time writing. No idea if anyone cares about the speed with which these things are released...

hibikir · on July 12, 2019

Hi Dave, you probably won't remember me (we only spent about 2 months together in Stripe), but I bet Mr Larson remembers.

The first question is who is this written for: It lacks the detail I would write for the incident review meeting audience, while lacking a simpler story for the non technical. As it is at the time I read it, I don't think it aims any audience very well.

I understand that the level of detail of the internal report might be excessive for the internal report, but if technical readers are the target, some more details would have helped. For example the monitoring details that Will described in another thread are a key missing detail that, if anything, would make Stripe look better, as problems like that happen all the time. I bet there are more details that are equally useful that would be in an internal report that would not reveal delicate information. In general, the only reason I could follow the document well is that I remember how the Stripe storage system worked last year, and I could handwave a year worth of changes. Since this part of the Stripe infrastructure is relatively unique, it's difficult to understand from the outside, and looks as if it doesn't have enough information.

In particular, the remediations say very little that is understandable from the outside: Most of the text could apply to pretty much any incident on a storage of queuing subsystem I was ever a part of: More alerts, an extra chart in an ever growing dashboard, some circuit breakers to deal with the specific failure shape... It's all real, but without details, it says very little.

I understand why you might not want to divulge that level of detail though. If we want fewer details, then the article could cut all kinds of low-information sections, and instead focus more on the response, and the things that will be changed in the future. The most interesting bit about this is the quick version rollback, which, in retrospect, might not have been the right call. A more detailed view of the alternatives, And why the actions that ultimately led to the second incident were made would be enlightening, and would humanize the piece.

Thank you for not just providing a public root cause analysis, but coming here to discuss it in HN.

patio11 · on July 12, 2019

I work at Stripe, on the marketing team, and assisted a bit here. My last major engineering work was writing the backend to a stock exchange.

If anyone on HN knows anyone who has the sort of interesting life story where they both know what can cause a cluster election to fail and like writing about that sort of thing, we would eagerly like to make their acquaintance.

luizfelberti · on July 12, 2019

Maybe Kyle Kingsbury (aka @aphyr) is the person you are looking for?

https://jepsen.io/services#consulting

wbronitsky · on July 13, 2019

Kyle used to work at Stripe and left. I don’t think he would come back unfortunately. That guy is absolutely amazing, especially with regards to distributes DBs and writing about them

Ocha · on July 12, 2019

For starters maybe provide more details beside the vague information of some feature of some database didnt work as expected. Imagine you are giving this to your employees (especially new ones) to learn something. How much actual useful knowledge is being shared here to learn from?

chacham15 · on July 12, 2019

Unexpected things are bound to happen. But, one thing that stuck out to me is that you dont seem to have a safe way to test changes (which would have prevented the second failure). Are there no other environments to test changes on? Is there no way to incrementally roll-out? Is there not another environment which can step in in place of a failing one while you investigate? These seem like fairly common industry practices which help you deal with unexpected failures, but I dont see a mention of if/why these practices failed and if/how that is being remediated.

jabart · on July 12, 2019

It would be great that in these types of situations if the CC Tokens validity period is extended, or at least known as the documentation states it is short. For our app if the tokens were valid longer, we could write this up as a non-event and retry when things were better.

dps · on Dec 6, 2018

(stripe head of engineering here) I hosted the open house in Dublin, thanks for coming and sorry to hear about your experience; even if it doesn't end up being a fit, we always want candidates to feel respected rather than demoralized. Without knowing the specifics of your situation, I do want to be clear that there are obviously a lot of factors for us to consider when evaluating candidates, but age isn't one of them.

dps · on April 22, 2018

Does the combination of hypothermic state and medically induced coma (“hibernation”) have a prolonging impact on life expectancy? The lower metabolic rate would intuitively suggest that’s a possibility. It’s interesting to think about how this technology, even ahead of use in travel to deep space could be used as a forward only time machine.

dps · on May 14, 2016

If you already use pocket, then Pocket 2 Kindle https://p2k.co/manage/home is a good service to send stuff from your reading list, nicely formatted to your Kindle.

I wrote something similar (taking lists of URLs) a couple of years ago and it's still live at www.kindlized.com but when think about the next step of grabbing my Pocket article list I found Pocket 2 Kindle and never bothered to update my own - kind of fun and sad at the same time to discover that someone has already built the thing you just thought of :-)

dps · on Jan 9, 2016

I've made a preview PDF available at singleton.io/alpha/Journal.pdf also.

dps · on Nov 30, 2014

Posting to see what HN make of this. A friend backed the Lantern Indiegogo campaign yesterday [https://www.indiegogo.com/projects/lantern-one-device-free-d...] which sounded exciting and useful. I love the idea here - core knowledge, news, crisis response stuff broadcast globally.

While I'm excited about this and will probably back the Lantern myself, the current Outernet could really use help with content. I downloaded the whole of the Outernet [328 MB compressed] and most of the stuff is of pretty dubious interest: "Two Randomized Trials Provide No Consistent Evidence for Nonmusical Cognitive Benefits of Brief Preschool Music Enrichment", or advances a pretty niche non-mainstream media outlook (the Corbett report podcast).

dps · on Nov 30, 2014

The broadcast currently contains 758 articles, of which 676 are Wikipedia pages.

The remainder are: - 15 gutenberg texts (e.g. Moby Dick, Fairy Tales, by The Brothers Grimm, ...)

  - 10 open access Harvard papers

  - 54 dw.de news articles

  and...

“Making (Up) an Archive: What Could Writing History Look Like in a Digital Age?” - outernet://dash.harvard.edu/handle/1/11297828

There Can Be No Turing-Test--Passing Memorizing Machines - outernet://dash.harvard.edu/handle/1/11684156

Corbett Report Episode 293 - The Ebola Effect - uternet://www.corbettreport.com/episode-293-the-ebola-effect/

The Place of the Gospel of Philip in the Context of Early Christian Claims about Jesus’s Marital Status - outernet://dash.harvard.edu/handle/1/11041837

The Activity of Reason - outernet://dash.harvard.edu/handle/1/3415961

Corbett Report Episode 293 - The Ebola Effect - outernet://www.corbettreport.com/episode-293-the-ebola-effect/

Civilization Starter Kit v0.01 - outernet://opensourceecology.org/outernet://opensourceecology.org/outernet://opensourceecology.org/Civilization_Starter_Kit_v0.01.pdf

Two Randomized Trials Provide No Consistent Evidence for Nonmusical Cognitive Benefits of Brief Preschool Music Enrichment - outernet://dash.harvard.edu/handle/1/11276120

The Collage of Humanity - https://collage.outernet.is/ Civilization Starter Kit v0.01 - outernet://opensourceecology.org/outernet://opensourceecology.org/Civilization_Starter_Kit_v0.01.pdf

Civilization Starter Kit v0.01 - outernet://opensourceecology.org/Civilization_Starter_Kit_v0.01.pdf

The Parable of Google Flu: Traps in Big Data Analysis - outernet://dash.harvard.edu/handle/1/12016836 "Learning from utopia: contemporary architecture and the quest for political and social relevance." - outernet://dash.harvard.edu/handle/1/10579145

Middle-Period Discourse on the Zhong Guo: The Central Country - outernet://dash.harvard.edu/handle/1/3629313

Be Careful What You Ask For: Reconciling a Global Internet and Local Law - outernet://dash.harvard.edu/handle/1/9696322

Thinking about prestige, quality, and open access - outernet://dash.harvard.edu/handle/1/4322577

dps · on Nov 23, 2014

Isn't this what Accept: image/webp is for?

dps · on May 11, 2014

Could pick a slightly more inspiring example chat for the landing page!

bluehex · on May 11, 2014

I agree it's a terrible screenshot. Without any context I have no idea what those icon -> icon dividers mean. I guess its someone changing their icon?

dps · on May 11, 2014

I have a r/g color blind friend who has a pair of these.

He said: "They are amazingly totally worth the money and work very well. I have video of myself realizing that Starbucks is green and not brown for the first time. Mindfuck I thought brown like coffee."