This was the first time we had this class of outage. Many things were in a very ...

tomjen3 · on Sept 10, 2016

What I don't get is why you didn't see the relatively low cpu usage on the database server and the super high ones on the webserver immediately in a nagios (or similar) dashboard.

mkagenius · on Sept 10, 2016

They were distracted by the previous experience of having issues elsewhere.

lrascao · on Sept 10, 2016

And apparently there were no alarms in place for these kind of things

babo · on Sept 10, 2016

Apparently a lot of parts of the system were on alarm.

bdob4xcfH · on Sept 10, 2016

It's because they don't have a simple rollup dashboard that you can see that at a glance, like most places. Can you imagine if your car just showed you an event log for a door open, oil, turn singles on etc. that's what most monitoring systems are like these days.

jwatte · on Sept 10, 2016

Roll backs are in chat logs? I'd assume your scripts would record what they do when they do it, including roll backs.

Also, when only deploying two times a day, it's harder to tell which of the included changes have the problem. That's an argument for more frequent deploys!

abhishekash · on Sept 10, 2016

Seems like pretty ambitious logging that it tripped the servers !!! Will be careful with my logging next time :) .

ycombinatorMan · on Sept 10, 2016

Out of curiosity, why are you deploying to all your web servers simultaneously? Could you not do a partial roll-out to make sure something like this doesnt happen?

mkagenius · on Sept 10, 2016

I doubt partial roll out would have helped in this particular case since it only happens in high load and they roll out new code twice a day.

marcog1 · on Sept 10, 2016

Correct. We don't roll out during peak load either.

tonfa · on Sept 10, 2016

Considered at least starting your release canary during peak load?

marcog1 · on Sept 10, 2016

We have talked about it. It is unlikely to helped with an event like this, and I don't recall an event where it would have. It also has the downside of extending our deployment cycle by a lot. Notably, we do run a canary internally, and that had no issues, which actually through us off for a while because while the app was partially down for users it was working for us and that hasn't happened to us in a while.