Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This was the first time we had this class of outage. Many things were in a very bad state, and many of these symptoms were more familiar to us. So we spent time ruling them out before realising webserver CPU was closer to the root cause than the other symptoms.

We roll back by reverting to a previous release on the load balancers, which is usually pretty instant. The previous releases were bad and themselves rolled back, which is a rare situation for us. So there was a bit of scrambling to look into the chat logs to determine a safe (non-rolled back) release we could roll back to. Then the high CPU caused our roll back to be really, really slow. Then we still had old processes running the bad release running, and killing them on webservers with high CPU took a while to actually work. Then it took a bit of time for load to come down on its own. All of this took place within the 8:08-8:29 window reported in the post. And I'm still simplifying a lot.




What I don't get is why you didn't see the relatively low cpu usage on the database server and the super high ones on the webserver immediately in a nagios (or similar) dashboard.


They were distracted by the previous experience of having issues elsewhere.


And apparently there were no alarms in place for these kind of things


Apparently a lot of parts of the system were on alarm.


It's because they don't have a simple rollup dashboard that you can see that at a glance, like most places. Can you imagine if your car just showed you an event log for a door open, oil, turn singles on etc. that's what most monitoring systems are like these days.


Roll backs are in chat logs? I'd assume your scripts would record what they do when they do it, including roll backs.

Also, when only deploying two times a day, it's harder to tell which of the included changes have the problem. That's an argument for more frequent deploys!


Seems like pretty ambitious logging that it tripped the servers !!! Will be careful with my logging next time :) .


Out of curiosity, why are you deploying to all your web servers simultaneously? Could you not do a partial roll-out to make sure something like this doesnt happen?


I doubt partial roll out would have helped in this particular case since it only happens in high load and they roll out new code twice a day.


Correct. We don't roll out during peak load either.


Considered at least starting your release canary during peak load?


We have talked about it. It is unlikely to helped with an event like this, and I don't recall an event where it would have. It also has the downside of extending our deployment cycle by a lot. Notably, we do run a canary internally, and that had no issues, which actually through us off for a while because while the app was partially down for users it was working for us and that hasn't happened to us in a while.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: