I think the biggest addition to this that I would make is why does CircleCI have only one ops person to handle the day-to-day operations of all of these moving parts? They've got a considerable number of moving parts, and one person fighting to keep everything from fighting over leaves no room for other people, with other operations experience, to look at solutions that could at least provide a bit of breathing room. For example, there are a few ways of throttling based on IP; limiting requests from github to, say, a kilobyte per second of bandwidth would have slowed down that incoming tide to let the queue start to drain.
I think this is the biggest frustration of cloud-based services that I have. Just because you don't have physical systems to maintain doesn't mean you don't need people who are comfortable climbing in and through the systems and network level of the stack to ensure everything's working as well as it could be. A good, solid, ops department gives a different perspective, and their focus on other layers means that developers don't have to worry about those pieces. Ops is not a profanity, and just because your devices are virtualized doesn't mean you don't need them for the rest of the things they do.
I enjoyed this piece. One piece of hard learned advice: add exponential back off in failure to your clients, now. It'll take only a small amount of work, and will save you from the inevitable self DDOS when your ingestion endpoint hiccups, and clients' buffered data creates a load so large that you may not be able to recover.
One thing about CircleCI's PM is that they don't really specify what kind of DB. Sounds like they implemented something on their own. (I am betting on not Cassandra).
I think this is the biggest frustration of cloud-based services that I have. Just because you don't have physical systems to maintain doesn't mean you don't need people who are comfortable climbing in and through the systems and network level of the stack to ensure everything's working as well as it could be. A good, solid, ops department gives a different perspective, and their focus on other layers means that developers don't have to worry about those pieces. Ops is not a profanity, and just because your devices are virtualized doesn't mean you don't need them for the rest of the things they do.