> One of the biggest customer-facing effects of this delay was that status.github.com wasn't set to status red until 00:32am UTC, eight minutes after the site became inaccessible. We consider this to be an unacceptably long delay, and will ensure faster communication to our users in the future.
Amazon could learn a thing or two from Github in terms of understanding customer expectations.
I recently stepped into a role with a devops component, and one of my first surprises was just how slow status.aws.amazon.com was to update about ongoing issues. I had to scramble to find twitter and external forums confirmation for the client.
What's even worse is that when Amazon finally updates their status page it's usually still a green icon with a little i tick for "information" even if it was a partial outage. It takes a lot for the icons to go red which is what you'd look for if you're experiencing issues.
I do the same thing, often searching Twitter for "aws" or "outage" and find people complaining about the problem which confirms my suspicions. It's a sad state of affairs when you have to do this and Amazon doesn't seem interested in fixing it.
The most recent issue that affected me was when all EC2 instances in VPCs couldn't connect to S3. At all.
It wasn't indicated on the status page until after it was fixed. And it was indicated as a green check in a sea of green checks. With a small "i" in the corner to represent the outage.
I love AWS. It's not without fault but overall I think it's been well architected, well documented, and well implemented.
But the status page has got to be the ultimate example of what not to do.
Huh, I wonder if the status page is in fact based on any automated monitoring at all, or just manual updates? I guess probably automated monitoring, just not very good automated monitoring.
If you have a support agreement with them then file a ticket requesting better customer communication and link back here as an example of how to do it right.
I think everyone complains in forums and online but doesn't actually file tickets about it. These things are worth tickets too.
I take it you have no experience filing tickets with them. A typical ticket goes something like this:
1. File ticket.
2. Wait. Then wait some more. Even if you pay big money for a support contract, they take a long time to respond (often > 1 hour).
3. Get a response from a first level rep who has no access to anything, has little dev experience, and asks some inane questions which I'm convinced is a purposeful stalling tactic.
4. Play the dumb question/obvious response dance, waiting an hour or more for a response each time.
5. If you are lucky (usually a couple hours in now) they acknowledge there's some problem (but never give you any detail) and escalate your ticket to a higher level internal team. If you are unlucky, you are calling up your account rep (do you even have one??) and getting them to harass tech support.
6. Usually around now the problem "magically" disappears if you haven't already fixed it yourself.
7. If you are lucky, a few hours, days, weeks later you get a response asking if you are still having the problem? You, of course, are NOT having the problem since you long ago solved it yourself. If you are really unlucky they try to schedule a meeting with one of their "solution architects" who is then going to waste an hour of your time telling you how to properly "design" your software for the cloud (i.e. trying to sell you on even more of their services).
8. Ticket is closed having never gotten to the bottom of the problem, maybe get a survey.
I've never seen this go down differently. Filing more tickets isn't going to change this. You want to really change things?
STOP PAYING THEM!
If a few mid-sized customers stop paying them and make a big-stink when they do it, then I guarantee you things will change! Until then, they have little incentive to improve and the big customers have a direct line to Amazon so they can circumvent all this crap. It's up to the small and mid-sized customers to push for change and the most effective way to do this would be to spend your money elsewhere.
To be honest, I've always found their support to be really good. Sometimes it can be a little slow to start, but I regularly experience technicians that go way above what I would expect to assist me & deliver a great outcome. If other companies in Australia were as responsive as them (e.g. telcos), I'd be a very happy man. EDIT: I'm on Business Support, so maybe that's your issue?
I'm also in Australia and have nothing but good things to say about AWS support, and are usually solved by the first responder (not necessarily on the first response). The technical skill has generally been pretty good.
But it's not specific to us down under - the support contacts come from all over the globe. We dropped from Business to Developer support when the $A tanked in order to save a buck, and it just takes a little longer is all - no real drop in quality. I wish other large companies had their level of support quality.
I'm on business support too and generally am talking to a rep in minutes. They aren't always able to find the problem before I do, but I always get follow up details later on the how / why that they did determine.
I wish our experience was like this. We used to have business level but we dropped it because we weren't getting value for it. Our experience was slightly better when we had it but we still ended up either fixing most problems on our own or waiting them out.
Which is unacceptable for one of the largest infrastructure providers. So many times we were sitting around twiddling our thumbs waiting for our expensive amazon support to get back to us when things were broken.
Same experience here. But: I've had luck complaining with a few well-chose hashtags and mentions on twitter, getting the attention of a tech lead related to a particular AWS service.
One example: redshift. Had an expensive temporary cluster that couldn't be deleted, for days. Was stuck "pending" or "rebuilding". Assigned account rep would take forever to respond, and just didn't understand, would forward directions to using AWS console. Yeah, DOESN'T WORK. After a week decided to try getting attention on twitter, got it fixed in about 12 hours.
>2. Wait. Then wait some more. Even if you pay big money for a support contract, they take a long time to respond (often > 1 hour).
My experiences don't reflect this, perhaps we are familiar with different levels of support contracts. I use AWS for work only so I can only speak to one level if their support.
>3. Get a response from a first level rep who has no access to anything, has little dev experience, and asks some inane questions which I'm convinced is a purposeful stalling tactic.
4. Play the dumb question/obvious response dance, waiting an hour or more for a response each time.
I can't agree with this either. I almost always use their chat option and a rep is usually available within 15m unless there is an AWS outage.
I do however completely agree with 5 and 6, but I don't let it bother me. They can't expose too much info about their infrastructure. I'm usually just looking for a confirmation of an issue in their side or not which they have always been willing to provide.
If you're using aws for business and are unhappy with their current level if support maybe you should talk with their sales folks to find out about higher tier support plans.
I think a lot of folks feel that it's a useless endeavor, so they don't bother. Amazon's been operating this way for years, and they're quite a large company; it seems unlikely to me that fundamental change can happen inspired by customer tickets, even if you're paying for support.
Basically, if Netflix isn't the source of the complaint, they're not going to give two fucks.
/me suspects that netflix engineers get outage notifications through some other avenue than the status page.
day to day i mostly write software, but I also help manage our infrastructure (we're a small company - 9 people total, 4 engineers, I'm one of the 2 that understands managing servers well enough to support it). We were on linode up until about a year and change ago and switched to AWS/Opsworks to both decrease our infrastructure bill and increase our ability to scale horizontally quickly (for unfortunately long definitions of quickly - "running setup...")
Both Linode and Amazon suck at their status pages (though linode was quite informative about their DDoS outages that started on Christmas). Every amazon issue we've had, the status page only changed once they'd more or less fixed it. As far as I'm concerned their status page is basically useless unless it's an extended outage, at which point it's still basically useless...
> Amazon could learn a thing or two from Github in terms of understanding customer expectations.
Do you mean that "the cloud provider that is bigger than the next 14 combined and whose jargon has spread through the community" doesn't understand what customers are interested in and delivering on that?
Gonna speak up to defend OP here: I've worn the devops hat for products across multiple "Large Companies" (Amazon and larger scale) and found that for small products where it was me and a few other devs keeping the lights on, we would have outage alerts on status pages/twitter typically _before_ public users even realized something was wrong, since we were all very high touch on the project.
The bigger a project gets, the less prioritized something like a status page often seems to get. Larger entities certainly _have_ them but I often see more things interfering as scale grows (this isn't only a MS thing, let me make clear) whether it be domain switches between engineering and social management (status is often via twitter), feeding the status page via a long telemetry/monitoring platform that has some lag, or a high threshold for what "outage" means to avoid flappy notices (at the cost of some false negatives).
I'm not even going to make a value judgement on the tradeoff of these costs at this point, (I certainly wouldn't dismiss it offhand as a net negative although equally it's not all roses) but at the very least I'd observe that something like a status page _CAN_ be serviced very well from an up and comer (for as much as Github is that any more) and it's far from a true statement that bigCOs can't take learnings from improving customer happiness from newer entities. (In fact, I wish that was a more common practice!)
There's no mention of why they don't have redundant systems in more than one datacenter. As they say, it is unavoidable to have power or connectivity disruptions in a datacenter. This is why reliable configurations have redundancy in another datacenter elsewhere in the world.
Given the dependency in question is Redis, such a solution is probably exacerbated by the fact Redis hasn't really had a decent HA solution.
This is also hidden by the fact that Redis is really reliable (in my experience at least). In my experience it usually takes an ops event (like adding more RAM to the redis machine) to realize where a crutch has been developed on Redis in critical paths.
A lot of tools and services people use either don't have HA at all or don't have a native support for true distributed HA. But that can't stop people from making some HA or alike solution. I am not sure what they use Redis for but along the line of caching and key-value store they must have figured out how to invalidate data, otherwise they'd be running only a single instance of Redis. i.e. they are running "HA" just in a single data center, so logically speaking that's not difficult to port over to another data center.
I'm not familiar enough with Redis's clustering features to speak to the exact issues with what you're proposing, but generally speaking, HA is almost a completely different problem than disaster recovery (DR). Sure, the protocol is the protocol, but you wouldn't want to cluster local and remote nodes together for several reasons, primarily latency, security, and resiliency. Performance will suffer if they're clustered together and a single issue could take down nodes in both data centers, which kind of defeats the purpose.
What you really want is a completely separate cluster running in a different data center (site). It should be isolated on its own network and ideally it should have different admin rights/credentials and a different software maintenance (patching) schedule. A completely empty site isn't much use so you'll need some kind of replication scheme. Naturally, these isolating steps make site replication difficult. You might patch one site and now the replication stream is incompatible with the other site. (You can't patch both sites at the same time because the patch might take down the cluster.) Or whatever you're using to replicate the sites, which has credentials to both sites, breaks and blows everything up. You need a way to demote and promote sites and a constraint on only one site being the "master" at a time. What happens if network connectivity is lost between sites? What happens if one site is down for an extended period of time? Maybe you need a third, tie-breaking site?
Once you work through these issues, you are still exposed to user error. Your replication scheme might be perfect... perfect enough that that an inadvertently dropped table (or whatever) is instantly replicated to the other site and is now unrecoverable without going to tape. Maybe you introduce a delay in the replication to catch these oopsies, but now your RPO is affected. Anyway, it's a bit of a shell game of compromises and margins of error.
Source: 10 years designing and building HA/DR solutions for Discover Card.
I was also wondering what they are using Redis for; found this article [1] from a while ago discussing Redis at Github; presumably the architecture has moved on a bit since then, but this may shed a bit of light on the subject.
Of course Sentinel does not make Redis conceptually different from what it is from the point of view of consistency guarantees during failures. It performs best-effort attempt to select the best slave to retain writes, but under certain failure modes its possible to lose writes during a failover.
This is common with many failover solutions of *SQL systems as well btw. It depends on your use case if this is an affordable risk or not. For most Redis use cases, usually the risk of losing some writes after certain failovers is not a big issue. For other use cases it is, and a store that retains the writes during all the failure scenarios should be used.
>Redis is really reliable (in my experience at least)
Redis has been demonstrated[0][1] to lose data under network partitions. This is particularly concerning when discussing the type of partial failure that GitHub reported.
Not sure how your comment refutes the contention of reliability. Seems to me to be more a condemnation of failures that do happen (which is of course worthy of concern, but irrelevant in a conversation about stability).
I am using "reliability" in the sense of RAS[0]. An HA datastore which erroneously ACK's writes has lowered reliability, as there are known cases where it gives incorrect outputs.
> There's no mention of why they don't have redundant systems in more than one datacenter
sometimes reading comments on hn makes me laugh out loud.
there's only one reason to not do this, and that's cost. what do you expect them to say about that? i mean really, you think they're going to put that in a blog post:
"Well, the reason we don't have an entire replica of our entire installation is because it costs way too much. In fact, more than double! And so far our uptime is actually 99.99% so there's no way it's worth it! You can forget about that spend! Sorry bros."
This is not only obviously true, I think it is also a completely reasonable calculus. They just proved that if the entire Redis cluster goes down they can get it back in 2.5 hours. It's almost certainly a caching layer, so there is no permanent data loss. If they fix the application bootstrap dependency on a Redis connection, and they add monitoring to more easily see in the future when the Redis cluster is the problem, next time that time period will probably be way shorter.
So, a very small risk of an hour or so of downtime sometime in the future which will not cause data loss, or tens of thousands of dollars a month for a failover cluster? I wouldn't replicate it either.
>It's almost certainly a caching layer, so there is no permanent data loss.
People who use Redis rarely end up using it solely as a caching layer. It often also takes on the role of an RPC facilitator and pseudo-database. GitHub's post also mentions that their engineering team had to replicate Redis' dataset before they could get the alternative hardware running, which implies that they do need some data in there before the site is operational.
Personally one of my pet peeves is people throwing mission-critical data in Redis and acting like it's honky-dory. It happens all the time and seems really difficult to get people to not do. There's a reason we have a real ACID compliant database storing non-disposable data; it's ridiculous to ignore that just because it's easier to stuff it in Redis.
I think it's reasonable to have a dependency on a Redis server, but I don't think it's reasonable to depend on any data in particular being stored in that server. It should be used as a caching/acceleration layer for data that can be easily and automatically regenerated.
Just a thought on something I've learned over a few years. Sometimes, the most correct way isn't necessarily the best. Example here might be that the redis db is being used to store data which is constantly being read. While being in a MySQL instance might be the most correct method, the end result might actually be slower. This is just my naive guess but the point is, sometimes, given a particular context, the value of taking a hacky/less correct solution becomes great enough to use it
It's solely about the effort; it's a lot easier to just say redis.set('some_random_name') = value than it is to figure out where something should go in the schema of a RDBMS. If the data needs to persist, it needs to be written to a database that provides good guarantees about data integrity. If someone wants to load the results of a query into Redis, more power to them, but I've come across a lot of people who just stuff things in memory-backed K-V stores with the apparent expectation that nothing could ever happen to that data. Developers have told me "Well, Redis writes to disk on shutdown, right?" and acted like that was good enough for permanent storage of mission-critical data.
I have no fundamental opposition to K-V stores or NoSQL databases, but I do think most developers favor them because it's easier to stuff them with data up front. There are big tradeoffs down the road, though, which companies don't seem to understand well, and which they aren't really equipped to handle.
I unfortunately am not equipped with the knowledge about how people use/abuse redis like storage mechanisms. But that bit about how NoSQL is used as a point of upfront convenience is bam spot on. The biggest reason people have given me when I ask them why they want mongo is "easier to add columns".
Maybe they do feel it's a reasonable business decision. In that case they shouldn't be surprised if a lot of their users make the equally reasonable business decision to reduce their exposure to Github.
A lot of people have started depending on github for more than just stashing source code some place centrally accessible as they're working on it. If github takes a lax attitude toward uptime then I suspect people will start looking for alternatives.
It's shocking that they don't at least have a read replica of their system in another 'AZ'. That's cloud hosting 101, and being self-hosted isn't an excuse to skimp on this.
If an outage caused 2 hours of read-only access to repos it would still be moderately impactful, but at least we could still build our Go code.
Yeah, they gloss over it but at its heart, keeping mission-critical servers in a single datacenter with no redundancy is among the most common and amateur infrastructure failures. Many would expect a company like GitHub to have anticipated and prevented it. GitHub should have a process to ensure that all services are redundant before they get pushed to production.
For all that work to be done in just two hours is amazing, especially with degraded internal tools, and both hardware and ops teams working simultaneously.
We should collectively be using incidents like this as an opportunity to learn, much like the GitHub team does. Our entire industry is held back by the lack of knowledge sharing when it comes to problem response and the fact that so many companies are terrified of being transparent in the face of failure.
This is very well written retrospective that gives us a glimpse into the internal review that they conducted. Imagine how much we could collectively learn if everyone was fearless about sharing.
Is there a timeline to how long it took them to figure out Redis was down? Because having experienced the same, you get an alert. Cool. HA-Proxy says app servers are down. Ok. You SSH in and see that everything looks ok but the processes are bouncing. You tail the logs to find out why (obviously lots of these steps could be optimized). Within a few seconds you spot the error connecting to Redis. A minute later you've verified the Redis hosts are offline.
That's the first 5 minutes after getting to a computer.
After that it doesn't really matter why they're down. You failover, get the site back up and worry about it later.
Are these systems on a SAN? That's probably the first mistake if so. Redis isn't HA. You're not going to bounce it's block devices over to another server in the event of a failure. That's just a complex, very expensive strategy that introduces a lot of novel ways to shoot yourself in the face. If you're hosting at your own data-center, you use DAS with Redis. Cheaper, simpler. I've never seen an issue where a cabinet power loss caused a JBOD failure (I'm sure it happens, but it's a far from common scenario IME), but then again, locality matters. Don't get overly clever and spread logical systems across cabinets just because you can.
Being involved with this sort of thing more frequently than I'd like to admit, I don't know the exact situation here, but 2h6m isn't necessarily anything to brag about without a lot more context.
What's pretty shameful is that a company with GitHub's resources isn't drilling failover procedures, is ignoring physical segmentation as an availability target (or maybe just got really really unlucky; stuff happens), and doesn't have a backup data-center with BGP or DNS failover. This is all stuff that (in theory if not always in practice), many of their clients wearing a "PCI Compliant" badge are already doing on their own systems.
Is it really "shameful"? Running systems like this at scale is hard. We're not talking about redundant power systems for an ICU Ward in a hospital. We're talking about a website which powers a sliver of the first world.
You bet they busted their ass to get this fixed and shared their learnings with us. I'm extremely grateful for this and yeah it inconvenienced my morning but nothing more.
You make it sound so easy. If it takes the Github folks 2 hours, I can bet it would've taken us much longer.
I learned they had an unfortunate power outage. Then it took about two hours to determine that the Redis servers weren't booting, and to failover.
That's honestly pretty unimpressive no matter how you slice it.
Setting up a Redis server is easy. Sharding it is easy. Setting up slave systems is easy.
What's hard is, like most things in life and tech, planning. Planning for failure. Practicing failure. Not by insisting you need a monkey-army, which sounds cool and fun, but by having a staging environment which is mundane and boring. Pull the power plug on a server. See how long it takes to identify the culprit and recover. Figure out what you could have done to identify the issue quicker. Figure out how you could recover quicker.
You're running Redis, and you haven't even setup slave systems in different cabinets? You've never tried pulling the plug on the JBOD to see what might happen? These servers apparently weren't even pinging. Why did it take more than 10 seconds to find out they weren't running? Why wasn't there a dashboard for basic system/process status across all services? Is GitHub's operations budget really that tight?
Setting up a pair of OpenBSD boxes with pf and HA-Proxy is easy. Making sure CARP picks up on the failover system when the primary fails is pretty easy. Scheduling it, testing it so you know it's actually going to work when you need it to is the hard part. Holding people accountable for it's uptime when the log says that hasn't been done this month is the hard part.
Setting up some FreeBSD boxes with DAS is easy. Giving each shard it's own slaves is easy. Monitoring the slaves to make sure they're actually useful is hard. Making sure you have a one-way failover script fire when the CARP interface fails over is easy. Snapshotting the ZFS filesystems is easy. Simulating breaks by pulling a power plug is hard. Write some garbage to the FS. Unplug the DAS. Unplug the host. Force "up but non-responsive" situations, figure out how to identify them, and how to recover from them.
Doing anything reliable with iSCSI is hard. Even Amazon has a poor reputation for it. Do your best to avoid it, and never ever ever buy into vendor promises that it'll allow you to just remount your block devices on a different host and never have to worry about data or downtime. IME. YMMV.
Using runit to keep your Ruby processes up: Easy. Proper logging, monitoring and alerts because processes are just bouncing constantly? Well, it's not rocket science. But it definitely takes discipline.
There are people a hell of a lot smarter than me that have been doing this for a lot longer. But some of this stuff is just flat out not acceptable. I've been on those angry-client phone calls. "Inadvertent" should be a trigger word to anyone in Ops. It really means somebody didn't do something they should have and it bit them in the ass (at least that's what it meant when I said it). Would the app have started fine if boot.rb didn't attempt to connect to Redis? That seems pretty far fetched. So that sounds like a red-herring. But what do I know.
Setting these things up is not the hard part. Identifying that a host isn't responding to ping isn't the hard part. Planning, procedures, discipline and execution, day in and day out when things aren't on fire to prepare for the day that they are, that's the hard part.
Maybe Github is has a much smaller Ops team and budget than I would've imagined. Maybe this should be a wake up call to the CEO. I don't know. These are just my observations from what was said and having deja-vu thinking about lessons I had to learn the hard way.
BTW, I'm not necessarily trying to bag on the guy getting paged at 3AM. Been there, done that. It sucks. You do the best you can. If anything that I'm saying has a ring of truth to it, then it's a leadership issue. And it's not about "give smart people things and get out of their way blah blah bullshit". This sort of stuff doesn't materialize out of thin air and good intentions.
OTOH it's crazy that Github is single-homed and can't afford an F5. :shrug:
BTW:
> I can bet it would've taken us much longer
One of the worst experiences of my professional life was 72 sleepless hours, mostly in a 50F data-center, trying to figure out why the SAN would sporadically drop off servers every other minute. Turns out somebody, not me for a change, set the MTU on the switches to 9000. So whenever a max-frame was used BOOM.
But yeah, I've been through Redis failures. It didn't take me two hours to get things going again. Load-balancer failures. Database failures (backup sure, but never ever plan to actually use it; it's plan Z at best). NFS failures as well. Though those might be pushing it with heads that take almost 10 minutes just to reboot.
From the outside 2 hours seems like a very long time to identify and recover from a Redis failure. (Knock on wood.) And it sounds like they didn't have Warm systems standing by for failover? That's bad...
In addition to recognizing the speed at which a non-obvious downtime was remedied, I would personally like to thank GitHub for the detailed technical report being released. Far too many companies release statements that were clearly written by or edited by PR people. Most companies just piss off their customers by releasing generic press releases that don't give us any idea of what happened. Downtime is inevitable; what matters is being open and honest about such problems, and offering insight into what can be improved for the future.
So thank you GitHub, please keep up the good work!
I don't know enough about server infrastructure to comment on whether or not Github was adequately prepared or reacted appropriately to fix the problem.
But wow it is refreshing to hear a company take full responsibility and own up to a mistake/failure and apologize for it.
Like people, all companies will make mistakes and have momentary problems. It's normal. So own up to it and learn how to avoid the mistake in the future.
As I said in another comment, the fact that they found an 8 minute delay from outage to status page update to be unacceptable speaks volumes to how much they value their relationship with their customers.
as an aside I feel that I'm quite fortunate to work in the EST timezone, as their outage apparently started at about 7pm my time. We have a general rule at my company to not deploy after 6pm unless an emergency fix absolutely needs to go up.
I saw the title of the story and said to myself, what outage? :P
Does Github run anything like Netflix Simbian Army against it's services? As a company by engineers for engineers with the scale that github has reached, I'm a bit surprised they are lacking a bit more redundancy. Though they may not need the uptime of netflix, an outage of more than a few minutes on github could affect businesses that rely on the service.
Google "Netflix downtime" for evidence that Netflix also has outages. Google has outages, sometimes very significant ones of Google Apps. Facebook has outages.
Complex systems fail. Period. All the time. Things like the Simian Army are fantastic tools that help you identify a host of problems and remediate them in advance, but they cannot test every combinatorial possibility in a complex distributed system.
At the end of the day, the best defense is to have skilled people who are practiced at responding to problems. GitHub has those in spades, which is why they could respond to a widespread failure of their physical layer in just over 2 hours.
The biggest win with the Simian Army isn't that it improves your redundancy. It's that it gives your people opportunities to _practice_ responses.
More than practicing responses, Chaos Monkey and Failure Injection Testing allow us to verify that we don't have unexpected hard dependencies. Sometimes you find out that your service can't start if another one becomes latent, in which case you can plan for it by adding redundancy/extra capacity, fallbacks or working in degraded mode.
Every time I read about a massive systems failure, I think of Jurassic Park and am mildly grateful that the velociraptor padock wasn't depending on the systems operation.
This just shows how difficult it is to avoid hidden dependencies without a complete, cleanly isolated, testing environment of sufficient scale to replicate production operations and do strange system fault scenarios somewhere that won't kill production.
It turns out that it's even hard then. Complex systems, by their very nature, fail in unexpected and unpredictable ways. If that weren't bad enough, hindsight bias makes it way too easy for us to look back with perfect knowledge and opine "That was so obvious, how could they have missed such a rudimentary issue?"
I'm not sure what part of servers failing to POST is especially complex or related to distributed computing.
For all the fawning over being provided technical details, this article was pretty light on them.
I don't think Github going down for a couple hours is that big of a deal TBH. But it does seem to expose a few really basic failings in their DR planning IMO.
I also think it's ridiculous that some commenters are trying to frame this as a distributed computing problem. It's not even a clustering problem (apparently). It's just looking at the iDRAC or whatever to see why the server isn't getting past POST and putting your recovery plan into action.
This is white box vanilla stuff that happens to everybody.
That servers had to be rebuilt as part of DR says a lot.
The fact that there was a Redis dependency during bootstrap? Probably a good thing. You know as well as anyone I'm sure the last thing you want is a bunch of processes that only look like they're up. And even if they could not error without their Redis connections, if Redis is used for caching, what's that going to do to availability? Would it be a good thing to have the processes up if they can only handle 10% of the usual load?
Those are details that aren't there.
But complex distributed computing problem this is not. Not as it was presented anyways.
Just a note: if you don't indent your links they'll be made clickable by the markup engine, which is convenient in general and especially for those of us on smart phones. :)
It is not precisely wrong, and thanks for tricking me into opening an obscene picture at work, asshole.
The internet is designed to be highly fault tolerant, because it was based on an arpanet project to design a network that would NOT go down, even if there was damage to a significant percentage of nodes.
The "asshole" in this case is JWZ, [randomly?] switching on the Referer header. Apparently he has a hard-on for HN; he's not the only one, but I won't be linking to his site again. (Although, is that really "obscene"? It doesn't do anything for me?) Try this instead, since Stanford are unlikely to engage in such shenanigans:
It's funny, my original comment had the links in plaintext so copying-and-pasting was required and Referer wasn't involved. I changed that on request. b^)
> ... Updating our tooling to automatically open issues for the team when new firmware updates are available will force us to review the changelogs against our environment.
That's an awesome idea. I wish all companies published the firmware releases in simple rss feeds, so everyone could easily integrate them with their trackers.
(If someone's bored, that may be a nice service actually ;) )
This was one of the toughest things about admining hardware clusters. Firmware updates (and firmware issues) are so hard to track down. It's so annoying. I remember spending a week tracking down an issue with a RAID controller and then spending another day or two on the phone with the vendor trying to get a firmware update so we did not have 2 racks of hardware sitting on a ticking time-bomb.
I've played with the idea of some automated software update reporting site ages ago - it'd read rss feeds and scrape websites for the required info. It'd probably need adjustments for each hardware manufacturer / product though, and regular updating. But that could possibly be part of an open source project, give the firmware maintainers the opportunity to help out too.
> Remote access console screenshots from the failed hardware showed boot failures because the physical drives were no longer recognized.
I'm getting flashbacks. All of the servers in the DC reboot and NONE of them come online. No network or anything. Even remotely rebooting them again we had nothing. Finally getting a screen (which is a pain in itself) we saw they were all stuck on a grub screen. Grub detected an error and decided not to boot automatically. Needless to say we patched grubbed and removed this "feature" promptly!
You can very clearly see two kinds of people posting on this thread: those who have actually dealt with failures of complex distributed systems, and those who think it's easy.
"We identified the hardware issue resulting in servers being unable to view their own drives after power-cycling as a known firmware issue that we are updating across our fleet."
Tell us which vendor shipped that firmware, so everyone else can stop buying from them.
And/or they want to maintain a working relationship with said vendor. Going nuclear is a good way of getting _exactly_ the minimum level of service that your SLA specifies.
It doesn't stop you from getting sued, though, it merely stops you from losing. It's pretty reasonable to want to avoid a lawsuit you're absolutely certain you could win.
Vendors very seldom sue customers for publicly saying their product is defective. The negative publicity tends to backfire. Legal action can backfire even worse. If the vendor claims the product isn't defective, they have to prove that in court to win a libel action. That means discovery and examination of the company's internal documents and the complaints of other customers, all on the record.
I feel it was good incident for the Open Source community, to see how dependent we are on GitHub today. I feel sad whenever I see another large project like Python moving to GitHub, a closed-sourced company. I know, GitLab is there as an alternative, but I would love to see all the big Open Source projects putting pressure over GitHub to make them open their source code, as right they are big player in open source, like it or not.
Git is a distributed version control system. Github is simply a place to host a repository and some issues. There is nothing stopping anyone from pushing to another remote hub for redundancy.
So you want Github to open source where they put your git repo and issues? Who cares about that? It's unimportant because regardless they're still the central endpoint to many open source projects, opened or closed source. If you want open source use Gitlab or any other service that sprinkles extra features around git.
I'll never understand this outrage of dependence on Github when you have a distributed version control system. It's not like it should be on github to setup third party repositories for you.
From a developers point of view, you're right. But there are package management systems and other stuff depending on being able to download from Github.
Ofc, Github isn't to blame for this, rather the ones that thought Github would be great to use as a CDN.
I agree with you that is an issue for the package manager. A package manager should never have such a single point of failure. I know bower, for example, would fail (at least last I checked unless they added a caching system since then) but it at least can point to specific git repositories so it can be moved if necessary.
Honestly if anyone is working on something important enough where they must be able to rebuild at a moment's notice then they should either be checking in dependencies from these package managers or setting up their own copies of what they need. But that's like backing up; most don't realize they needed to be doing that until they have an issue so I can understand the anger when Github does go down.
During the outage I was not able to push and that's all. I can build everything locally and push to Heroku from my machine. You can always add an upstream repository to Bitbucket or something like that.
In my case at least is not outrage, it's just that I would love to feel that they are more responsive to their community and that I think that I'm not the only one who thinks that GitHub community would gladly help to make the product even better. I see it like a win win situation. Maybe not and it's just business.. dunno
If github opensource all of their stuff, this still wouldn't prevent issues like this for the projects that want to use a hosted service instead of hosting it themselves, and many projects don't want to host these services them selves anymore.
When I worry about dependency on GitHub, I'm thinking about not the inconvenient hours of downtime but the larger threat that they might disappear or turn evil.
What I would like to see even more than opensource github would be a standard for spreading over more services. For instance, syncing code, issues, pull requests, wiki, pages, etc between self-hosted gitlab and gitlab.com, or between gitlab.com and github.com. Further, I'd like to see it be easier to use common logins across services.
I don't think we can rely on Github giving us this, but if GitLab would add it between gitlab.com and gitlab ce, that would be a compelling reason to think of switching.
The problem with git is that it's decentralized but has no model for automatic consistency. Which means we've basically ignored the issue in tooling.
That said we probably don't need to solve it - we just need a way to keep read remotes online that's separate from any one server. The DHT git project was a good move in the right direction.
Was it a good incident to see how dependent we are on GitHub? Every time there's a GitHub outage, a vocal group of people will voice their opinions that we are too dependent on GitHub, we should be using open source alternatives, GitHub should be open source, etc. Then, within a few days, everybody goes silent and we return to our normal lives.
I don't think outages at GitHub are very frequent. This one was lengthy, so it's definitely been on a lot of peoples' minds, but this conversation always comes up when it happens.
And yet, some of the entitlement around this outage is incredible. It's as though a community's want to see Github online, is far more relevant than the lack of SLAs and thousand dollar service fees.
Of course it was. I don't know if everybody goes silent after a few days, It's the first outage I'm aware of, but some people at university made see the hypocrisy of using GitHub for open source projects and I feel that if there is a community strong enough to make some impact on GitHub that could be hackernews. Maybe I'm wrong.
If you look back through the years and find a few other stories of "GitHub is down", you'll see that this conversation happens every time. Some people tread into the HackerNews thread and say "More people should be using self-hosted GitLab instances" or "if GitHub would just open source their code, we wouldn't need to be so dependent."
But then the conversation stops within days because, the fact is, hosting your own git servers and getting people to actually use them is a huge pain in the ass. More simply put: people just like using GitHub.
Furthermore, GitHub's a business. They're selling private repositories. They do open source quite a bit of code, but they're not going to open source their actual product.
hosting your own git servers and getting people to actually use them is a huge pain in the ass. More simply put: people just like using GitHub.
Given that this is the case, I fell like GitHub is entrenched enough that they could open source their codebase and not lose any customers. People are paying them for the convenience of someone else hosting their git repository.
Technically you dont need a server. You just need access to remotes. That could be your team's repos. Then PRs just mean pulling from a coworker to your repo.
But yes, most of us are most comfortable with the central repository model.
Sure ! And I think it's a great business and I'm a happy customer too. My comment wasn't against the enterprise at all!
I just wanted to point that now that you guys are well stablished and have huge impact in the open source community, adopting a more open approach with your end users can be very beneficial for both, GitHub and the user base. I'm sure that lot of people would contribute to your codebase and thing like this: https://github.com/dear-github/dear-github would be less frequent and notorious.
But if the enterprise edition is really the source of income, open sourcing it doesn't have any sense. I agree on that. Maybe, another way to be more open to contributions from the community ? I dunno
> I feel sad whenever I see another large project like Python moving to GitHub, a closed-sourced company.
What would you rather have? A dependency on a bunch of projects with variable hosting of whatever means or all your dependencies hosted with the uptime of GitHub? Having an install fail because some host is down somewhere deep in your nest of dependencies is going happen a lot more if you have more hosts to worry about.
It must be nice to know that the majority of your customers are familiar enough with the nature of your work that they'll actually understand a relatively complex issue like this. Almost by definition, we've all been there.
If only Bitbucket could give such comprehensive reports. A few months back outages seemed almost daily. Things are more stable now. I hope for the long term.
Isn't BB's problem basically that there are too many users? GH's outage writeup is cool, because it's a one off and it can be analysed. When BB is just overloaded for a long time and needs more power, it's not going to be very interesting.
(unless I missed some specific non capacity related outages?)
Maybe. BitBucket was also an acquisition so for some time I believe there was a lack of resource provided to them and there was a huge technical debt/integration effort required. At this very time, I don't know if Atlassian actually care much about BitBucket. They are probably more concerned about delivering Stash than BitBucket, my wild guess.
I was an active BB user a couple years ago, and the project I worked on would hg clone from BB many times a day so I would be the first one to notice a 503 or whatever error coming from their service. Typically I would see one or two outage per month, some last a few minutes, some last several hours. Most of the time the outage impacted git/hg checkout, so I think that was their technical bottleneck.
We use Stash and it is surprisingly not bad at all. Github is much more polished but for code browsing and review, it does that which it is supposed to do.
Generally speaking, I'd recommend AGAINST running UPSes in racks that are managed by top-tier data centres. I've had way more trouble with UPSes misbehaving than I ever have with data centres losing power. EDIT: I'd also point out that 2 hours is a long time to be running on in-rack UPSes. I've usually seen them designed to withstand about an hour, but not much more.
UPSs don't always cover everything. There are systems that are considered critical that are on UPS, and others that are considered restartable that might not be. There are a lot of tradeoffs in a data center. Having full UPS and generator backup capacity for everything gets very expensive.
I have multiple experiences with high end DCs with dual UPS and diesel genset experiencing power fail.
Once it involved fire alarms, which trigger safety shutdowns within a suite. The other involved a failed static switch panel - ie, the things that aren't mean to be able to fail.
> Over the past week, we have devoted significant time and effort towards understanding the nature of the cascading failure which led to GitHub being unavailable for over two hours.
I don't mean to be blasphemous, but from a high level, is the performance issues with Ruby (and Rails) that necessitate close binding with Redis (i.e., lots of caching) part of the issue?
It sounds like the fundamental issue is not Ruby, nor Redis, but the close coupling between them. That's sort of interesting.
I don't think that Ruby/Rails has anything to do with this, really. If you want to scale any app, you're going to want to do some caching somewhere. What this boils down to is that their app has a dependency in an initializer that depends on redis. Without a connection to redis, it will flap.
As someone with a fair bit of ruby+rails+redis experience, I don't think this is blasphemous, but I also don't think the performance issues of ruby/rails having anything to do with the failure. Generally you would cache/store something in redis not because your programming language or framework is slow, but because a query to another database is slow (or at least, slower than redis), or because redis data structures happen to be a good/quick way to store certain kinds of data.
I believe the fundamental issue was just that redis availability was taken for granted by app servers so that certain code paths/requests would fail if it wasn't available, rather than merely be slower.
Not all processes that involve GitHub are development processes. I've seen automated deployments fail inside a corporate network when the resident HTTP proxy had a bad day and could not connect to github.com.
So, while it sounds like they have reasonable HA, they fell down on DR.
unrelated, I could not comprehend what this means?..:
technicians to bring these servers back online by draining the flea power to bring
I assume they mean completely disconnect the equipment from ALL external power sources. Typically even when a piece of equipment is offline in a data center, it continues to draw power, and will often keep running systems like DRAC and other management/status tools (since the whole concept of a data center is NEVER having to get up out of your chair, so even a "shutdown" system needs to be able to be remotely started).
Since the firmware had a bug, bad state could be stored, completely removing power may clear that state and appears to have done so in this case. They may have also needed to pull the backup battery, and reset the firmware settings, but I wouldn't presume that just from the term "flea power."
No, it sounds good, because it's realistic and then you can build mitigation strategies.
I was recently involved in an outage that occurred because the sama datacenter was hit by lightning three times in a row. Everything was redundant up the wazoo and handled the first two hits just fine, but by the time the power went out for the third time within N minutes, there wasn't enough juice left in some of the batteries!
Now would it be possible to build an automated system that can withstand this? Probably. But would your time & money be better spend worrying about other failure modes? Almost certainly.
If your plan to avoid downtime is to prevent power outages, you're going to have downtime. All their sentence says is they can't prevent power outages. That's fine, because the other 1/nth of your servers are on a different power grid in a different state.
Whose datacenter are they in? This is the second time in less than two weeks that they've suffered a power-related issue. My company is in 4 different sites around the world and we've never lost power ever - and, if one circuit did go out, we'd still be up and running because all of our servers have redundant power supplies on separate infeed circuits.
"...but we can take steps to ensure recovery occurs in a fast and reliable manner. We can also take steps to mitigate the negative impact of these events on our users."
The lessons that giants like Netflix have learned about running massive distributed applications show that you cannot avoid failure, and instead must plan for it.
Now, having a single datacenter is not a good plan if you want to give any sort of uptime guarantee, but that's a different point to make.
My point is: they shouldn't ONLY plan on ensuring recovery occurs fast; they should also plan on having multiple data centers, which to me is more important. It's frightening to know that such an important service is only operating in a single data center.
However, their recovery report didn't mention anything about such a plan.
I completely agree that geo-redundancy is a hard requirement for a site as critical to the functioning of the internet as Github.
A generous reading of "We can also take steps to mitigate the negative impact of these events on our users." would include improvements of that sort.
That said, I also didn't spot any concrete proposals for geo-redundancy in the post-mortem. Perhaps that's a detail that will be figured out in a following exercise, or perhaps they really don't have any plans for GR, in which case the generous reading would be unwarranted.
Why not? There's no reason in principle that you can't have hot standbys that are switched over to immediately when the primary fails. Or even a no-primary setup with each cluster being master for some data and slave for others (a la Cassandra's replication model).
There may be specific aspects of Github's usecase that make this difficult, but please don't pretend that geo-redundancy is impossible. Look at Netflix's architecture for an example of a site that services traffic from multiple AZs.
I'm going to guess that these are Dell R730xd boxes with PERC H730 Mini controllers (LSI MegaRAID SAS-3 3108).
A failed/failing drive present during cold boot could cause the controller to believe there were no drives present. To add insult to injury, on early BIOS versions this made the UEFI interface inaccessible. The only way to recover from this state was to re-seat the RAID controller.
There were also two bizarre cases where the operating system SSD RAID1 would be wiped and replaced with a NTFS partition after upgrading the controller firmware (and more) on an affected system (hanging/flapping drives). Attempts to enter UEFI caused a fatal crash, but reinstall (over PXE) worked fine. BIOS upgrade from within fresh install restored it.
From the changelog:
Fixes:
- Decreased latency impact for passthrough commands on SATA disks
- Improved error handling for iDRAC / CEM storage functions
- Usability improvements for CTRL-R and HII utilities
- Resolved several cases where foreign drives could not be imported
- Resolved several issues where the presence of failed drives could lead to controller hangs
- Resolved issues with managing controllers in HBA mode from iDRAC / CEM
- Resolved issues with displayed Virtual Disk and Non-RAID Drive counts in BIOS boot mode
- Corrected issue with tape media on H330 where tape was not being treated as sequential device
- resolved an issue where Inserted hard drives might not get detected properly.
> We had inadvertently added a hard dependency on our Redis cluster being available within the boot path of our application code.
I seem to recall a recent post on here about how you shouldn't have such hard dependencies. It's good advice.
Incidentally, this type of dependency is unlikely to happen if you have a shared-nothing model (like PHP has, for instance), because in such a system each request is isolated and tries to connect on its own.
> Because we have experience mitigating DDoS attacks, our response procedure is now habit and we are pleased we could act quickly and confidently without distracting other efforts to resolve the incident.
The thing that fixed the last problem doesn't always fix the current problem.
Power outage in DC brought many machines down. Redis clusters failed to start owing to disk issues (not cleanly unmounted?). The reboot of remaining machines uncovered an unknown dependency on the machines needing the redis cluster to be up in order to boot.
There were other learning points such as immediately going into anti DDoS mode and human communication issues that didn't realise or escalate the problem until some time after the issues started occurring.
No CI/test process was in place for critical systems to ensure that they had no external dependencies.
Takeaway: If you run any complex system, ensure that each component is tested for its response to various degrees of failure in peer services, including but not limited to totally unavailable, intermittent connectivity, reduced bandwidth, lossy links, power-cycling peers.
No CI/test process was in place for hardware/firmware combos to ensure they recovered fine from power loss.
Takeaway: If you run a decent-sized cluster, ensure all new hardware ingested is tested through various power state transitions multiple times, and again after firmware updates. With software defined networking now the norm, we have little excuse not to put a machine through its paces on an automated basis before accepting it to run critical infrastructure.
No CI/test process was in place for status advisory processes to ensure they were sufficiently rapid, representative, and automated.
Takeaway: Test your status update processes as you would test any other component service. If humans are involved, drill them regularly.
Infrastructure was too dependent on a single data center.
Takeaway: Analyze worst case failure modes, which are usually entire-site and power, networking or security related. Where possible, never depend on a single site. (At a more abstract level of business, this extends to legal jurisdictions). Don't believe the promises of third party service providers (SLAs).
PS. I am available for consulting, and not expensive.
You could google "HA", click in the Wikipedia link that shows all the things "HA" may refer to, and deduct that the most logical thing in the list, given the context, would be this link: https://en.wikipedia.org/wiki/High_availability.
The statement about having poor deductive logic skills was the more insulting part of the post (compared to ignorance of an initialism, which I think you are correct in thinking is insignificant).
I seriously doubt this version of the story. While it's possible for several hardware/firmware to fail in all your datacenters, for them to fail at the same time is highly unlikely. This may just be a PR spin to think they're not vulnerable to security attacks.
While this was happening at Github, I noticed several other companies facing that same issue at the same time. Atlassian was down for the most part. It could have been an issue with the service github uses, but they won't admit that. Notice they never said what the firmware issue was instead blaming it on 'hardware'.
I think they should be transparent with people about such vulnerability, but I suspect they would never say so because then they would lose revenue.
They're not hosted in multiple datacenters; there was a power interruption in their single datacenter that exposed this firmware bug. The point of this postmortem isn't the initial power interruption but rather its repercussions, why it took so long to recover from and how they can improve their response and communications in the future.
Ok...so this is another PR...without admitting the issue. I don't know github's infrastructure, but they have a single point of failure? Last I know, every place these days have backup power especially a datacenter...so those were not working either? My point is that it's much better to be upfront sometimes. In fact github didn't have to say anything about the whole thing since everyone forgot already...
Amazon could learn a thing or two from Github in terms of understanding customer expectations.