Anybody having similar issues with Github Actions again? Jobs are failing to connect to external services, or take extremely long time to do so for me.
> Update - Git Operations is experiencing degraded performance. We are continuing to investigate.
I hate that companies are saying degraded performance for "its not working at all" (even if only for some people). That's not degraded performance, that's broken.
At least the status grid says Partial Outage, but stop with the degraded performance bullshit.
Well, I wasn't able to push to one of my repositories, even though I tried many times. So it was effectively out for me, not "degraded". Partial Outage, like what the status grid said, seems much more accurate.
But this goes beyond just this particular case. Remember when half the internet was down because of an AWS issue? Yet they still said "degraded performance" or similar.
Degraded performance is what it says. Things that took 20s are currently taking 60s. Things that took 60s are now taking 180s. Things are going slower. Maybe by a constant factor. Maybe exponentially.
Where this sometimes turns into a "partial outage", is that there might be response timeouts somewhere in the backend of the stack, meaning that things that took 60s are now taking [100s and then a 504 error response]. ("Because," the backend dev thinks, "why would clients want a response after 100s? They've probably given up and gone home already at that point.")
Or, equally likely, there's a response deadline somewhere in your client—where you figured "their API will never take more than 100s to respond", so even though the API is still chugging away at the 100s mark, your client cuts it off, drops the connection, and displays an error.
Either way, the literal problem at hand is still, simply, "things are going slower."
"Degraded performance" can help you predict whether you'll be one of the customers encountering failures, in a way that "partial outage" cannot. Were your particular querying workloads to the API fast before? You'll probably be okay. Were they slow before? Then you might experience the further slowdown as an outage.
If you want to design for fault-tolerance of this case, it's important to know how APIs you talk to deal with long-running workloads. Do they have timeouts? Are they customizable? Are they different per plan level? These are things the API's documentation should communicate.
Once you know these things—and you have metrics on how long requests to a given upstream API are taking to complete—you'll know exactly whether "degraded performance" will affect you or not, before actually getting customer reports.
A waitress can't keep telling a customer their food is coming when they've actually lost the order ticket and have no clue what was ordered anymore, just because other customers aren't having the same issue.
Well, I guess she can in this case, and other customers will happily come to her defense!
One thing about large systems, even ones that are quite reliable overall, is that if they reported an outage whenever a single customer saw a problem, their dashboards would literally never be green. That'd be true in some sense but not very useful.
A teammate once made a "problem" light for Gmail as a joke and coded it to be lit when there were any alerts firing at any severity in the internal monitoring system. I think most people would say that Gmail is pretty reliable, but if the bulb ever went out, you could safely assume there was a problem with the bulb or the monitoring system.
That's why big systems have SLOs like (to give a simple example) 99.9% of requests succeeding rather than 100%, or 99.999% of customer data being currently available rather than 100%. (Sometimes they can get more complex like at least x% of customers have no more than y% of time periods in which more than z% of requests fail.) And if they have per-customer SLOs, those are generally much less strict than site-wide SLOs and require a certain traffic level to take effect.
In today's case, I think github's problems are widespread enough that whatever SLO they've set is being violated, but the point stands in general.
> Degraded performance is what it says. Things that took 20s are currently taking 60s. Things that took 60s are now taking 180s. Things are going slower. Maybe by a constant factor. Maybe exponentially.
It wasn't a performance issue though. It was literally just plain not working. This is the error I got when trying to push:
! [remote rejected] master -> master (failure)
Not after 10s, 30s, 60s -- instantly, every single time.
In really complex systems, there's the ugly case where some middleware needs to fetch something from another service to start up... and then that fetch times out. So the middleware ends up in an invalid state.
This is a distributed-systems design faux-pas. Systems that need information to start up, should have some sort of burned in or persistently-cached stale version of that needed information, such that they can start up without being able reach out for it. They should then probe for the newest info until they have it. (See e.g. Elixir's tzdata library.)
If you don't do this, and instead put a synchronous remote call in the critical path for a service's startup, then you're turning "degraded performance" into a real outage condition, entirely unnecessarily.
It's actually not that common, because you need to get to a certain scale before things like this start happening, and that scale brings with it hiring people who know about scenarios like this. But sadly, it does still happen sometimes.
> In really complex systems, there's the ugly case where some middleware needs to fetch something from another service to start up... and then that fetch times out. So the middleware ends up in an invalid state.
That's no longer a performance issue then though. It may have been triggered by a performance issue, but for downstream services or users, its a plain old outtage.
Besides, who cares if its "technically" a performance issue if its unusable for you? Its the end result that's important and if I can't use the service, it make no difference to me what the technicalities are, only that I cannot use it, and my customers certainly don't care if my product stops working because microsoft or amazon or whoever have performance issues, they only care that my product stops working.
I didn't say that this case is not an outage. I was just trying to explain why Github were not reporting an outage as an outage.
Let me try again:
The reason sites report "performance degradation" in these cases, is that what can usually be detected immediately and automatically by metrics is the underlying performance degradation; while the fact that some process got unexpectedly wedged due to the performance degradation usually goes undetected for quite some time, until customers start reporting the outage. And, before the dashboard can be switched to "suspected outage", it has to first be determined whether any of these customers' reports are of a real "outage", and not just of "my own client is programmed to give up on the request after 100s" problems.
Each of these weird edge-case distributed-systems failures is usually fixed — stopped from ever happening again — after the first time it gets triggered. Which means that every time one of these happens, it's the first time the system has failed in that particular way. So there's never any automated monitoring in place to notice that the system has failed in that particular way — as that would imply that the failure was in some way predictable.
As such, while yes, the outage is real, I don't blame a company for not reporting the outage as such immediately, or for quite a while after it happens. Sometimes it's not clear right up until the root cause is resolved, that it actually was a partial customer-facing outage. Sometimes all your third-party downtime detector apps just aren't doing the right things to hit the code path / get routed through the middlebox that's failing for some customers; and all the real-customer reports getting through to you are vague and muddled.
(One thing that can be immensely helpful to a SaaS's ops team, is to have at least one highly-technical company as a customer who reliably only reports "upstream fault" problems, rather than PEBKAC problems. If that canary customer is complaining, then the outage is likely real. They're better than any downtime detector service — and, surprisingly, often faster, too! Sadly, most SaaS businesses have literally no single such customer.)
> ("Because," the backend dev thinks, "why would clients want a response after 100s? They've probably given up and gone home already at that point.")
Or worse: they might have given up on this request and issued another one. When you have a problem that might be caused by (or exacerbated by) overload and might cause heavy retrying (by the user and/or automatically at one or more layers), you want to give up on the requests that are least likely to matter. Deadlines are a standard way to do that. Other ways include propagating cancellation through the stack, sheddable request categories, and queue disciplines like adaptive LIFO.
Anyway, I agree with everything you wrote, and I want to point out that whatever deadlines they have in the system are probably helping more than they're hurting. Of course I can't be sure of that without knowing the nature of the outage.
Please disregard the following offtopic rant; I'm not sure if I'm just getting older and jaded or if Github (and the web at large, if I might be frank) is having these issues more often as of late.
Is it that what we are doing is becoming more complex, or are we, as a web, moving faster and breaking more things? I wonder sometimes if the advancement and affordability of our hardware and the systems we can run on them has decreased our reliance on optimization and correctness as a requirement of deploying on affordable hardware. Basic functions used to have to be combined with great rigor to run on the same machine for swaths of users. Now we are so far from that we have 1000 layers of abstraction, and it isn't hard to imagine how lots of tiny inefficiencies stacked on top of each other may be adding up to bigger issues.
In my org we have a term for this, where you're encumbered by faults deep in the toolchain - getting 'middlewared'.
On the other hand, I don't look at a temporary road closure due to maintenance or an accident as an indictment of civil engineering or the highway system.
The analogy falls apart when you say 'maintenance'. Maintenance tends to have fallback detour paths, is expected, and isnt a 'deal breaker' like this kind of outage is.
This is closer to an 'accident'. And if accidents are frequent at a road location, it _is_ considered an indictment of the civil engineering of that location.
Good point. But we have brought in so many systems, that the analogy might be: we are routinely using the highway system to move content form one room to another in our home.
The issue here is one of scale, not necessarily intentional complexity - though admittedly one often leads to the other.
In a world of single servers - or even multiple servers load balanced over a single database - you only have a single point of failure. And while that is often touted as an terrible thing, once it goes down it's game over... in some ways, our scaled up distributed systems have a worse problem.
The issue is we envision our distributed systems optimistically - as isolated little bubbles that receive and emit clean, efficient, fault tolerant packets. When one goes down, only a few others get affected, and the broken module can be easily re-deployed with very little fuss.
In reality, the people with the vision quickly become horrified with what is produced. The authentication system touches everything. The stock management system gets hooked directly into the homepage. Suddenly, every single part of your system is hinged on at least 5 servers, and if any of them go down, it's game over until one of them gets brought back up, and by that point the job management queue is so backed up with critical tasks (that should probably have never been entrusted to a background jobs queue) that it takes hours for the organism to re-balance itself.
And so, these systems go down more often, because instead of one single point of failure, you have 5 - while telling yourself you have none.
The question is, how can we do this better? I've seen big teams and small teams similarly struggle with this. I guess Netflix with their chaos monkey strategy wins on this, but even that is not infallible - as you say, the 1000 layers of abstraction (and goodness knows how many unique combinations of interactions and potential simultaneous failures) something is bound to be missed from time to time.
So I don't think the problem is we're moving faster or breaking more things. It's just that we don't know how broken our systems are until they break.
While the web is quite complex now, I think the expectation of 100% uptime for anything is unreasonable.
My local taco shop ran out of rice a few weeks ago. A few days ago they had some unspecified kitchen issue and had to close. They don't open on sundays.
Occasional downtime is normal. We can try to prevent it, as devs, working with machines, but it is inevitable. Nothing continues working forever. All industries have downtime, it needs to be accepted and planned for.
Even in the old days, with the simplest possible thing that needs a db connection, you could outscale that db, or bandwidth, or whatever, and crash. It was easier back then in a lot of ways - 1000 people visit your new site? Crash. Before auto scaling infra you had to either deal with it or pay for too much hardware. I'm sure there are many other examples.
> All industries have downtime, it needs to be accepted and planned for.
And those that can't - e.g. those with lives at risk over failure - have very slow, conservative and burdensome processes for changes and approval, that would never be accepted without the necessary risk.
Indeed, tech moves too quickly for those processes generally. Those who have great processes in tech - and I guess Github is one of those with great processes despite the occasional downtime - pay a lot for the uptime.
Critical industries plan for failure. Pilots and astronauts plan for catastrophic failure (and it happens). Hospitals sometimes have fires or run out of oxygen or beds or have equipment failures and have to plan for that.
Everybody assumes tech will just keep working in the SaaS world. At the end of the day though, costs of the average service having downtime (even if you calculate google makes 2 million per minute) are generally pretty low and are usually recouped after the service comes back up.
> My local taco shop ran out of rice a few weeks ago. A few days ago they had some unspecified kitchen issue and had to close. They don't open on sundays.
And mine has never run out of rice. They are closed on Sunday and Monday. I don’t monitor them enough to know how many 9’s, but I’ve probably eaten there 200 times in 9 years and never experienced an outage.
Some places suck more than others. But my expectation is 100% uptime when it comes to rice at my taco shop.
I think assuming competence is healthy, as long as I’m not a jerk when it happens to fail. I love my taco shop and would survive if they didn’t have an ingredient once. But they work really hard to keep everything working.
Not ignoring the truth in the other comments or yours, the CI/CD is also in the midst of a battle with mining. While we allow people to turn energy and CPU cycles into cash, people are going to abuse any run-what-you-like service like GitHub Actions.
There have been multiple incidents longer than an hour just in the last few months this year.
It's worse in some ways than, say, 5 years ago, because... GH offers more services. And more people end up relying on those, so, there's more than can go wrong. And when it does, it usually has a bigger impact.
Some client projects have tied everything to GH/Actions/CI/deploy/etc. When there's any issue with GH, something is affected. Other clients have more distributed services (bitbucket/gh/gitlab/etc), which is ... sometimes less convenient, but usually means something is still up and functioning even if something else is interrupted.
The title of the incident has been updated to: "Incident with GitHub Actions, API Requests, Issues, GitHub Packages, GitHub Pages, Pull Requests, and Webhooks". Basically all core functionality is down at the moment.
I also noticed that starring repos was not working. At that time, a couple of hours ago, only GitHub Actions was listed as having problems on githubstatus. So I was unsure whether being able to star things was malfunctioning for more people or just me. And I was also thinking about how much cruft has been added to the GitHub web UI lately, ever since the acquisition of GitHub by Microsoft. And I yearn for GitHub as it used to be. But the greatest value in GitHub for me is browsing and starring stuff that others have made, so it's not like I can just set up my own instance either. But I wonder about the future of GitHub. I think it will be more and more geared towards big enterprise users, and less and less a place that will be what it used to feel like for individual developers.
I was on the phone with them getting a demo of Enterprise Managed Users at the time and the demo was failing too, ha. So we looked at the status page together and saw all yellow.
Doesn't matter, I wanted EMUs without the demo anyway. Still was funny to happen when I was talking to an engineer at the time.
I could see things like this messing up launch date/times occasionally, but what workplace falls apart if you can't push code to github for a few hours? Surely you can do other work while waiting for them to fix things.
One finding itself in a particular rush to put out a fix, I suppose. I’d say “you should have back up ways to deploy” but I can certainly see why a lot of workflows get centralized on a provider like GitHub.
I know you're being flippant, but if this was Subversion I couldn't commit, checkout other branches, or really do anything. With git I can do everything except push and pull from Github.
Wait what!? Can't you just reconfigure your .git configs to temporarily use another repo locally?
Seriously: for example copying then transforming any one of your local .git repo to a "bare" repository takes like two commands. Anyone with a Linux machine can do it. Then all the devs can temporarily use that repo to push/pull/whatever.
I think if you're to the point where you're emailing patches, it's worth it.
Emailing patches is a classic git flow, used e.g. by the linux kernel. Git even includes a command to format emails and send them. Nothing wrong with it!
Is it not possible to use 'runner' to run github actions locally? The docs for this seem like they are extremely sparse - probably because this competes with the paid option?
I really wish there was some pressure to have these status pages be somewhat "realistic" and responsive rather than hoping that temporary problems go away so they don't have SLA questions.
I checked after none of my pushed commits showed in PR's, and it was "Webhooks" only. That's not useful.
Which seems like an understatement on their part - shouldn't everything be 'red' if all services are down? If not, what kind of outage does it take to get to red?
I was invited to a party Saturday hosted by the GH CEO, and while this is likely just a huge coincidence, I wonder if some key engineer(s) came down with covid. The friend who invited me said there would be rapid testing at the party.