Incident with GitHub Actions, API Requests, Git Operations, Issues and more

batica81 · on Aug 24, 2021

Anybody having similar issues with Github Actions again? Jobs are failing to connect to external services, or take extremely long time to do so for me.

dkersten · on Aug 10, 2021

> Update - Git Operations is experiencing degraded performance. We are continuing to investigate.

I hate that companies are saying degraded performance for "its not working at all" (even if only for some people). That's not degraded performance, that's broken.

At least the status grid says Partial Outage, but stop with the degraded performance bullshit.

jdavis703 · on Aug 10, 2021

I was able to merge a PR after trying a dozen or so times. So it would appear it’s degraded, not totally out.

tyingq · on Aug 10, 2021

That still feels like "not usable" to me.

dkersten · on Aug 10, 2021

Well, I wasn't able to push to one of my repositories, even though I tried many times. So it was effectively out for me, not "degraded". Partial Outage, like what the status grid said, seems much more accurate.

But this goes beyond just this particular case. Remember when half the internet was down because of an AWS issue? Yet they still said "degraded performance" or similar.

derefr · on Aug 10, 2021

Degraded performance is what it says. Things that took 20s are currently taking 60s. Things that took 60s are now taking 180s. Things are going slower. Maybe by a constant factor. Maybe exponentially.

Where this sometimes turns into a "partial outage", is that there might be response timeouts somewhere in the backend of the stack, meaning that things that took 60s are now taking [100s and then a 504 error response]. ("Because," the backend dev thinks, "why would clients want a response after 100s? They've probably given up and gone home already at that point.")

Or, equally likely, there's a response deadline somewhere in your client—where you figured "their API will never take more than 100s to respond", so even though the API is still chugging away at the 100s mark, your client cuts it off, drops the connection, and displays an error.

Either way, the literal problem at hand is still, simply, "things are going slower."

"Degraded performance" can help you predict whether you'll be one of the customers encountering failures, in a way that "partial outage" cannot. Were your particular querying workloads to the API fast before? You'll probably be okay. Were they slow before? Then you might experience the further slowdown as an outage.

If you want to design for fault-tolerance of this case, it's important to know how APIs you talk to deal with long-running workloads. Do they have timeouts? Are they customizable? Are they different per plan level? These are things the API's documentation should communicate.

Once you know these things—and you have metrics on how long requests to a given upstream API are taking to complete—you'll know exactly whether "degraded performance" will affect you or not, before actually getting customer reports.

jeremyjh · on Aug 10, 2021

When I push git commits it fails instantly each and every time. To me, that looks like an outage.

015a · on Aug 10, 2021

Its an outage for you. You do not have the information to assert that its an outage for everyone.

sombremesa · on Aug 10, 2021

An outage is an outage, not a performance issue.

A waitress can't keep telling a customer their food is coming when they've actually lost the order ticket and have no clue what was ordered anymore, just because other customers aren't having the same issue.

Well, I guess she can in this case, and other customers will happily come to her defense!

scottlamb · on Aug 10, 2021

One thing about large systems, even ones that are quite reliable overall, is that if they reported an outage whenever a single customer saw a problem, their dashboards would literally never be green. That'd be true in some sense but not very useful.

A teammate once made a "problem" light for Gmail as a joke and coded it to be lit when there were any alerts firing at any severity in the internal monitoring system. I think most people would say that Gmail is pretty reliable, but if the bulb ever went out, you could safely assume there was a problem with the bulb or the monitoring system.

That's why big systems have SLOs like (to give a simple example) 99.9% of requests succeeding rather than 100%, or 99.999% of customer data being currently available rather than 100%. (Sometimes they can get more complex like at least x% of customers have no more than y% of time periods in which more than z% of requests fail.) And if they have per-customer SLOs, those are generally much less strict than site-wide SLOs and require a certain traffic level to take effect.

In today's case, I think github's problems are widespread enough that whatever SLO they've set is being violated, but the point stands in general.

dkersten · on Aug 10, 2021

> Degraded performance is what it says. Things that took 20s are currently taking 60s. Things that took 60s are now taking 180s. Things are going slower. Maybe by a constant factor. Maybe exponentially.

It wasn't a performance issue though. It was literally just plain not working. This is the error I got when trying to push:

    ! [remote rejected] master -> master (failure)

Not after 10s, 30s, 60s -- instantly, every single time.

pera · on Aug 10, 2021

Well I am getting 500s instantly, not after some internal timeout

derefr · on Aug 10, 2021

In really complex systems, there's the ugly case where some middleware needs to fetch something from another service to start up... and then that fetch times out. So the middleware ends up in an invalid state.

This is a distributed-systems design faux-pas. Systems that need information to start up, should have some sort of burned in or persistently-cached stale version of that needed information, such that they can start up without being able reach out for it. They should then probe for the newest info until they have it. (See e.g. Elixir's tzdata library.)

If you don't do this, and instead put a synchronous remote call in the critical path for a service's startup, then you're turning "degraded performance" into a real outage condition, entirely unnecessarily.

It's actually not that common, because you need to get to a certain scale before things like this start happening, and that scale brings with it hiring people who know about scenarios like this. But sadly, it does still happen sometimes.

dkersten · on Aug 10, 2021

> In really complex systems, there's the ugly case where some middleware needs to fetch something from another service to start up... and then that fetch times out. So the middleware ends up in an invalid state.

That's no longer a performance issue then though. It may have been triggered by a performance issue, but for downstream services or users, its a plain old outtage.

Besides, who cares if its "technically" a performance issue if its unusable for you? Its the end result that's important and if I can't use the service, it make no difference to me what the technicalities are, only that I cannot use it, and my customers certainly don't care if my product stops working because microsoft or amazon or whoever have performance issues, they only care that my product stops working.

derefr · on Aug 10, 2021

I didn't say that this case is not an outage. I was just trying to explain why Github were not reporting an outage as an outage.

Let me try again:

The reason sites report "performance degradation" in these cases, is that what can usually be detected immediately and automatically by metrics is the underlying performance degradation; while the fact that some process got unexpectedly wedged due to the performance degradation usually goes undetected for quite some time, until customers start reporting the outage. And, before the dashboard can be switched to "suspected outage", it has to first be determined whether any of these customers' reports are of a real "outage", and not just of "my own client is programmed to give up on the request after 100s" problems.

Each of these weird edge-case distributed-systems failures is usually fixed — stopped from ever happening again — after the first time it gets triggered. Which means that every time one of these happens, it's the first time the system has failed in that particular way. So there's never any automated monitoring in place to notice that the system has failed in that particular way — as that would imply that the failure was in some way predictable.

As such, while yes, the outage is real, I don't blame a company for not reporting the outage as such immediately, or for quite a while after it happens. Sometimes it's not clear right up until the root cause is resolved, that it actually was a partial customer-facing outage. Sometimes all your third-party downtime detector apps just aren't doing the right things to hit the code path / get routed through the middlebox that's failing for some customers; and all the real-customer reports getting through to you are vague and muddled.

(One thing that can be immensely helpful to a SaaS's ops team, is to have at least one highly-technical company as a customer who reliably only reports "upstream fault" problems, rather than PEBKAC problems. If that canary customer is complaining, then the outage is likely real. They're better than any downtime detector service — and, surprisingly, often faster, too! Sadly, most SaaS businesses have literally no single such customer.)

scottlamb · on Aug 10, 2021

> ("Because," the backend dev thinks, "why would clients want a response after 100s? They've probably given up and gone home already at that point.")

Or worse: they might have given up on this request and issued another one. When you have a problem that might be caused by (or exacerbated by) overload and might cause heavy retrying (by the user and/or automatically at one or more layers), you want to give up on the requests that are least likely to matter. Deadlines are a standard way to do that. Other ways include propagating cancellation through the stack, sheddable request categories, and queue disciplines like adaptive LIFO.

Anyway, I agree with everything you wrote, and I want to point out that whatever deadlines they have in the system are probably helping more than they're hurting. Of course I can't be sure of that without knowing the nature of the outage.

dboreham · on Aug 10, 2021

Not too helpful when their UI timeout is 10s.

howolduis · on Aug 10, 2021

[flagged]

tmpz22 · on Aug 10, 2021

Except everyone does it

0des · on Aug 10, 2021

Please disregard the following offtopic rant; I'm not sure if I'm just getting older and jaded or if Github (and the web at large, if I might be frank) is having these issues more often as of late.

Is it that what we are doing is becoming more complex, or are we, as a web, moving faster and breaking more things? I wonder sometimes if the advancement and affordability of our hardware and the systems we can run on them has decreased our reliance on optimization and correctness as a requirement of deploying on affordable hardware. Basic functions used to have to be combined with great rigor to run on the same machine for swaths of users. Now we are so far from that we have 1000 layers of abstraction, and it isn't hard to imagine how lots of tiny inefficiencies stacked on top of each other may be adding up to bigger issues.

In my org we have a term for this, where you're encumbered by faults deep in the toolchain - getting 'middlewared'.

bbatha · on Aug 10, 2021

On the other hand, I don't look at a temporary road closure due to maintenance or an accident as an indictment of civil engineering or the highway system.

nightowl_games · on Aug 10, 2021

> maintenance or an accident

The analogy falls apart when you say 'maintenance'. Maintenance tends to have fallback detour paths, is expected, and isnt a 'deal breaker' like this kind of outage is.

This is closer to an 'accident'. And if accidents are frequent at a road location, it _is_ considered an indictment of the civil engineering of that location.

jmount · on Aug 10, 2021

Good point. But we have brought in so many systems, that the analogy might be: we are routinely using the highway system to move content form one room to another in our home.

undecisive · on Aug 10, 2021

The issue here is one of scale, not necessarily intentional complexity - though admittedly one often leads to the other.

In a world of single servers - or even multiple servers load balanced over a single database - you only have a single point of failure. And while that is often touted as an terrible thing, once it goes down it's game over... in some ways, our scaled up distributed systems have a worse problem.

The issue is we envision our distributed systems optimistically - as isolated little bubbles that receive and emit clean, efficient, fault tolerant packets. When one goes down, only a few others get affected, and the broken module can be easily re-deployed with very little fuss.

In reality, the people with the vision quickly become horrified with what is produced. The authentication system touches everything. The stock management system gets hooked directly into the homepage. Suddenly, every single part of your system is hinged on at least 5 servers, and if any of them go down, it's game over until one of them gets brought back up, and by that point the job management queue is so backed up with critical tasks (that should probably have never been entrusted to a background jobs queue) that it takes hours for the organism to re-balance itself.

And so, these systems go down more often, because instead of one single point of failure, you have 5 - while telling yourself you have none.

The question is, how can we do this better? I've seen big teams and small teams similarly struggle with this. I guess Netflix with their chaos monkey strategy wins on this, but even that is not infallible - as you say, the 1000 layers of abstraction (and goodness knows how many unique combinations of interactions and potential simultaneous failures) something is bound to be missed from time to time.

So I don't think the problem is we're moving faster or breaking more things. It's just that we don't know how broken our systems are until they break.

grumple · on Aug 10, 2021

While the web is quite complex now, I think the expectation of 100% uptime for anything is unreasonable.

My local taco shop ran out of rice a few weeks ago. A few days ago they had some unspecified kitchen issue and had to close. They don't open on sundays.

Occasional downtime is normal. We can try to prevent it, as devs, working with machines, but it is inevitable. Nothing continues working forever. All industries have downtime, it needs to be accepted and planned for.

Even in the old days, with the simplest possible thing that needs a db connection, you could outscale that db, or bandwidth, or whatever, and crash. It was easier back then in a lot of ways - 1000 people visit your new site? Crash. Before auto scaling infra you had to either deal with it or pay for too much hardware. I'm sure there are many other examples.

misnome · on Aug 10, 2021

> All industries have downtime, it needs to be accepted and planned for.

And those that can't - e.g. those with lives at risk over failure - have very slow, conservative and burdensome processes for changes and approval, that would never be accepted without the necessary risk.

grumple · on Aug 10, 2021

Indeed, tech moves too quickly for those processes generally. Those who have great processes in tech - and I guess Github is one of those with great processes despite the occasional downtime - pay a lot for the uptime.

Critical industries plan for failure. Pilots and astronauts plan for catastrophic failure (and it happens). Hospitals sometimes have fires or run out of oxygen or beds or have equipment failures and have to plan for that.

Everybody assumes tech will just keep working in the SaaS world. At the end of the day though, costs of the average service having downtime (even if you calculate google makes 2 million per minute) are generally pretty low and are usually recouped after the service comes back up.

prepend · on Aug 10, 2021

> My local taco shop ran out of rice a few weeks ago. A few days ago they had some unspecified kitchen issue and had to close. They don't open on sundays.

And mine has never run out of rice. They are closed on Sunday and Monday. I don’t monitor them enough to know how many 9’s, but I’ve probably eaten there 200 times in 9 years and never experienced an outage.

Some places suck more than others. But my expectation is 100% uptime when it comes to rice at my taco shop.

I think assuming competence is healthy, as long as I’m not a jerk when it happens to fail. I love my taco shop and would survive if they didn’t have an ingredient once. But they work really hard to keep everything working.

oliwarner · on Aug 10, 2021

Not ignoring the truth in the other comments or yours, the CI/CD is also in the midst of a battle with mining. While we allow people to turn energy and CPU cycles into cash, people are going to abuse any run-what-you-like service like GitHub Actions.

thih9 · on Aug 10, 2021

Even more offtopic: your comment makes me think of other industries and realise how harmful the heavily processed food must be.

kilroy123 · on Aug 10, 2021

Microsoft forcing them to switch to Azure seems to not have helped.

pm90 · on Aug 10, 2021

Is there any evidence that their issues are due to them using Azure?

JohnWhigham · on Aug 10, 2021

GitHub has these ~hour-long outages like twice a year. In the early 2010s it used to be much more frequent.

mgkimsal · on Aug 10, 2021

twice a year?

https://www.githubstatus.com/history

There have been multiple incidents longer than an hour just in the last few months this year.

It's worse in some ways than, say, 5 years ago, because... GH offers more services. And more people end up relying on those, so, there's more than can go wrong. And when it does, it usually has a bigger impact.

Some client projects have tied everything to GH/Actions/CI/deploy/etc. When there's any issue with GH, something is affected. Other clients have more distributed services (bitbucket/gh/gitlab/etc), which is ... sometimes less convenient, but usually means something is still up and functioning even if something else is interrupted.

jdavis703 · on Aug 10, 2021

I’ve been using GitHub for nearly a decade now. It used to be down all the time. Outages are now comparatively rare.

phendrenad2 · on Aug 10, 2021

Compared to 2010? Maybe. Compared to 2014? Nope

contingencies · on Aug 10, 2021

Nice one. Added to https://github.com/globalcitizen/taoup

NiekvdMaas · on Aug 10, 2021

The title of the incident has been updated to: "Incident with GitHub Actions, API Requests, Issues, GitHub Packages, GitHub Pages, Pull Requests, and Webhooks". Basically all core functionality is down at the moment.

rvz · on Aug 10, 2021

Exactly.

Basically all GitHub services are currently degraded or down today.

howolduis · on Aug 10, 2021

This is not correct. I was able to open the website just now.

williamscales · on Aug 10, 2021

That doesn't mean it's not degraded.

twistedpair · on Aug 10, 2021

`watch -n1 "git push origin"`

Time for lunch. The push will get through eventually.

paulrostorp · on Aug 10, 2021

+1 I can't believe spamming it actually worked !

remram · on Aug 10, 2021

At least make it stop once it succeeds...

packersville · on Aug 10, 2021

+1 Worked for me. Beautiful

codetrotter · on Aug 10, 2021

I also noticed that starring repos was not working. At that time, a couple of hours ago, only GitHub Actions was listed as having problems on githubstatus. So I was unsure whether being able to star things was malfunctioning for more people or just me. And I was also thinking about how much cruft has been added to the GitHub web UI lately, ever since the acquisition of GitHub by Microsoft. And I yearn for GitHub as it used to be. But the greatest value in GitHub for me is browsing and starring stuff that others have made, so it's not like I can just set up my own instance either. But I wonder about the future of GitHub. I think it will be more and more geared towards big enterprise users, and less and less a place that will be what it used to feel like for individual developers.

Exuma · on Aug 10, 2021

Who else googled and saw git gc/git fsck SO answer and tried it.

thebouv · on Aug 10, 2021

I was on the phone with them getting a demo of Enterprise Managed Users at the time and the demo was failing too, ha. So we looked at the status page together and saw all yellow.

Doesn't matter, I wanted EMUs without the demo anyway. Still was funny to happen when I was talking to an engineer at the time.

rvz · on Aug 10, 2021

Two months since the last time a serious issue like this happened. [0]

Not really good if you don't have a backup method to avoid this. Unlike those who went all in.

[0] https://news.ycombinator.com/item?id=27366805

grumple · on Aug 10, 2021

I could see things like this messing up launch date/times occasionally, but what workplace falls apart if you can't push code to github for a few hours? Surely you can do other work while waiting for them to fix things.

Operyl · on Aug 10, 2021

One finding itself in a particular rush to put out a fix, I suppose. I’d say “you should have back up ways to deploy” but I can certainly see why a lot of workflows get centralized on a provider like GitHub.

KrishnaShripad · on Aug 10, 2021

Can't push my commits too!

penagwin · on Aug 10, 2021

Same here!

> remote: fatal error in commit_refs

jackson1442 · on Aug 10, 2021

Same here- I was doing a bit more complicated rebase than usual and thought it was my fault for a good few minutes there.

singlow · on Aug 10, 2021

If only we had a decentralized version control system.

luhn · on Aug 10, 2021

I know you're being flippant, but if this was Subversion I couldn't commit, checkout other branches, or really do anything. With git I can do everything except push and pull from Github.

bspammer · on Aug 10, 2021

The devs at my company have been emailing patches to each other for the last couple of hours. It's been fun actually.

TacticalCoder · on Aug 10, 2021

Wait what!? Can't you just reconfigure your .git configs to temporarily use another repo locally?

Seriously: for example copying then transforming any one of your local .git repo to a "bare" repository takes like two commands. Anyone with a Linux machine can do it. Then all the devs can temporarily use that repo to push/pull/whatever.

I think if you're to the point where you're emailing patches, it's worth it.

z77dj3kl · on Aug 10, 2021

Emailing patches is a classic git flow, used e.g. by the linux kernel. Git even includes a command to format emails and send them. Nothing wrong with it!

bspammer · on Aug 10, 2021

We're all remote so behind home NATs etc. Didn't expect the outage to last as long as it did.

stnmtn · on Aug 10, 2021

There's no reason you can't host your own!

dboreham · on Aug 10, 2021

Because the GitHub source code is available to run by anyone.

0des · on Aug 10, 2021

Git does not imply Github in this context.

jobstijl · on Aug 10, 2021

This happens quit a lot with github lately it seems.

chrishynes · on Aug 10, 2021

Seems like GitHub is in read only mode right now -- no write actions work (creating issues, adding comments, pushing, branching, PR's)

nightowl_games · on Aug 10, 2021

Anyone got any good tips on how to run github action workflows locally or self-hosted? Incidents like this make me want to self-host our CI/CD.

I'm confused as to the difference between https://github.com/nektos/act and https://github.com/actions/runner

Is it not possible to use 'runner' to run github actions locally? The docs for this seem like they are extremely sparse - probably because this competes with the paid option?

duped · on Aug 10, 2021

The runner is a pain in the butt to setup locally. Switch to gitlab or don't rely on community actions, run everything in single bash script.

hashhar · on Aug 10, 2021

It's recovering - everything's working for me now.

Much hugs to the GitHub SREs.

misnome · on Aug 10, 2021

The status SAYS it's all resolved, but PRs are still missing commits pushed during the outage (but before the status page was updated).

I guess I'll have to manually push fake commits to get it to update, or wait for whatever internal cache to drop off grumble grumble

annowiki · on Aug 10, 2021

It seems more than just actions/webhooks. I can't push any commits at all.

dgorges · on Aug 10, 2021

We are also seeing issues with Github Actions. Here we go again...

mcjiggerlog · on Aug 10, 2021

Yup definitely more than just webhooks - I can't perform basic git operations like push.

misnome · on Aug 10, 2021

I really wish there was some pressure to have these status pages be somewhat "realistic" and responsive rather than hoping that temporary problems go away so they don't have SLA questions.

I checked after none of my pushed commits showed in PR's, and it was "Webhooks" only. That's not useful.

berkes · on Aug 10, 2021

Slowly everything is turning yellow at the status page.

blacksmith_tb · on Aug 10, 2021

Which seems like an understatement on their part - shouldn't everything be 'red' if all services are down? If not, what kind of outage does it take to get to red?

stnmtn · on Aug 10, 2021

It seems like most of this is services going "read-only", I would imagine Red for PRs would be that you can't access or read them whatsoever

This is different from commits, which probably should be Red.

etimberg · on Aug 10, 2021

Haven't been able to clear notifications since 11:15. Looks like almost everything is down now

boros2me · on Aug 10, 2021

Yeah, not just Webhooks... Git pushes are failing too.

Flex247A · on Aug 10, 2021

I was getting server 500 error while forking godotengine/godot when my internet cut off. I thought it was an issue from my side.

ehakan · on Aug 10, 2021

Git LFS is down too, failing with lock errors

sethd · on Aug 10, 2021

Perhaps one could say that falls under "Git Operations", but it feels like it deserves a mention of its own. :)

sega_sai · on Aug 10, 2021

It seems to happen again after the first incident was fixed few hours ago (with github actions)

yuriko · on Aug 10, 2021

Now Git operations also down...

aashcan · on Aug 10, 2021

meh, here I thought my git-foo was wrong.

Ran garbage clearance for nothing also. :-(

niffydroid · on Aug 10, 2021

And with this, they are still better than bitbucket

odiroot · on Aug 10, 2021

Thanks Github, whole team could clock out early.

whofw · on Aug 10, 2021

Glad it's early enough to go back to bed!

jackson1442 · on Aug 10, 2021

Looks like everything's green now.

_birdnerd_ · on Aug 10, 2021

I was invited to a party Saturday hosted by the GH CEO, and while this is likely just a huge coincidence, I wonder if some key engineer(s) came down with covid. The friend who invited me said there would be rapid testing at the party.