The Azure status page has more information. I suggest updating the link.
> A severe weather event, including lightning strikes, occurred near one of the South Central US datacenters. This resulted in a power voltage increase that impacted cooling systems. Automated datacenter procedures to ensure data and hardware integrity went into effect and critical hardware entered a structured power down process.
Wow, I didn't tie these two events together until reading this comment. The flash floods last night were quite awful. mySA, a local news site (which I don't necessarily trust), has said that the daily rainfall total was 3x its historic record in the 1800s. [0]
It's always quite fascinating whenever cloud platforms like this have "leaky abstractions." GCP had a very long storage service degradation today, as well. [1] I don't know if it's related.
Hmmm. That's quite the jump. Daily rainfall of 6.07" vs 1889 record of 1.76". But hey, just a fluke, no doubt. Or evidence for iffy reporting.
> SAN ANTONIO - Heavy rain caused flash flooding Monday night in northern Bexar County and extending into Comal County. Some areas had up to 9 inches of rain, and water rose on some roadways, including Interstate 10.
> Tuesday morning, the National Weather Service said via social media that parts of San Antonio had up to 6.07 inches of rain, which "smashes" the daily rainfall record of 1.76 inches from 1889.
> Over 9 inches of rain was observed around Stone Oak Parkway, and more than 8 inches between Shavano Park and Camp Bullis, according to the NWS website.
Visual Studio Online has been offline all day. They say it is due to the same Azure outage. This has had a productivity impact.
If Microsoft didn't own GitHub, this may have prompted a move, but since they do it seems a little redundant given that Github will likely be on Azure too before long.
In my experience git is excellent at being offline. I’m not sure if we would have had problem if our team foundation setup had ever went off line, but we moved to git long ago because it’s what every single IT-based student learns in my country. With every potential hire knowing git and almost none of them knowing tfd/vsts, it seemed silly not to migrate since the platforms both work well for what we need and git was cheaper over all.
You can use git outside of GitHub and Microsoft though, I mean, you could always use bitbucket.
So much this. All the pipeline and delivery stuff is glued into github’s API. Most companies aren’t using git any more but github and git is just a dependency. I’ve even seen developers who are unaware that git can exist without github.
I incidentally think this is a completely terrible outcome.
It doesn't really mater what service provider you use. You can't do anything that requires a sync if the server is down. You can do a bit of local development, but things like automated tests and deployment to staging servers will tend to break if they can't access the source.
I think it's unlikely. Github were on cloud, and then moved to their own infrastructure a while back. In fact they have their own provisioning framework and all that fun stuff.
I doubt they will move back to Azure or any cloud for that matter. It's the same story with Dropbox and similar companies. Once past a point in scale, and depending on the case, for example need to control the data security to have certain certifications, it's essential to have your own infrastructure.
Microsoft owns Azure, Microsoft owns GitHub. Azure is Microsoft’s own infrastructure. Remember kids there is no cloud it’s just somebody else’s computer.
For me, it's even crazier to rely on a third party provider to be able to do my work.
I wouldn't be comfortable at all knowing that my development infrastructure is beyond my control.
Not intended to be snarky: why is this crazy to you? All cloud providers have had downtime incidents, major hosted VCS providers, SaaS products. Downtime is a fact of life in tech.
Hrm. I've worked at a large tech firm for more than a decade and there has never been a full day where VCS or the build farm were down all day. It's notable when it's down for more than twenty minutes.
As counterpoint, I’ve seen banks down for extended periods of time, hours occasionally stretching to a day or two (TSB, LLoyds, Bank Of America, BankSimple [BBVA]). Downtime is a fact of life. Google, Amazon, IBM, and Microsoft have had major cloud outages. GitLab nuked their production DB. Slack and Reddit are frequently down.
Unless it’s life critical (911, air traffic control), if it’s down its only going to hamper productivity, but it’ll be back eventually. Time to stretch and get a coffee, and if it’s all day, going home and we’ll start fresh tomorrow.
We’re not saving lives, we’re just building websites. Downtime isn’t shameful, it happens to all of us.
If a single bank is down or Reddit (lol) then the impact is fairly limited, but if one of the 3 major cloud providers, which powers large parts of the internet is down for an entire day, then the impact is a little bit more critical I would say ;).
There's a reason why Azure has a SLA and Reddit doesn't ;)
Also if you start comparing the big companies with GitLab then we don't have to continue talking anymore. It's not ok to nuke your production db and that's why everyone in the tech scene laughs about GitLab and comparing them to Azure is like comparing a lego house to brick and mortar.
I received a notification just now in my Visual Studio 2017 Community Edition that I "have a license for Visual Studio Enterprise 2017. Get it now". When I click on in, the web page asks me to sign in which returns a 502 error code.
They have some services that are "global", ie not tied to a given region. Those services' requests are actually processed all over the place, but south central is a big datacenter. The 9th biggest in the world, apparently. When it lost cooling and shut down, everything routed around it as planned... But it caused so much extra traffic that it overwhelmed the connections to other datacenters. The backlog of requests is tremendous of course, so even after they got south central back up, all the other datacenters are way over their traffic capacity. They've got the Datacenter back up, and are now restoring storage and storage dependent services.
Honestly it's hard to imagine a good mitigation for this. "Build more datacenters" is already happening as fast as it can. "Keep enough spare capacity around to handle losing one of the biggest datacenters in the world" is pretty unreasonable.
If you, as a customer, are uptime focused enough that it's worth paying extra, then the sensible practice has always been Cross-Cloud infrastructhre/failovers. At least since the Amazon Easter failure of 2011. That's what giants like Netflix do.
> "Keep enough spare capacity around to handle losing one of the biggest datacenters in the world" is pretty unreasonable.
Err what?
It's entirely reasonable to expect Azure to handle the loss of a single DC and not have a 14+ hour global outage. I don't care how big the DC is, losing one should not take out the world, especially not for the length of time this one has been going on.
With that, though, it sounds like the size of this datacenter is way out of scale compared to the rest of their DC's. They are really going to need to break apart the services that they host there to make sure that DC to DC and region to region fail over works correctly.
> It's entirely reasonable to expect Azure to handle the loss of a single DC and not have a 14+ hour global outage.
(I apologize if the following sounds snarky. I don't mean it that way, I just can't find better wording.)
Microsoft has repeatedly violated my sense of "reasonable" in the past, including in recent times with Windows 10. Therefore this kind of glitch isn't very shocking to me.
> Honestly it's hard to imagine a good mitigation for this.
Besides the one that AWS and GCP have implemented? That is, to have at least N+1 datacenters? Actually, I think N+1 is the old Google prod regime. I suspect that GCP is at least N+1 per continental region, and I'd be surprised if AWS isn't as well.
The worst part has been the poor communication. If they were to give clearer insight from the get go, that'd give me more confidence and patience. Saying "check back in 2 hours" isn't useful.
Having worked for a cloud provider, the reason they are saying that is because they are actively working to understand and fix the problem but haven't come to a well resound solution and thus they cannot give you a decent time estimate because you will probably get even more mad if they under/over estimate the time it took to fix it.
If they said this, "because they are actively working to understand and fix the problem but haven't come to a well resound solution and thus they cannot give you a decent time estimate because you will probably get even more mad if they under/over estimate the time it took to fix it." I would thank and applaud them. Tell me what it is you're doing at least. Why don't you understand the problem? What are you investigating? Some transparency goes a long way for me.
I used to think like this too - e.g. I was happy when our national rail started announcing the cause of delays. But then a friend of mine was complaining that they did this, because he didn't want to be troubled with their internal problems - "just tell me what to do".
When your customer's demands are so directly opposed, you're somewhat caught between a rock and a hard place.
Hmm yeah, I guess in this case that is true. Probably a matter of not having been in this situation often and thus not designed the status page that extensively.
Exactly, this happens even in just a normal production failure. I don't know what else they could have said/communicated. Not to mention this is the 7th largest data center in the world, resolving the problem likely took/is taking a long long time just because there are so many machines. I was lucky that the only outage effect I've suffered from is that my storage is locked, which means I can't add new file/edit code in production...but that's much better than it being down completely. My databases are geo-redundant, so that was a blessing today.
How often do we do this for our customers? Honestly, if a customer wanted to know that much detail about my process, I would be quite bothered, and I suspect most people would be as well. Wait for the recap, everything will become clear then. You don't need to know everything in the moment.
I'm not asking for every detail. It's horrible to wait 20 hours and counting for unknown reasons and to not have something to work towards. I only ask for more transparency is all.
It's akin to waiting for surgery and the doctor saying "we're working on it". I don't want/need the details for the surgery, but tell me everything is ok, what comes next and give me some estimates to set expectations.
That is a top 'did you know' factoid that I am sure I will tell others.
But do AT+T engineers carry T-Mobile phones?
If yes then they should put themselves together a deal so that none of the on-call engineers have to worry about running up big bills using their phones. When there are freak weather events they are all in it together.
And Amazon. I recall that during their most recent s3 outage Amazon's status page was green across the board, because somehow all the assets that were supposed to be displayed when things went wrong were themselves hosted on the thing that was down.
Some thought they were spreading eggs across multiple baskets by replicating across regions and trusting Azure was doing the same for their “global” services.
As a consumer, the lesson here is that Azure is one, big basket. It would probably be prudent to think of AWS and GCS as single baskets too.
PM for VSTS here. The final scale unit in South Central US was brought back online a few hours ago, which means that the final accounts that were affected should now be operational. We're still restoring package management to some accounts, but otherwise, you should be back to working. Please feel free to reach out to me if you're still having trouble. Email is my HN username @ microsoft.com. We're _very_ sorry for this very significant outage.
Like any major service, I'm sure they do. But also like any other service, how well is all the resiliency tested in the real world, is a separate question. And today we have an answer.
Anyway I have a good guess as to what most of the employees there are going to be doing for the next six months.
Just today I was having issues with the Prettier extension in VS Code, and I uninstalled it to see if that would fix it (I read that usually fixes the issues I was having). Then I realized that I couldn't install it again because VS Marketplace was down.
This was like 8 hours ago and still no signs of recovery. Of course, all my builds are failing because of some stupid formatting issue that Prettier usually would solve, so yeah..thanks MSFT.
When you have a style guide - test for it, if the test fails, then fail the style linter job and don't allow the change to be accepted.
It's failed to meet your acceptable code criteria after all.
If you find you are making your code unreadable just to pass, then your style guide is wrong. That needs fixing, not the CI job.
If you find an urgent "this needs to merge, style rules be dammed" change, allow your senior team members to overrule the style CI job and merge it anyway.
I assume their build system checks for formatting and will raise an error if it doesn't conform. And this person would use the VS Code extension to auto-conform their code.
I'm questioning the soundness of such setup in general, and especially if it means that losing connection to a third-party prettifier makes you unable to work on your own codebase.
Hmm, the alternatives are not enforcing a similar code style, or enforcing it earlier on (e.g. on commit). I can understand why they would not want the former, and the latter is more annoying when experimenting, i.e. when code style does not matter that much yet. Thus, in CI sounds like the right choice.
I can add an ignore formatting line, but then people would ask why I did that. We have a styleguide to increase readability and make code-reviews easier.
Is it a sound setup? I personally think it is. It promotes good standards and can sometimes even help catch some issues related to structure and code flow.
The outage appears to be ongoing, and its having ripple affects in other regions (management portal unresponsive, autoscale and other services not firing in West US at least for me).
Also unable to lodge a support ticket because the portal fails to identify me as having paid support (that API request appears to timeout).
As of 07:15 UTC on the 5th of September 2018 the Azure status message reported:
"NEXT UPDATE: The next update will be provided by 07:00 UTC 05 Sep 2018 or as events warrant."
As I finished writing this they finally updated with essentially the same message except stalling for an additional two hours.
So, if you're thinking "Well, surprises will happen". Yeah, and Microsoft is not actually prepared for that at all, so, sucks to be their customer I guess?
Google doesn't have near the cloud presence of Amazon and Microsoft...maybe one day when they do, we can properly compare them. Given Google's small size/role in the space it's impossible to gauge if this is true.
Google Apps is not GCS. If you wanted to compare that number you'd have to throw in Office 365/Azure into the same number and it would dwarf Google.
Google has ~3% market share compared to Microsoft's ~28% and Amazon's ~40%. Not even in the same league at the moment. Google is more on par with IBM and Rackspace, for now. Google will undoubtedly make strides in the space, but they haven't been tested.
> Google has ~3% market share compared to Microsoft's ~28% and Amazon's ~40%.
Where does this number come from?
If it is based on the revenue reported, be very careful with Microsoft's numbers. They report a lot of products as "Azure intelligent cloud", including Office suite subscriptions, on-premise server licences, and software (Windows, SQL Server) licensing revenue from other cloud providers in that number.
Pretty soon their claimed growth is going to flatten out, because they couldn't find any more revenue to report as "Azure intelligent cloud", like PC hardware ...
People have been throwing this conspiracy theory around for years...even though you can look at Microsoft's earnings reports and see they are broken out as different line items.
AWS and Azure have had "big" outages people because actually use them. Rackspace and IBM are almost neck and neck with Google's best efforts (3% markshare Vs. 30%/40% for Azure/AWS)[0].
The difference I see is that AWS didn't seem to have suffered a non-region specific service outage yet.
Whether or not it's related to the South Central US region outage, today Azure Active Directory was down for ~4h. Because it's not running in a particular region, no HA or DR strategy, other than having your services be deployed to multiple cloud providers, would have mitigated the outage.
I'm sure services I use run on all of the major cloud providers, but if that page was supposed to be enlightening, I only recognized one brand from the first page of customers.
I refreshed a couple times, and sure, I saw more (on average 1 or 2) that I recognized on each page. But I don't think your response is particularly persuasive. Are you suggesting that the services that I use that run on AWS are in fact, not services I actually use?
Or am I not a consumer? I'm confused.
Edit: Do you hold any Alphabet/Google stock? I've noticed your comment history trends toward dismissing criticism of Google, praising their products, and taking opportunities to speak about the flaws of their top competitors.
In the future, if you or anyone runs across an account that has been abusing HN in this way, it would be much better to email us at hn@ycombinator.com so we can look into it. I happened to randomly see your comment, but the odds of that are low.
>I'm sure services I use run on all of the major cloud providers, but if that page was supposed to be enlightening, I only recognized one brand from the first page of customers.
So the first page was supposed to be indicative of all of the popular consumer facing services they host? Here, let me help you out: Spotify, eBay, Twitter, Apple iCloud, Verizon, Vimeo, Netflix, etc
>I refreshed a couple times, and sure, I saw more (on average 1 or 2) that I recognized on each page. But I don't think your response is particularly persuasive. Are you suggesting that the services that I use that run on AWS are in fact, not services I actually use?
What popular consumer services were on AWS again?
>Edit: Do you hold any Alphabet/Google stock? I've noticed your comment history trends toward dismissing criticism of Google, praising their products, and taking opportunities to speak about the flaws of their top competitors.
Do you own Microsoft stock? Because quote a few of your posts seem to praise their products and services. Do you work for them?
We've banned this account. All it has done is aggressively post pro-Google comments and diss Google competitors.
Single-purpose accounts are not allowed here, especially not when pushing an agenda, and most of all not when pushing corporate propaganda. Of all the things that make HN users angry, that's at the top. And I agree with them.
Most of the time we tell HN users that they're not allowed to accuse each other of astroturfing. When we do find a clear-cut case of abuse that's been getting away with it for this long, I get pretty steamed.
You've also frequently broken the site guidelines by being uncivil, so much so that we've warned you at least half a dozen times. That's more than enough reason to ban you in its own right.
Disclaimer: I work at GCP but am not speaking for them. I also wish a speedy recovery for our colleagues at Azure: an outage like this can only result from many things going sideways simultaneously, and both the cause and recovery can be complicated in ways that flippant "well why didn't you just N+1 it" commenters here on HN can only guess at.
To be clear, by "global"(around the whole globe/earth), I mean that a specific service is broken in every public region . You seem to have a wider interpretation for "global".
"Google Compute Engine VM instances allocated with duplicate internal IP addresses, stopped instances networking are not coming up when started." - 22 hours
Newly-launched instances, or instances that were stopped and started, received duplicate IP addresses. 4.5 hours in a mitigation was provided, but it was only resolved after 22 hours, and customers may have had to still fix individual instances. As far as I remember, this was global, and there is nothing on the status page indicating it was limited to one region or a subset of regions. So, for 4 hours, if you needed to create a VM with working networking, you couldn't, anywhere on GCP, and no mitigation was available. Do you not consider this to be "global"?
"Instances using Local SSD might experience VM failures. This affects GCE VMs globally. No data corruption has been observed." - 5 hours
The original claim was:
> So AWS has had some big outages, as has Azure. Has GCP had any big outages yet?
I said:
> GCP has had multiple many-hour (6+) GLOBAL outages in the past year. I think it's at about 3 so far this year.
So, maybe it's only 2 major global outages, or maybe it's 3 5-hour+ global problems, but the only way anyone can claim that Google hasn't had any big outages is if they don't have enough market share for a global outage to affect many websites or end-users.
The biggest AWS outage in the past 3 years was S3 being down in one region for 2 hours in Feb 2018 ...
No other regions were affected, except for global APIs (e.g. create S3 bucket), which one shouldn't rely on on your critical path.
Many new customer features have been delivered to allow mitigation of this kind of failure (e.g. cross-region S3 replication).
AWS had a power failure at one data-center in us-east-1 earlier this year, which had very little impact (basically only customers who didn't have sufficient redundancy in other AZs were affected).
We all ditched the Unix model of a central server with dumb terminals around because Microsoft told us the future was everybody having a full OS in their workstation. Now they tell us the future is going back 30 years, and having all of our data and programs in somebody else's machines (theirs).
> A severe weather event, including lightning strikes, occurred near one of the South Central US datacenters. This resulted in a power voltage increase that impacted cooling systems. Automated datacenter procedures to ensure data and hardware integrity went into effect and critical hardware entered a structured power down process.
https://azure.microsoft.com/en-us/status/