Hacker News new | past | comments | ask | show | jobs | submit login
Microsoft Azure suffers outage after cooling issue (datacenterdynamics.com)
210 points by pcunite on Sept 4, 2018 | hide | past | favorite | 108 comments



The Azure status page has more information. I suggest updating the link.

> A severe weather event, including lightning strikes, occurred near one of the South Central US datacenters. This resulted in a power voltage increase that impacted cooling systems. Automated datacenter procedures to ensure data and hardware integrity went into effect and critical hardware entered a structured power down process.

https://azure.microsoft.com/en-us/status/


I wonder if that's one of their facilities down here in San Antonio. Was getting flash flood alerts on my phone all night and morning.


Wow, I didn't tie these two events together until reading this comment. The flash floods last night were quite awful. mySA, a local news site (which I don't necessarily trust), has said that the daily rainfall total was 3x its historic record in the 1800s. [0]

It's always quite fascinating whenever cloud platforms like this have "leaky abstractions." GCP had a very long storage service degradation today, as well. [1] I don't know if it's related.

[0] https://www.mysanantonio.com/news/weather/article/Several-re...

[1] https://status.cloud.google.com/incident/storage/18003

edited for formatting


Hmmm. That's quite the jump. Daily rainfall of 6.07" vs 1889 record of 1.76". But hey, just a fluke, no doubt. Or evidence for iffy reporting.

> SAN ANTONIO - Heavy rain caused flash flooding Monday night in northern Bexar County and extending into Comal County. Some areas had up to 9 inches of rain, and water rose on some roadways, including Interstate 10.

> Tuesday morning, the National Weather Service said via social media that parts of San Antonio had up to 6.07 inches of rain, which "smashes" the daily rainfall record of 1.76 inches from 1889.

> Over 9 inches of rain was observed around Stone Oak Parkway, and more than 8 inches between Shavano Park and Camp Bullis, according to the NWS website.

https://www.mysanantonio.com/news/weather/article/mysananton...


Yes, it is in San Antonio.


Visual Studio Online has been offline all day. They say it is due to the same Azure outage. This has had a productivity impact.

If Microsoft didn't own GitHub, this may have prompted a move, but since they do it seems a little redundant given that Github will likely be on Azure too before long.

https://blogs.msdn.microsoft.com/vsoservice/?p=17405


VSTS being down all day has had a non-negligible impact for us. We have most of our deployment pipelines there, and they've been inaccessible all day.


In my experience git is excellent at being offline. I’m not sure if we would have had problem if our team foundation setup had ever went off line, but we moved to git long ago because it’s what every single IT-based student learns in my country. With every potential hire knowing git and almost none of them knowing tfd/vsts, it seemed silly not to migrate since the platforms both work well for what we need and git was cheaper over all.

You can use git outside of GitHub and Microsoft though, I mean, you could always use bitbucket.


In my experience, most companies that use "github" (and not "git") cannot get much done if github is down.


So much this. All the pipeline and delivery stuff is glued into github’s API. Most companies aren’t using git any more but github and git is just a dependency. I’ve even seen developers who are unaware that git can exist without github.

I incidentally think this is a completely terrible outcome.


It doesn't really mater what service provider you use. You can't do anything that requires a sync if the server is down. You can do a bit of local development, but things like automated tests and deployment to staging servers will tend to break if they can't access the source.


Or badly configured CI, that redownloads dependencies all the time instead of caching them locally.


I think it's unlikely. Github were on cloud, and then moved to their own infrastructure a while back. In fact they have their own provisioning framework and all that fun stuff.

I doubt they will move back to Azure or any cloud for that matter. It's the same story with Dropbox and similar companies. Once past a point in scale, and depending on the case, for example need to control the data security to have certain certifications, it's essential to have your own infrastructure.


Microsoft owns Azure, Microsoft owns GitHub. Azure is Microsoft’s own infrastructure. Remember kids there is no cloud it’s just somebody else’s computer.


You're missing the point. Github wants to have control over its hardware. It's as easy as that.


Having VSTS down all day meant I got exactly 0 done today. Completely crazy to me.


For me, it's even crazier to rely on a third party provider to be able to do my work. I wouldn't be comfortable at all knowing that my development infrastructure is beyond my control.


Not intended to be snarky: why is this crazy to you? All cloud providers have had downtime incidents, major hosted VCS providers, SaaS products. Downtime is a fact of life in tech.


Hrm. I've worked at a large tech firm for more than a decade and there has never been a full day where VCS or the build farm were down all day. It's notable when it's down for more than twenty minutes.


As counterpoint, I’ve seen banks down for extended periods of time, hours occasionally stretching to a day or two (TSB, LLoyds, Bank Of America, BankSimple [BBVA]). Downtime is a fact of life. Google, Amazon, IBM, and Microsoft have had major cloud outages. GitLab nuked their production DB. Slack and Reddit are frequently down.

Unless it’s life critical (911, air traffic control), if it’s down its only going to hamper productivity, but it’ll be back eventually. Time to stretch and get a coffee, and if it’s all day, going home and we’ll start fresh tomorrow.

We’re not saving lives, we’re just building websites. Downtime isn’t shameful, it happens to all of us.


If a single bank is down or Reddit (lol) then the impact is fairly limited, but if one of the 3 major cloud providers, which powers large parts of the internet is down for an entire day, then the impact is a little bit more critical I would say ;).

There's a reason why Azure has a SLA and Reddit doesn't ;)

Also if you start comparing the big companies with GitLab then we don't have to continue talking anymore. It's not ok to nuke your production db and that's why everyone in the tech scene laughs about GitLab and comparing them to Azure is like comparing a lego house to brick and mortar.


At GitLab we are always trying to iterate and we learned a lot from our incident with the database.

The one thing we are proud of is our transparency. The community really appreciates our openness and we are happy about it.

Here's the one example https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-...


I received a notification just now in my Visual Studio 2017 Community Edition that I "have a license for Visual Studio Enterprise 2017. Get it now". When I click on in, the web page asks me to sign in which returns a 502 error code.


Edit: <snipped> out my rant.

It's been a long day because of this. Just going to leave it at that.


They have some services that are "global", ie not tied to a given region. Those services' requests are actually processed all over the place, but south central is a big datacenter. The 9th biggest in the world, apparently. When it lost cooling and shut down, everything routed around it as planned... But it caused so much extra traffic that it overwhelmed the connections to other datacenters. The backlog of requests is tremendous of course, so even after they got south central back up, all the other datacenters are way over their traffic capacity. They've got the Datacenter back up, and are now restoring storage and storage dependent services.

Honestly it's hard to imagine a good mitigation for this. "Build more datacenters" is already happening as fast as it can. "Keep enough spare capacity around to handle losing one of the biggest datacenters in the world" is pretty unreasonable.

If you, as a customer, are uptime focused enough that it's worth paying extra, then the sensible practice has always been Cross-Cloud infrastructhre/failovers. At least since the Amazon Easter failure of 2011. That's what giants like Netflix do.


> "Keep enough spare capacity around to handle losing one of the biggest datacenters in the world" is pretty unreasonable.

Err what?

It's entirely reasonable to expect Azure to handle the loss of a single DC and not have a 14+ hour global outage. I don't care how big the DC is, losing one should not take out the world, especially not for the length of time this one has been going on.


Indeed. This article by AWS VP James Hamilton gives a unique insight into how Amazon approaches the problem of sizing data centers for redundancy:

https://perspectives.mvdirona.com/2017/04/how-many-data-cent...


Here I was hoping this was a reference to James Mickens: https://blogs.microsoft.com/ai/james-mickens-the-funniest-ma...


Agreed.

With that, though, it sounds like the size of this datacenter is way out of scale compared to the rest of their DC's. They are really going to need to break apart the services that they host there to make sure that DC to DC and region to region fail over works correctly.


> It's entirely reasonable to expect Azure to handle the loss of a single DC and not have a 14+ hour global outage.

(I apologize if the following sounds snarky. I don't mean it that way, I just can't find better wording.)

Microsoft has repeatedly violated my sense of "reasonable" in the past, including in recent times with Windows 10. Therefore this kind of glitch isn't very shocking to me.


> Honestly it's hard to imagine a good mitigation for this.

Besides the one that AWS and GCP have implemented? That is, to have at least N+1 datacenters? Actually, I think N+1 is the old Google prod regime. I suspect that GCP is at least N+1 per continental region, and I'd be surprised if AWS isn't as well.


But you are basically saying that azure cannot deliver the redundancy they charge their customers for?


Just want to point out that once Azure launches their submersible datacenter units Azure Functions may literally become dead in the water.


One of the demos at Build was Azure Stack running on oil rigs, so you're not far off


I, for one, salute you, sir. Well played.


The worst part has been the poor communication. If they were to give clearer insight from the get go, that'd give me more confidence and patience. Saying "check back in 2 hours" isn't useful.


> Saying "check back in 2 hours" isn't useful.

Having worked for a cloud provider, the reason they are saying that is because they are actively working to understand and fix the problem but haven't come to a well resound solution and thus they cannot give you a decent time estimate because you will probably get even more mad if they under/over estimate the time it took to fix it.


If they said this, "because they are actively working to understand and fix the problem but haven't come to a well resound solution and thus they cannot give you a decent time estimate because you will probably get even more mad if they under/over estimate the time it took to fix it." I would thank and applaud them. Tell me what it is you're doing at least. Why don't you understand the problem? What are you investigating? Some transparency goes a long way for me.


I used to think like this too - e.g. I was happy when our national rail started announcing the cause of delays. But then a friend of mine was complaining that they did this, because he didn't want to be troubled with their internal problems - "just tell me what to do".

When your customer's demands are so directly opposed, you're somewhat caught between a rock and a hard place.


You can easily reflect both positions in your status page.

Those who don't need to be bothered with the details can refrain from reading them.


I really don't understand why everything has to be black or white with any of this stuff.

All this would take to keep both sides happy is a little link with "more info" below it.

Why things like this are so difficult, I will never understand.


Hmm yeah, I guess in this case that is true. Probably a matter of not having been in this situation often and thus not designed the status page that extensively.


Exactly, this happens even in just a normal production failure. I don't know what else they could have said/communicated. Not to mention this is the 7th largest data center in the world, resolving the problem likely took/is taking a long long time just because there are so many machines. I was lucky that the only outage effect I've suffered from is that my storage is locked, which means I can't add new file/edit code in production...but that's much better than it being down completely. My databases are geo-redundant, so that was a blessing today.


Were you able to perform operations given ARM services were down? Were you using your secondary DB for read-only or did you fail-over some other way?


How often do we do this for our customers? Honestly, if a customer wanted to know that much detail about my process, I would be quite bothered, and I suspect most people would be as well. Wait for the recap, everything will become clear then. You don't need to know everything in the moment.


I'm not asking for every detail. It's horrible to wait 20 hours and counting for unknown reasons and to not have something to work towards. I only ask for more transparency is all.

It's akin to waiting for surgery and the doctor saying "we're working on it". I don't want/need the details for the surgery, but tell me everything is ok, what comes next and give me some estimates to set expectations.


We're affected by this issue. And we had our alerts system in Azure as well, so we didn't get alerts about the outage (welp).


That's why T-Mobile's on-call engineers carry around AT&T phones.

(source: friend who's an engineer at T-Mobile)


That is a top 'did you know' factoid that I am sure I will tell others.

But do AT+T engineers carry T-Mobile phones?

If yes then they should put themselves together a deal so that none of the on-call engineers have to worry about running up big bills using their phones. When there are freak weather events they are all in it together.


They would use Verizon if anything


"factoid": I don't think it means what you think it means.

https://www.merriam-webster.com/dictionary/factoid


The second definition there fits perfectly.


You made the same mistake as Microsoft, who seemed to have put their Azure status page on Azure:

https://www.reddit.com/r/AZURE/comments/9cvgn2/is_there_a_st...

Their status page is back up now, but my stuff's still broken. :\


And Amazon. I recall that during their most recent s3 outage Amazon's status page was green across the board, because somehow all the assets that were supposed to be displayed when things went wrong were themselves hosted on the thing that was down.


Yes, changing the status page required updating a S3 bucket that was located in us-east-1 which is where it went down so they couldn't change it.


And now, S3 has cross-region replication ...

And so does DynamoDB, and Aurora ...


And thus the lesson of all eggs in one basket has been taught again.


Some thought they were spreading eggs across multiple baskets by replicating across regions and trusting Azure was doing the same for their “global” services.

As a consumer, the lesson here is that Azure is one, big basket. It would probably be prudent to think of AWS and GCS as single baskets too.


Thanks for sharing. A big outage like this does have the benefit of making things better in the future.


[citation needed]


Presumably some people will be setting up nagios servers outside of azure now. So that makes things better.


VSTS is still down for us. TFS hosted code repos along with our entire bug system on VSTS means that no work is being done.

I suspect we are gonna have to wait at least one other day at best for this to resolve. Meanwhile my local code goes even more out of sync.

I’m probably just gonna spin up a git repo on my local machine and use that to share code with my team.


PM for VSTS here. The final scale unit in South Central US was brought back online a few hours ago, which means that the final accounts that were affected should now be operational. We're still restoring package management to some accounts, but otherwise, you should be back to working. Please feel free to reach out to me if you're still having trouble. Email is my HN username @ microsoft.com. We're _very_ sorry for this very significant outage.


Kind of surprised of the lack of redundancy, especially for their first party products. Shouldn't they be deploying to more than one failure zone?


Like any major service, I'm sure they do. But also like any other service, how well is all the resiliency tested in the real world, is a separate question. And today we have an answer.

Anyway I have a good guess as to what most of the employees there are going to be doing for the next six months.


The Visual Studio Marketplace is also down https://marketplace.visualstudio.com/.


Just today I was having issues with the Prettier extension in VS Code, and I uninstalled it to see if that would fix it (I read that usually fixes the issues I was having). Then I realized that I couldn't install it again because VS Marketplace was down. This was like 8 hours ago and still no signs of recovery. Of course, all my builds are failing because of some stupid formatting issue that Prettier usually would solve, so yeah..thanks MSFT.


> Of course, all my builds are failing because of some stupid formatting issue that Prettier usually would solve, so yeah..thanks MSFT.

Is failing builds due to formatting issues really a sound setup?


Yes.

When you have a style guide - test for it, if the test fails, then fail the style linter job and don't allow the change to be accepted.

It's failed to meet your acceptable code criteria after all.

If you find you are making your code unreadable just to pass, then your style guide is wrong. That needs fixing, not the CI job.

If you find an urgent "this needs to merge, style rules be dammed" change, allow your senior team members to overrule the style CI job and merge it anyway.


I assume their build system checks for formatting and will raise an error if it doesn't conform. And this person would use the VS Code extension to auto-conform their code.


I'm questioning the soundness of such setup in general, and especially if it means that losing connection to a third-party prettifier makes you unable to work on your own codebase.


Hmm, the alternatives are not enforcing a similar code style, or enforcing it earlier on (e.g. on commit). I can understand why they would not want the former, and the latter is more annoying when experimenting, i.e. when code style does not matter that much yet. Thus, in CI sounds like the right choice.


I'd go for precommit hook or similar, but not a huge deal.


Their not unable to work, they just don't have a "format this according to the rules our codebase has set out" button.

They can also just format their code according to the rules by hand..


I can add an ignore formatting line, but then people would ask why I did that. We have a styleguide to increase readability and make code-reviews easier. Is it a sound setup? I personally think it is. It promotes good standards and can sometimes even help catch some issues related to structure and code flow.


This is very inconvenient. I'm not able to download an extension that I need to work on a project.


It's back online again!


The outage appears to be ongoing, and its having ripple affects in other regions (management portal unresponsive, autoscale and other services not firing in West US at least for me).

Also unable to lodge a support ticket because the portal fails to identify me as having paid support (that API request appears to timeout).


Yeah our company's website was down all day due to this. We are looking at ways to mitigate in the future.



As of 07:15 UTC on the 5th of September 2018 the Azure status message reported:

"NEXT UPDATE: The next update will be provided by 07:00 UTC 05 Sep 2018 or as events warrant."

As I finished writing this they finally updated with essentially the same message except stalling for an additional two hours.

So, if you're thinking "Well, surprises will happen". Yeah, and Microsoft is not actually prepared for that at all, so, sucks to be their customer I guess?


So AWS has had some big outages, as has Azure. Has GCP had any big outages yet?



Outage is part of life, but Google's is most resilient in my experience.


Google doesn't have near the cloud presence of Amazon and Microsoft...maybe one day when they do, we can properly compare them. Given Google's small size/role in the space it's impossible to gauge if this is true.


Google is fairly near MS (Azure did $1.9B in Q1, and GCP+Google Apps did $1.7B). But AWS is ~2.5x the other two.


Google Apps is not GCS. If you wanted to compare that number you'd have to throw in Office 365/Azure into the same number and it would dwarf Google.

Google has ~3% market share compared to Microsoft's ~28% and Amazon's ~40%. Not even in the same league at the moment. Google is more on par with IBM and Rackspace, for now. Google will undoubtedly make strides in the space, but they haven't been tested.


> Google has ~3% market share compared to Microsoft's ~28% and Amazon's ~40%.

Where does this number come from?

If it is based on the revenue reported, be very careful with Microsoft's numbers. They report a lot of products as "Azure intelligent cloud", including Office suite subscriptions, on-premise server licences, and software (Windows, SQL Server) licensing revenue from other cloud providers in that number.

Pretty soon their claimed growth is going to flatten out, because they couldn't find any more revenue to report as "Azure intelligent cloud", like PC hardware ...


People have been throwing this conspiracy theory around for years...even though you can look at Microsoft's earnings reports and see they are broken out as different line items.


Depends who's figures you go by. As per [1] they're at 6%, double your source.

[1] https://www.google.com.au/amp/s/www.cnbc.com/amp/2018/04/27/...


It may have grown, but it's still a fraction of Microsoft and Amazon which is the entire point.



AWS and Azure have had "big" outages people because actually use them. Rackspace and IBM are almost neck and neck with Google's best efforts (3% markshare Vs. 30%/40% for Azure/AWS)[0].

[0] https://www.skyhighnetworks.com/cloud-security-blog/microsof...


The difference I see is that AWS didn't seem to have suffered a non-region specific service outage yet. Whether or not it's related to the South Central US region outage, today Azure Active Directory was down for ~4h. Because it's not running in a particular region, no HA or DR strategy, other than having your services be deployed to multiple cloud providers, would have mitigated the outage.


The biggest and most popular services run on Google Cloud.

https://cloud.google.com/customers/

You know....services consumers actually use.


I'm sure services I use run on all of the major cloud providers, but if that page was supposed to be enlightening, I only recognized one brand from the first page of customers.

I refreshed a couple times, and sure, I saw more (on average 1 or 2) that I recognized on each page. But I don't think your response is particularly persuasive. Are you suggesting that the services that I use that run on AWS are in fact, not services I actually use?

Or am I not a consumer? I'm confused.

Edit: Do you hold any Alphabet/Google stock? I've noticed your comment history trends toward dismissing criticism of Google, praising their products, and taking opportunities to speak about the flaws of their top competitors.


In the future, if you or anyone runs across an account that has been abusing HN in this way, it would be much better to email us at hn@ycombinator.com so we can look into it. I happened to randomly see your comment, but the odds of that are low.


>I'm sure services I use run on all of the major cloud providers, but if that page was supposed to be enlightening, I only recognized one brand from the first page of customers.

So the first page was supposed to be indicative of all of the popular consumer facing services they host? Here, let me help you out: Spotify, eBay, Twitter, Apple iCloud, Verizon, Vimeo, Netflix, etc

>I refreshed a couple times, and sure, I saw more (on average 1 or 2) that I recognized on each page. But I don't think your response is particularly persuasive. Are you suggesting that the services that I use that run on AWS are in fact, not services I actually use?

What popular consumer services were on AWS again?

>Edit: Do you hold any Alphabet/Google stock? I've noticed your comment history trends toward dismissing criticism of Google, praising their products, and taking opportunities to speak about the flaws of their top competitors.

Do you own Microsoft stock? Because quote a few of your posts seem to praise their products and services. Do you work for them?


We've banned this account. All it has done is aggressively post pro-Google comments and diss Google competitors.

Single-purpose accounts are not allowed here, especially not when pushing an agenda, and most of all not when pushing corporate propaganda. Of all the things that make HN users angry, that's at the top. And I agree with them.

Most of the time we tell HN users that they're not allowed to accuse each other of astroturfing. When we do find a clear-cut case of abuse that's been getting away with it for this long, I get pretty steamed.

You've also frequently broken the site guidelines by being uncivil, so much so that we've warned you at least half a dozen times. That's more than enough reason to ban you in its own right.


And, um, are any of those "most popular services" listed on the page you linked?


GCP has had multiple many-hour (6+) GLOBAL outages in the past year. I think it's at about 3 so far this year.

But, it doesn't make the headlines like a 2-hour S3 outage in a single region, which must mean something ...


GCP's status history would seem to disagree, unless you have unusual definitions of "outage" and/or "global": https://status.cloud.google.com/summary

The last incident I'd personally classify as major lasted 39 minutes and was widely reported: https://status.cloud.google.com/incident/cloud-networking/18...

Disclaimer: I work at GCP but am not speaking for them. I also wish a speedy recovery for our colleagues at Azure: an outage like this can only result from many things going sideways simultaneously, and both the cause and recovery can be complicated in ways that flippant "well why didn't you just N+1 it" commenters here on HN can only guess at.


To be clear, by "global"(around the whole globe/earth), I mean that a specific service is broken in every public region . You seem to have a wider interpretation for "global".

Besides the one you listed:

* https://status.cloud.google.com/incident/compute/18005

"Google Compute Engine VM instances allocated with duplicate internal IP addresses, stopped instances networking are not coming up when started." - 22 hours

Newly-launched instances, or instances that were stopped and started, received duplicate IP addresses. 4.5 hours in a mitigation was provided, but it was only resolved after 22 hours, and customers may have had to still fix individual instances. As far as I remember, this was global, and there is nothing on the status page indicating it was limited to one region or a subset of regions. So, for 4 hours, if you needed to create a VM with working networking, you couldn't, anywhere on GCP, and no mitigation was available. Do you not consider this to be "global"?

* https://status.cloud.google.com/incident/compute/18009

"Instances using Local SSD might experience VM failures. This affects GCE VMs globally. No data corruption has been observed." - 5 hours

The original claim was:

> So AWS has had some big outages, as has Azure. Has GCP had any big outages yet?

I said:

> GCP has had multiple many-hour (6+) GLOBAL outages in the past year. I think it's at about 3 so far this year.

So, maybe it's only 2 major global outages, or maybe it's 3 5-hour+ global problems, but the only way anyone can claim that Google hasn't had any big outages is if they don't have enough market share for a global outage to affect many websites or end-users.


The biggest AWS outage in the past 3 years was S3 being down in one region for 2 hours in Feb 2018 ...

No other regions were affected, except for global APIs (e.g. create S3 bucket), which one shouldn't rely on on your critical path.

Many new customer features have been delivered to allow mitigation of this kind of failure (e.g. cross-region S3 replication).

AWS had a power failure at one data-center in us-east-1 earlier this year, which had very little impact (basically only customers who didn't have sufficient redundancy in other AZs were affected).


We all ditched the Unix model of a central server with dumb terminals around because Microsoft told us the future was everybody having a full OS in their workstation. Now they tell us the future is going back 30 years, and having all of our data and programs in somebody else's machines (theirs).

Sometimes I can't understand people.


You can use a VPN to view the status page if it's erroring out for you, I'd also suggest trying to clear your local DNS cache.


And I was wondering why my App.Gateways are not deploying last night


The Extensions Marketplace for VS Code also seems offline, still.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: