AWS S3 Outage

nmjohn · on Aug 10, 2015

I'm seeing it as well - majority of connections are being dropped for us atm

  The Amazon S3 team recently completed some maintenance   
  changes to Amazon S3’s DNS configuration for the US STANDARD region on 
  July 30th, 2015.
    
  You are receiving this email because we noticed that your bucket
  is still receiving requests on the IP addresses which were removed 
  from DNS rotation. These IP addresses will be disabled on August 
  10th at  11:00 am PDT, at which time any requests still using
  those addresses will receive an HTTP 503 response status code.
  
  Applications should use the published Amazon S3 DNS names for 
  US STANDARD: either s3.amazonaws.com or s3-external-2.amazonaws.com
  with their associated time to live (TTL) values. Please refer to 
  our documentation at: 
  http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region 
  for more information on Amazon S3 DNS names.

Something to do with that perhaps? AWS sent us that last thursday

bbrazil · on Aug 10, 2015

Java by default can cache DNS forever, which may be why many people are seeing problems. Set networkaddress.cache.ttl to adjust this.

http://javaeesupportpatterns.blogspot.ie/2011/03/java-dns-ca... has more detail.

ork · on Aug 10, 2015

Reading this article, it seems that starting at 1.6 the default TTL became 30s.

Xorlev · on Aug 10, 2015

It's not true as far as I can tell. You need this flag set or it keeps DNS entries forever, at least on the JVM we run (Oracle, 1.8u51).

cperciva · on Aug 10, 2015

If it is related, someone screwed up badly. I didn't receive such an email; and I'm seeing the same error rate with IPs I resolved 5 minutes ago.

fizx · on Aug 10, 2015

I wonder what it would take for amazon to show one of the yellow icons on their status page? Has it ever happened? Would a datacenter have to fall in the ocean?

gschier · on Aug 10, 2015

Ya, it bothers me that their status messages for major outages are simply "elevated error rates".

nmjohn · on Aug 10, 2015

Whats frustrating is when you have customers who are also down because of the outage - but when you say Amazon is experiencing severe outages causing 50% of our requests to be dropped and there's not much we can do, it makes us look pretty bad when they they go to the amazon dashboard and only see "Elevated Error Rates."

RoryH · on Aug 10, 2015

"Elevated Error rates" probably does not qualify as breach of their SLA

encoderer · on Aug 10, 2015

In these cases I suggest saying "we are being severely affected by an Amazon outage"

ranman · on Aug 10, 2015

Often times it's just that though. Just because many customers are experiencing something doesn't mean ALL customers are experiencing something. When I worked there what the media would describe as a major outage was really less than 1% of one region... this particular instance seems pretty odd though.

cperciva · on Aug 10, 2015

I'm sure "elevated error rates" is the first alarm which goes off. And once they've put that description onto the status page, they're probably more worried about getting it fixed than going back and changing the wording.

mrsirduke · on Aug 10, 2015

They should worry about that a lot. Amazon are notoriously bad at communicating during outages. They've gotten better, but they're big enough that it should have priority.

cperciva · on Aug 10, 2015

I agree. I was speculating, not trying to defend. :-)

Perdition · on Aug 10, 2015

Amazon isn't a couple of guys in a garage. They have hordes of administrative personnel who could be tasked to update a status page.

Companies 1/1000th the size of Amazon can manage it.

cperciva · on Aug 10, 2015

Yes and no. Sure, Amazon has administrative personnel. Sure, some of those administrative personnel would probably be happy to get paid extra to carry a pager and be summoned to work at 3AM to update a status page.

But the last thing you want to do is put inaccurate information onto a status page; so mere administrative personnel isn't enough -- you'd need people who understand enough about the system to be able to write about it without introducing errors.

I'm guessing that the intersection of "administrative personnel", "willing to carry pagers" and "understand the internals of AWS services" is a very small set.

cmdkeen · on Aug 10, 2015

Except "willing to carry pagers" is currently the basis of employment at Amazon, and not just for AWS but for whole chunks of their technical business. It's one of the many reasons why they have a pretty dire reputation (see plenty of discussions on here from former Amazon employees).

They also claim to have "customer obsession" as a leadership principle, this whole thread is an excellent example of that being failed in a big way.

seiji · on Aug 10, 2015

This is a situation where "we pay you, now do as you're told" comes in handy.

Not every job can be full of self-directed aspirational spiritual awakenings. If that were the case, nobody would deliver my dinner on a bike when it's -20ºF outside.

Perdition · on Aug 10, 2015

>I'm guessing that the intersection of "administrative personnel", "willing to carry pagers" and "understand the internals of AWS services" is a very small set

Being a non-engineer doesn't mean they don't know anything about the technology. And they don't need to know the internals, just enough to convey information from the engineers managers to the public.

Plenty of other organizations manage resolving issues while transmitting information about the issue to other stakeholders.

Also, most administrative personnel have far less job opportunities than engineers. If they can get the engineers to carry pagers they can get a PR minion to carry one.

codeonfire · on Aug 10, 2015

You would be surprised to learn just how few people run your favorite web service.

xenoclast · on Aug 10, 2015

It's yellow now. I'm pretty sure the datacenter is still there. I think they go yellow/red after a certain time has passed or someone manually changes it (probably rarely)

nzadrozny · on Aug 10, 2015

Seems that "S3 offline" is the AWS equivalent of a datacenter falling into the ocean. Gotta wonder how many services are using S3 as a faux message queue?

ninamwen · on Aug 10, 2015

I think most services use the message queue as a message queue (SNS), but everyone has to store their files somewhere.

cperciva · on Aug 10, 2015

They generally only use those icons in hindsight. As long as they're still "investigating" they stick to green with a note attached.

SlashmanX · on Aug 10, 2015

There's one there now for "Amazon Simple Storage Service (US Standard)"

jschorr · on Aug 10, 2015

Latest Update from http://status.aws.amazon.com/:

1:52 AM PDT We are actively working on the recovery process, focusing on multiple steps in parallel. While we are in recovery, customers will continue to see elevated error rate and latencies.

kore_sar · on Aug 10, 2015

The icon got yellow. I repeat: YELLOW

hughstephens · on Aug 10, 2015

It only goes red when a nuclear event occurs, obliterating most of humanity and only the machines remain.

gotroot · on Aug 10, 2015

That would actually turn it green again, I think.

hughstephens · on Aug 10, 2015

maybe true – but if a server gets no requests, does it really exist?

or would the inter-machine chatter continue ad infinitum? would they run out of IPs or successfully transition to IPv6?

So many questions.

mryan · on Aug 10, 2015

Cory Doctorow's When Sysadmins Ruled the Earth touches on this - what happens to the internet activity during a global crisis?

http://craphound.com/overclocked/Cory_Doctorow_-_Overclocked...

hughstephens · on Aug 10, 2015

this is excellent, hadn't seen it before! thx

deckar01 · on Aug 10, 2015

Eventually the undefined behavior of this chatter will result in unallocated memory slowly churning in the garbage collection of time. Some day one of the sectors of unallocated memory will be executed resulting in a self replicating program. This program will evolve and multiply, pondering on the vastness of the S3verse, forever in search of Root.

eloisant · on Aug 10, 2015

my service looks pretty red now

mentat · on Aug 10, 2015

The EC2 launch thing is still green even though all launches fail for us.

ownagefool · on Aug 10, 2015

Could you expand what you mean? Is that because your launch is trying to fetch something from s3?

colechristensen · on Aug 10, 2015

There was a period in which I couldn't launch instances. Meaning the instance state did not ever reach "running" according to the console and were not responsive to initial ssh attempts. (knife timed out after 5 minutes and the machines were still unavailable after several more)

eropple · on Aug 10, 2015

Maybe related to AMI retrieval from S3?

nebulon · on Aug 10, 2015

Went to green again and it seems to be resolved.

hughstephens · on Aug 10, 2015

Next update is live

  2:38 AM PDT We continue to execute on our recovery plan
  and have taken multiple steps to reduce latencies and error
  rates for Amazon S3 in US-STANDARD. Customers may continue 
  to experience elevated latencies and error rates as we 
  proceed through our recovery plan.

adamtulinius · on Aug 10, 2015

Can't pull docker images from the hub either, and their statuspage currently shows S3-problems: https://status.docker.com/

endymi0n · on Aug 10, 2015

http://status.docker.com/ - they are "Investigating issue with high load", now yellow, but was red before.

Can't pull any images either.

dhruvbhatia · on Aug 10, 2015

Can't docker login or pull images from the hub. Both operations hang.

simonpantzare · on Aug 10, 2015

Seeing the same thing. Got back from vacation an hour ago, probably related. :)

sschueller · on Aug 10, 2015

At least it didn't happen while you where on vacation. :)

dutchbrit · on Aug 10, 2015

Welcome back!

crodjer · on Aug 10, 2015

Could this be a reason why Heroku is misbehaving? https://status.heroku.com/incidents/792K

robotfelix · on Aug 10, 2015

I can't help but be a little surprised that Heroku's entire build system is disabled by an S3 failure in one region. Now I'm unable to add a notice about the issues to my site's HTML...

crodjer · on Aug 10, 2015

True. Given the business Heroku is in, they should have redundancies in multiple regions if not multiple service providers.

Xorlev · on Aug 10, 2015

Since S3 is the defacto artifact delivery system for most people that run on AWS, it's not much of a surprise. For the most part, very isolated incidents aside, S3 is rock solid. Even EC2 relies on S3 for launching non-EBS instances.

crodjer · on Aug 10, 2015

Heroku build logs spitted this out:

    Unable to fetch source from: https://s3-external-1.amazonaws.com/heroku-sources-production/heroku.com/<some-uuid>?AWSAccessKeyId=<some-access-key>&Signature=<some-signature>&Expires=1439198046

ranrub · on Aug 10, 2015

"1:08 AM PDT We believe we have identified the root cause of the elevated error rates and latencies for requests to the US-STANDARD Region and are working to resolve the issue."

looks like the cavalry are coming

chncdcksn · on Aug 10, 2015

GitHub is having release download issues, possibly due to this. https://status.github.com/

Maxious · on Aug 10, 2015

Mapbox and Hipchat too https://www.mapbox.com/status/ https://status.hipchat.com/

DanKlinton · on Aug 10, 2015

Feels like internet is not anymore distributed thing that if one website/node goes down others keep working...

Feels like in future... If cloud provider goes down... All internet will stop working :)

rodgerd · on Aug 10, 2015

This is pretty much the case. Years of evangelising the idea that (a) everybody should be on Amazon and (b) everybody should be on the cheapest regions of Amazon mean that while the underlying datacentres are probably much better managed, individually speaking, than the tapestry of colos that made up the world a decade ago, an outage has much more wide-ranging effects than you'd get at that point.

xenoclast · on Aug 10, 2015

  [3:25] AM PDT We are still working through our recovery plan.

Man, I'd love to see that plan.

cddotdotslash · on Aug 10, 2015

I'd also love to see a post mortem on this, but I highly doubt they'll release anything about it.

clebio · on Aug 10, 2015

Seems Hasicorp is maybe affected by this as well.

    $ vagrant up
    Bringing machine '...' up with 'virtualbox' provider... ==> ...: Box 'debian/jessie64' could not be found.
    ...
    ...: Downloading: https://atlas.hashicorp.com/debian/boxes/jessie64/versions/8.1.0/providers/virtualbox.box
    An error occurred while downloading the remote file. The error message, if any, is reproduced below. Please fix this error and try again.
    The requested URL returned error: 500 Internal Server Error

EDIT: not Markdown.

uxp · on Aug 10, 2015

VagrantCloud, now known as Atlas, is merely a redirector service for Vagrant box management. I host all my own boxes (on S3), but still use them for an easy means of sharing without having to remember a long-ass URL.

cperciva · on Aug 10, 2015

As of 10:29:33 UTC, everything is back to normal as far as I can measure.

mryan · on Aug 10, 2015

Should be back to normal now. The latest update is:

3:46 AM PDT Between 12:08 AM and 3:40 AM PDT, Amazon S3 experienced elevated error rates and latencies. We identified the root cause and pursued multiple paths to recovery. The error has been corrected and the service is operating normally.

jakozaur · on Aug 10, 2015

Open-source library request: A library that lets you use S3 and Google Storage Cloud simultaneously and fail-back to another if one have problems.

There are many use-case when paying 2x for storage is a reasonable tradeoff for higher availability and also be provider independent.

mnutt · on Aug 10, 2015

Depending on your use case, it may be slightly easier to accomplish this with s3 event notifications + AWS Lambda to write to a different region or service.

Importantly, make sure you CNAME your bucket under your own domain so that you can switch services.

edit: Much easier than AWS Lambda, actually: http://aws.amazon.com/about-aws/whats-new/2015/03/amazon-s3-...

based2 · on Aug 10, 2015

https://jclouds.apache.org/

Apache jclouds® is an open source multi-cloud toolkit for the Java platform that gives you the freedom to create applications that are portable across clouds while giving you full control to use cloud-specific features.

skj · on Aug 10, 2015

The GCS command line tool, gsutil, can talk to both S3 and GCS. That might be a nice place to start.

rplnt · on Aug 10, 2015

There used to be this https://deltacloud.apache.org/ (not exactly what you want + it seems to be dead).

ketralnis · on Aug 10, 2015

What's stopping you from writing it?

jakozaur · on Aug 10, 2015

Time and priorities. There is a difference between "I wish this thing existed, so I can use it/contribute to it" and "I need it so badly that I'm willing to spend a lot of time to make it production ready".

So far S3 seems to be reliable enough...

toomuchtodo · on Aug 10, 2015

Might be something to integrate into Libcloud [1] instead of rolling your own.

[1] https://libcloud.apache.org/

dangravell · on Aug 10, 2015

Looks like I picked a bad week to stop sniffing glue.

jontro · on Aug 10, 2015

Looks like this brought down typekit too. "Font Network is experiencing issues caused by an outage at our storage provider." http://status.typekit.com/

mrsuprawsm · on Aug 10, 2015

From http://status.aws.amazon.com/:

12:36 AM PDT We are investigating elevated errors for requests made to Amazon S3 in the US-STANDARD Region.

zerotosixty · on Aug 10, 2015

That would explain why my console was not performing well even if http://status.aws.amazon.com/ says "Service is operating normally", good thing their api seems to be functioning during that outage, for me at least

cubicfur · on Aug 10, 2015

Good thing I built myself a local game streaming server instead of putting that in a remote GPU instance.

colinbartlett · on Aug 10, 2015

I started receiving lots of alerts from my side project https://StatusGator.io which monitors status pages. It's astonishing to me how many services depend on AWS directly or indirectly.

padelt · on Aug 10, 2015

The most ironic thing I've seen in a while: Their homepage features icons of services they monitor. Many of them 503-fail - they are hosted on cloudfront.

pemp · on Aug 10, 2015

As it has happened before, Amazon AWS status page is lying to us.

S3 is in yellow, which means "performance issues". But not being able to download files from many buckets it's clearly a "service disruption" (red).

mahouse · on Aug 10, 2015

It is actually possible if you keep trying.

pemp · on Aug 10, 2015

It has to be many times, because I have not succeeded and I still keep trying.

ramon · on Aug 10, 2015

This happens a couple of times, specially when replacing files frequently! I submit things to S3 everyday, if you're uploading a chunck of files you'll get errors every now and then when replacing files.

theyeti · on Aug 10, 2015

It seems to have come back now for me. Could someone else confirm the same ?

theyeti · on Aug 10, 2015

Update: We are beginning to see recovery in error rates and latencies for Amazon S3.

RyanGWU82 · on Aug 10, 2015

Yes, I noticed things getting better around 3:30 AM Pacific.

JoshGlazebrook · on Aug 10, 2015

I'm assuming this is also why I can't start any instances.

> 12:51 AM PDT We are investigating increased error rates for the EC2 APIs and launch failures for new EC2 instances in the US-EAST-1 Region.

rpmartz · on Aug 10, 2015

Same here

blowski · on Aug 10, 2015

I'm also having issues connecting to buckets based in Ireland (eu-west-1). Just hangs at authentication stage. Tried from 3 different internet connections, all having the same problem.

onre · on Aug 10, 2015

Ditto. API does not answer at all, latency is high and customers are not amused. Outage has lasted for hours already.

edit; does not seem to affect all the buckets though. Only one of ours is experiencing this, others are fine.

mrsuprawsm · on Aug 10, 2015

For what it's worth, eu-west-1 buckets are working fine for me here, via s3cmd and aws cli.

ranrub · on Aug 10, 2015

Looks like it's getting better on our side now

mentat · on Aug 10, 2015

Now updates about ELB scaling and Lambda failures.

whyleyc · on Aug 10, 2015

  3:36 AM PDT Customers should start to see declines in elevated errors and
  latencies in the Amazon S3 service.

Fixed ?

greenleafjacob · on Aug 10, 2015

We are seeing lots of 503s, empty response bodies, and peer reset / dropped connections.

thinkindie · on Aug 10, 2015

I'm also getting problems with Cloudfront attached to an S3 bucket

nebulon · on Aug 10, 2015

Same here, first it only affected some dev deployment (s3+cdn), now it spilled over to other buckets :-/

gedrap · on Aug 10, 2015

I see this thread as a list of services depending on S3 being healthy.

jschorr · on Aug 10, 2015

Also a good list of developers and devops people not getting any sleep tonight :-/

bardworx · on Aug 10, 2015

Yup!

3am "Dry Run" Staging on Heroku...fail 7am Deployment to Production....fail

Now: "We have confirmed elevated latencies affecting our SendEmail, SendRawEmail and SendSMTPEmail APIs in the US-EAST-1 Region and are working to address the problem."

Which is perfect since most of our PO orders are being placed between 6am-10am.

Tonight, adult beverages will be needed after everything is resolved.

rwitoff · on Aug 10, 2015

same here. our s3 services are reporting similar 503's and network timeouts. a few of our partners are already down as well with their own 500s. another stormy night in the cloud.

zubairov · on Aug 10, 2015

+1 for us it's CDN (CloudFront) - only HTTP 503 responses

kernel_sanders · on Aug 10, 2015

Can't launch instances in EC2 in US-East-1 at the moment.

jschorr · on Aug 10, 2015

It appears EC2 is affected as well now:

12:51 AM PDT We are investigating increased error rates for the EC2 APIs and launch failures for new EC2 instances in the US-EAST-1 Region.

cperciva · on Aug 10, 2015

Probably just fallout from the fact that EBS snapshots are stored in S3. If you can't create an EBS volume, you won't be able to launch an EC2 instance from it.

mentat · on Aug 10, 2015

It appears that s3 based AMIs totally fail to launch as unavailable as well.

geomark · on Aug 10, 2015

Yep. Can't even get a response to a s3cmd command.

pydevops · on Aug 11, 2015

AWS API still works while AWS web console is not.

vaibhavrajput · on Aug 10, 2015

Why it is behaving like this each other day?

eva1984 · on Aug 10, 2015

What is the last time that AWS gets two consecutive major outages within, maybe, 30 days?

lubos · on Aug 10, 2015

Last one was 10 days ago.

https://news.ycombinator.com/item?id=9980222

vaibhavrajput · on Aug 10, 2015

+1, Also it was down on leap second. http://mashable.com/2015/06/30/aws-disruption

cheeseprocedure · on Aug 10, 2015

To be fair, that was not really AWS's fault, nor (apparently) a leap second issue:

http://www.bgpexpert.com/article.php?article=167

https://twitter.com/Axcelx/status/616058414746202113

mentat · on Aug 10, 2015

Current list of additional services affected: CloudSearch

Elastic Compute Cloud

Elastic Load Balancing

Elastic MapReduce

Relational Database Service

CloudTrail

Config

Lambda

OpsWorks

rgbrgb · on Aug 10, 2015

yep, we're seeing timeouts and 404s for images stored on s3 :(

good luck to the on-call engineers at amazon!

jackyjjc · on Aug 10, 2015

even ec2 is down:

Amazon Elastic Compute Cloud (N. Virginia) Increased API Error Rates 12:51 AM PDT We are investigating increased error rates for the EC2 APIs and launch failures for new EC2 instances in the US-EAST-1 Region.

mryan · on Aug 10, 2015

EC2 is not down.

devastor · on Aug 10, 2015

It's up, but launching new instances doesn't work due to the S3 issue.

ranman · on Aug 10, 2015

I can still launch new instances.

jtwaleson · on Aug 10, 2015

loading fonts on aws pages is slow as hell because of this https://aws.amazon.com/ses/

lostdd · on Aug 10, 2015

What a wonderful beginning of a week! Thanks AWS.

gschier · on Aug 10, 2015

Lots more complaints on Twitter...

jsonperl · on Aug 10, 2015

Yep, seeing the same.

shaper60 · on Aug 10, 2015

oh... seeing the same

zyzyis · on Aug 10, 2015

wtf, aws again

shaper60 · on Aug 10, 2015

まだ直らないねえ

kureikain · on Aug 10, 2015

It's two hours. Luckily it happenned around 12:00AM otherwise we are having a bad day.

Whomever being a DevOps or SysAdmin probably cannot sleep tonight :(.

In our case, we put Fastly on top of our assets/images so that only a partil of request get errors. The cached object on Fastly is still fine.

jbbarth · on Aug 10, 2015

Actually on some parts of the world it's not 12:00AM you know, and there are sysadmins/devops there too :-)

addandsubtract · on Aug 10, 2015

11:40am: Currently can't push to Heroku :(

kureikain · on Aug 10, 2015

yeah, forgot that :(. Hope it comes back soon.

eloisant · on Aug 10, 2015

Our European customers are not amused.

hughstephens · on Aug 10, 2015

haha if two hours is nothing to you, that's pretty lucky. We work 24x7 (global customers, yes US is a bit busier but not by a lot) and we're an Australian company, so not a whole lot of fun.

kureikain · on Aug 10, 2015

Yeah, sorry to hear that :(. I mean it's already two hours, a long time.

However, if it doesn't come back before 5AM I think we are screw :(