Seeing huge numbers of 503s from the S3 API in us-east-1. Anyone else having problems? I only found one other on Twitter: https://twitter.com/cperciva/status/630641484677558273
I'm seeing it as well - majority of connections are being dropped for us atm
The Amazon S3 team recently completed some maintenance
changes to Amazon S3’s DNS configuration for the US STANDARD region on
July 30th, 2015.
You are receiving this email because we noticed that your bucket
is still receiving requests on the IP addresses which were removed
from DNS rotation. These IP addresses will be disabled on August
10th at 11:00 am PDT, at which time any requests still using
those addresses will receive an HTTP 503 response status code.
Applications should use the published Amazon S3 DNS names for
US STANDARD: either s3.amazonaws.com or s3-external-2.amazonaws.com
with their associated time to live (TTL) values. Please refer to
our documentation at:
http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region
for more information on Amazon S3 DNS names.
Something to do with that perhaps? AWS sent us that last thursday
I wonder what it would take for amazon to show one of the yellow icons on their status page? Has it ever happened? Would a datacenter have to fall in the ocean?
Whats frustrating is when you have customers who are also down because of the outage - but when you say Amazon is experiencing severe outages causing 50% of our requests to be dropped and there's not much we can do, it makes us look pretty bad when they they go to the amazon dashboard and only see "Elevated Error Rates."
Often times it's just that though. Just because many customers are experiencing something doesn't mean ALL customers are experiencing something. When I worked there what the media would describe as a major outage was really less than 1% of one region... this particular instance seems pretty odd though.
I'm sure "elevated error rates" is the first alarm which goes off. And once they've put that description onto the status page, they're probably more worried about getting it fixed than going back and changing the wording.
They should worry about that a lot. Amazon are notoriously bad at communicating during outages. They've gotten better, but they're big enough that it should have priority.
Yes and no. Sure, Amazon has administrative personnel. Sure, some of those administrative personnel would probably be happy to get paid extra to carry a pager and be summoned to work at 3AM to update a status page.
But the last thing you want to do is put inaccurate information onto a status page; so mere administrative personnel isn't enough -- you'd need people who understand enough about the system to be able to write about it without introducing errors.
I'm guessing that the intersection of "administrative personnel", "willing to carry pagers" and "understand the internals of AWS services" is a very small set.
Except "willing to carry pagers" is currently the basis of employment at Amazon, and not just for AWS but for whole chunks of their technical business. It's one of the many reasons why they have a pretty dire reputation (see plenty of discussions on here from former Amazon employees).
They also claim to have "customer obsession" as a leadership principle, this whole thread is an excellent example of that being failed in a big way.
This is a situation where "we pay you, now do as you're told" comes in handy.
Not every job can be full of self-directed aspirational spiritual awakenings. If that were the case, nobody would deliver my dinner on a bike when it's -20ºF outside.
>I'm guessing that the intersection of "administrative personnel", "willing to carry pagers" and "understand the internals of AWS services" is a very small set
Being a non-engineer doesn't mean they don't know anything about the technology. And they don't need to know the internals, just enough to convey information from the engineers managers to the public.
Plenty of other organizations manage resolving issues while transmitting information about the issue to other stakeholders.
Also, most administrative personnel have far less job opportunities than engineers. If they can get the engineers to carry pagers they can get a PR minion to carry one.
It's yellow now. I'm pretty sure the datacenter is still there. I think they go yellow/red after a certain time has passed or someone manually changes it (probably rarely)
Seems that "S3 offline" is the AWS equivalent of a datacenter falling into the ocean. Gotta wonder how many services are using S3 as a faux message queue?
1:52 AM PDT We are actively working on the recovery process, focusing on multiple steps in parallel. While we are in recovery, customers will continue to see elevated error rate and latencies.
Eventually the undefined behavior of this chatter will result in unallocated memory slowly churning in the garbage collection of time. Some day one of the sectors of unallocated memory will be executed resulting in a self replicating program. This program will evolve and multiply, pondering on the vastness of the S3verse, forever in search of Root.
There was a period in which I couldn't launch instances. Meaning the instance state did not ever reach "running" according to the console and were not responsive to initial ssh attempts. (knife timed out after 5 minutes and the machines were still unavailable after several more)
2:38 AM PDT We continue to execute on our recovery plan
and have taken multiple steps to reduce latencies and error
rates for Amazon S3 in US-STANDARD. Customers may continue
to experience elevated latencies and error rates as we
proceed through our recovery plan.
I can't help but be a little surprised that Heroku's entire build system is disabled by an S3 failure in one region. Now I'm unable to add a notice about the issues to my site's HTML...
Since S3 is the defacto artifact delivery system for most people that run on AWS, it's not much of a surprise. For the most part, very isolated incidents aside, S3 is rock solid. Even EC2 relies on S3 for launching non-EBS instances.
"1:08 AM PDT We believe we have identified the root cause of the elevated error rates and latencies for requests to the US-STANDARD Region and are working to resolve the issue."
This is pretty much the case. Years of evangelising the idea that (a) everybody should be on Amazon and (b) everybody should be on the cheapest regions of Amazon mean that while the underlying datacentres are probably much better managed, individually speaking, than the tapestry of colos that made up the world a decade ago, an outage has much more wide-ranging effects than you'd get at that point.
$ vagrant up
Bringing machine '...' up with 'virtualbox' provider... ==> ...: Box 'debian/jessie64' could not be found.
...
...: Downloading: https://atlas.hashicorp.com/debian/boxes/jessie64/versions/8.1.0/providers/virtualbox.box
An error occurred while downloading the remote file. The error message, if any, is reproduced below. Please fix this error and try again.
The requested URL returned error: 500 Internal Server Error
VagrantCloud, now known as Atlas, is merely a redirector service for Vagrant box management. I host all my own boxes (on S3), but still use them for an easy means of sharing without having to remember a long-ass URL.
Should be back to normal now. The latest update is:
3:46 AM PDT Between 12:08 AM and 3:40 AM PDT, Amazon S3 experienced elevated error rates and latencies. We identified the root cause and pursued multiple paths to recovery. The error has been corrected and the service is operating normally.
Depending on your use case, it may be slightly easier to accomplish this with s3 event notifications + AWS Lambda to write to a different region or service.
Importantly, make sure you CNAME your bucket under your own domain so that you can switch services.
Apache jclouds® is an open source multi-cloud toolkit for the Java platform that gives you the freedom to create applications that are portable across clouds while giving you full control to use cloud-specific features.
Time and priorities. There is a difference between "I wish this thing existed, so I can use it/contribute to it" and "I need it so badly that I'm willing to spend a lot of time to make it production ready".
Looks like this brought down typekit too.
"Font Network is experiencing issues caused by an outage at our storage provider."
http://status.typekit.com/
That would explain why my console was not performing well even if http://status.aws.amazon.com/ says "Service is operating normally", good thing their api seems to be functioning during that outage, for me at least
I started receiving lots of alerts from my side project https://StatusGator.io which monitors status pages. It's astonishing to me how many services depend on AWS directly or indirectly.
The most ironic thing I've seen in a while: Their homepage features icons of services they monitor. Many of them 503-fail - they are hosted on cloudfront.
This happens a couple of times, specially when replacing files frequently! I submit things to S3 everyday, if you're uploading a chunck of files you'll get errors every now and then when replacing files.
I'm also having issues connecting to buckets based in Ireland (eu-west-1). Just hangs at authentication stage. Tried from 3 different internet connections, all having the same problem.
3am "Dry Run" Staging on Heroku...fail
7am Deployment to Production....fail
Now: "We have confirmed elevated latencies affecting our SendEmail, SendRawEmail and SendSMTPEmail APIs in the US-EAST-1 Region and are working to address the problem."
Which is perfect since most of our PO orders are being placed between 6am-10am.
Tonight, adult beverages will be needed after everything is resolved.
same here. our s3 services are reporting similar 503's and network timeouts. a few of our partners are already down as well with their own 500s. another stormy night in the cloud.
Probably just fallout from the fact that EBS snapshots are stored in S3. If you can't create an EBS volume, you won't be able to launch an EC2 instance from it.
Amazon Elastic Compute Cloud (N. Virginia)
Increased API Error Rates
12:51 AM PDT We are investigating increased error rates for the EC2 APIs and launch failures for new EC2 instances in the US-EAST-1 Region.
haha if two hours is nothing to you, that's pretty lucky. We work 24x7 (global customers, yes US is a bit busier but not by a lot) and we're an Australian company, so not a whole lot of fun.