Hacker News new | past | comments | ask | show | jobs | submit login
"Amazon EBS sucks. I just lost all my data" (amazonwebservices.com)
111 points by cemerick on May 10, 2010 | hide | past | favorite | 66 comments



Although I sometimes get downvoted for this, I'll say it again:

You can't outsource your liability.

If your product is a webapp, then the underlying messy bits of backups, hardware, availability and redundancy also require some amount of conscious thought on your part. Not every site/app needs it's own mini-datacenter, and you might not even need your own dedicated server (though you probably do when you reach a certain minimal amount of scale). But you DO need to have someone who is thinking about backups and availability, and a valid solution is not to assume that the smart folks at Amazon or Rackspace or any other hosting provider are going to be completely and consistently working with your best interests and uptime in mind.

EVRYTHING fails at some point. Every server, every generator, every upstream connection, every hosting provider big or small. And in this case I mean fail as in goes dark for some period of time not covered by backups or hot-spares.

So, plan accordingly.


You can't outsource your liability.

Of course you can. That is the entire reason the insurance industry exists.

More practically for the instant case, I use a provider who has a turnkey backup option, rather than one which would force me to spend expensive engineer time rolling my own only to discover that I really suck at thinking through all of the design challenges of backup solutions. (Something which always seems to get discovered that the most inconvenient of times.)


What is your "insurance" for hosting? I'm not aware of any SLAs that actually pay out anything near equal value to an outage they caused. If you're paying a hosting provider $500/mo for a sever that you generate an average of $1000/hr in web sales from and they have a 3 hour outage, you'll get an SLA credit for a couple of bucks applied to your next months service.

The Internet is filled with news about hosting provider provided backups being unusable for a number of reasons at inopportune times.

You should have a backup copy of your code/databases in your control, on a machine that is completely independent from whatever you are doing your production hosting on. You should have done a "warm metal" install and test of that code another server to be sure that you can recover operations in a reasonable amount of time (whatever is appropriate for your case).

For your scale (based on your posts here, my assumption: Single developer, or developer with a couple of contractors; production site; 1-5K visits per month; non time-sensitive/mission-critical service.) you probably don't need high-availability auto-failover. But, you SHOULD have your DNS hosted separate from your hosting provider, you SHOULD have low TTL's, you SHOULD have a backup server in a warm state at some other provider, and you SHOULD know how to at least do a basic DNS update to redirect traffic over to a backup site that either runs the service or puts up a basic, friendly "OOPS, BRB" page.

I've often thought of a startup that would basically human-automate these things for guys like you. You still wouldn't be 100% self-sufficient, but you would be able to outsource SOME of your entire reliance on 1 provider. You'd then have to have 2 tiers of total failure (your provider, and this service) to encounter complete down time.


> More practically for the instant case, I use a provider who has a turnkey backup option, rather than one which would force me to spend expensive engineer time rolling my own

EBS lets you create backup snapshots in S3 with one single command. The problem the (I think the epithet is warranted here) idiot who posted the original article had is that he didn't use it.


10 years later you realize you aren't covered for that cancer surgery because you didn't check off the box saying you were a smoker on the 100 page form.


I agree with this 100%. It seems that an awful lot of companies like to pass the blame as soon as something unwanted happens. Yes, it may be the underlying fault of a service provider but, to your customers, you are the only company they deal with and it's your reputation that will be impacted no matter how many posts you put on your site and all over the internet.

Using service companies does not excuse poor customer service.


This could well turn out to be a fatal flaw for enterprise adoption of third party clouds: when corporate IT staff screw up like this or just messing with a control panel (check the AWS forums where too often a "It disappeared!" complaint has a reply of "It was deleted at thus and so time and date using the control panel."), are they going to accept the blame or shift it to the third party?


This already happens every day when the blame is passed through to Oracle, Gartner, etc. without any acknowledgement of the fact that, bad as the software truly is, it was selected and purchased at great cost and [usually] considerable delay by the people disclaiming responsibility.

In other words, this will only be a fatal flaw if the business customers stop giving IT departments a free pass for being bad at technology.


That's true, I agree 100%.

I'd add to that though, if you're a small company sometimes you must recognize that a larger 3rd party can use scale to provide a more reliable service than you can.


"You can't outsource your liability" I have to disagree. You can define any part of your business as a "liability", so in that case you don't have to outsource anything at all. In fact a big reason people outsource tasks is because of liability. AWS promises physical redundancy, which apparently doesn't mean crap!


Saying it "doesn't mean crap" is an exaggeration. AWS's physical redundancy offers a marginal increase in reliability (roughly 3.5%, according to their stats) at potentially lower cost than providing the same system in-house because of economies of scale. They never promised to be 100% fail-safe (and would be foolish to do so.)


The point is that they are not backing up your data they way they advertise it. The whole point of physical redundancy is to eliminate single point of failure yet from their email it seems that such single point still exist. Also if adding physical redundancy improves reliability only by 3.5% it means that they have different definition of the term.


The whole point of physical redundancy is to eliminate single point of failure yet from their email it seems that such single point still exist.

Which bit of "multiple failures of the underlying hardware components" leads you to think that a single point of failure still exists?


AWS promises you physical redundancy, what they don't offer you is an ironclad guarantee that nobody, customer included will ever screw up. So you need backups. It's really simple.


Amen.

It's quite amazing how many people think they can point the finger at some third party because they 'handed it off'. After all, if it's amazon then it's safe as the bank of England right?

But even banks can burn down and even amazon does not owe you a penny if they lose your stuff.

They'll say sorry, really nicely and maybe they'll offer you some store credit, comes in great when you have lots of free time, you can get some books. Maybe on how to back-up your data or so.

But if you are halfway competent you smile and take that backup and you continue your business, because the value of your business outweighs amazons liability by a huge factor.

Backing up is risk mitigation, if the risk to you is small then you can afford to live without backups, it means that the cost of backing up outweighed the cost of re-creation of the data.

If the risk is larger then you probably should accept that, and go do something about it.

Either way, the responsibility is yours.


Why is this on HN? AWS provides a great way to back up EBS volumes called snapshots. Snapshots only store the deltas from the previous snapshot, and all the work to create one is done by AWS, not the server it is attached to.

This guy didn't read the docs and did not use AWS snapshots. It was the equivalent of not having a backup strategy for your local hard drive.


I agree--this is just noise. The comment in the parent post saying "I run datacenters, and things don't fail" should be crossposted to thedailywtf. EVERYTHING fails. Plan for it.


At first I thought the same thing. But I think this is a useful discussion to have on HN, since not everyone here understands probabilities or the realities of 'cloud' services.

Maybe by having this the top rated story for a bit it will make someone think twice about their virtual infrastructure and examine their assumptions for errors. Do it for the data ;)


I submitted it because stories like this are common, but the underlying issues and misunderstandings aren't often aired out as they have been nicely here. Lots of people don't understand what they are and are not getting with "cloud" providers, nevermind grokking specifics like MTBF stats (and similar).


This post happens about once a week on the Amazon forums. I've watched it play out dozens of times on the S3 and Cloudfront forums too, and every single time it turns out to be operator error.

In this case, the guy didn't realize he needed to take snapshots of his volumes. It's not surprising, really, since the documentation isn't so great for AWS, and it's probably even more painful knowing that it would have been a single button click to back up his volume using Amazon's tools.

But in the end, there's nothing to see here. Just like the guy who wakes up in the morning to find all his S3 files mysteriously gone (after he 'renamed' his bucked the previous night by dropping and recreating it), it always turns out to be the user shooting himself in the foot.

And in the cases when Amazon actually does something wrong, they're always on top of it immediately and back with a public explanation within hours. (from my experience)


OTOH, if operators keep making the same mistake over and over again, maybe the UI should be changed. I think there's a fundamental mismatch in the EC2 control panel because it looks like anyone should be able to use it, but you have to be a competent sysadmin to use it safely.


Reuven Cohen makes a more detailed version of this point: "[Amazon] expect a certain level of knowledge of both system administration as well as how AWS itself has been designed to be used. Newbies need not apply or should use at you're own risk. Which isn't all that clear to a new user, who hears that cloud computing is safe and the answer to all your problems. ... You need the late adopters for the real revenue opportunities, but these same late adopters require a different more gentle kind of cloud service... As IaaS matures it is becoming obvious that the "Über Geek" developers who first adopted the service is not where the long tail revenue opportunities are." http://www.elasticvapor.com/2010/05/failure-as-service.html


> "expect an annual failure rate (AFR) of between 0.1% –0.5%, where failure refers to a complete loss of the volume"

Well, I think the OP has just experienced a sample from a probability distribution characterized above.


There are people in the world who do not understand probability. Like the OP:

EBS is either reliable or not. You cannot be a little pregnant.

Even the OP's metaphor is broken! The saying "you cannot be a little bit pregnant" may be traditional, but it too is only an approximation. There are these things called miscarriages. They happen all the time, often before a woman realizes she is pregnant. Then there are the false pregnancies:

http://en.wikipedia.org/wiki/False_pregnancy

And these are just the common edge cases.


You can be "a little bit pregnant" when you're talking about large groups of people. Of the women in your country, how many are pregnant at any given time? It's probably "a little bit". Obviously someone is confusing a statistical sampling of 1 with the kind of volume Amazon must deal with.

I find it informative that a rare event such as this is being given so much press. If this was more routine, this wouldn't be news. Amazon seems to be doing better than 0.1% based on that alone.


I failed statistics. If out of a million hard drives 5000 die in a year and take 15 minutes to swap, what are the odds of 2 failing on the same machine?


Your assumptions are unreasonable, insufficiently well specified, and are asking the wrong question.

If in a year out of a million hard drives only 5000 die, then you're projecting a 200 year average lifetime per disk drive. No real disk has that. A more reasonable 5 year average lifespan gives you 200,000 failures per year. Which is much worse.

Next, you're asking about the odds of 2 failing on the same machine. How many disks are on a machine? 1? 10? 100? Are failures independent events? It makes a huge difference. In fact they are not independent because when the motherboard craps out you lose access to all disks on that machine at once. At their scale it is too much work to figure out whether some of that data is recoverable - you just assume there is another copy somewhere and throw away the stale data. If you're wrong, then oops.

You are also throwing out the 15 minute disk replacement time. It may take 15 minutes to replace a disk, but that figure is irrelevant. To replace a disk you have to locate the machine, and it has to matter enough to you to send a person out. I guarantee you that the time before a person gets involved is going to average more than 15 minutes. Generally a lot more than 15 minutes. (Google famously takes the attitude that it is generally more work than it is worth to find the broken machine, and lets most dead machines sit there indefinitely. I wouldn't be surprised if other cloud providers imitate this.)

Next you have to consider that the end user shouldn't care about machines. For the purpose of redundancy Amazon is not going to keep multiple copies of the same data on the same machine. They are going to put them in different machines, and hopefully in different places. That will reduce the odds of a single failure losing your data.

All of that said I am somewhat shocked that Amazon would advertise a 0.5-0.1% rate of data loss as acceptable. I don't know Google's actual failure rate, but I'd be willing to bet large amounts of money that it is much lower than that.

For instance search for "gmail lost data". The only significant gmail data loss that turns up was in 2006. (See http://techcrunch.com/2006/12/28/gmail-disaster-reports-of-m... for more.) A grand total of 60 accounts got wiped out. Subsequently most of the lost data was restored from backup. (I doubt that the error was at the data storage layer.)

That's not just better than what Amazon delivers. That is ridiculously better.


That failure rate is for EBS drives, which are essentially hot disks. EBS drives can almost instantly snapshot onto S3 backup stores.

Amazon has said they've never heard of anyone experiencing data loss on S3. So if you use EC2+EBS+S3 properly, you should not ever experience data loss except data captured since your last snapshot to S3 (which should be extremely frequently).


To be fair you'd have to include failures of EBS snapshots and of failures to multiple datacenters to gain parity with Google in your comparison. I'm sure the gmail app doesn't use its storage subsystem naively no matter what the numbers are. You're absolutely right in general though.


I'm positive that there is nothing naive in how Gmail uses storage. However I suspect that they are using standard best practices that are common throughout Google.


Lots of people who didn't fail statistics assume that events are independant. A bit like, the chance of my machine catching fire are 1 in a 1000, my machine did catch fire and I lost both drives - the odds on that happening must be 1 in a million!


What's more relevant is that the odds of either drive failing are identical. The likelihood that the first drive will fail doesn't change because the second one failed -- which is the classic statistical blunder. ("I got 42 heads in a row, the next one's BOUND to be tails!")

Every single drive in the datacenter typically has the same likelihood of failure (since they're usually the same make and model, and similar production runs), so the odds that a drive somewhere in a large data center is failing RIGHT NOW are in reality rather high.

The Big Mac ended up being a good example of this. The likelihood of having a memory error on a normal PC is low, because the likelihood of a single DIMM having a memory error is low.

Punch that up to the 22,000 or thereabouts DIMMs that populated the Big Mac cluster, and now you're looking at a very HIGH likelihood that a large-scale computation using the cluster will experience memory errors, and therefore produce invalid results -- which is why UVa ended up replacing the entire cluster with ECC-equipped machines in short order. Until they did that, researchers had to run simulations multiple times and compare results to make sure that their simulations weren't contaminated by memory errors.


>The likelihood that the first drive will fail doesn't change because the second one failed

No thats the point, hardware failures on the same machine/rack/psu/site are not independant. A power supply spike that kills drive 0 in a RAID will probably kill the mirror drive 1 as well - that's why RAID isn't a bckup stratergy


You missed the point -- a power supply spike isn't a hard drive failure, even if it kills a couple of drives, it's a power supply failure.

I was explaining a common misconception about statistics, and using hard drives and DIMMs as examples.


If amazon is advertising that single failures can't cause data losses, then a single exploding power supply (for instance) breaks that agreement.


If by "exploding" power supply you mean the potential for a power supply to explode with eruptive force and destroy other equipment in the rack, I believe that the other equipment being destroyed would be considered an multiple failure. It's very unlikely that a power supply could do that (although I guess anything is possible). A proper datacenter built with Datacenter-grade equipment will have multiple redundant power supplies in a blade enclosure, fed by different rails, which come off of different main lines in to the building. So a single failed source of power won't cause a failure.


The percentages they give aren't for a single drive failure, they're for data loss (multiple drive failure). So the odds are .1-.5%.



Very low. But that assumes they are independent events, which has never been the case in my experience.


No, he experienced a 'user error'.

And those happen far more frequently.


The OP's point:

Either you are reliable or not.

If you claim that the data is backed up and so on, then you are giving a garantee. Now yes, if all of amazon's datacenters burn down, or the one where ur data is, yes it will be lost, but that should be the corner case that you prepare for. Thats when you do an offsite backup in your house daily/weekly to ensure that at least there is no one place that burns down = company down the tube.

No matter the solution there is always the probability that shit will happen to all of it.

However if you are selling a reliable service, don't sell an unreliable reliable service. Reliable should mean that at least you back it up so that if one hard drive rack blows up, the data is not gone.


They didn't say it was bulletproof. They stated that data was replicated to more than one device to ensure that a single component failure would not result in data loss.

They didn't say it was bulletproof or impossible to loose data. They explain fairly well how things work, what the failure rates are, and give you the tools to do the same risk analysis and cost/benefit calculations you would do anyway, whether using a cloud service or rolling your own.

Multiple equipment failure can happen. Even across data centers and availability zones. The larger the entire AWS system gets, the higher the chances of eventually seeing edge cases where the wrong equipment at just the wrong time screws up to lose data.

If you want bulletproof data storage system that has such a rediculously low failure rate that you are guaranteed not to loose data for a hundred years, you can get it - but it's going to cost a heck of a lot more than anything Amazon is selling you.


There was a pretty lively exchange on twitter last night regarding this. I strongly disagree with the AWS forum poster. EBS does not suck. In fact, EBS and other services from AWS and Rackspace provide the building blocks to allow you to construct incredibly scalable, available systems.

However, you have to accept that when you use IaaS you are taking on some of the operational responsibility and you have to know what you are doing or find someone who does. If this user had been snapshotting regularly to S3, the worst thing they would have experienced is a couple of hours of downtime. All of their data would have been safe and easily recovered.

They didn't do that and the worst case scenario that AWS clearly describes in it's docs (failure of MULTIPLE devices) happened. And it will happen again, someday. Accept that and accept that failure is a feature when systems are designed properly.


The Amazon EBS page states (which the author quotes):

> As an example, volumes that operate with 20 GB or less of modified data since their most recent Amazon EBS snapshot can expect an annual failure rate (AFR) of between 0.1% – 0.5%, where failure refers to a complete loss of the volume. This compares with commodity hard disks that will typically fail with an AFR of around 4%, making EBS volumes 10 times more reliable than typical commodity disk drives.

Nowhere within that does it say 0.00% failure rate, and later in the page they even describe how to mitigate the risk of losing data due to disk failure using snapshots, mirrored across availability zones.


Hard drives suck just as bad. I have a RAID-1 built from three disks out of separate batches. Somehow, I wasn't paying attention to bad sectors the RAID software couldn't fix, and all the disks failed.

Cheap 1TB disks and cheap cloud storage like EBS means that it's now cheaper than ever to lose a shit-ton of data. (I didn't actually lose anything important, the corrupted areas were not important files. But still; three drive failures in a week!)

My fatal mistake, BTW, was ordering from Newegg. Apparently they do not ship OEM drives correctly, and they are almost guaranteed to fail. I was a little suspicious when I saw a raw drive in a plastic shell with some packing peanuts around it. When I had the drives replaced, they did not come from the factory that way!


I have bought hundreds of OEM drives from NewEgg, all of them have had the plastic shell around them, and so far I've seen a 2% failure rate.

The plastic shell is the way the manufacturers ship them direct to NewEgg, that is not NewEgg's doing.


No, the plastic shelled disks are packed in solid foam in boxes holding something like 20 disks. Companies like NewEgg and Amazon break those boxes but do not follow the manufacturer's requirements (e.g. for shipping a failed disk back to them) when they repack them for individual orders.

Last time I bought disks ZipZoomFly properly repacked them by putting the shelled disks into individual foam boxes.


Drives generally come packed either in those plastic shells, or in anti-static wrap, and then packed into specially cut foam (foam with slots, like a very parallel toaster) in a box with foam at the bottom and foam at the top. If you're not buying the drives in bulk, you don't get the box with foam.


Yeah, when I get the replacements from the drive manufacturer, that's how I receive them. But I've ordered pillows that are better packed than what I get from Newegg.


Most people that don't have extended experience with large-scale data stores do not understand basic principle: redundancy decreases probability of data loss, but it never eliminates it completely. All massive data stores slowly bleed data, it's just they bleed it so slow that it's acceptable for most scenarios. In case of this specific example, once number of users is large enough, there always be somebody who lost their volume.

To illustrate this: think about a-la-GFS randomly triplicating data store on 1000 nodes. Once enough data is put in (lets say 100M blobs), there always be blob unique to any given triplet. In other words simultaneous loss of any 3 nodes out of 1000 will always result in data loss. (Simultaneous is in the sense "faster than time to detect failure and recover"). Of course failures are not limited to node loss, but there is corruption in transit, hard drive loss, bad sectors, rack-level failures. As the volume of the data and number of nodes grows it all adds up, so even if for each particular blob mean time to data loss is astronomically high, probability to loose some blob on any given day is very real.


...because I didn't have my own backup."


Is there any way to set EBS to auto-snapshot on a specified time period through the existing control panel interface? Are snapshots possible through the API?


This article includes a script that will perform a consistent snapshot of an XFS-formatted EBS drive (in linux).

http://alestic.com/2009/09/ec2-consistent-snapshot

Consistent snapshots are needed when the disk contains multiple files that must be kept in sync with each other - such as a database's data files and the journal. If your DB gets its disk snapshotted while these files are out of sync, the DB will need to recover the missing data by replaying the journal. Not all DB's can do this reliably for all types of changes (e.g. MySQL's MyISAM engine is fragile in this regard). So you need to make sure that the files in the snapshot all reflect a consistent state of the disk.


there is not, however it is really easy to write a script to freeze the volume and execute a snapshot.

# xfs_freeze -f /data # ec2-create-snapshot vol-###### # xfs_freeze -u /data


So that's an xfs feature, but is is this really necessary on a journaling filesystem? I guess for ext3 I'd replace ext_freeze with sync.


One of my clients has 2 windows instances. I will need to look at the windows tools to see if there is something equivalent there.


At least you didn't have to pay thousands of dollars to delete your data!


Odd, my car lost my coffee when I put it on the roof on the way to work. Good thing there was backup coffee at the office.


Do you also have RAID (Redundant Array Independant Donuts) ?


Sounds like he did not make S3 snapshots of his EBS volumes. Ouch. I feel very confident about the robustness of data that I store on AWS because I can make an S3 snapshot, and recover from that snapshot on a fresh EC2 to test the backup. BTW, I changed the way I use AWS: now I always make bootable EBS images, increasing the size > the 10 GB limit so I snapshot my OS setup and data and apps all at the same time.


There is a easy rule of thumb.

Make a backup of your important stuff often and regularly no matter how many redundancies are in place (see Murphy's law).

Right now? Yes, like really right now if you didn't.

EDIT: Spelling


Can this also happen to an S3 bucket? How do I backup an S3 bucket? Any ideas?


I suppose it's possible but S3 is designed for nine 9's wrt to durability. There are > 100 billion objects and I'm not aware of any being lost due to AWS fault.


Actually, I'm wrong. It's eleven 9's. http://bit.ly/ageV9D


Also curious to know of any S3 bucket failure stories out there. I would assume this is uncommon, but I also assumed that the EBS failure rate was much lower prior to reading this.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: