How to use Amazon’s S3 web service for Scaling Image Hosting

d2viant · on July 14, 2010

"We could push our images to Amazon, and never have to worry about backing them up, or keeping extra copies in case of hardware failure"

Perhaps a clarification, but Amazon does not guarantee your data as part of S3. They will do their best job (and have a great track record), but just like any other service -- failures will occur. Ultimately their SLA may provide for reimbursement of downtime, but that won't get your data back.

That being said, S3 is still (probably) magnitudes better than what you can homegrow.

johnswamps · on July 14, 2010

Related: http://news.ycombinator.com/item?id=1334187. Some guy was storing his data on Amazon EBS and it was all lost (not sure how EBS compares to S3 in terms of probability of data loss).

timf · on July 15, 2010

Werner Vogels says S3 is designed "for 99.999999999% durability." That is eleven nines.

(From http://www.allthingsdistributed.com/2010/05/amazon_s3_reduce... )

EBS volumes have a theoretical "annual failure rate (AFR) of between 0.1% – 0.5%, where failure refers to a complete loss of the volume."

(From http://aws.amazon.com/ebs/ )

Considering how many EBS volumes there are, that is not that low of a percentage for us to see many cases. It happens and the case at that link is not the only one.

But that person didn't take the advice, he did not snapshot his EBS volumes to S3. It says it right there in the description (as well as in the user's guide): "The durability of your volume depends both on the size of your volume and the percentage of the data that has changed since your last snapshot."

timf · on July 15, 2010

S3 is like a giant RAID device to me, similar caveats apply from the "RAID is not a proper backup" arguments. There is not protection against deletions (malicious or accidental), writing over a file with a corrupted file, etc.

Further, there is not protection against your account becoming errant in Amazon's eyes resulting in you getting locked out. And d2viant's point remains, Amazon could mess up or be subject to something out of their control.

I'm not trying to diminish the excellent service and good track record they have, I use it myself as a piece of my total backup solution. But just a piece.

mattmanser · on July 14, 2010

Hehe, I was about to post exactly the same comment.

swindsor · on July 14, 2010

True, there is guarantee. But it's a pretty good track record - and the underline technology holds up pretty well. It's more reliable than what we'd be able to build with a small startup budget (or pay for offsite backups/storage).

mattmanser · on July 14, 2010

It's a good track record unless it happens to you, and then you're screwed if you didn't do something else to save it. I'm sure at least one person on HN has posted about losing all their data 'cause a machine failed.

It's the equivalent of not getting insurance. Yes, it saves you money, but on the other hand if you're really unlucky the consequences are truly dire.

swindsor · on July 14, 2010

http://aws.amazon.com/s3/#protecting I am totally okay with 99.999999999% durability. This is way better than I'll be able to build/afford on my own.

In case of an epic S3 disaster, we could backup our S3 buckets to another offsite storage, but we're not going to improve much further than what their giving us.

zepolen · on July 14, 2010

I've been doing the stuff in this article for almost a year now but instead of x-sendfile, I use nginx's proxy_cache which does the 'get from S3, save to disk, serve subsequents from disk' for you.

That way deploying more photo servers for speed just requires nginx and a few lines of config.

A couple notes:

1. Make sure you partition your data on your local disk properly so that there are never more than ~5000 files per directory, you can use nginx rewrite rules to do this for you (and make nginx save the actual url to nested folders).

2. If possible, make thumbnails smaller than 4kb.

swindsor · on July 14, 2010

1. Good point - I didn't bring this up in the article. When we store the images locally, we create divide end of the unique ID of the image into two levels of subdirectories. This gives us "random enough" distribution in subdirectories, and an easy way to look up the files on disk. This logic exists in our rails app, but we didn't want to expose this pathing out to end users. This allows us to change it later if we need to create further subdirectories.

2. Good tip - we try to use our judgement here balancing good design/good performance. We tend to opt for a pretty website first, then go back and make it faster with optimization.

zepolen · on July 15, 2010

Regarding "This logic exists in our rails app, but we didn't want to expose this pathing out to end users". I'm not sure if you understood, but I'm saying exactly this. nginx will take a /whatever request and convert it to /w/wh/wha/whatever on disk transparently - you can (and I have needed to) change this whenever you like with no front facing changes.

swindsor · on July 15, 2010

Oh, that's cool - I didn't know you could use matches from a location in an alias. Is that a new feature in nginx?

zepolen · on July 15, 2010

I think it's always been a feature of nginx:

    location ~ "/(.*)" {
        root /whatever/$1;

You could even do:

        set $subdir $1;

so that it can be used in another location handler.

blasdel · on July 14, 2010

I use http://github.com/markevans/dragonfly for this purpose, using rack/cache to hold on to the generated images (also with no expiration necessary).

Right now I'm working on doing streaming intermediate resizes that never load the full decompressed image into memory, because I need to be able to take uploads of 50 megapixel images and be able to respond with new versions based on user input in a timely manner. RMagick may be an extremely shitty implementation, but even a more solid one is stymied by the algorithm.

The solution is to just do something else -- for JPEGs, libjpeg has the capability to do a cheap streaming resize to 1/2, 1/4, or 1/8 by partially sampling the 8x8 DCT blocks and for PNGs MediaWiki's image server includes a utility for doing streaming resizes, though it's finicky about the exact scale factors it'll allow: http://svn.wikimedia.org/viewvc/mediawiki/trunk/pngds/

elektronaut · on July 14, 2010

Thanks for the link, I haven't seen that plugin before.

We use a plugin I wrote (http://github.com/elektronaut/dynamic_image/), which seems to be based on similar ideas on syntax. For caching we just do regular page caching, with no expiration. Since the request doesn't have to hit the backend at all (except for the first view), this is pretty damn fast.

swindsor · on July 14, 2010

Dragonfly looks pretty cool. I like the approach of using rack (our app was written pre-rack, but it's been upgraded to newer versions of rails, so we could make it rack-enabled).

Have people written different datastores (like S3?). I love the idea of a modular, pluggable image server that can choose different backends based on business need/scaling.

Luckily, we deal with pretty small pictures since they are just pictures of our users & their classes. We don't really have any need for handling larger sizes, and our biggest use cases surround generating thumbnails (similar to many social networking web apps).

jasonkester · on July 15, 2010

You're not done.

You've done 99% of the work to get your images serving quickly to your users, but then you stopped. Why??? Cloudfront takes exactly four minutes to set up, and it's a full-blown CDN.

Nobody, not even Amazon, recommends serving content directly from S3 anymore. Either this article is 2 years out of date or the author doesn't understand his subject as well as he thinks.

brlewis · on July 15, 2010

For whatever reason they want to serve arbitrary sizes on the fly. They do mention cloudfront in their future optimizations section, but complain about the price of CDNs in general. Cloudfront looks awfully cheap unless I'm missing something.

jasonkester · on July 15, 2010

Unless their business revolves around serving arbitrarily sized versions of user images, they're probably doing that wrong too. The "solved" way to deal with this is to generate all the sizes your app will need and store them on the CDN. Resizing images is pretty much the most expensive thing you do in a run-of-the-mill web app. Best to do it once and be done with it, especially now that cloud storage is so cheap.

And speaking of cheap, Cloudfront is as close to free as S3 itself. It will roughly double your S3 bill, leaving you paying a grand total of $8.04 per month for your image hosting on a ~1,000,000 unique/month site. Remember the 90's when that would run you $400/month? It's so cheap that it's not worth calculating. Just flip on Cloudfront, configure your CNAME and get on with your life.

ant5 · on July 15, 2010

For whatever reason they want to serve arbitrary sizes on the fly.

The primary advantage here is that you don't have to coordinate generation of new images just because someone, somewhere requires a specific size in some client of your image system.

Of course, just stick a CDN in front of your on-demand scaling implementation, potentially backed by S3, and you're done.

swindsor · on July 20, 2010

We wanted to be able to serve arbitrary sizes so we can handle different demands for thumbnails and new use cases. Having to resize all of your images because you've re-designed your homepage sucks, and is no fun (especially if you have a large set of images).

Cloudfront is really enticing, though. For a cheap CDN, if we really had the need, we'd probably migrate to cloudfront, then come up with a task to batch resize our S3 images, then stick them back into S3. This could probably be done on a one-off task on ec2 by spinning up a few instances, or even with Hadoop.

If anyone's taken this approach, I'd love to see it!

minalecs · on July 14, 2010

nice write up.. just today I asked for a service that would do this http://news.ycombinator.com/item?id=1514991 ( sorry for thread hijack) but basically I didn't want to do image processing on the same instance as the app is running because of the constraints of cloud instances.

Based on your steps listed here 1.Handle request 2.Fetch original source image from S3 3.Resize/apply effects 4. Return result back to user. Why not just resize/apply at time of upload and store in S3... are you resizing and applying effects on every request for an image ?

jrnkntl · on July 14, 2010

From the article:

"For performance, each of our image servers cache the source, and any resize, locally to disk. Since images are never updated (only created), and get a unique ID for each one, we don’t have to worry about cache invalidation, only expiration. We can then write a simple script to remove images from this disk cache with files of an access time greater than a certain threshold (say 30 days). That way, if we change from one size thumbnail to another, eventually the old thumbnail sizes will get purged."

daryn · on July 14, 2010

Like jrnkntl said, we cache all resized copies.

The reason we resize and apply effects at request time instead of at upload time is that our design needs may change over time, and this way we don't need to go back and batch reprocess things.

cracell · on July 14, 2010

I've been wondering for a bit if there's enough potential profit to build an image upload processing service. Image uploading and manipulation is one of the very common pain points to Rails applications and I assume a lot of web applications in general.

A nice pain free, embed the uploader and set some settings and then the 3rd party processes the images and then uploads them to your S3 and pings you. With the ability to reprocess old images to a new size and integrate with attachment_fu or paperclip (or has it's own similar plugin)

sstrudeau · on July 14, 2010

Yes. One reply mentioned transloadit.com just launched. This is also in Drop.io's vision for where they're going in the future. Transloadit.com preprocesses the images, which are great for use cases where you expect you'll need all or most of your processed images most of the time and you don't frequently change your image size requirements.

For use cases where you expect only to need a fraction of the possible image "shapes" for any given asset but can't easily predict it ahead of time, on-the-fly generation is very attractive.

I've been back-of-the-mind sketching my ideal service along these lines for a year or more -- if my imaginary ideal existed, it would save me a ridiculous amount of grief.

bittersweet · on July 16, 2010

Just found http://www.uploadjuicer.com/ that does this as well.

edit: Sorry just found out they only do resizing/rescaling, you still have to take care of the initial upload yourself.

stympy · on July 20, 2010

I have just published a gem and a sample Rails app that show how you can directly upload to S3, then call the API at uploadjuicer.com:

http://github.com/uploadjuicer/

niels · on July 14, 2010

transloadit just launched. They do just that. Looks pretty cool.

swindsor · on July 14, 2010

I could see where that would be useful. Image uploads in our case are actually handled by our existing web app, so our image servers can be ignorant of user authentication/etc. We end up using swfupload, which is a nice client experience, and then in the callback we call directly into our image servers. Since it's an ajax experience, it's pretty nice, but if we had very large images, or if we had to pre-process out a number of different sizes/etc I could see where kicking off an async job would be useful.

It would be really cool to have an EC2 instance handle these resizes and put the images in new buckets for you - Might be a really neat implementation.