Ask HN: Service Idea: Cloud based image hosting

jacquesm · on Sept 8, 2010

I built exactly that about a year ago, serving up billions of images (not billions of different images but billions of requests) every day.

The way I ended up doing the scaling (we didn't need any other format conversions) was using the retrieval URL and a 404 handler that is smart enough to be able to access a list of 'allowed' sizes (so that doesn't become a potential attack vector). So if you access a file in a size that hasn't been made yet it gets created on the fly.

The whole thing has been up and running across 9 servers for a year now, it has triple replication and a bunch of varnish servers on the front end to make it fast.

We have two ways of putting data on there, one through an API that accesses the servers directly, another using a queuing mechanism.

To improve the legibility of the urls we used a virtual path rather than a bunch of parameters.

so http://mycdn.com/storage/client/format/id/id/id/id/id/id/id....

where the 'id' bits are 2 digits from the image identifier.

The nodes have 4TB storage each. Originally we used XFS but deletion was too much of a bottle-neck so we ended up switching the system after it was already live to EXT3, which improved performance quite a bit.

I'm sure that if you build this 'properly' (as in nicely abstracted, multi-user, with redundancy by using multiple locations and so on) that there is a market for it but I'm not sure how big that market would be.

So yes, this probably has legs.

al_james · on Sept 8, 2010

Sounds very similar to what we have here, except we are storing the original files on S3 to avoid replication / redundancy issues.

I guess getting the pricing right is key to working out demand.

ritonlajoie · on Sept 8, 2010

Is what you built public or private ?

jacquesm · on Sept 8, 2010

Private. Building this taught me a lot I of stuff that I thought was 'easy' is actually pretty hard when you need to do it often enough :)

I always thought live video was hard, it turns out large numbers of images is actually much harder. That really surprised me.

arfrank · on Sept 8, 2010

Google App Engine just released something similar to this a few weeks ago:

Announcement: http://googleappengine.blogspot.com/2010/08/multi-tenancy-su...

Docs: http://code.google.com/appengine/docs/python/images/function...

It could be set up to do image resizing on the fly per URL parameters you pass to it, and storage/bandwidth is cheaper than S3 if I recall correctly. It's based on the same infrastructure as Picasa.

Edit: In fact it could be easily used to create such a service rather than having to build out the functionality oneself.

al_james · on Sept 8, 2010

Thats very interesting. Thanks!

arfrank · on Sept 8, 2010

No problem, let me know how it goes, my email is in my profile. You question peaked my interest in building such a service on top of GAE with just basic API access and billing for usage. As long as you cover the hosting cost Google charges you, it'd seem to be relatively straightforward. I'm just not sure if there is enough control built in to determine what bandwidth went where.

nl · on Sept 8, 2010

GAE (can) serve images out of the blobstore, so you can monitor statistics when you tell it to get the image out. Docs are here: http://code.google.com/appengine/docs/java/images/overview.h...

(I expect you'd probably want to use memcache to cache images rather than the blobstore everytime, though)

(Note that you pretty much have to keep the images in the blobstore because you don't have filesystem access. You might be able to keep them in the datastore if you wanted, but those are the only two AppEngine options)

al_james · on Sept 9, 2010

Hmmmm.... The 1 MB limit in and out of the image service could be a problem.

nl · on Sept 9, 2010

The image service is nice, but really only needed if you want to do transforms. You can serve raw image data out of the blobstore.

BobbyH · on Sept 8, 2010

Wordpress does this, but rather than serve the images from s3, it uses s3 to store the images and populate a self-hosted varnish cache: http://blog.apokalyptik.com/2007/10/10/so-you-wanna-see-an-i... This reduces the s3 bill by an order of magnitude (http://ma.tt/2007/10/s3-news/), so you may want to consider this approach.

In fact, using this approach, you could use s3 (just storage) to undercut s3 (storage+bandwidth) on cost and get lots of customers! I'd be in! :-)