> My business is such that external events cause immediate spikes. My traffi...

saurik · on June 3, 2012

Here is a graph I generated a few weeks ago: we've since had yet another major traffic spike due to the release of Absinthe 2.0 with Rocky Racoon (an untethered jailbreak for iOS 5.1.1) that is actually one of the most intense spikes yet (but am on my iPhone and can't make new graphs).

http://test.saurik.com/hackernews/absinthe.png

I over-allocate the database server for Cydia, but spawn up new web servers on demand. I keep as much of the CPU-intensive work then off the database, store as many static assets as I can on services such as S3, and use distributed queued logging (RELP).

For JailbreakQA's database (where downtime isn't that important) I do an instance stop, change the type of computer it is running on (such as from m1.large to c1.xlarge), start it again, and have a drastically different machine with only a minute of downtime. EC2 is a godsend (for me).

count · on June 3, 2012

It's significantly more difficult to scale a traditional relational database (although not impossible!), than to scale the web/app layer that sits in front of it. Snapshot + clone + some kind of sync middleware (like pgpool for postgres) can probably get you 80-90% of the way there. Rearchitecting so that your db server is not the bottleneck should help there as well.

Maybe you need to have a master/slave setup, and on huge load, flip the slave instance over to be a instance type with quadruple the RAM and CPUs for a few hours, then back to a single-core, low-memory instance to keep the data-sycn flowing. There's a million ways to skin this cat.

If your database itself is the bottleneck, then, yeah, on the fly flexibility might be difficult to achieve.

In his case, a relational database probably isn't the bottleneck at all, and scaling out caches, web front ends, etc. is all fairly straight forward. There are huge numbers of folks taking advantage of this kind of flexibility.

Hell, Amazon has a whole API you can integrate with that handles it for you (even has $ references, so you don't accidentally spend yourself bankrupt because of a TC story).

mnutt · on June 3, 2012

My company provides dynamic content in emails, and as such gets large traffic spikes when 10 million emails get sent at once and everyone begins opening them. The content's configuration (in postgres) is trivially cacheable, but our app servers render different content based on the user's context.

So we have a bunch of shared-nothing app servers that we can spin up and down based on the emails we know are going out. Automatically detecting spikes and spinning up new instances between the send and the peak is much harder, though.

dpritchett · on June 3, 2012

Sounds fascinating! Do you use centralized logging? If so how do you manage that?

mnutt · on June 4, 2012

Yeah, we're using Cassandra for logging. Not quite as simple to scale up, but it's write-only in the request cycle and hasn't been anywhere near a bottleneck yet.