Hacker News new | past | comments | ask | show | jobs | submit login

Most of this is relatively straightforward and unsurprising. But the one part that grabbed me is about "jittering". They insert random delays into timed events (the example given is cache expiration) to prevent a thundering herd problem when all the parts of the distributed system see the event at the same time (and for popular content, presumably repopulate the cache from the backend simulteously).

This is simple enough when described, but is not a technique I've seen applied much in practice or discussed in the community. I'm wondering if it's something that gets reinvented for all the projects that need it or if it's secret sauce known only in youtube. Regardless, I thought it was pretty insightful.




As a datapoint, we use jitter a lot in cache infrastructure at Facebook


The Adblock Plus blog details the thundering herd problems they faced. Their ad blocking lists checked for updates every 5 days. Eventually, many users' update schedules would converge on Mondays because office computers did not run over the weekend. Updates that had been scheduled for Saturday or Sunday would spill over to Monday.

https://adblockplus.org/blog/downloading-a-file-regularly-ho...


Windows does much the same thing w.r.t. policy refresh (which sucks down files from domain controllers) and update of "occasionally updated" metadata like last logon timestamp.


A related technique is to use distinct prime numbers for periodicities in the first place:

http://www.stdlib.net/~colmmacc/2009/11/26/period-pain-3/

(http://www.stdlib.net/~colmmacc/2009/09/27/period-pain-part-... http://www.stdlib.net/~colmmacc/2009/09/14/period-pain/ for the background material).

The reasons primes are optimal should be obvious :)


As an interesting counter point, Jeff Dean at Google says they prefer to have a known hit every once in a while, so they will have all the cron jobs go off at the same time versus introducing jitter. It perhaps has to do with their emphasis on reducing variability to control the size of the long tail distribution.


A nice extension of this is to use exponentially distributed delays. Then reoccurring events form poisson processes which are really easy to reason about when composed eg if you have N servers firing off events at the same exponentially distributed rate the times are distributed the same as if you had one server firing events at N x the rate and distributing jobs uniformly at random to the other servers.

http://en.wikipedia.org/wiki/Poisson_process


A counterexample is that the Linux kernel tries to schedule timer events for the same deadline time. That allows the processor to sleep longer because the kernel doesn't need to wake up as often just to handle 1 or 2 timer events.


It's kind of a weird way to describe it. The better way is that you want your caching to be probabilistic rather than deterministic. In general, you want to avoid anything that would get the nodes in a distributed system to harmonize.

The other way to solve the problem though would be to handle that circumstance cleanly. There are ways to resolve a thundering herd without creating a scalability problem.


I've heard it called "splay" before e.g. `chef-client` will take an interval option and a splay option. Splay is a random amount of time added to the interval to prevent thundering herds.


On this topic, an interesting paper from 1994 on adding randomization to network traffic sources:

http://ee.lbl.gov/papers/sync_94.pdf


Fascinating paper, even if it is 18 years old now. Thanks.


In the past i've seen jitter used in all sorts of applications, from cron jobs to configuration management to memcache key expiration. Any time you have a crapload of nodes that all need to do an operation at a specific time you rely on jitter to keep resources from bottlenecking. Probably used anywhere the systems or network has a miniscule amount of resource headroom (like cluster nodes that run at 93% utilization)


Same in multicast DNS which is used in zeroconf:

http://files.multicastdns.org/draft-cheshire-dnsext-multicas...

Search for "random". Cool spec written by Stuart Cheshire of Apple.


I hadn't thought about it in the general case, but I do frequently find myself adding a sleep for PID mod some appropriate constant to the beginning of big distributed batch jobs in order to keep shared resources (NFS, database, etc) from getting hammered all at once.


Why is this not done in electronic exchanges to render HFT operations pointless?


First come, first served is one of the rules that makes the exchanges the exchanges.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: