Most of this is relatively straightforward and unsurprising. But the one part th...

alexgartrell · on March 26, 2012

As a datapoint, we use jitter a lot in cache infrastructure at Facebook

cpeterso · on March 26, 2012

The Adblock Plus blog details the thundering herd problems they faced. Their ad blocking lists checked for updates every 5 days. Eventually, many users' update schedules would converge on Mondays because office computers did not run over the weekend. Updates that had been scheduled for Saturday or Sunday would spill over to Monday.

https://adblockplus.org/blog/downloading-a-file-regularly-ho...

sk5t · on March 27, 2012

Windows does much the same thing w.r.t. policy refresh (which sucks down files from domain controllers) and update of "occasionally updated" metadata like last logon timestamp.

colmmacc · on March 26, 2012

A related technique is to use distinct prime numbers for periodicities in the first place:

http://www.stdlib.net/~colmmacc/2009/11/26/period-pain-3/

(http://www.stdlib.net/~colmmacc/2009/09/27/period-pain-part-... http://www.stdlib.net/~colmmacc/2009/09/14/period-pain/ for the background material).

The reasons primes are optimal should be obvious :)

toddh · on March 26, 2012

As an interesting counter point, Jeff Dean at Google says they prefer to have a known hit every once in a while, so they will have all the cron jobs go off at the same time versus introducing jitter. It perhaps has to do with their emphasis on reducing variability to control the size of the long tail distribution.

jamii · on March 26, 2012

A nice extension of this is to use exponentially distributed delays. Then reoccurring events form poisson processes which are really easy to reason about when composed eg if you have N servers firing off events at the same exponentially distributed rate the times are distributed the same as if you had one server firing events at N x the rate and distributing jobs uniformly at random to the other servers.

http://en.wikipedia.org/wiki/Poisson_process

cpeterso · on March 26, 2012

A counterexample is that the Linux kernel tries to schedule timer events for the same deadline time. That allows the processor to sleep longer because the kernel doesn't need to wake up as often just to handle 1 or 2 timer events.

cbsmith · on March 26, 2012

It's kind of a weird way to describe it. The better way is that you want your caching to be probabilistic rather than deterministic. In general, you want to avoid anything that would get the nodes in a distributed system to harmonize.

The other way to solve the problem though would be to handle that circumstance cleanly. There are ways to resolve a thundering herd without creating a scalability problem.

davnola · on March 26, 2012

I've heard it called "splay" before e.g. `chef-client` will take an interval option and a splay option. Splay is a random amount of time added to the interval to prevent thundering herds.

breidh · on March 26, 2012

On this topic, an interesting paper from 1994 on adding randomization to network trafﬁc sources:

http://ee.lbl.gov/papers/sync_94.pdf

minimax · on March 27, 2012

Fascinating paper, even if it is 18 years old now. Thanks.

0xbadcafebee · on March 26, 2012

In the past i've seen jitter used in all sorts of applications, from cron jobs to configuration management to memcache key expiration. Any time you have a crapload of nodes that all need to do an operation at a specific time you rely on jitter to keep resources from bottlenecking. Probably used anywhere the systems or network has a miniscule amount of resource headroom (like cluster nodes that run at 93% utilization)

makmanalp · on March 26, 2012

Same in multicast DNS which is used in zeroconf:

http://files.multicastdns.org/draft-cheshire-dnsext-multicas...

Search for "random". Cool spec written by Stuart Cheshire of Apple.

minimax · on March 26, 2012

I hadn't thought about it in the general case, but I do frequently find myself adding a sleep for PID mod some appropriate constant to the beginning of big distributed batch jobs in order to keep shared resources (NFS, database, etc) from getting hammered all at once.

sneak · on March 27, 2012

Why is this not done in electronic exchanges to render HFT operations pointless?

tedunangst · on March 27, 2012

First come, first served is one of the rules that makes the exchanges the exchanges.