Why a lookup table? Why not just have the client remember the last backoff, and multiply it by a constant and apply a ceiling?
def getNewBackoff( oldBackoff: int): int {
newBackoff = oldBackoff * BACKOFF_CONSTANT
if (newBackoff > BACKOFF_CEILING) {
newBackoff = BACKOFF_CEILING
}
return newBackoff
}
if you bit align the ceiling value you can replace the conditional with a bitwise and mask. If you use 2 as your exponent you can replace the multiplication with a bit shift left.
It’s also a good idea to introduce jitter, e.g. don’t set the new backoff to the old backoff * a constant, but add a random value such that the old value gets randomized around the desired new value- this helps prevent pile-ons when a bunch of clients noticed about the same time that the server became unresponsive; if they all have the same backoff algorithm then they’re all going to hit the server again at about the same time. It’s good to spread that out to try to avoid server overloading during restarts/cold starts/whatever.
The best results I ever got were when I used server side hinting to avoid pile-ons when the retries from the client were all clustered around the same time.
When my server got in an overloaded state, the auth path for clients short circuited and delivered an overloaded message with a randomized suggested retry value. The server can’t make the clients honor it (we could in my case because we wrote the clients) but if the clients do honor it you can spread out reconnects over a period instead of make them all point-in-time and dampen or prevent oscillating overloads.
Similar to your suggestion, I implemented a wrapper around retries in Go before and despite liking the lookup table version quite a lot, I think it's more important to be able to parametrize the initial duration of the back-off and also the rate at which it grows, so that's what I focused on.
I'm not sure how my example fares against on a simplicity scale, but it feels a little more readable to me:
A server cannot make a client behave a certain way. Explicit doesn’t matter in this case, and as many of the comments said, most of the time the best thing to do for the server is to spread retry load randomly over an interval. Explicit is bad for the server.
The only thing (IMO) explicit retries are good for is to make client state more deterministic on the client, but there are many ways of doing that other than hard coding retry intervals. For example, use asynchronous patterns for networking APIs and allow querying state of API-calls-in-progress.
def getNewBackoff( oldBackoff: int): int {
newBackoff = oldBackoff * BACKOFF_CONSTANT
if (newBackoff > BACKOFF_CEILING) { newBackoff = BACKOFF_CEILING }
return newBackoff }
if you bit align the ceiling value you can replace the conditional with a bitwise and mask. If you use 2 as your exponent you can replace the multiplication with a bit shift left.
It’s also a good idea to introduce jitter, e.g. don’t set the new backoff to the old backoff * a constant, but add a random value such that the old value gets randomized around the desired new value- this helps prevent pile-ons when a bunch of clients noticed about the same time that the server became unresponsive; if they all have the same backoff algorithm then they’re all going to hit the server again at about the same time. It’s good to spread that out to try to avoid server overloading during restarts/cold starts/whatever.
The best results I ever got were when I used server side hinting to avoid pile-ons when the retries from the client were all clustered around the same time.
When my server got in an overloaded state, the auth path for clients short circuited and delivered an overloaded message with a randomized suggested retry value. The server can’t make the clients honor it (we could in my case because we wrote the clients) but if the clients do honor it you can spread out reconnects over a period instead of make them all point-in-time and dampen or prevent oscillating overloads.