How would it be better than round robin DNS with low TTL?

jrockway · on June 22, 2020

Basically, it affords you the ability to cache for longer and still end up with users able to go to your website.

Right now, you can try resolving common hosts, and you will see that they often provide several hosts in response to a lookup. What the browser does with those IPs is up to the browser, the standard does not define what to do. What the administrator that sets up that record wants is "send to whichever one of these seems healthy", and some browsers do do that. Other browsers just pick one at random and report failure, so your redundancy makes the system more likely to break.

What I want is a way to define what to do in this case. Maybe you want to try them all in parallel and pick the first to respond (at the TCP connection level). Maybe you want to try them sequentially. Maybe you want to open a connection to all of them and send 1/n requests to each. Right now, there is no way to know what the service intends, so the browser has to guess. And each one guesses differently.

(You will notice that people like Google and Cloudflare skillfully respond with only one record with a 5 minute TTL. That is so the behavior of the browser is well defined, but it also eats their entire year of 99.999% uptime with one bad reply. Your systems had better be very reliable if DNS issues can eat a year's worth of error budget.)

achiang · on June 23, 2020

> (You will notice that people like Google and Cloudflare skillfully respond with only one record with a 5 minute TTL. That is so the behavior of the browser is well defined, but it also eats their entire year of 99.999% uptime with one bad reply. Your systems had better be very reliable if DNS issues can eat a year's worth of error budget.)

This chapter in the Google SRE book explains how our load balancing DNS works:

https://landing.google.com/sre/sre-book/chapters/load-balanc...

Source: my team runs this service

1996 · on June 23, 2020

I skimmed though, not a bad idea- instead of using a reverse proxy, you are basically doing a poor man multicast, by letting many servers answer a request. And instead of rewriting the packets, you encapsulate, which should be lighter and faster.

It might be a little more resilient than even a very minimal nginx, but more than that, I think it must give you more control about what happens when a packet is not "answered" after some set amount of time - you write off who should have been the answerer, then resend that same packet to another server. Keep a buffer of packet, scrape them from the buffer when ACK'ed by the answerer, resend them to another answerer if not ACK'ed after some set amount of time.

Am I guessing correctly?

It seems a bit overcomplicated for normal usecases, but adequate for a large scale like google.

achiang · on June 23, 2020

The design you propose is stateful, and if you read the chapter closely, you can see we spend a lot of effort to make things stateless.

The main thing I wanted to respond to in this thread about a single bad server destroying your yearly SLO is described in the first paragraph in the section on load balancing at the virtual IP address.

blueblisters · on June 23, 2020

Sorry I couldn't find a clear rationale in the link. Why does Google prefer a stateless load balancer? Is it infeasible to maintain state at that scale?

1996 · on June 23, 2020

Sorry, I didn't read the document that closely. It was a bit too long.

Overall, virtual IPs are still an interesting solution.

jrockway · on June 23, 2020

You might like the actual paper: https://static.googleusercontent.com/media/research.google.c...

1996 · on June 22, 2020

> What the administrator that sets up that record wants is "send to whichever one of these seems healthy

In the rrDNS, remove the A record of the hosts that fails tests, or that has a load that's too high

> Maybe you want to try them all in parallel and pick the first to respond (at the TCP connection level).

Something a geoIP at your DNS can do, certainly not as good as doing that in the client, but it should be decent enough.

> Your systems had better be very reliable if DNS issues can eat a year's worth of error budget

Or, if you aren't google or cloudflare, use a 30 to 60s TTL in rrDNS, with health checks to selectively remove IP that fail, on pools splitting your servers by region with geoIP - this way, if 1/10 of your east coast servers fail, nobody from APAC will be impacted, and only 1/10th of your US east users, and only for the TTL (I'm abstracting ISP that cache for too long, but you already mitigate a lot of the problem there)

I can see how it would be easier to handle that in the browser, but you may already be able to do that with some JS to estimate the latency, then store the result in a cookie that causes a reload to www.eastcoast.yoursite.com if the user sticks to www.yoursite.com or after returning home goes to www.apac.yoursite.com while new measurement say "not optimal" and update the cookie

jrockway · on June 23, 2020

I am kind of OK with this solution, and is in fact my plan to roll out HTTP/3 for my personal sites. I wrote https://github.com/jrockway/nodedns to update a DNS record to contain the IP addresses of all schedulable nodes in my cluster. I can then serve HTTP/3 on a well-known port and it is probable that many requests will reach me successfully. (I had to do this because my cloud provider's load balancer doesn't support UDP, and I don't have access to "floating IPs"; basically my node IPs change whenever the cluster topology needs to change.)

I don't really like it because it still means a minute of downtime when the topology does change. I would prefer telling the browser what strategy to use to try a new node, rather than relying on heuristics and defaults.

m3047 · on June 23, 2020

Unless you're using DoH, in most cases the browser is using the device's stub resolver just like all other apps on the device.