Hacker Newsnew | past | comments | ask | show | jobs | submit | aleksi's favoriteslogin

That code was in turn a loose port of the dial function from Plan 9 from User Space, where I added TCP_NODELAY to new connections by default in 2004 [1], with the unhelpful commit message "various tweaks". If I had known this code would eventually be of interest to so many people maybe I would have written a better commit message!

I do remember why, though. At the time, I was working on a variety of RPC-based systems that ran over TCP, and I couldn't understand why they were so incredibly slow. The answer turned out to be TCP_NODELAY not being set. As John Nagle points out [2], the issue is really a bad interaction between delayed acks and Nagle's algorithm, but the only option on the FreeBSD system I was using was TCP_NODELAY, so that was the answer. In another system I built around that time I ran an RPC protocol over ssh, and I had to patch ssh to set TCP_NODELAY, because at the time ssh only set it for sessions with ptys [3]. TCP_NODELAY being off is a terrible default for trying to do anything with more than one round trip.

When I wrote the Go implementation of net.Dial, which I expected to be used for RPC-based systems, it seemed like a no-brainer to set TCP_NODELAY by default. I have a vague memory of discussing it with Dave Presotto (our local networking expert, my officemate at the time, and the listed reviewer of that commit) which is why we ended up with SetNoDelay as an override from the very beginning. If it had been up to me, I probably would have left SetNoDelay out entirely.

As others have pointed out at length elsewhere in these comments, it's a completely reasonable default.

I will just add that it makes no sense at all that git-lfs (lf = large file!) should be sending large files 50 bytes at a time. That's a huge number of system calls that could be avoided by doing larger writes. And then the larger writes would work better for the TCP stack anyway.

And to answer the question in the article:

> Much (all?) of Kubernetes is written Go, and how has this default affected that?

I'm quite confident that this default has greatly improved the default server latency in all the various kinds of servers Kubernetes has. It was the right choice for Go, and it still is.

[1] https://github.com/9fans/plan9port/commit/d51419bf4397cf13d0...

[2] https://news.ycombinator.com/item?id=34180239

[3] http://publications.csail.mit.edu/lcs/pubs/pdf/MIT-LCS-TM-65...


The basic problem, as I've written before[1][2], is that, after I put in Nagle's algorithm, Berkeley put in delayed ACKs. Delayed ACKs delay sending an empty ACK packet for a short, fixed period based on human typing speed, maybe 100ms. This was a hack Berkeley put in to handle large numbers of dumb terminals going in to time-sharing computers using terminal to Ethernet concentrators. Without delayed ACKs, each keystroke sent a datagram with one payload byte, and got a datagram back with no payload, just an ACK, followed shortly thereafter by a datagram with one echoed character. So they got a 30% load reduction for their TELNET application.

Both of those algorithms should never be on at the same time. But they usually are.

Linux has a socket option, TCP_QUICKACK, to turn off delayed ACKs. But it's very strange. The documentation is kind of vague, but apparently you have to re-enable it regularly.[3]

Sigh.

[1] https://news.ycombinator.com/item?id=10608356

[2] https://developers.slashdot.org/comments.pl?cid=14515105&sid...

[3] https://stackoverflow.com/questions/46587168/when-during-the...


To avoid network congestion, the TCP stack implements a mechanism that waits for the data up to 0.2 seconds so it won’t send a packet that would be too small. This mechanism is ensured by Nagle’s algorithm, and 200ms is the value of the UNIX implementation.

Sigh. If you're doing bulk file transfers, you never hit that problem. If you're sending enough data to fill up outgoing buffers, there's no delay. If you send all the data and close the TCP connection, there's no delay after the last packet. If you do send, reply, send, reply, there's no delay. If you do bulk sends, there's no delay. If you do send, send, reply, there's a delay.

The real problem is ACK delays. The 200ms "ACK delay" timer is a bad idea that someone at Berkeley stuck into BSD around 1985 because they didn't really understand the problem. A delayed ACK is a bet that there will be a reply from the application level within 200ms. TCP continues to use delayed ACKs even if it's losing that bet every time.

If I'd still been working on networking at the time, that never would have happened. But I was off doing stuff for a startup called Autodesk.

John Nagle


Ford Aerospace was one of the first commercial sites of BSD Unix. The licensing was complicated. We had to buy Unix 32V from AT&T first. That transaction got on the path for major corporate documents. AT&T and Ford Motor had a cross-licensing agreement. Eventually, I got a no-cost license agreement embossed with the corporate seals of both the Ford Motor Company and the American Telephone and Telegraph Corporation. Made a copy and taped it onto a VAX. Then I drove up to Berkeley from Palo Alto and Bill Joy gave me a BSD tape.

BSD didn't have networking at that point. We bought 3COM's UNET.[1] That was TCP/IP, written by Greg Shaw. $7,300 for a first CPU. $4,300 for each additional CPU. It didn't use "sockets"; you opened a connection by opening a pseudo-device. UNET itself was in user space, talking to the other end of those pseudo-devices.

Once we got that going, we had it on VAX machines, some PDP-11 machines, and some Zilog Z8000 machines. (The Zilog Z8000 was roughly similar to a PDP-11) All of which, along with some other weird machines including a Symbolics LISP machine, eventually interoperated. We had some of Dave Mills' Fuzzballs as routers [2], and a long-haul link to another Ford location that connected to the ARPANET. Links included 10Mb/s Ethernet, a DEC device called a DMC that used triaxial coax cables, and serial lines running SLIP. A dedicated 9600 baud serial synchronous line to Detroit was a big expense.

My work in congestion control came from making all this play well together. These early TCP/IP implementations did not play well with others. Network interoperability is assumed now, but it was a new, strange idea back then, in an era when each major computer maker had their own networking protocols. UNET as delivered was intended to talk only to other UNET nodes. I had to write UDP and ICMP, and do a major rewrite on TCP.

When BSD got networking, it was initially intended to talk only over Ethernet, to other BSD implementations. When 4.3BSD came out, it would only talk to some other implementations during alternate 4 hour intervals. I had to fix the sequence number arithmetic, which wrapped incorrectly.

And finally, it all worked. For a few years, it was said of the TCP/IP Internet that it took "too many PhDs per packet." One day, on the Stanford campus, I saw a big guy with a tool belt carrying an Ethernet bridge (a sizable box in those days) under his arm, and thought, this is finally a working technology.

[1] https://archive.org/details/bitsavers_3Com3ComUN_1019199/pag...

[2] https://eecs.engin.umich.edu/stories/remembering-alum-david-...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: