What a great post, about something I've been working on almost as long as Avery. I'm sad I missed it the first time it came around.
I was going to write a long response here, but I think I'll save that for a blog post (short summary: I disagree vehemently with what I believe the premise here to be, think that people shouldn't be waiting for the IETF to give them permission to build new network layers, am fairly certain there's no such thing as a "layering violation", and think overlay networks are/will-ultimately make IPv6 irrelevant). So on this thread I'll just pick some nits.
Ethernet networking is not as gross as it's made out here. ARP isn't entirely pointless! For instance, at the ISP I ran tech for in the 1990s, I was able to pretty seamlessly move our "data center" and corporate offices across Chicago without renumbering just by exploiting ARP (I wrote a dumb proxy ARP policy router). We did similar things to route traffic to the particular terminal services customers were dialing into, or to the ISDN router who's PRI happened to service a particular customer. An IP purist would object that we weren't using OSPF the way God intended us to, but it worked and was probably more reliable than the bona fide routing protocols we replaced that with.
This narrative also, I think, gives short shrift to DHCP, which does a lot more than pick IP addresses out for new endpoints, but also pretty much fully configures their IP connection. If you had to tech support 10,000 random customers in the era before DNS servers were transparently assigned at connection, you're not pining for the simple elegance of RARP.
Also: nobody should care about "IGMP-snooping bridges", since IP multicast is and always was hopeless.
I was going to say the same thing, so I won't :-). That said, I was in the Sun Systems group when Bob Hinden ("Boss Bob" (there were three Bobs in the group) of the network group was proposing SIPP as the "next generation IP." It has been illustrative (but I don't think educational alas) to see how much more easily this protocol would have managed to be implemented and deployed.
That said, as Thomas points out (indirectly) in the parent to this comment, the Internet was deployed across a pre-existing network (the telephone switching network) without any co-operation from the people who defined or wrote or deployed the protocols the implement telephone switching. As long as the connection from point A to point B worked, the packets could figure out how to get from A to B. There is absolutely nothing preventing a suitably motivated group from creating their own elegant "network" that they layer on top of the existing broadband networks of today, without having to either consult, or get permission from, any standards organization.
> There is absolutely nothing preventing a suitably motivated group from creating their own elegant "network" that they layer on top of the existing broadband networks of today, without having to either consult, or get permission from, any standards organization.
That is essentially what most SD-WAN devices do- treat the Internet as an 'underlay' network- most of them are using proprietary code to create their own network infrastructure that isn't standards based.
It generally is standards based. Their customers demand it to be so. IPSec tunnel overlays, usually if not always full mesh. The non-standard part is tiny insignificant tweaks to IPSec that render it unacceptable to standards speaking endpoints, thus you can't coordinate with your open source IPSec device. Stupid myopia, because these systems depend on proprietary orchestration anyway.
+1 for velocloud. SDWAN mesh between all your devices, and they provide a cloud gateway that allows you to connect to any compatible ipsec device, without having to backhaul all the data to one specific endpoint.
ARP is also nice and abstract and well-defined; it can bridge from any multi-endpoint subnet's layer-2 address to an IP address. Not sure if anyone actually uses it for non-802, but the generality has forced a clean design.
To add to your praises of DHCP - it can also configure routers, and is in fact the standard solution for that in IPv6. Instead of giving you one or several addresses for NAT through DHCP, it gives the router an address for itself, and also a prefix to assign to clients on its internal network. Super neat stuff, and a boon to administrators.
Also to nitpick your summary, because nitpicking is what I do - layering violations are a thing, but only in the same way that violating software abstraction barriers are a thing. Not a hard-and-fast rule, and sometimes if you're doing weird enough stuff you just have to do it.
Got to disagree. The level of abstraction is very useful as a means of swapping out one layer without changing the other technologies - e.g. running IP over point-to-point fiber, or AlohaNet, or 802, or carrier pigeon. Or running running ethernet over a phy with whatever ridiculous number of Mbps is the latest thing. (802.11, of course, has effectively zero phy/link distinction, but anything that has to deal with such high packet drop rates and negotiation of physical layer between endpoints is going to be a mess.)
There's an issue with the specific OSI layering, but that's higher in the stack: it has waaay too many layers at the top. Everything up to maybe the transport layer (TCP/UDP/SCTP) is very well delinked in most implementations, but the session/presentation/application layer distinctions are total BS.
They're a useful tool for understanding the mindset of the original developers, but as you go "up" in the layers, the division of responsibilities becomes more and more arbitrary, with a very sharp uptick after "layer 3".
But more importantly, the notion that routing and forwarding "belongs" in IP, because that's the layer 3 protocol --- that's just false. There's no validity to it, and lots of systems have built overlays with layer 3 function on top of UDP (which in the "layering" model is a "layer 4" protocol, but is really best thought of as an escape hatch with which to build any new system you want on top of IP).
1. layers are a thing (and while any given piece of hardware or software can be serving as an amalgam of any contiguous sequence of layers, you can still analyze the behavior of such a component as if it were N separate abstract components, one for each layer it embodies);
2. layering and layering violations are a thing, in the particular sense of code that intermingles and entangles the concerns of different network layers being automatically a design smell (e.g. OpenVPN smells because, rather than building a clean layer-1 circuit abstraction on top of a layer-4/5/7 stream, and then running a regular substrate-oblivious layer-2 on top, OpenVPN runs a "dirty" layer-2 implementation directly on top of a layer-7 protocol (HTTP), where the layer-2 implementation knows things about HTTP and uses HTTP features to signal layer-2 data, such that it can no longer freely interoperate with other layer-2 implementations);
3. but just going down the layer stack, repeating layers, is not a layering violation. You can build all the way up to a circuit-switching abstraction like TCP, and then put PPP on that to go down to layer 2, and come back up again, and that's not even bad engineering.
"1. layers are a thing (and while any given piece of hardware or software can be serving as an amalgam of any contiguous sequence of layers, you can still analyze the behavior of such a component as if it were N separate abstract components, one for each layer it embodies);"
* Path MTU discovery: For proper operation, TCP needs to know a link-layer property for each of the links between a source and destination.
This bypasses the IP layer, because IP fragmentation does not play well with TCP. On the other hand, TCP does not even see the concept of a "path" between the source and destination; IP may route each segment uniquely.
* TCP over wireless links: TCP makes the assumption that segment loss implies congestion; wireless links have the propensity to drop packets for a plethora of reasons that have nothing to do with congestion. Hey, it's a bad assumption, and there's work on congestion controls that don't make that assumption, but maybe we ought to ask Van Jacobson if life mightn't be easier if the link could tell the transport protocol, "My bad! That was me, I did that?"
* Path MTU discovery: that's part of the IP contract. IP provides an unreliable datagram service with an MTU that varies based on destination endpoint but will never be below 1280b (in IPv6 - IPv4 was 576b). IPv6 also wisely doesn't do fragmentation; sizing your packets correctly is the job of layer 4.
* TCP over wireless links: TCP's congestion control mechanism is a heuristic based on ever-evolving understanding of the characteristics of links in the wild. There are things that layer 3 can do that unambiguously get in layer 4's way (bufferbloat makes low-latency response unfeasible), but it's layer 4's job to deal with reliability and congestion control. (By the way - unlike LFNs, WiFi is actually not a pathological case for TCP congestion control and buffering. A good mental model for those periodic WiFi drops is of an Ethernet cable being disconnected and reconnected with a different one picked at random from a supply closet. In a lot of very common cases, when traffic gets passed again it will not be at the same throughput as before and so the endpoints need to rediscover the available throughput.)
To your more general suggestions about alternative designs: generally, schemes that have the link layer communicate with the endpoints using them scale BADLY to large internetworks, and the global internet is the largest.
What does "on UDP" mean? UDP is just a means of running an arbitrary datagram protocol that rides on top of IP; it's how you'd build a system that treats IP the way IP treats Ethernet.
Sure, but you mentioned protocols that have "built overlays with layer 3 function on top of UDP". What are the examples you're referring to?
EDIT: My comment in reply to the sibling comment, which mentioned vxlan:
That's more of a recursive version of the lower layers; using layers 1-4 of one instance of the OSI model as layer 2 of another instance. If anything, this demonstrates just how useful the clear abstraction barrier between layer 2 and layer 3 is; you can have a very complicated software package (like a VPN) as a layer 2 instead of a physical network and all the code from layer 3 up doesn't even need to know.
That's more of a recursive version of the lower layers; using layers 1-4 of one instance of the OSI model as layer 2 of another instance.
If anything, this demonstrates just how useful the clear abstraction barrier between layer 2 and layer 3 is; you can have a very complicated software package (like a VPN) as a layer 2 instead of a physical network and all the code from layer 3 up doesn't even need to know.
There are other models of modularity that make it easy to separate transport, routing, link, and physical protocols without starting from the assumption that "layer X can only interact with the minimum common denominator interface for layers X-1 and X+1". That assumption leads to everything from the PMTU discovery silliness to the pain of getting TCP to work correctly over links like wireless where packet loss does not imply congestion.
I've heard some folks talk about TLS as a "session" layer, and it is fortunate that we no longer have to translate between ASCII and EBCDIC underneath the application, so the "presentation" layer now seems like it is mis-named. Ah how times change.
In the early to mid 80s "layer 3 switching" was becoming a thing and each switch vendor had their own method for implementation. Cabletron was a large switch vendor then and their method of layer 3 switching depended upon ARP. Each host would be assigned a /32 ip address and their default gateway would be their own ip address. There was a registry setting available on Windows NT server that would cause the DHCP server to provide hosts with DHCP address and router assignments that met these requirements.
Ports that had routers connected to them were designated as router ports and needed to have proxy arp enabled.
Whenever a host wanted to talk to any IP address which was not already in it's arp cache it would send an arp request. The management system of the switch, which in this case was software running on a server outside the switch, would look up in it's tables if it knew the IP address from another switch port. If so and all policies allowed the host sending the request to speak to the port the destination was associated with the manager would respond to the arp request with the mac of the destination. If the requested IP address didn't exist in it's tables the request would be flooded out all router ports.
One issue, though "nobody should care about "IGMP-snooping bridges". I so wish this were true, but (first-hand knowledge) tons of infrastructure these days utilizes IP multicast, including building lighting, HVAC, intercom, VoIP, etc.
Out of curiosity what do you mean by this? Are you referring to all Multicast solutions? Can I just be specific -- what do you think of Dante, Audio/Video-over-IP or other time sensitive an synced services that use Multicast?
Isn't plain old UDP already an unstoppable DDOS tool? Multicast doesn't make it that much harder to stop. In fact using it as a DDOS tool seems a bit problematic since the victim would need to join the groups to receive the traffic. Yes a piece of malware on the victim's computer could go and attempt to join every single multicast source on the internet, but it's a self correcting problem since they wouldn't be able to maintain their subscriptions with their link totally saturated. Much easier to stop than normal DDOS attacks.
The problem is that we have never figured out a multicast routing solution that would work at Internet scale. Especially one that can be implemented in hardware on routers.
> we have never figured out a multicast routing solution that would work at Internet scale
Sure we did, it's called bittorrent. Ok, it isn't really multicast and you probably have to sacrifice ordered delivery, but for many of the use-cases where multiple-delivery would have been a good idea, bittorrent has proven to be a very successful "minimum viable multicast".
Bittorrent succeeded while decades of "multicast" research/experiments failed because bittorrent realized the multi-delivery problem was really about managing peers, which isn't solvable at layer-3.
edit: by which I mean: previous attempts at multicasting assumed it was a packet routing problem, when peer management is actually a question for the application layer.
Bittorrent is the opposite of multicast. Instead of aggregating the data into a single channel to save bandwidth, we instead split it up across every single recipient in a huge NxN graph.
This also illustrates the other problem with multicast on the Internet: It's mostly saving bandwidth on the backbone and at the server. The backbone has plenty of bandwidth to spare, and servers are often in data centers these days where bandwidth is not a huge concern.
The use case where someone does video production in their basement and broadcasts it out to millions of people across the internet over their home cable modem connection is just not compelling enough for ISPs and the backbone providers to make Multicast happen. Just put it on Youtube and let Google sort it out.
hmm. Multicast is often used for, like, IPTV. That's a very different task from BitTorrent. Torrents are indeed about managing peers. IPTV is centralized, not p2p, the benefit of multicast for IPTV is that the routers in between the source (ISP) and your client only carry one copy of the stream instead of one stream per client.
At internet scale.. well, it would be nice to have this efficiency for Twitch and YouTube Live. Which are also pretty centralized (CDN) so I don't see how this is about managing peers.
Bittorrent has a P2P streaming protocol called Bittorrent Live which was used to operate a TV service for several years but I have no idea how efficient it is compared to IPTV multicasting or central servers+CDN.
How exactly? Sources have to pass RPF check following ucast path and receivers have to follow the path either to RP or source, or the packets don't get there.
It's also, effectively, a promise to maintain Internet-wide routing table entries for every page on the web rather than every host (which is something we also can't really do today).
Multicast for everything is difficult. But would it be all that difficult to have 100k or 1M entries?
Something that would definitely be doable today is an IP header that stores 25 or 50 extra destination addresses. But it seems like nobody really cares. Just make streaming services send out a thousand packets with identical data.
Well, it could be done based on microtransactions. To set up your mcast tree you need to pay. The slots are auctioned off every X minutes on a DAG-chain-block-thing.
I was going to write a long response here, but I think I'll save that for a blog post (short summary: I disagree vehemently with what I believe the premise here to be, think that people shouldn't be waiting for the IETF to give them permission to build new network layers, am fairly certain there's no such thing as a "layering violation", and think overlay networks are/will-ultimately make IPv6 irrelevant). So on this thread I'll just pick some nits.
Ethernet networking is not as gross as it's made out here. ARP isn't entirely pointless! For instance, at the ISP I ran tech for in the 1990s, I was able to pretty seamlessly move our "data center" and corporate offices across Chicago without renumbering just by exploiting ARP (I wrote a dumb proxy ARP policy router). We did similar things to route traffic to the particular terminal services customers were dialing into, or to the ISDN router who's PRI happened to service a particular customer. An IP purist would object that we weren't using OSPF the way God intended us to, but it worked and was probably more reliable than the bona fide routing protocols we replaced that with.
This narrative also, I think, gives short shrift to DHCP, which does a lot more than pick IP addresses out for new endpoints, but also pretty much fully configures their IP connection. If you had to tech support 10,000 random customers in the era before DNS servers were transparently assigned at connection, you're not pining for the simple elegance of RARP.
Also: nobody should care about "IGMP-snooping bridges", since IP multicast is and always was hopeless.