Yes, before hardware inline kTLS offload, we were limited to 200Gb/s or so with Naples. With Rome, its a bit higher. But hardware inline kTLS with the Mellanox CX6-DX eliminates memory bandwidth as a bottleneck.
The current bottleneck is IO related, and its unclear what the issue is. We're working with the hardware vendors to try to figure it out. We should be getting about 390Gb/s
> But hardware inline kTLS with the Mellanox CX6-DX eliminates memory bandwidth as a bottleneck.
For a while now I had operated under the assumption that CPU-based crypto with AES-GCM was faster than most hardware offload cards. What makes the Mellanox NIC perform better?
I.e.: Why does memory bandwidth matter to TLS? Aren't you encrypting data "on the fly", while it is still resident in the CPU caches?
> We're working with the hardware vendors to try to figure it out. We should be getting about 390Gb/s
Something I explained to a colleague recently is that a modern CPU gains or loses more computer power from a 1° C temperature difference in the room's air than my first four computers had combined.
You're basically complaining that you're unable to get a mere 10% of the expected throughput. But put in absolute terms, that's 40 Gbps, which is about 10x more than what a typical server in 2020 can put out on the network. (Just because you have 10 Gbps NICs doesn't mean you can get 10 Gbps! Try iperf3 and you'll be shocked that you're lucky if you can crack 5 Gbps in practice)
> For a while now I had operated under the assumption that CPU-based crypto with AES-GCM was faster than most hardware offload cards. What makes the Mellanox NIC perform better?
> I.e.: Why does memory bandwidth matter to TLS? Aren't you encrypting data "on the fly", while it is still resident in the CPU caches?
It may depend on what you're sending. Netflix's use case is generally sending files. If you're doing software encryption you would load the plain text file into memory (via the filesystem/unified buffer cache), then write the (session specific) encrypted text into separate memory, then tell give that memory to the NIC to send out.
If the NIC can do the encryption, you would load the plain text into memory, then tell the NIC to read from that memory to encrypt and send out. That saves at least a write pass, and probably a read pass. (256 MB of L3 cache on latest EPYC is a lot, but it's not enough to expect cached reads from the filesystem to hit L3 that often, IMHO)
If my guestimate is right, a cold file would go from hitting memory 4 times to hitting it twice. And a file in disk cache would go from 3 times to once; the CPU doesn't need to touch the memory if it's in the disk cache.
Not that this is a totally different case from encrypting dynamic data that's necessarily touched by the CPU.
> You're basically complaining that you're unable to get a mere 10% of the expected throughput. But put in absolute terms, that's 40 Gbps, which is about 10x more than what a typical server in 2020 can put out on the network. (Just because you have 10 Gbps NICs doesn't mean you can get 10 Gbps! Try iperf3 and you'll be shocked that you're lucky if you can crack 5 Gbps in practice)
I had no problem serving 10 Gbps of files on a dual Xeon E5-2690 (v1; a 2012 CPU), although that CPU isn't great at AES, so I think it only did 8 Gbps or so with TLS; the next round of servers for that role had 2x 10G and 2690 v3 or v4 (2014 or 2016; but I can't remember when we got them) and thanks to better AES instructions, they were able to do 20 G (and a lot more handshakes/sec too). If your 2020 servers aren't as good as my circa 2012 servers were, you might need to work on your stack. OTOH, bulk file serving for many clients can be different than a single connection iperf.
> If my guestimate is right, a cold file would go from hitting memory 4 times to hitting it twice. And a file in disk cache would go from 3 times to once; the CPU doesn't need to touch the memory if it's in the disk cache.
> I.e.: Why does memory bandwidth matter to TLS? Aren't you encrypting data "on the fly", while it is still resident in the CPU caches?
I assume NF's software pipeline is zero copy, so if TLS is done in the NIC data only gets read from memory once when it is DMA'd to the NIC. With software TLS you need to read the data from memory (assuming it's not already in cache, which given the size of data NF deals with is unlikely), encrypt it, then write it back out to main memory so it can be DMA'd to the NIC. I know Intel has some fancy tech that can DMA directly to/from the CPU's cache, but I don't think AMD has that capability (yet).
> Try iperf3 and you'll be shocked that you're lucky if you can crack 5 Gbps in practice
Easy line rate if you crank the MTU all the way to 9000 :D
> modern CPU gains or loses more computer power from a 1° C temperature difference in the room's air
If you're using the boost algorithm rather than a static overclock, and when that boost is thermally limited rather than current limited. With a good cooler it's not too hard to always have thermal headroom.
> Easy line rate if you crank the MTU all the way to 9000 :D
In my experience jumbo frames provide at best an improvement of about 20% in rare cases, such as ping-pong UDP protocols such as TFTP or Citrix PVS streaming.
The current bottleneck is IO related, and its unclear what the issue is. We're working with the hardware vendors to try to figure it out. We should be getting about 390Gb/s