ExaLink Fusion – Ultra low latency switch

robin_reala · on Oct 30, 2014

Out of interest, what’s the performance of current switches in this market segment? I don’t have anything to relate 110ns to.

deadgrey19 · on Oct 30, 2014

"Latency from 550ns to 2 microseconds" (http://www.arista.com/en/products/7250x-series)

"Consistent latency as low as 350ns for all packet sizes" (http://www.arista.com/en/products/7150-series)

So this is about 3x faster than current devices.

myrandomcomment · on Oct 30, 2014

So Arista has a switch with an FPGA - 7124FX. The market for the HFT has crashed. When Arista shipped the 7124S the HFT guys where using Cisco 4900 at 4ms and the 7124S was at 600ns. It was quite a change. The markets went crazy. The chip was called Bali by Fulcrum Microsystems - Intel bought them. The follow on chip Alta was very very very late. Arista did the 7124SX which got the latency down to 500ns. The Bali itself was 300ns but the PHY chips added the additional latency (in and out). This switch replaced the 7124S but there was not a mad rush to upgrade. Going from 4ms to 600ns was an order of magnitude but going from 600ns to 500ns, not so much. Cisco who got killed by Arista in this market spent ~$100M and came up with the Nexus 3548. It had "warp mode" that could do ~50ns. This mode however required lots of pre-planning and was fixed. Market data feeds are multicast. The handoff from the exchange was 1G or even low as 100mb in the Asian markets back then. If you added up every feed you could buy it was around ~3G. You would never do this on one link. The servers for looking at the data would join groups to get the feed. The servers used 10G NICs when the traffic load on them was only around ~100MB. Once again for latency. Serialization delay was the key here. The market order would go back up to the market on another path. The idea was the HFT guys would want to process the data faster then the rest of the people. The link down to them and the link up to order was the same for everyone. If you went into the NYSE colo the cable length to the router was the same if you where in the rack next to it or on the other side of the datacenter. Anyway back to switches. So Arista shipped the 7150 around the same time as Cisco shipped the 3548. It was around 350ns using the Alta chip. The reality was that this was low enough and the traders started to look for other places to tweak.

Calling this the fastest switch in the world and then printing 5ns is misleading. It is 5ns as a Layer 1 patch panel. Not really what you want for market data which needs multicast, PIM, IGMP snooping, BPG, ACL, etc.. For the 110s is this multicast with all features enabled? Let me know if I missed a link.

BTW, if you look back when the 7124S was released there were others that built a switch based on the Bali chip like BNT. An important point to note is that the chip is line rate multicast but that is not enough. Processing the joins/leaves and programing the chip is a function of the software and that depends on the quality of the code and the CPU system in the switch. Arista won because of this. Here is a link to a bake off from 2010.

http://www.networkworld.com/article/2241525/virtualization/a...

coreyoconnor · on Oct 30, 2014

I used to work for Fulcrum Microsystems. Super fun. Your summary of the market around that is great. Thanks! I had not thought about them in a while.

The technology involved in enabling 64 octet frames to be shoved around in 300ns is fascinating. The software control of these systems was just as fascinating. These chips had a high level of programmability in the frame handler. How to use that programmability was an open question when I left Fulcrum.

myrandomcomment · on Oct 30, 2014

The thing that make the Fulcrum stuff so cool at the time was the chips being asynchronous, which is very very strange. Fulcurm had to built their own tools to validate the layout etc. Amazingly complex. Most people do not realize how amazingly complex this ASIC stuff is and how hard it is to get right without major bugs.

One very interesting bug, but one that did not really matter, was handing 1G at line rate. For those that do not know a bit of a background.

So you have a 10G chip that connects to a PHY chip.

http://en.wikipedia.org/wiki/SerDes

There are 4 x 3.25G lanes to the PHY. Now lets say you want to put a 1G SFP in that port. Well the path uses only one of the 3.25G lanes, not all 4. All the of logic and timing is really built around the packets going over the 4 links.

So anyway, if you ran a test at 100% line rate with a 1G optic there would be drops if your packet size was not divisible by 4. The stand packets sizes that are commonly used for RFC2544 testing are 64,128,256,512,1024,1280,1518. So all the sizes work expect for 1518. You could do 10G at line rate @ 1518 packet size but not 1G! It is very important for everyone to understand that this only matters in lab testing and has ZERO impact in a real world environment. If you changed the IFG on the link it would run at line rate.

http://en.wikipedia.org/wiki/Interpacket_gap

The Fulcurm chips were amazing. I very much enjoyed using them.

sargun · on Oct 30, 2014

What happened to Fulcrum? I know Intel bought them. The FM6000 is still their top of the line chip. When can I get a new switch that's competitive to T2, with FlexPipe?

wmf · on Oct 30, 2014

There are tidbits of info about the FM10000 going around but I haven't seen speeds and feeds. It seems to integrate the NICs into the switch; I'm not sure what the point of that is.

deadgrey19 · on Oct 30, 2014

> Calling this the fastest switch in the world and then printing 5ns is misleading.

The claim in the (original) story title is 110ns layer 2+ switching. I'm sitting in the summit right now, no details on what 2+ means, but, the target market is market data distribution (and this was discussed in the slides) so I would assume that they've thought about this.

5ns is additional functionality available in previous products (http://exablaze.com/exalink-50) which is also in this product.

From the sides, the idea is that they can set up multiple broadcast groups with the 5ns device to distribute market data, and then aggregate responses together with with 110ns (or in some cases 100ns) device, all in one box.

myrandomcomment · on Oct 30, 2014

So the exalink-50 number is as a Layer1 switch.

So the comment "can setup multiple broadcast groups" sounds like the Cisco Nexus 3548 Warp Mode. The market did not take to that so not sure how this is going to change that. Like I said, I have not reviewed all the data yet.

An important step for verification for all this is to put it in the hands of David Newman for a test, either paid via his private company Network Test, or publicly via Network World. A testing house like EANTC would also be great.

As you can might have guessed from my post I know a bit about this world (I am not involved anymore). At this point it is more of a historical curiosity on my part more then anything (I still have a number of friends that run HFT networks at the banks). Given what I saw in testing and claims during the early HFT rush I always want to see 3rd party testing from a reputable testing house showing a real market data feature set being used.

Here is a starting list:

1. BGP running 2. Mulitcast w/PIM 3. IGMP snooping 4. ACL configured 5. 1 port to max port of switch fanout. 6. 1% to 100% line rate on the feed. 7. Packet size ranges, both fixed and mixed. 8. Join/leave time. 9. Max groups joined.

deadgrey19 · on Oct 30, 2014

Agreed. Independent verification is definitely needed. Although I think that this is why they've announced this at STAC which does this (https://stacresearch.com/about). I've moved over to academia now, but I came from tan HFT prop-trading background. We would have jumped at the chance to get one of these in our hands, but this was 3 years ago now.

Can I ask why you'd want BGP running in a private LAN? This seems like an odd thing to do, especially in Colo environments.

myrandomcomment · on Oct 30, 2014

The handoff of the market data feed is over multicast but the connection at L3 to the exchange is BGP.

Since everyone has the same latency from the feed, the goal is to take the feed in and fanout to host as fast as possible. So ideally one port on this switch is to the exchange and the others direct to servers or if you need more scale to another lay of low latency switches that the server farms connect to. For all the money spent on this it is really pretty simple from a networking point of view.

SEJeff · on Oct 30, 2014

Layer 1 and switch are mutually exclusive. A hub or patch panel are layer 1 devices. Switching, by using the mac address, is fundamentally a layer 2 technology.

Am I missing something here? This thing is fast when used as a patch panel, but that is cheating and isn't really switching.

deadgrey19 · on Oct 30, 2014

Yes. There's a layer one patch panel inside, with 48x3 ports. 48 for the front and 48x2 for inside. Inside they have two "application" bays and in one of them is a layer2+ switching device. So it's a layer 1+2+ device in one package.

bio4m · on Oct 30, 2014

The really interesting bit for me is that theres a plugin module with a FPGA that can be programmed by the end user.

Which means simple apps wont even have their traffic leave the switch. I can think of a ton of uses for that, especially from a security perspective.

lrm242 · on Oct 30, 2014

Exablaze also sells NICs with an FPGA as does SolarFlare. You can get an FPGA in an Arista switch as well. Very cool stuff indeed.

deadgrey19 · on Oct 30, 2014

I don't think Arista sells that device anymore. Having said that, the Arista device (7124FX?) only had the FPGA on 8 out of 24 ports and the latency through the transceivers was pretty terrible. Very few people took it up.

The Solarflare device had (has?) the FPGA in a strange place, behind the NIC controller. AFAIK, the Exablaze NIC is pure FPGA which again saves on latency.

tw04 · on Oct 30, 2014

That's table stakes at this point. Arista has had a programmable FPGA for years.

deadgrey19 · on Oct 30, 2014

It's not just an FPGA, the 2x application bays support each support 48ports of connectivity to anything, at this point it's "just" a 48port FPGA implementing the worlds fastest layer 2+ switch, but it could be anything in there. The arista device only supported 8 ports and the transceiver latencies were terrible.

deadgrey19 · on Oct 30, 2014

Indeed! It looks to me like there's two of them, so the potential is pretty incredible.

otherdude438 · on Oct 30, 2014

deadgrey19... familiar name. Aren't you Matt Grosvenor who used to or still does work with Exablaze?

exablaze might not even be able to keep their ip as it remains subject to dispute: http://meanderful.blogspot.com/p/exablaze-and-zomojo.html

The bio you submitted to STAC says you worked at Exablaze: "He is an experienced software and hardware developer and has worked for a collection of start-ups including Exablaze,Zomojo..."

https://stacresearch.com/system/files/summit/files/stac_summ...

deadgrey19 · on Oct 30, 2014

Thanks. Yes, I used to work at Zomojo (Zomojo.com), a prop-trading HFT firm which is also Exablaze's parent company. I'm now a doctoral researcher at the university of Cambridge although I do still maintain a good relationship with Exabalze and have been called to consult for them from time to time. At Cambridge we have installed and use many Exablaze devices in our research work. Full details can be found on my profile with links to my current work. (quick link here: http://www.cl.cam.ac.uk/research/srg/netos/camsas/people.htm...)

The Exablaze announcement was made today at the London STAC summit at which I was invited to speak on unrelated work.

crxgames · on Oct 30, 2014

And the high frequency trading guys just busted out the checkbooks.

deadgrey19 · on Oct 30, 2014

Exablaze, today announced at the London STAC Summit that the company has introduced the world’s fastest network switch and application platform, the ExaLINK Fusion. The ExaLINK Fusion performs conventional layer 2 switching at approximately 110 nanoseconds latency and layer 1.5 switching at 100 nanoseconds, significantly faster than any existing switching device. The ExaLINK Fusion preserves the sub-five nanosecond layer 1 switching fabric and related capabilities of its industry leading ExaLINK 50 device, and adds layer 2 switching functionality implemented within a Xilinx Ultrascale FPGA. The layer 1 switching fabric is used as a central connection point for front panel line cards and internal application-specific modules.

nomnombunty · on Oct 30, 2014

I have doubts about how useful the ExaLink Fusion is in practice. Many exchanges require at minimum a layer 3 switch to terminate at the cross connect. In those cases, you cannot directly connect an ExaLink switch to the exchange.

I am quite surprised that no one mentioned the Cisco nexus 3548. Switching at L2 with 110ns latency is not that impressive considering that the Csico nexus 3548 switches packet at (L2/L3) with 50ns (with warp span turned on) http://www.cisco.com/c/en/us/products/switches/nexus-3548-sw...

deadgrey19 · on Oct 30, 2014

I think this is what they mean when they call it a "layer2+" device.

110ns is the device in its capacity as a full layer 2(+) switch. The "warp span" equivalent would be using the layer broadcast groups functionality which runs at 5ns, 10x faster.

patrickg_zill · on Oct 30, 2014

So, can high 10GE cards actually make use of the lower latency? Another comment says that 350ns is the current best and this is faster... but how much slop is there in the 10GE standard? What I mean is, if the latency is less than the standard timing allows, it may not offer a practical benefit.

deadgrey19 · on Oct 30, 2014

Timing in 10GE is sub-nano second. The PCS (physical encoding sublayer) runs at about 11Ghz (<0.1ns), so I think there's still a long way to go.

KaiserPro · on Oct 30, 2014

you can also pipe 10gige over inifniband, which is pretty cheap.

Has that advantage of having quicker rdma than 10gig ethernet. so while the switching may be slower, the processing is faster because data is piped directly into memory.

http://www.mellanox.com/page/performance_infiniband

deadgrey19 · on Oct 30, 2014

Ethernet now also supports RDMA via the RoCE standard. No idea how it compares to Inifiband RDMA, but it has been heavily exploited in this project from Microsoft Research: https://www.usenix.org/system/files/conference/nsdi14/nsdi14...

justincormack · on Oct 30, 2014

Plus they mention it in their opencompute nodes so it must be shipping https://gigaom.com/2014/10/30/microsoft-tweaks-its-server-sp...

electic · on Oct 30, 2014

Pretty impressive. Does anyone know about availability and the pricing for this?

deadgrey19 · on Oct 30, 2014

They said in the presentation (paraphrasing) "Evaluation units available late 2014, retail units available early 2015". My impression is that pricing will be similar to other high-end switches e.g. $20-30K.

sargun · on Oct 30, 2014

Any idea what the forwarding plane is made of in these switches?

otherdude438 · on Oct 30, 2014

vitesse crosspoint asic and Xilinx FPGA

matthurd · on Nov 7, 2014

Yeah, besides the Xilinx FGPA in the module, the PCB interconnect would have to be specifically the Vitesse Crosspoint VSC3144XHR-12 https://www.vitesse.com/products/product/VSC3144

and you can see from the product brief (https://www.vitesse.com/products/download.php?fid=4548&numbe...) many people use it in telco oriented gear for back-planes and the like.

Here is my take on why you shouldn't buy an ExaNIC from Exablaze (http://meanderful.blogspot.com/2014/11/dont-buy-exanic-from-...)

Just remember that I'm very biased...

--Matt.