A 3d torus is a tradeoff in terms of wiring complexity/cost and performance. When node counts get high you can't really have a pair of wires between all pairs of nodes, so if you don't use a torus you usually need a stack of switches/routers aggregating traffic. Those mid-level and top-level switch/routers get very expensive (high bandwidth cross-section) and the routing can get a bit painful. 3d torus has far fewer cables, and the routing can be really simple ("hop vertically until you reach your row, then hop horizontally to read your node"), and the wrap-around connections are nice.
That said, the torus approach was a gamble that most workloads would be nearest-neighbor, and allreduce needs extra work to optimize.
An AI data center tends to have enormous power consumption and cooling capabilities, with less disk, and slightly different networking setups. But really it just means "this part of the warehouse has more ML chips than disks"
Thank you very much, that is the piece of the puzzle I was missing. Naively, it still seems (to me) far more hops for a 3d torus than a regular multi-level switch when you've got many thousands of nodes, but I can appreciate it could be much simpler routing. Although, I would guess in practice it requires something beyond the simplest routing solution to avoid congestion.
I think the choice they made, combined with some great software and hardware engineering, allows them to continue to innovate at the highest level of ML research regardless of their specific choice within a reasonable dollar and complexity budget.
That said, the torus approach was a gamble that most workloads would be nearest-neighbor, and allreduce needs extra work to optimize.
An AI data center tends to have enormous power consumption and cooling capabilities, with less disk, and slightly different networking setups. But really it just means "this part of the warehouse has more ML chips than disks"