Wouldn't a 3d torus network have horrible performance with 9,216 nodes? And really horrible latency? I'd have assumed traditional spine-leaf would do better. But I must be wrong as they're claiming their latency is great here. Of course, they provide zero actual evidence of that.
And I'll echo, what even is an AI data center, because we're still none the wiser.
A 3d torus is a tradeoff in terms of wiring complexity/cost and performance. When node counts get high you can't really have a pair of wires between all pairs of nodes, so if you don't use a torus you usually need a stack of switches/routers aggregating traffic. Those mid-level and top-level switch/routers get very expensive (high bandwidth cross-section) and the routing can get a bit painful. 3d torus has far fewer cables, and the routing can be really simple ("hop vertically until you reach your row, then hop horizontally to read your node"), and the wrap-around connections are nice.
That said, the torus approach was a gamble that most workloads would be nearest-neighbor, and allreduce needs extra work to optimize.
An AI data center tends to have enormous power consumption and cooling capabilities, with less disk, and slightly different networking setups. But really it just means "this part of the warehouse has more ML chips than disks"
Thank you very much, that is the piece of the puzzle I was missing. Naively, it still seems (to me) far more hops for a 3d torus than a regular multi-level switch when you've got many thousands of nodes, but I can appreciate it could be much simpler routing. Although, I would guess in practice it requires something beyond the simplest routing solution to avoid congestion.
I think the choice they made, combined with some great software and hardware engineering, allows them to continue to innovate at the highest level of ML research regardless of their specific choice within a reasonable dollar and complexity budget.
It’s data center with much higher power density. We’re talking about 100 going to 1,000 kw/rack vs 20 kw/rack for a traditional data center. Requiring much different cooling a power delivery.
A data center that runs significant AI training or inference loads. Non AI data centers are fairly commodity. Google's non-AI efficiency is not much better than Amazon or anyone else. Google is much more efficient at running AI workloads than anyone else.
> Google's non-AI efficiency is not much better than Amazon or anyone else.
I don't think this is true. Google has long been a leader in efficiency. Look at the power usage effectiveness (PUE). A decade ago Google announced average PUEs around 1.12 while the industry average was closer to 2.0. From what I can tell they reported a 1.1 average fleet wide last year. They've been more transparent about this than any of the other big players.
AWS is opaque by comparison, but they report 1.2 on average. So they're close now, but that's after a decade of trying to catch up to Google.
To suggest the rest of the industry is on the same level is not at all accurate.
And I'll echo, what even is an AI data center, because we're still none the wiser.