Wouldn't a 3d torus network have horrible performance with 9,216 nodes? And real...

dekhn · 2025-04-09T21:51:01 1744235461

A 3d torus is a tradeoff in terms of wiring complexity/cost and performance. When node counts get high you can't really have a pair of wires between all pairs of nodes, so if you don't use a torus you usually need a stack of switches/routers aggregating traffic. Those mid-level and top-level switch/routers get very expensive (high bandwidth cross-section) and the routing can get a bit painful. 3d torus has far fewer cables, and the routing can be really simple ("hop vertically until you reach your row, then hop horizontally to read your node"), and the wrap-around connections are nice.

That said, the torus approach was a gamble that most workloads would be nearest-neighbor, and allreduce needs extra work to optimize.

An AI data center tends to have enormous power consumption and cooling capabilities, with less disk, and slightly different networking setups. But really it just means "this part of the warehouse has more ML chips than disks"

nsteel · 2025-04-10T10:12:19 1744279939

> most workloads would be nearest-neighbor

Thank you very much, that is the piece of the puzzle I was missing. Naively, it still seems (to me) far more hops for a 3d torus than a regular multi-level switch when you've got many thousands of nodes, but I can appreciate it could be much simpler routing. Although, I would guess in practice it requires something beyond the simplest routing solution to avoid congestion.

cavisne · 2025-04-10T01:03:58 1744247038

Was that gamble wrong? I thought all LLM training workloads do collectives that involve all nodes (all-gather, reduce-scatter).

dekhn · 2025-04-10T16:10:22 1744301422

I think the choice they made, combined with some great software and hardware engineering, allows them to continue to innovate at the highest level of ML research regardless of their specific choice within a reasonable dollar and complexity budget.

xadhominemx · 2025-04-09T21:11:39 1744233099

It’s data center with much higher power density. We’re talking about 100 going to 1,000 kw/rack vs 20 kw/rack for a traditional data center. Requiring much different cooling a power delivery.

xnx · 2025-04-09T20:28:01 1744230481

> what even is an AI data center

A data center that runs significant AI training or inference loads. Non AI data centers are fairly commodity. Google's non-AI efficiency is not much better than Amazon or anyone else. Google is much more efficient at running AI workloads than anyone else.

dbmnt · 2025-04-10T03:59:33 1744257573

> Google's non-AI efficiency is not much better than Amazon or anyone else.

I don't think this is true. Google has long been a leader in efficiency. Look at the power usage effectiveness (PUE). A decade ago Google announced average PUEs around 1.12 while the industry average was closer to 2.0. From what I can tell they reported a 1.1 average fleet wide last year. They've been more transparent about this than any of the other big players.

AWS is opaque by comparison, but they report 1.2 on average. So they're close now, but that's after a decade of trying to catch up to Google.

To suggest the rest of the industry is on the same level is not at all accurate.

https://en.wikipedia.org/wiki/Power_usage_effectiveness

(Amazon isn't even listed in the "Notably efficient companies" section on the Wikipedia page).

literalAardvark · 2025-04-10T09:31:40 1744277500

A decade ago seems like a very long time.

We've seen the rise of OSS Kubernetes and eBPF networking since, and a lot more that I don't have on-stack rn.

I wouldn't be surprised if everyone else had significantly closed the hardware utilization gap.