A 3d torus is a tradeoff in terms of wiring complexity/cost and performance. Whe...

nsteel · 2025-04-10T10:12:19 1744279939

> most workloads would be nearest-neighbor

Thank you very much, that is the piece of the puzzle I was missing. Naively, it still seems (to me) far more hops for a 3d torus than a regular multi-level switch when you've got many thousands of nodes, but I can appreciate it could be much simpler routing. Although, I would guess in practice it requires something beyond the simplest routing solution to avoid congestion.

cavisne · 2025-04-10T01:03:58 1744247038

Was that gamble wrong? I thought all LLM training workloads do collectives that involve all nodes (all-gather, reduce-scatter).

dekhn · 2025-04-10T16:10:22 1744301422

I think the choice they made, combined with some great software and hardware engineering, allows them to continue to innovate at the highest level of ML research regardless of their specific choice within a reasonable dollar and complexity budget.