Really? Seems like scaling is pretty tolerant of latency, but very bandwidth int...

rtkwe · 2025-08-09T15:06:20 1754751980

It heavily depends on the workload, if one node needs to interact commonly with the memory on another node, like calculating the output of the weights stored on node the other node for the LLM, it's going to be dog slow because it has to wait 100x as long as it does for local. If you can batch the work into chunks that mostly get processed on one node then get passed to another then it can be parallelized easily.

eg if the individual layers of your model can fit on one node and the output can be pipelined so work can continue cascading through the various nodes it'd do well. But because the current word changes the next word a lot on LLMs you can't pipeline it. But you can see it in this [0] image from the attached blog post when he was testing llama.cpp, each node processes a batch of work and passes it off to the next node then goes idle.

[0] https://www.jeffgeerling.com/sites/default/files/images/fram...