Running LM on two gpus on a single system comes with 10x speed penalty. Getting layers across network will be in general even slower. They talk about 1 token per second, with images it will be even less due to larger amount of sequential steps.
It can be useful... if it's even possible. But there is quite slim amount of possible use cases.
Generation will be slower, so why bother? For high amounts of batches? Maybe. But why use it if we have Swarm by db0?
Training theoretically can be worth it, but something like Kickstarter and gpu renting can be both more cost-effective and quicker.
Speculative sampling to the rescue - you decode locally with a smaller-LLM, and only check from time to time with a large model, like every few tokens. This guarantees exactly the same quality with a big speedup, as you don't need to predict with the large model each individual token.
It can be useful... if it's even possible. But there is quite slim amount of possible use cases.
Generation will be slower, so why bother? For high amounts of batches? Maybe. But why use it if we have Swarm by db0?
Training theoretically can be worth it, but something like Kickstarter and gpu renting can be both more cost-effective and quicker.