I don’t really see how this works though — Isn’t it the case that longer “compute” times are more expensive? Hogging a gpu overnight is going to be more expensive than hogging it for an hour.
Nah, it’d take all night because it would be using the GPU for a fraction of the time, splitting the time with other customer’s tokens, and letting higher priority workloads preempt it.
If you buy enough GPUs to do 1000 customers’ requests in a minute, you could run 60 requests for each of these customers in an hour, or you could run a single request each for 60,000 customers in that same hour. The latter can be much cheaper per customer if people are willing to wait. (In reality it’s a big N x M scheduling problem, and there’s tons of ways to offer tiered pricing where cost and time are the main trafeoffs.)