> IIUC to serve an LLM is to perform an O(n^2) computation on the model weights for every single character of user input.
The computations are not O(n^2) in terms of model weights (parameters), but linear. If it were quadratic, the number would be ludicrously large. Like, "it'll take thousands of years to process a single token" large.
(The classic transformers are quadratic on the context length, but that's a much smaller number. And it seems pretty obvious from the increases in context lengths that this is no longer the case in frontier models.)
> These models are 40+GB so that means I need to provision about 40GB RAM per concurrent user
The parameters are static, not mutated during the query. That memory can be shared between the concurrent users. The non-shared per-query memory usage is vastly smaller.
> How much would I have to charge for this?
Empirically, as little as 0.00001 cents per token.
For context, the Bing search API costs 2.5 cents per query.
The computations are not O(n^2) in terms of model weights (parameters), but linear. If it were quadratic, the number would be ludicrously large. Like, "it'll take thousands of years to process a single token" large.
(The classic transformers are quadratic on the context length, but that's a much smaller number. And it seems pretty obvious from the increases in context lengths that this is no longer the case in frontier models.)
> These models are 40+GB so that means I need to provision about 40GB RAM per concurrent user
The parameters are static, not mutated during the query. That memory can be shared between the concurrent users. The non-shared per-query memory usage is vastly smaller.
> How much would I have to charge for this?
Empirically, as little as 0.00001 cents per token.
For context, the Bing search API costs 2.5 cents per query.