Hopefully they’re doing plenty of batching - you don’t even need to roll your own as you’re describing. Inference servers like Triton will dynamically batch requests with SLA params for max response time (for example).
That said I don’t think anyone anyone outside of OpenAI knows what’s going on operationally. Same goes for VRAM usage, potential batch sizes, etc. This is all wild speculation. Same goes for whatever terms OpenAI is getting out of MS/Azure.
What isn’t wild speculation is that even with three year reserve pricing last gen A100x8 (H100 is shipping) will set you back $100k/yr - plus all of the usual cloud bandwidth, etc fees that would likely increase that by at least 10-20%.
We’re talking about their pricing and costs here. This gives a general idea what anyone trying to self host this would be up against - even if they could get the model.
Yes and a devops engineer to manage an even moderately complex cloud deployment is an average of an extra $150k/yr. I don't know where this "cloud labor skill, knowledge, experience, and time is free" thinking comes from.
8, 80k, or 800k GPUs depending on requirements and load - the point remains the same.
That said I don’t think anyone anyone outside of OpenAI knows what’s going on operationally. Same goes for VRAM usage, potential batch sizes, etc. This is all wild speculation. Same goes for whatever terms OpenAI is getting out of MS/Azure.
What isn’t wild speculation is that even with three year reserve pricing last gen A100x8 (H100 is shipping) will set you back $100k/yr - plus all of the usual cloud bandwidth, etc fees that would likely increase that by at least 10-20%.
We’re talking about their pricing and costs here. This gives a general idea what anyone trying to self host this would be up against - even if they could get the model.