My bet is it's just brute force. I don't understand how they did 10M though, thi...

QuadmasterXLII · on Feb 23, 2024

10 million means a forward pass is 100 trillion vector vector products. A single A6000 can do 38 trillion float-float products a second. I think their vectors are ~4000 elements long?

So the question is, would the google you know devote 12,000 gpus for one second to help a blogger find a line about jewish softball, in the hopes that it would boost PR?

My guess is yes tbh

rfoo · on Feb 24, 2024

idk

For longer context to brute force it the problem is more on the memory side instead of the compute. Both bandwidth and capacity. We have more than enough compute for N^2 actually. The initial processing is dense, but is still largely bound by memory bw. Output is entirely bound by memory bw since you can't make your cores go brrr with only GEMV. And then you need capacity to keep KV "cache" [0] for the session. A single TPU v5e pod has only 4TB HBM, assuming pipeline parallel across multiple TPU pods isn't going to fly, I haven't run the numbers but I suspect you get batch=1/batch=2 inference at best. Which is prohibitively expensive. But again who knows, groq demonstrated a token-wise more expensive inference tech and got people wowed by pure speed. Maybe Google's similar move is long context. They have an additional advantage as they can have exclusive access to TPU so that before H200 ships they may be the only one who can serve a 1M token LLM to the public without breaking a bank.

[0] "Cache" is a really poor name. It you don't do this you get O(n^3) which is not going to work at all. IMO it's wrong to name your intermediate state "cache" if removing it changes asymptotic complexity.

QuadmasterXLII · on Feb 24, 2024

Sorry, read up on flash attention. there is no storage bottleneck.

rfoo · on Feb 24, 2024

I'm not talking about the so-called quadratic memory requirement of the attention step, there NEVER WAS ONE.

I'm talking about a simple fact - to efficiently (cost-wise) run LLM inference you have to have a KV "cache" and its size grows (linearly) by your expected batch size and your context window length. With a large context window length it become even bigger than model weight.

I don't want to be mean, but sorry:

Sorry, read up on PagedAttention. You clearly don't know what you are talking about, please be better.

QuadmasterXLII · on Feb 25, 2024

I'm not sure you're actually doing the math for these long contexts. A naked transformer generating 1k tokens with 1k prompt is spending all its time doing a bunch of forward passes to generate each token- that's what's driven your intuition. A naked transformer generating 1k tokens with 1M prompt is spending all its time generating the embeddings for the prompt (filling the kv cache), and then the iterating generation at the end is a tiny fraction of the compute even if you have to run it 1k times