> For contexts and models with d_model > n_ctx/12, the context-dependent computational cost per token is a relatively small fraction of the total compute.
For GPT3, n_ctx is 4096 and d_model is 12228 >> 4096/12.
From eq 2.2, additive part is usually in few 10s of millions. So, for N > 1B, approximation should be good but it doesn't work. For example, GPT3 inference flops is actually 3.4E+18 so the ratio is 19,000 not 2.