https://arxiv.org/pdf/2001.08361.pdf. See the C_forward formula approxiamtion.

cubefox · on March 2, 2023

Thank you. Though it isn't quite clear to me whether the additive part is negligible?

vishal0123 · on March 2, 2023

From the paper

> For contexts and models with d_model > n_ctx/12, the context-dependent computational cost per token is a relatively small fraction of the total compute.

For GPT3, n_ctx is 4096 and d_model is 12228 >> 4096/12.

sytelus · on March 2, 2023

From eq 2.2, additive part is usually in few 10s of millions. So, for N > 1B, approximation should be good but it doesn't work. For example, GPT3 inference flops is actually 3.4E+18 so the ratio is 19,000 not 2.