Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

https://arxiv.org/pdf/2001.08361.pdf. See the C_forward formula approxiamtion.


Thank you. Though it isn't quite clear to me whether the additive part is negligible?


From the paper

> For contexts and models with d_model > n_ctx/12, the context-dependent computational cost per token is a relatively small fraction of the total compute.

For GPT3, n_ctx is 4096 and d_model is 12228 >> 4096/12.


From eq 2.2, additive part is usually in few 10s of millions. So, for N > 1B, approximation should be good but it doesn't work. For example, GPT3 inference flops is actually 3.4E+18 so the ratio is 19,000 not 2.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: