Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Start searching SuperHOT and RoPE together. 8k-32k context length on regular old Llama models that were originally intended to only have 2k context lengths.


Any trick which is not doing full quadratic attention cripples a models ability to reason "in the middle" more than they already are crippled. Good long context length models are currently a mirage. This is why no one is seriously using GPT-4-32k or Claude-100k in production right now.

Edit: even if it's doing full attention like the commentator says, turns out that's not good enough! https://arxiv.org/abs/2307.03172


This is still doing full quadratic attention.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: