Is it just me, or does every approach basically boil down to not wanting to pay ...

Is it just me, or does every approach basically boil down to not wanting to pay the full quadratic cost over the context (usually by selecting which tokens to pay attention to, or using some computationally cheaper substitute for each token).

I feel like all these approaches kind of equivalent to a fully dense attention matrix over a smaller context, but carefully curating what goes into the context, also known to us humans as summarizing each bit of text, or (perhaps less efficiently) going through a textbook with a highlighter.

My intuition is that the winning approach will be a small (ish), lets say 8k context, with efficient an summarization and dynamic information retrieval scheme.