Hacker Newsnew | past | comments | ask | show | jobs | submit | koayon's commentslogin

This is a very fair point! If we had infinite compute then it's undeniable that transformers (i.e. full attention) would be better (exactly as you characterise it)

But that's the efficiency-effectiveness tradeoff that we have to make: given that compute is limited, would we prefer attention over shorter sequences or SSMs over longer sequences? The answer is probably "well, it depends on your use case" - I can definitely see reasons for both!

A fairly compelling thought for me is hybrid architectures (Jamba is a recent one). Here you can imagine having perfect recall over recent tokens and lossy recall over distant tokens. E.g. if the AI is generating a feature-length film, you "could imagine having Attention look at the most recent frames for short-term fluidity and an SSM for long-term narrative consistency" (quote from the OP)


If I remember it right, the llm big bird had something like this. For a particular word it would attend strongly with its closer neighbours but weakly to words far from it. Look for sparse attention. I think that's the relevant terminology. Not sure if it matches exactly what you described


And given that the compute is O(n^2) with context window, it's a very real tradeoff, at least in the short term


Hey! OP here Great question - h' in Equation 1a refers to the derivative of h with respect to time (t). This is a differential equation which we can solve mathematically when we have x in order to get a closed-form solution for h. We would then plug in that h (the hidden state) into equation 1b.

In our case, we don't actually wait for a closed-form solution but instead compute the discrete representation (Equation 2)

Hope that helps!


Another interesting one is that the hardware isn't really optimised for Mamba yet either - ideally we'd want more of the fast SRAM so that we can store more larger hidden states efficiently


Definitely agree that a lot of work going into hyperparameter tuning and maturing the ecosystem will be key here!

I'm seeing the Mamba paper as the `Attention Is All You Need` of Mamba - it might take a little while before we get everything optimised to the point of a GPT-4 (it took 6 years for transformers but should be faster than that now with all the attention on ML)


Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: