Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Mamba Explained (thegradient.pub)
201 points by andreyk on March 30, 2024 | hide | past | favorite | 44 comments


> But Transformers have one core problem. In a transformer, every token can look back at every previous token when making predictions.

Lately I've been wondering... is this a problem, or a strength?

It might be a fallacy to compare how LLMs "think" with how humans think. But humor me for a second. When you are speaking, each time you emit a word, you are not attending to every previous word in your sentence (like transformers), rather you have a state in your mind that represents the grammar and concepts, which is continuously updated as you speak (more similar to SSMs).

Similarly, when you read a book, every time you read a word, you are not attending to every previous word in the book. Your model of "the book" is rather a fuzzy/approximate state that is updated with new information every time a new word appears. Right? (I'm sorry I know this is very handwavy and psuedoscientific but bear with me).

Ok, so if (big if) you feel like the above is true, then to match human-type language modelling, SSMs seem more human-like than transformers.

BUT... then aren't transformers strictly better in terms of accuracy? Because a transformer never "forgets" information, as long as it is within the context window, because it revisits that information every time it emits a new token.

So let's say we can remove the "quadratic attention" problem of transformers with SSMs. That's a nice training/inference performance boost. But... look at where we got with "naive" attention. GPT 4, Claude 3. It's not like we're hitting a wall with quadratic attention. It's absurdly more expensive than SSMs, but GPUs certainly aren't getting slower. If all AI work stops now, and only hardware improves, it wouldn't be long until GPT4 could run on local hardware, right, provided Moore's law?

/end rant, not really sure what my point was, I'm not against SSMs (they're cool) but rather I'm wondering if the SOTA will ever be SSM when attention is so damn good


> Lately I've been wondering... is this a problem, or a strength?

It probably depends. But an idea I've been playing with: because transformers have such a strong ability for recall during inference, they might be introducing a strong inductive bias for memorization as opposed to generalization. Why bother to build a complete world model when you can just attend to the answer? The global minimum in loss (at least for the training dataset) would use those memorizing and interpolating circuits over those that generalize well. This seems consistent with LLMs as they exist today: superhuman at recall, very mediocre at reasoning. Though, for what it's worth, existing SSSMs haven't yet shown they can outperform (or even match) transformers when it comes to reasoning.

If this hypothesis were true, you might expect to see grokking in state space models more quickly than in transformer models.

(Even if it's hard to train transformers to generalize, superhuman recall is still incredibly valuable, and likely a hybrid system would offer the best of both worlds.)


Yes transformers are obviously more capable than humans in my opinion. Claude can ingest dozens of pages in seconds and -- in a single shot -- write a summary bringing in relevant passages.

The innovation is not the speed, but the lack of recursion or iteration. Humans, even accomplished ones, have to reread sections and really 'internalize' ideas before being able to summarize and very few humans can -- in a single attempt -- generate perfect speech. Most of us speak and unknowingly revise our own speech as we go along. Unlike transformers, that speak confidently, we start making a sentence and then decide halfway through its not going where we like. Then we start it over again, and by the powers of human attention, no one seems to really notice.

Transformers Are just insanely complicated and expensive to train.


I view transformers as like the language center of the brain. When we write or speak, especially when it's critical to get things right, we have this ability to think "that doesn't make sense" and start over. I view this recursion as more of a strength than weakness. You can get an LLM to generate an answer and when asked about the validity of the answer it would acknowledge that it got it wrong. This begs the question that if it had perfect recall and understanding why did it give the wrong answer in the first place?

I don't know how the reasoning part comes to us but if we could implant that capability to a transformer model then it would end up pretty good.


I agree, and also, when I'm writing, I am working towards a hierarchy of goals at the level of sentence, paragraph and beyond, and I'm also wondering if what I have written and plan to write could be confusing or misunderstood.

I think it's fair to ask whether these are essential techniques for improving precision and clarity, or just a way to compensate for not being able to see the whole picture all at once - but if the latter is the case, there's still room for improvement in LLMs (and me, for that matter.) I notice that experts on a topic are often able to pick out what matters most without any apparent hesitation.


> I view this recursion as more of a strength than weakness

Sure, it's a strength given that transformers are currently limited by compute budget, but theoretically, if we were to have a way to overcome this, it seems obvious to me that transformer's 'one-shot' ability makes them better.

That being said the recursive aspect you're referencing can be built into a transformer as well. This is a sampling and training problem.


> we start making a sentence and then decide halfway through its not going where we like

I'll just add the observation that when we do this it's largely based on feedback receive from the recipient (well, so long as you're talking-with as opposed to talking-at) - we're paying attention to how the audience is paying attention or not, any small facial tics that might betray skepticism or agreement and so on. I'm looking forward to interacting with an LLM that pairs an emotion-vector along with each token it has previously produced.

hume.ai goes a long way analyzing audio, just a matter of time before they're ingesting realtime facial cues to also incorporate their audience's reaction in their choice of what to say next


This is a very fair point! If we had infinite compute then it's undeniable that transformers (i.e. full attention) would be better (exactly as you characterise it)

But that's the efficiency-effectiveness tradeoff that we have to make: given that compute is limited, would we prefer attention over shorter sequences or SSMs over longer sequences? The answer is probably "well, it depends on your use case" - I can definitely see reasons for both!

A fairly compelling thought for me is hybrid architectures (Jamba is a recent one). Here you can imagine having perfect recall over recent tokens and lossy recall over distant tokens. E.g. if the AI is generating a feature-length film, you "could imagine having Attention look at the most recent frames for short-term fluidity and an SSM for long-term narrative consistency" (quote from the OP)


If I remember it right, the llm big bird had something like this. For a particular word it would attend strongly with its closer neighbours but weakly to words far from it. Look for sparse attention. I think that's the relevant terminology. Not sure if it matches exactly what you described


And given that the compute is O(n^2) with context window, it's a very real tradeoff, at least in the short term


>> But Transformers have one core problem. In a transformer, every token can look back at every previous token when making predictions.

> Lately I've been wondering... is this a problem, or a strength?

Exactly. There are lot of use cases where perfect recall is important. And earlier data may be more or less incompressible, such as if an LLM is working on a large table of data.

Maybe we'll end up with different architectures being used for different applications. E.g. simple chat may be OK with an RNN type architecture.

I've also seen people combine Mamba and Transformer layers. Maybe that's a good tradeoff for some other applications.


It depends on the task I imagine. Like writing a novel was mentioned, keeping important story lines in your memory for a long time will be necessary, or at least certainly more important than remembering what the characters were eating for lunch on page 10. But if you need to find that one loophole in a contact you probably will benefit from the perfect recall.


>Lately I've been wondering... is this a problem, or a strength?

It's a strength; fundamentally it's impossible to achieve the same degree of accuracy with a sub-quadratic attention mechanism: https://arxiv.org/abs/2209.04881 (unless the Strong Exponential Time Hypothesis is false, which is very unlikely, like P=NP).


>> Is this a problem or a strength?

I was wondering the same thing. I understand, why the initial developers of this method declared it as a strength. Still I think it's a problem, too:

If the Tranformer reads this sentence:

A equals B

It understands, that B comes after A and therefore A equals B. But how does it learn that after A comes B and therefore B equals A.

I am referring to the logical problems, that most (all?) modern language models suffer of.


I see many people get confused by this due to the widely spread (and false) "stochastic parrot" theme. But these models are much more than mere senzence-repeaters. In a way, the model is not learning that after A comes B. I mean, it could. With a lack of additional training data it probably would, too. But with enough data, this kind of sentence completion based purely on existing sentences no longer works because it would saturate parameters. So to retain and improve accuracy during training, it will have to come up with a compression that essentially forms a model of the real world. Or at least the world that the training corpus describes [1]. In that sense, it no longer "knows" that B comes after A (except for the input context), but it would have learned that there is a special relation between A and B. In can then also apply this kind of learned logic to new concepts that appear first in the context during inference. With all that happening internally, it only has to morph this state back into a natural language output. But with billions of parameters and countless layers, there is more than enough computational room for this to happen. In fact, recent models have shown that even small models can get pretty good at logic if you only get the training data right.

[1] https://arxiv.org/abs/2210.13382


We're running out of the ability to make transistors smaller and closer together so beyond some major breakthrough I wouldnt expect Moore's law to continue nearly long enough to get to the point of running GPT4 on consumer hardware in the short term


Well consumer hardware can run something in the order of ~50B quantized at a "reasonable" price today, we'd need about 5 or 6 doublings to run something that would be GPT 4 tier at 1T+. So, it would need to continue for roughly a decade at least?

Current models are horrendously inefficient though, so with architectural improvements we'll have something of that capability far sooner on weaker hardware.


Ah, but we've just begun stacking transistors in the third dimension.


That doesn't solve the problem, it just pushes is down the road a bit. The exponential growth is merely offset by a constant factor once. Unless we figure out how to push transistors in the 5th, 6th etc dimension with every new generation.


It was never a solution, Moore's law has more than one dimension as well, not just density but heat dissipation. Can't cool down a transistor that's surrounded by transistors on all sides.


> It's not like we're hitting a wall with quadratic attention. It's absurdly more expensive than SSMs, but GPUs certainly aren't getting slower.

We are not hitting a wall, but a slope. Hardware improvements will not make up for it indefinitely. Software will have to make up for it, but the problem is that it costs millions of dollars to hit compile.


>> When you are speaking, each time you emit a word, you are not attending to every previous word in your sentence

I was exactly doing this until late in my youth. until I learnt people do it sequentially. But it is doable to create connections and pick the sensible case. Not the most relaxing thing.


It's a tradeoff to be managed depending on the application rather than a problem.


very good point and the sooner we can accept this difference (we access hyperdimensional entities we discover through language and math via fast and slow access and vocalize it through the alphabets we learned to read) the more "intelligence" we can unlock from AI.


What's an SSM?

For the uninitiated (like me), apparently it stands for State Space Models.


I don't think it's weird or broken to think and compare on what LLM do vs what our brain do.

It shows more than not that we are also parrots


I find it difficult to understand certain math and science papers/articles due to ambiguous use of language.

For example "all previous tokens can be passed to the current token." That seems like a poorly constructed sentence. A token is not a function and it's not an algorithm either... How can you pass tokens to a token? This type of ambiguous language in academic papers makes it hard to read... Maybe the phrase 'every token has an association with every other previously encountered token' would be better? Or every token is used to compute the token vector for each token... I don't know, all I can do is guess the meaning of the word 'passed'. They want us to infer and fill in the gaps with our own assumptions. It assumes that we are primed to think in a certain highly constrained way...

For some reason a lot of academia around AI is littered with such imprecise language. They choose to use niche concepts and repurposed wording that their own small community invented rather using words and ideas that are more widely understood but which would convey the same information.

Rational people who aren't directly involved in those fields who generally resist jumping to conclusions will struggle to understand what is meant because a lot of those words and ideas have different interpretations in their own fields.

I studied machine learning at university and wrote ANNs from scratch and trained them and even I find the language and concepts around LLMs too ambiguous. I'd rather just ask ChatGPT.

One thing that bothers me is that the community has moved away from relating concepts to neurons, interconnections, input layers, hidden layers and output layers. Instead, they jump straight into vectors and matrices... Pretending as though there is only one way to map those calculations to neurons and weights. But in fact, this abstraction has many possible interpretations. You could have fully connected layers or partially connected layers... Maybe you need a transformer only in front of the input layer or between every layer... So many possibilities.

The entire article means little if considered in isolation outside of the context of current configurations of various popular frameworks and tools.


I agree although I've always interpreted it as a combination of difficulty explaining complex architecture, and also not really understanding why things work the way they do. A lot of modern AI sits in this kind of quasi-empirical realm just above (in an emergent properties sense) analytic math and statistics, and it seems like there's not a very good integrative account or understanding of what's going on, or a way of deriving what direction to go in. So you end up with poor explanations in part because the authors of the structures themselves don't quite understand why things are working as they are.


that's not what it says in the article. it actually says "information from all previous tokens can be passed to the current token".

that statement is meaningfully different from "all previous tokens can be passed to the current token". and both really makes sense if you understand attention mechanisms.


Sorry for the misquote but it's a distraction from my issue which was with the usage of the word 'passed'.

Do you pass information from other tokens to a token in the sense that each token processes information from other tokens? A token isn't a processing unit AFAIK, it's just a word part. The processing is not the responsibility of the token itself. My understanding is that tokens may be associated with each other via an external structure but not passed to each other. Or maybe they meant a token vector? And the token vector contains information from related tokens? It's unclear.

To me, 'passed' means data passed to a function or algorithm for processing. It's confusing unless a token is a function or algorithm.

My point is that this language only makes sense if you are already up to date in that field.


Well they gave the equations so follow closely where the token representations end up and how they're acted upon.


Anyone else keep seeing articles about Mamba and thinking it's about Python/Conda? It's annoying when the new cool thing picks the same name as something else you like that deserves attention.


Sounds like you need a language model to help you categorize Mamba articles into Python and non-Python articles?


> attention

I see what you did there


Links to more about Mamba (selective state space models) on HN yesterday:

https://news.ycombinator.com/item?id=39853958#39855430


This submission has the same content as the link here (submitted to HN about a month ago):

https://news.ycombinator.com/item?id=39501982 https://www.kolaayonrinde.com/blog/2024/02/11/mamba.html


Yes and its the same author this time published on the Gradient (the link before was to the personal blog). The Gradient by the way are amazing curators of AI news in general and have one of the better podcasts I am aware of interviewing developers in the trenches.

Adding: this resurgence in Mamba in general is also due to some actual sota progress with SSM like the new AI21 lab released this week [1] and likely to see others merging different architecture layers (this is a 52B MoE with 12B params active during inference blending both Mamba and transformers)

>As the first production-grade model based on Mamba architecture, Jamba achieves an unprecedented 3X throughput and fits 140K context on a single GPU.

[1] https://www.ai21.com/jamba


I just have to say it: that image shows gunpla, i.e. Mobile Suit Gundam, not Transformers!


An official request has been made to ICANN to rescind the OP's nerd card.


This is more human like, and people will complain that it doesn’t have photographic memory. That is, it’s not superhuman in that regard. But there are many tasks where superhuman recall is not required. We know this because those tasks are currently performed by humans.


So in an effective Mamba query the question goes at the end, after input data? I thought that the question should go at the beginning, so it can decide which information in the data is relevant.


I could be wrong, as I haven't used Mamba, but it seems to remain similar to transformers in that it doesn't "decide" anything and streams tokens to follow the existing ones; attention isn't a thing in the same way, but recency does still have impact. To that end, putting context after the question makes it more likely to follow the context, not the question.


This is the best explanation I have seen for Mamba.


TLDR: Friendship ended with transformers. Now Mamba is my best friend.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: