Paving the way to efficient architectures: StripedHyena-7B

kcorbitt · on Dec 8, 2023

For short context tasks looks like it's slightly stronger than Llama 7B and slightly weaker than Mistral 7B. Really impressive showing for a completely new architecture. I've also heard that it was trained on far fewer tokens than Mistral, so likely still room to grow.

Overall incredibly impressive work from the team at Together!

tempusalaria · on Dec 8, 2023

Did they disclose the training compute/token count?

stellaathena · on Dec 9, 2023

Nope :(

icyfox · on Dec 8, 2023

Common wisdom in most industries is to release bad PR announcements on a Friday and good ones towards the start of the week. It's interesting how the advent of twitter communication has shifted the ML ecosystem to publishing work whenever it's ready versus trying to find an optimal weekday. Or maybe they're optimizing for the weekend hackers that will take improvements released late in the week and put them into practice on Saturdays and Sundays.

alsodumb · on Dec 8, 2023

NeurIPS is also starting this Sunday, I'm sure some of these releases are meant to get the word going during the conference.

refulgentis · on Dec 8, 2023

It is interesting, been thinking about that myself a bit: some random associated thoughts: NeurIPS is next week, & there was OpenAI drama news that went under the radar due to Friday rule.

bratao · on Dec 8, 2023

And this uses Hyena, that can be considered a "previous generation" of Mamba. I think that this anwsers the question about the scalability of SSM and the transformer finally found an opponent.

ttul · on Dec 9, 2023

Well, since Together did hire the author of Mamba…

algo_trader · on Dec 8, 2023

> H 7B is always faster than optimized Transformers > (>10%, >20% and >50% end-to-end faster than FlashAttention

50% improvement on very large sequences is great. But it is not yet transformative (pun intended..)

Presumably FlashAttention already does lots of the less-than-quadratic improvement

goalonetwo · on Dec 8, 2023

There seems to be a new model every single day. How do people have time to keep track with everything going on in AI?

hinkley · on Dec 8, 2023

From decades of observing at a distance and observing observers at a distance, I think it's safe to say that, like fusion, there are walls that AI run into, not unlike the risers on a staircase, and when we collectively hit one, there's a lot of scuttling back forth. A lot of movement, but no real progress. If that plateau goes on too long, excitement (and funding) dry up and things die down.

Then someone figures out how to get past the current plateau, and the whole process repeats. That could be new tech, a new architecture, or it could be old tech that was infeasible and had to wait for Moore's Law.

Right now we are on the vertical part of the sawtooth pattern. Everyone hopes this will be the time that takes us to infinity, but the old people are just waiting for people to crash into the new wall.

fvv · on Dec 8, 2023

Why things should dry up when contrary to fusion ai is already usable by millions daily ? Even if prpgress should stall a bit the product or fine-tunes or normal progress will still be super supeful , the "too soon" point has been surpassed

blagie · on Dec 9, 2023

A lot of previous plateaus in AI are usable and used by billions daily, for example, giving good navigation routes on your phone, managing NPCs in a video game, showing ads, or recommending movies.

It's not that they don't have value -- they do, and in the trillions of dollars -- but once understood, they move from "AI" to "algorithms" and stop being exciting.

The current progress feels different to me, though. The current step in capability is much higher than previous ones, as is the potential disruption.

Al-Khwarizmi · on Dec 9, 2023

I think what makes the current iteration of AI different is that we don't understand how the emerging abilities work.

A map navigation algorithm: we understand it, we know where the limit is (basically it cannot do anything that isn't map navigation), so it stops being exciting.

GPT: we don't understand it, we don't know where the limit is... And it doesn't seem it will stop being exciting until we do.

kybernetikos · on Dec 9, 2023

People say this a lot - that we don't understand this or that, but I'm not really sure what they mean. We know exactly how these algorithms work. We know every calculation - the maths is not particularly difficult, we understand how the training process leads to information being stored in the weights, we know how inference works. What more would you want to understand before you would agree we understood it?

blagie · on Dec 9, 2023

> We know exactly how these algorithms work.

We have no idea how these algorithms work.

> We know every calculation - the maths is not particularly difficult,

We do know that.

> we understand how the training process leads to information being stored in the weights, we know how inference works.

We do not know that.

> What more would you want to understand before you would agree we understood it?

Let me give an analogy:

We have an almost perfect understanding of transistors. If you hand me a Qualcomm mobile chipset in a black box, I'll have little or no understanding of how that allows me to make phone calls. Back in the day, I understood the x86 instruction set very well. However, if you gave me the binary of a video game, I'd have no idea how it worked. Neuroscientists understand the mathematics of how neurons work, imperfectly (but pretty well). For the sake of argument, we can pretend the models are perfect. We understand the neural wiring of simple organisms perfectly. We still have very little idea of how the human brain works.

The algorithms in deep learning are evolved and have billions of parameters. We understand the general topology, and the math of individual neurons, but we have absolutely no idea how the things work as a system. Anyone who tells you they do is lying (very likely with no ill intent; they're probably deluding themselves as well).

The people doing deep learning are, by and large, not brilliant mathematicians, of the type who did earlier AI. The math is simple compared to most of the convex optimization algorithms which came before (and could probably be made much better if those were applied). Even at the human level, a lot of work in deep learning is

- randomly tweaking parameters, topologies, and algorithms

- developing intuition (NOT theory) for which ones work better, and

- bullshitting explanations for why that might be (which would, at best, pass for hypothesis in any scientific process)

It's hard for me to emphasize how little we know about how or why these things work. We just set up a general framing which evolves well, and evolved it. An analogy would be if we set up a random number generator to write a piece of code, ran it 10^10^10 times, and picked the result which made the best wavelet transforms. We'd have no clue how it works. The only difference is (1) we have algorithms which are more tractable than randomly picking algorithms (2) we set up neural networks which evolve better than code, largely by virtue of being continuous rather than discrete.

kybernetikos · on Dec 9, 2023

I'm still struggling to understand what counts as 'knowing how it works' for you.

In my view, if you randomly generated a piece of code to do a task, you know how it works - you can see the algorithm right there in front of you. If you randomly generated it and checked each instance until you found a good one, you even know how you got it and why it's good at the task (because you checked it and threw away the ones that weren't). Obviously if you start not knowing what the code is, you have a simple, surface level lack of understanding which you can resolve by tracing it. Once you've done that, you understand everything there is to understand about it. The fact that it produces nice wavelet transforms is a simple product of how it was found.

What more do you want to understand about it? What more is there to understand about it?

blagie · on Dec 9, 2023

I have plenty of pieces of code I've written, decades ago, where I know what they do, but have NO idea how they work.

If I want to understand a piece of code, I need to read it and understand it. Modern models have billions of parameters and are trained over >10^20 computations. That's more than I can ever hope to read.

> because you checked it and threw away the ones that weren't

I know what it does under specific circumstances. I don't know what it does elsewhere. We have a pretty good understanding of how GPT-4 works on training data, but we have a very poor understanding of what it does for the countless other uses we see. Code I write, I analyze carefully for corner cases.

If we develop an AI which has a corner case of "exterminate humanity" which wasn't in the training set, that's, well, very possible.

I've trained plenty of neural networks (even once coding in machine code straight to custom hardware, back in the day). I can't say I understand how very many of them work, though.

> you can resolve by tracing it

You can't trace through 200B parameters or through 10^20 computations. That's beyond human capacity. We have no idea how it works, and we have a very poor understanding of emergent behaviors.

Evolution "trains" biological organisms to survive to have as many babies as possible. Vengeance? Love? Loyalty? Pain? Hate? Emergent behaviours.

kybernetikos · on Dec 9, 2023

> I know what it does under specific circumstances. I don't know what it does elsewhere.

No, but if you get given a circumstance, you know how to work out what it does in that circumstance. Are you saying that you need to keep every input -> output mapping in your head to feel you understand a piece of code? I feel like I understand multiplication pretty well, but there are many multiplication calculations you could give me where I wouldn't know the answer without a lot of thinking. There's some I couldn't work out by myself in my lifetime. That doesn't stop me feeling like I understand multiplication pretty well.

> Vengeance? Love? Loyalty? Pain? Hate? Emergent behaviours.

Sure, and arguably at the level we're talking about, those are descriptive rather than explanatory. 'Vengeance' isn't something a neuron knows about, nor is it a biological mechanism in our cells, it's how we describe high level behavior resulting from the interactions of the cells. It's an abstraction. If you had the accurate model you were talking about earlier, you'd be able to work out that given the right input, a particular behavior is output. That others might call that behavior 'vengeance', makes not a single iota of difference to your ability to predict the behavior of the system.

Are you saying that you need to have developed high level descriptions of the behavior of a system in order to feel you understand it? What if there are no high level descriptions? In the hypothetical scenario where we hit on an algorithm randomly, there's no requirement that it translates to any specific high level concepts, there's just input, output and the algorithm, all of which we can understand.

Or perhaps you mean that you already have a set of categories for output behavior and to truly understand something you need to be able to categorise the inputs and know which broad input categories result in which output categories? I could probably accept that as a broad definition of understanding, but there's a lot of flex there in terms of exactly what level of granularity you're requiring.

blagie · on Dec 10, 2023

For a multiplication, I can work out any problem, and I have a sense of what it will do under any circumstance.

> Are you saying that you need to have developed high level descriptions of the behavior of a system in order to feel you understand it?

Yes. That's almost the definition of "understanding."

> What if there are no high level descriptions?

There are things we don't or can't understand. That's approximately Godel's Theorem. That likely includes some phenomena in fluid mechanics and in quantum mechanics. It may or may not include large-scale deep learning models.

It's okay to admit we don't, or can't, understand something.

> Or perhaps you mean that you already have a set of categories for output behavior and to truly understand something you need to be able to categorise the inputs and know which broad input categories result in which output categories?

There are different levels of understanding. However, with LLMs, I don't have a clear sense of under what conditions one might decide to, for example, eradicate humanity. I'd say that suggests I have a very limited understanding of them. I don't think there are many people with a better understanding than mine, and no one with a good understanding.

I feel like I understand a multiplication algorithm well enough to know it won't do that ever, however. If I multiply two numbers, I won't get a humanity-ending answer out.

I don't know if deep learning models have some analogue to emotions. I do know multiplication doesn't.

And so on.

og_kalu · on Dec 9, 2023

>you know how it works - you can see the algorithm right there in front of you.

Seeing the algorithm in front of you doesn't mean you know how it works. It's gibberish code. If you'd never learnt C or any other programming language in your life, i could show you the C Code for a popular application. You could inspect it all you like. You will still understand nothing. The best you can do is, "this code is running this application".

In the real world, you can just pick up a C book and start learning. In this instance, no one on earth has learnt C and there are no books on it.

and neural network calculations are not just one unvarying "algorithm"

kybernetikos · on Dec 9, 2023

Knowing the algorithm literally means that you could theoretically reproduce it yourself, step by step (absent practical worries about longevity). Each of the steps is simple and well understood. What to do at each step is simple and well understood. What more is there to understand?

og_kalu · on Dec 9, 2023

By that definition, i know COBOL too since i can look up programs coded in it. I can reproduce the program too by hand-typing it and running it elsewhere. Banks should hire me asap /s.

I genuinely don't get what is so hard to understand here. You don't know the algorithm. You can see it. That's all. You don't suddenly understand information just because you can see and copy it. Would certainly be nice though.

kybernetikos · on Dec 9, 2023

What I'm trying to work out is what you mean by 'understand'. When it comes to an algorithm, what is it that you need to know beyond how to execute it in order to believe you understand it?

og_kalu · on Dec 9, 2023

Being able to make predictable changes directly to the algorithm would be a starter.

TheDudeMan · on Dec 9, 2023

Yes. The thing that makes the current generation of AI different is that the architectures scale. Another $10 million in training effort WILL yield improvement. And Moore’s law pairs nicely with scaling behavior. In other words, there is currently no end in sight. Plus, algo advancements like this make things happen ever faster. Plus, increased VC money means more money to throw at hardware and more folks trying new things in software. Soon we’ll be replaced :(

actionfromafar · on Dec 8, 2023

Depends on what you are looking for. I have this hesitation too. What we have and are on track for is useful and cool, but how far will we come in this spurt until we are back at slight incremental gains?

Implementation wise in business, we are very early though. It feels like email in 1995, we have barely scratched the surface of what LLMs can mean for business and everyday life.

ReptileMan · on Dec 9, 2023

Because suddenly the tech moves from world transformative to world enhancing. The potential profits from trillions to mere billions. From immortality to slightly longer lifespan.

goalonetwo · on Dec 8, 2023

Thanks for putting this so eloquently. That's exactly how I feel as well.

YetAnotherNick · on Dec 9, 2023

We know for a fact that there is no wall till GPT 4 and open source still has long way to reach there.

holoduke · on Dec 8, 2023

I know. The new reddit look sucks big. But the sub reddits still give you good insights in the latest developments. I am playing arround a lot with image related stuff arround stable diffusion. Comfyui subreddit gives me daily. Now after a few weeks i think i have a fair good understanding in what is hot. Checkpoints, ipadapters, facemodels etc. Just playing arround and you will get a grasp of it. I guess its similar with text generation.

tavavex · on Dec 9, 2023

FYI: if you want a faster experience, https://old.reddit.com/ still works perfectly

benreesman · on Dec 8, 2023

I sorta keep huggingface open next to HN now. Follow like “TheBloke” and a few others and you’ll know what’s up.

ekianjo · on Dec 9, 2023

Thebloke just quantizises existing models what is the point of following him?

all2 · on Dec 9, 2023

Because he quantizes up and coming models. You pay attention to him because he's a proxy for what's currently hot.

m3kw9 · on Dec 9, 2023

You look at the comments here to see if it’s any different or getting rave reviews. Most model are crap, the new mistral one from yesterday seem to be pretty good. But most models are not very practical or useful for anything other than amusement right now. I imagine in a year we’d get close to gpt4 models locally and with low spec requirements where a 2 series NVidia card can run.

taneq · on Dec 9, 2023

Is almost like everything’s accelerating exponentially.

refulgentis · on Dec 8, 2023

I honestly don't think I could if I still had a job not in it

skerit · on Dec 8, 2023

7B models are so exciting. So much is happening with those smaller models.

Metricon · on Dec 9, 2023

BTW, for anyone who might not be aware of it, this model trained by Intel based on the Mistral architecture is probably the single best general 7B model available currently:

https://huggingface.co/Intel/neural-chat-7b-v3-2 (also see https://huggingface.co/Intel/neural-chat-7b-v3-1 from the previous version for more details)

It's licensed Apache 2.0 and unaligned (uncensored).

gardnr · on Dec 9, 2023

How is it better than the model from the team that made the dataset? https://huggingface.co/Open-Orca/Mistral-7B-SlimOrca

anon373839 · on Dec 9, 2023

The Intel one had supervised fine-tuning with the SlimOrca dataset, and then DPO alignment on top of that using a preference dataset.

The technique for generating the preference data is what’s so interesting about that one. Instead of having human labelers choose a preferred response, they generated a response from a small model and a large model, and then always selected the large one’s as the preferred response.

Metricon · on Dec 9, 2023

I haven't personally tried that one, but on the HuggingFace LLM Leaderboard:

Open-Orca/Mistral-7B-SlimOrca - AVG: 60.37, ARC: 62.54, HellaSwag: 83.86, MMLU: 62.77, TruthfulQA: 54.23, Winogrande: 77.43, GSM8k: 21.38

Intel/neural-chat-7b-v3-2 - AVG: 68.29, ARC: 67.49, HellaSwag: 83.92, MMLU: 63.55, TruthfulQA: 59.68, Winogrande: 79.95, GSM8k: 55.12

minimaxir · on Dec 8, 2023

I wish they were a smidge smaller since 7B LLMs just barely run on a 16GB VRAM GPU (like a T4 server GPU) without quantization shenanigans.

Fortunately the learnings from finding better 7B models will trickle down, or more will be done with distillation (e.g. Gemini Nano)

3abiton · on Dec 8, 2023

What's wrong with quantization?

minimaxir · on Dec 8, 2023

There's still a (subjective) generative quality loss, even with recent tricks to minimize it.

leetharris · on Dec 8, 2023

Q8 is generally less than 1% degradation, Q5KM is around 3%. After that is when it starts to really degrade.

triyambakam · on Dec 9, 2023

Can you explain for a noob why?

orbital-decay · on Dec 9, 2023

Easier to train, easier to experiment with. Most research and prototyping happens on the scale that is just barely out of the "toy" category.

ttul · on Dec 9, 2023

Quantization means reducing the number of bits used to encode each floating point number constituting a parameter in the model So instead of having billions of possible values per weight, you might have just 255. The model has to have its weights crammed into a much smaller number of possible values, which reduces its ability to produce good outputs.

triyambakam · on Dec 9, 2023

Sorry, my question is, why are the 7B models so exciting?

aseipp · on Dec 9, 2023

They don't require really expensive and power-hungry components to run, i.e. a mid-range GPU can run a (4-or-5-bit quantized) 7B model at +50 tokens/second, so it's completely feasible to run on a small budget. They are easier to fine-tune, because they are smaller, and you can even just do CPU inference if you really want. There are good OSS implementations like llama.cpp and exllama. And there is a lot of belief that 7B models are not yet tapped out in terms of efficacy, so they will keep improving.

regularfry · on Dec 9, 2023

A 7b quantised model is also about the biggest you can run on an M1 MacBook too. It's nowhere near that speed but it does work.

kaoD · on Dec 9, 2023

To add some numbers to sibling's comment, if a parameter is originally fp16 (a half precision float, I think this is what LLaMA was trained on) you need 16bit*7*10^9 ~= 13GiB of RAM to fit a whole 7B model in memory. Current high-end consumer GPUs (4090) top at 24GB, so these small models fit in GPUs you can have at home.

For comparison, the next largest size is usually 13B which at fp16 already takes ~24GiB (some of which you'll be using for your regular applications like your browser, the OS, etc.)

7B also faster since the critical path of the signal flow is smaller.

Training requires even more RAM (and the more RAM you have the faster you can train).

You could quantize 13B to make it fit in consumer cards without large losses (see e.g. charts for k-quants LLaMA inference[0]) but training on quantized models impacts more than inference (couldn't find charts here, I'm on mobile). But this means you could also quantize 7B models to run them on even less powerful GPUs like low-end consumer GPUs or even eventually mobile phones (which are also power-sensitive due to running on batteries).

[0] https://github.com/ggerganov/llama.cpp/pull/1684

firejake308 · on Dec 8, 2023

Darn, I was hoping the RWKV people had finally obtained reportable results. This is still interesting, though. Maybe we will see more alternatives to transformers soon

stellaathena · on Dec 9, 2023

RWKV had a paper accepted at EMNLP and released models which match the performance of equivalent transformers.

What else are you looking for?

senseiV · on Dec 9, 2023

They Do, the latest rwkv v5, matches mamba at 3b scale, and from the benchmarks I see, its similar to hyena

refulgentis · on Dec 8, 2023

There are RWKV numbers in there, so you sort of got your wish sideways. :)

Has RWKV not released anything until now? I thought it was an open project that was in use, if sort of by a hipster 1%

stellaathena · on Dec 9, 2023

Paper: https://arxiv.org/abs/2305.13048

Models (v4 are the ones from the paper): https://huggingface.co/RWKV

GitHub: https://github.com/BlinkDL/RWKV-LM

senseiV · on Dec 9, 2023

V5 7b is out, close to hyena, gets 1400 t/s on a 3090, while an h100 llama 7b 8bit is 1200 t/s

mmaunder · on Dec 8, 2023

Is the model available or is this just an API/app?

lelag · on Dec 8, 2023

Weights seem available at https://huggingface.co/togethercomputer/StripedHyena-Nous-7B....

SparkyMcUnicorn · on Dec 8, 2023

And here's the base model: https://huggingface.co/togethercomputer/StripedHyena-Hessian...

And GH repo: https://github.com/togethercomputer/stripedhyena

mkesper · on Dec 9, 2023

Please stop using non-zero-based y axes. Every comparison chart in this post uses it, overemphasizing the difference between models.

anon373839 · on Dec 8, 2023

This is a seriously impressive model.