LWM – Open LLM with 1M Tokens Context Window

dang · on Feb 16, 2024

Recent and related:

World model on million-length video and language with RingAttention - https://news.ycombinator.com/item?id=39367141 - Feb 2024 (58 comments)

azinman2 · on Feb 16, 2024

We saw this before. Anyone know what the VRAM requirements are, if it’s been quantized, and if/when llama.cpp might support?

woadwarrior01 · on Feb 16, 2024

From the paper[1]:

> We additionally scale our inference code to support million-length sequences by implementing RingAttention for decoding. Inference for such long sequences requires a minimum of v4-128 with a TPU mesh sharding of 32 tensor parallelism, and 4 sequence parallelism (ring dimension).

Each TPU-v4 has 32 GiB of HBM memory, so about 4 TiB (128 x 32GiB) of memory in fp16, without quantization.

[1]: https://arxiv.org/abs/2402.08268

Mathnerd314 · on Feb 16, 2024

The model itself is only 13.6GB though: https://huggingface.co/LargeWorldModel/LWM-Chat-1M-Jax/tree/...

woadwarrior01 · on Feb 16, 2024

Indeed. Almost all of the inference memory overhead comes from attention matrices and the KV cache.

moffkalast · on Feb 16, 2024

So uh, they thought "we have this datacenter full of GPUs laying around, you know what would be best, if we made a model that needs the entire thing just for inference".

Brute forcing a quadratic complexity problem seems like wastefulness at its worst.

cyanydeez · on Feb 18, 2024

aren't most LLMs just a brute force probability engine?

3abiton · on Feb 16, 2024

Sounds like they're loading the internet into those vrams.

visarga · on Feb 16, 2024

Probably never gonna run on normal boxes, it uses RingAttention which only makes sense when you have many GPUs. It's a datacenter-only model.

azinman2 · on Feb 16, 2024

I’d still like to know the requirements. You can rent A100’s or have access thru work, etc.

ramoz · on Feb 16, 2024

> Scaling Inference We additionally scale our inference code to support million-length sequences by implementing RingAttention for decoding. Inference for such long sequences requires a minimum of v4-128 with a TPU mesh sharding of 32 tensor parallelism, and 4 sequence parallelism (ring dimension). We perform inference in pure single precision, where additional improvements can be made through techniques in scalability such as quantization.

Filligree · on Feb 16, 2024

It might even fit in a max-sized macbook pro.

machinekob · on Feb 16, 2024

Maybe in few years with 4bit quantisation. (around 1TiB is needed for that)

K0balt · on Feb 16, 2024

There are models with shorter attention sizes that probably are much smaller in vram needs. The model itself is only about 32g, so in 5K-M quant it wouldn’t be that bad, and the smaller attention versions might actually be workable in higher end workstations with cleverness like llama.cpp. Back of the napkin guess tells me maybe around 60g for the 32k context version

5kg · on Feb 16, 2024

From https://arxiv.org/pdf/2402.08268.pdf

> We trained our models using TPUv4-1024, which is approximately equivalent to 450 A100s

> Inference for such long sequences requires a minimum of v4-128

So you'll need ~60 A100 for inference.

xiphias2 · on Feb 16, 2024

They are using RingAttention that scales self attention computation linearly by number of devices by passing KV result blocks in a ring:

https://arxiv.org/abs/2310.01889 (Submitted on 3 Oct 2023)

faizshah · on Feb 16, 2024

So what approach do these large context models take with caching the context, or do they re-compute each time?

Havoc · on Feb 16, 2024

That's exciting!

Any views on the license? Github says Apache 2 for weights...but hugging face says llama license

machinekob · on Feb 16, 2024

If you don't have GPU/TPU farm this isn't as exciting as it seems to need around 2TiB with 8bit quantization.

thefourthchime · on Feb 16, 2024

Was it yesterday Gemini Pro 1.5 announced the 1M token size? Jesus stuff is moving fast.

derac · on Feb 16, 2024

This paper was released Feb 13th.

guluarte · on Feb 16, 2024

10M end of week 1 billon end of month

blainm · on Feb 16, 2024

I would be curious to know if anyone has tried a hybrid approach where you have a Mamba-like architecture for longer term recall but it's combined with a transformer for short term memory?

logicchains · on Feb 16, 2024

Yep, https://arxiv.org/abs/2402.04248 tried a Mambaformer which seemed to perform well.

enonimal · on Feb 16, 2024

maybe a fun karpathy video here...

jerpint · on Feb 17, 2024

I’m starting to wonder if the needle haystack benchmark is maybe too easy

bluelightning2k · on Feb 16, 2024

Incredible achievement. Awkward timing with Gemini announcing the same.

canadiantim · on Feb 16, 2024

Can you feed pdf's into this? Seems like it handles images like a champ and those needle in haystack benchmarks are wild. And it's open?! wow, very very impressive!!

lxe · on Feb 16, 2024

Why Jax over Pytorch for Video/Text model?

benpacker · on Feb 16, 2024

easier to scale to a distributed TPU cluster I think?

lacoolj · on Feb 16, 2024

surprised this beat a GPT-4 equivalent when Google already announced. Maybe they were too busy with Sora to care about 1 million ctx right now

sebzim4500 · on Feb 16, 2024

Presumably they had Sora almost ready to go, and the LLM team didn't have anything that they could publish with a few hours notice.

m3kw9 · on Feb 16, 2024

How is the needle in haystack test on this thing?

derac · on Feb 16, 2024

https://github.com/LargeWorldModel/LWM#lwm-capabilities

canadiantim · on Feb 16, 2024

wow, dam that's crazy impressive

guidsen · on Feb 16, 2024

[flagged]

behnamoh · on Feb 16, 2024

This is an ad.