Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
LWM – Open LLM with 1M Tokens Context Window (github.com/largeworldmodel)
161 points by amrrs on Feb 16, 2024 | hide | past | favorite | 36 comments


Recent and related:

World model on million-length video and language with RingAttention - https://news.ycombinator.com/item?id=39367141 - Feb 2024 (58 comments)


We saw this before. Anyone know what the VRAM requirements are, if it’s been quantized, and if/when llama.cpp might support?


From the paper[1]:

> We additionally scale our inference code to support million-length sequences by implementing RingAttention for decoding. Inference for such long sequences requires a minimum of v4-128 with a TPU mesh sharding of 32 tensor parallelism, and 4 sequence parallelism (ring dimension).

Each TPU-v4 has 32 GiB of HBM memory, so about 4 TiB (128 x 32GiB) of memory in fp16, without quantization.

[1]: https://arxiv.org/abs/2402.08268



Indeed. Almost all of the inference memory overhead comes from attention matrices and the KV cache.


So uh, they thought "we have this datacenter full of GPUs laying around, you know what would be best, if we made a model that needs the entire thing just for inference".

Brute forcing a quadratic complexity problem seems like wastefulness at its worst.


aren't most LLMs just a brute force probability engine?


Sounds like they're loading the internet into those vrams.


Probably never gonna run on normal boxes, it uses RingAttention which only makes sense when you have many GPUs. It's a datacenter-only model.


I’d still like to know the requirements. You can rent A100’s or have access thru work, etc.


> Scaling Inference We additionally scale our inference code to support million-length sequences by implementing RingAttention for decoding. Inference for such long sequences requires a minimum of v4-128 with a TPU mesh sharding of 32 tensor parallelism, and 4 sequence parallelism (ring dimension). We perform inference in pure single precision, where additional improvements can be made through techniques in scalability such as quantization.


It might even fit in a max-sized macbook pro.


Maybe in few years with 4bit quantisation. (around 1TiB is needed for that)


There are models with shorter attention sizes that probably are much smaller in vram needs. The model itself is only about 32g, so in 5K-M quant it wouldn’t be that bad, and the smaller attention versions might actually be workable in higher end workstations with cleverness like llama.cpp. Back of the napkin guess tells me maybe around 60g for the 32k context version


From https://arxiv.org/pdf/2402.08268.pdf

> We trained our models using TPUv4-1024, which is approximately equivalent to 450 A100s

> Inference for such long sequences requires a minimum of v4-128

So you'll need ~60 A100 for inference.


They are using RingAttention that scales self attention computation linearly by number of devices by passing KV result blocks in a ring:

https://arxiv.org/abs/2310.01889 (Submitted on 3 Oct 2023)


So what approach do these large context models take with caching the context, or do they re-compute each time?


That's exciting!

Any views on the license? Github says Apache 2 for weights...but hugging face says llama license


If you don't have GPU/TPU farm this isn't as exciting as it seems to need around 2TiB with 8bit quantization.


Was it yesterday Gemini Pro 1.5 announced the 1M token size? Jesus stuff is moving fast.


This paper was released Feb 13th.


10M end of week 1 billon end of month


I would be curious to know if anyone has tried a hybrid approach where you have a Mamba-like architecture for longer term recall but it's combined with a transformer for short term memory?


Yep, https://arxiv.org/abs/2402.04248 tried a Mambaformer which seemed to perform well.


maybe a fun karpathy video here...


I’m starting to wonder if the needle haystack benchmark is maybe too easy


Incredible achievement. Awkward timing with Gemini announcing the same.


Can you feed pdf's into this? Seems like it handles images like a champ and those needle in haystack benchmarks are wild. And it's open?! wow, very very impressive!!


Why Jax over Pytorch for Video/Text model?


easier to scale to a distributed TPU cluster I think?


surprised this beat a GPT-4 equivalent when Google already announced. Maybe they were too busy with Sora to care about 1 million ctx right now


Presumably they had Sora almost ready to go, and the LLM team didn't have anything that they could publish with a few hours notice.


How is the needle in haystack test on this thing?



wow, dam that's crazy impressive


[flagged]


This is an ad.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: