> We additionally scale our inference code to support million-length sequences by implementing RingAttention for decoding. Inference for such long sequences requires a minimum of v4-128 with a TPU mesh sharding of 32 tensor parallelism, and 4 sequence parallelism (ring dimension).
Each TPU-v4 has 32 GiB of HBM memory, so about 4 TiB (128 x 32GiB) of memory in fp16, without quantization.
So uh, they thought "we have this datacenter full of GPUs laying around, you know what would be best, if we made a model that needs the entire thing just for inference".
Brute forcing a quadratic complexity problem seems like wastefulness at its worst.
> Scaling Inference We additionally scale our inference code to support million-length sequences by implementing RingAttention for decoding. Inference for such long sequences requires a minimum of v4-128 with a TPU mesh sharding of 32 tensor parallelism, and 4 sequence parallelism (ring dimension). We perform inference in pure single precision, where additional improvements can be made through techniques in scalability such as quantization.
There are models with shorter attention sizes that probably are much smaller in vram needs. The model itself is only about 32g, so in 5K-M quant it wouldn’t be that bad, and the smaller attention versions might actually be workable in higher end workstations with cleverness like llama.cpp. Back of the napkin guess tells me maybe around 60g for the 32k context version
I would be curious to know if anyone has tried a hybrid approach where you have a Mamba-like architecture for longer term recall but it's combined with a transformer for short term memory?
Can you feed pdf's into this? Seems like it handles images like a champ and those needle in haystack benchmarks are wild. And it's open?! wow, very very impressive!!
World model on million-length video and language with RingAttention - https://news.ycombinator.com/item?id=39367141 - Feb 2024 (58 comments)