Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Probably never gonna run on normal boxes, it uses RingAttention which only makes sense when you have many GPUs. It's a datacenter-only model.


I’d still like to know the requirements. You can rent A100’s or have access thru work, etc.


> Scaling Inference We additionally scale our inference code to support million-length sequences by implementing RingAttention for decoding. Inference for such long sequences requires a minimum of v4-128 with a TPU mesh sharding of 32 tensor parallelism, and 4 sequence parallelism (ring dimension). We perform inference in pure single precision, where additional improvements can be made through techniques in scalability such as quantization.


It might even fit in a max-sized macbook pro.


Maybe in few years with 4bit quantisation. (around 1TiB is needed for that)


There are models with shorter attention sizes that probably are much smaller in vram needs. The model itself is only about 32g, so in 5K-M quant it wouldn’t be that bad, and the smaller attention versions might actually be workable in higher end workstations with cleverness like llama.cpp. Back of the napkin guess tells me maybe around 60g for the 32k context version


From https://arxiv.org/pdf/2402.08268.pdf

> We trained our models using TPUv4-1024, which is approximately equivalent to 450 A100s

> Inference for such long sequences requires a minimum of v4-128

So you'll need ~60 A100 for inference.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: