Probably never gonna run on normal boxes, it uses RingAttention which only makes...

azinman2 · on Feb 16, 2024

I’d still like to know the requirements. You can rent A100’s or have access thru work, etc.

ramoz · on Feb 16, 2024

> Scaling Inference We additionally scale our inference code to support million-length sequences by implementing RingAttention for decoding. Inference for such long sequences requires a minimum of v4-128 with a TPU mesh sharding of 32 tensor parallelism, and 4 sequence parallelism (ring dimension). We perform inference in pure single precision, where additional improvements can be made through techniques in scalability such as quantization.

Filligree · on Feb 16, 2024

It might even fit in a max-sized macbook pro.

machinekob · on Feb 16, 2024

Maybe in few years with 4bit quantisation. (around 1TiB is needed for that)

K0balt · on Feb 16, 2024

There are models with shorter attention sizes that probably are much smaller in vram needs. The model itself is only about 32g, so in 5K-M quant it wouldn’t be that bad, and the smaller attention versions might actually be workable in higher end workstations with cleverness like llama.cpp. Back of the napkin guess tells me maybe around 60g for the 32k context version

5kg · on Feb 16, 2024

From https://arxiv.org/pdf/2402.08268.pdf

> We trained our models using TPUv4-1024, which is approximately equivalent to 450 A100s

> Inference for such long sequences requires a minimum of v4-128

So you'll need ~60 A100 for inference.