> Scaling Inference We additionally scale our inference code to support million-length sequences by implementing RingAttention for decoding. Inference for such long sequences requires a minimum of v4-128 with a TPU mesh sharding of 32 tensor parallelism, and 4 sequence parallelism (ring dimension). We perform inference in pure single precision, where additional improvements can be made through techniques in scalability such as quantization.
There are models with shorter attention sizes that probably are much smaller in vram needs. The model itself is only about 32g, so in 5K-M quant it wouldn’t be that bad, and the smaller attention versions might actually be workable in higher end workstations with cleverness like llama.cpp. Back of the napkin guess tells me maybe around 60g for the 32k context version