Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Scaling Inference We additionally scale our inference code to support million-length sequences by implementing RingAttention for decoding. Inference for such long sequences requires a minimum of v4-128 with a TPU mesh sharding of 32 tensor parallelism, and 4 sequence parallelism (ring dimension). We perform inference in pure single precision, where additional improvements can be made through techniques in scalability such as quantization.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: