They are using RingAttention that scales self attention computation linearly by ...

		xiphias2 on Feb 16, 2024 \| parent \| context \| favorite \| on: LWM – Open LLM with 1M Tokens Context Window They are using RingAttention that scales self attention computation linearly by number of devices by passing KV result blocks in a ring: https://arxiv.org/abs/2310.01889 (Submitted on 3 Oct 2023)