Indeed. Almost all of the inference memory overhead comes from attention matrice...

		woadwarrior01 on Feb 16, 2024 \| parent \| context \| favorite \| on: LWM – Open LLM with 1M Tokens Context Window Indeed. Almost all of the inference memory overhead comes from attention matrices and the KV cache.