It's my understanding that in regular non-sliding window context models the llm is able to pay attention to any part of the input when generating the output. The attention head is essentially able to jump back and forward to any point in its context window. This is what differentiates the attention mechanism from other models that use token proximity as a proxy for relevance.