You literally just shift the window over by to the next token once you reach the max amount of tokens you want for context window, NOT with what you train on, (only limited with memory now)
This has obvious issues since you're now losing information from the now unseen tokens which becomes significant if your context window is small in comparision of the answer/question you're looking at. That's why companies try to give stupidly large context windows. The problem is they're not training on the large context window, they're training on something smaller (2048 and above). Due to how attention is setup, you can train on a small amount of context and extrapolate it to any number of tokens possible since they train via ROPE which trains the model because on words and their offset to the neighboring words. This allows us to effectively x2,x3,x10,x100 the amount of tokens we generate vs train with with some form consistency BUT still cause a lot of issues consistency wise since the model approaches more of a "this was trained on snippets but not the entire thing" situation where it has a notion of the context but not fundamentally the entire combined context
That’s a very basic way to keep the LLM inferring past the context window size (there’s better, smarter ways) but that’s not at all what the question was which is how they train a 2M token length window. My understanding at a basic level is that you need corpuses that are >2M in length for training data which is where the problem comes in for - there’s only so much long form content and it’s swamped by all the smaller stuff. I think there’s probably tricks now but I suspect it’s still largely an open problem.
AFAIK nobody does that. They train on much much shorter text but with use tricks in the position encoding steps that can be extrapolated by the LLMs. Lile ROPE and YARN etc.
AFAIK (not much) it definitely helps to train on longer sequences even with rope/yarn and is needed if you care about long context performance (and not just the long context capability).