Yes, kinda. The transformer doesn't have a mechanism for dynamically adjusting its input size, so you need to strike a balance between the window being big enough for practical purposes but also small enough that you can still train the network.
Previous networks with RNNs could in theory receive inputs of arbitrary size, but in practice their performance decreased as the input got longer because they "forgot" the earlier input as they went on. The paper "Neural Machine Translation by Jointly Learning to Align and Translate" solved the forgetting problem by, you guessed it, adding attention to the model.
Eventually people realized that attention was all you needed (ha!), removed the RNN, and here we are.
Is this the reason for the limited token windows?