It seems that crossformer has a very large number of tokens (patch as tokens). The author of this paper believes that one variable corresponds to one token is sufficient, and it is natural to use attention to describe their overall relationship among these individual entities.
https://openreview.net/forum?id=vSVLM2j9eie