It's not, it uses locality-sensitive hashing to reduce attention complexity from O(n^2) to O(nlogn) while maintaining the same performance in 16GB as a best model that could fit into 100GB but nobody scaled it up to 1000 GPUs as its purpose was the opposite.