It's not, it uses locality-sensitive hashing to reduce attention complexity from...

bitL on March 15, 2023 | parent | context | favorite | on: GPT-4

It's not, it uses locality-sensitive hashing to reduce attention complexity from O(n^2) to O(nlogn) while maintaining the same performance in 16GB as a best model that could fit into 100GB but nobody scaled it up to 1000 GPUs as its purpose was the opposite.