One of the most frustrating things about all the documentation on Transformers is the sole emphasis on NLP.
In particular, one of the most interesting parts of the Transformer architecture to me is the attention mechanism which is permutation invariant (if not for the positional embeddings people use to counteract this inherent quality of attention layers). Also the ability to arbitrarily mask this or that node in the graph -- or even individual edges -- gives the whole thing so much flexibility for encoding domain knowledge into your architecture.
Positional embeddings may still be required in many cases but you can be clever about them beyond the overly restrictive perspective of attention layer inputs purely as one-dimensional sequences.
In particular, one of the most interesting parts of the Transformer architecture to me is the attention mechanism which is permutation invariant (if not for the positional embeddings people use to counteract this inherent quality of attention layers). Also the ability to arbitrarily mask this or that node in the graph -- or even individual edges -- gives the whole thing so much flexibility for encoding domain knowledge into your architecture.
Positional embeddings may still be required in many cases but you can be clever about them beyond the overly restrictive perspective of attention layer inputs purely as one-dimensional sequences.