I always thought the query/key/value analogy was confusing and unnecessary. That tired analogy is why I don’t think Attention is All You Need is a particularly good paper. The BERT paper is much more readable.
If you actually look at what a self attention head looks like it’s much easier to understand and really not that complicated.
Once you get self attention, multi headed attention is just doing that N times in parallel over the same sequence.
If you actually look at what a self attention head looks like it’s much easier to understand and really not that complicated.
Once you get self attention, multi headed attention is just doing that N times in parallel over the same sequence.