Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That's not what multi-head attention means. Multi-head attention is the use of learned projection operators to perform attention operations within multiple lower-dimensional subspaces of the network's embedding space, rather than a single attention operation in the full embedding space. E.g. projecting 10 512-D vectors into 80 64-D vectors, attending separately to the 8 sets of 10 embedding projections, then concatenating the results together to reform 10 512-D vector outputs.

In fact the projection operations are the only learned part of a Transformer's self-attention function -- the rest of self-attention is just a weighted sum of the input vectors, where the weights come from the (scaled) vector correlation matrix.



How is that different from what I said?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: