Not quite ELI5 and there's a few partially overlapping answers around already but here goes.
The key part is the attention mechanism, as title of the paper may have spoiled. It works moreless like this:
- Start with an input sequence X1, X2 ... Xn. These are all vectors.
- Map the input sequence X into 3 new sequences of vectors: query (Q), key (K), and value(V), all of the same length as the input X. This is done using learnable mappings for each of the sequences (so one for X->Q, another for X->K and one for X->V).
- Compare similarity of every query with every key. This gives you a weight for each query/key pair. Call them W(Q1, V2) and so forth.
- Compute output Z as sum of every _value_ weighted by the weight for the respective query/key pair (so Z1 = V1W(Q1,K1) + V2W(Q1,K2) + ... + VnW(Q1,Kn), Z2 = V1W(Q2,K1) + V2*W(Q2,K2)...)
- and that's about it!
As throwawaymaths mentions, this is quite similar to a learnable hash table with the notable difference that the value fetched is also changed, so that it doesn't fetch "input at an index like i" but "whatever is important at an index like i".
Now a few implementation details on top of this:
- The description is for a single "attention head". Normally several, each with their own mappings for Q/K/V, are used, so the transformer can look at different "things" simultaneously. 8 attention heads seems pretty common.
- The description doesn't take the position in the sequence into account (W(Q1,K1) and W(Q1,Kn) are treated perfectly equally). To account for ordering, "positional encoding" is normally used. Usually this is just adding a bunch of scaled sine/cosine waves to the input. Works surprisingly well.
- The transformer architecture has a number of these "attention layers" stacked one after the other and also 2 different stacks (encoder, decoder). The paper is about machine translation, so the encoder is for the input text and the decoder for the output. Attention layers work just fine in other configurations as well.
The rest of the architecture is fairly standard stuff
The key part is the attention mechanism, as title of the paper may have spoiled. It works moreless like this:
- Start with an input sequence X1, X2 ... Xn. These are all vectors.
- Map the input sequence X into 3 new sequences of vectors: query (Q), key (K), and value(V), all of the same length as the input X. This is done using learnable mappings for each of the sequences (so one for X->Q, another for X->K and one for X->V).
- Compare similarity of every query with every key. This gives you a weight for each query/key pair. Call them W(Q1, V2) and so forth.
- Compute output Z as sum of every _value_ weighted by the weight for the respective query/key pair (so Z1 = V1W(Q1,K1) + V2W(Q1,K2) + ... + VnW(Q1,Kn), Z2 = V1W(Q2,K1) + V2*W(Q2,K2)...)
- and that's about it!
As throwawaymaths mentions, this is quite similar to a learnable hash table with the notable difference that the value fetched is also changed, so that it doesn't fetch "input at an index like i" but "whatever is important at an index like i".
Now a few implementation details on top of this:
- The description is for a single "attention head". Normally several, each with their own mappings for Q/K/V, are used, so the transformer can look at different "things" simultaneously. 8 attention heads seems pretty common.
- The description doesn't take the position in the sequence into account (W(Q1,K1) and W(Q1,Kn) are treated perfectly equally). To account for ordering, "positional encoding" is normally used. Usually this is just adding a bunch of scaled sine/cosine waves to the input. Works surprisingly well.
- The transformer architecture has a number of these "attention layers" stacked one after the other and also 2 different stacks (encoder, decoder). The paper is about machine translation, so the encoder is for the input text and the decoder for the output. Attention layers work just fine in other configurations as well.
The rest of the architecture is fairly standard stuff