Attention Is All You Need
Citation
Vaswani et al., “Attention Is All You Need”, NeurIPS 2017.
Key ideas
- Replaces recurrence and convolutions entirely with self-attention.
- Multi-head attention lets the model jointly attend to information from different representation subspaces.
- Positional encodings (sinusoidal) inject sequence order since there is no recurrence.
- Encoder-decoder architecture with cross-attention between the two.
What I found interesting
The scaling factor \(\frac{1}{\sqrt{d_k}}\) in the attention formula prevents dot products from growing too large in high dimensions, which would push softmax into low-gradient regions.
Open questions
- How sensitive is performance to the specific positional encoding scheme?
- What breaks first as sequence length grows?
Enjoy Reading This Article?
Here are some more articles you might like to read next: