Attention Is All You Need

Citation

Vaswani et al., “Attention Is All You Need”, NeurIPS 2017.

Key ideas

  • Replaces recurrence and convolutions entirely with self-attention.
  • Multi-head attention lets the model jointly attend to information from different representation subspaces.
  • Positional encodings (sinusoidal) inject sequence order since there is no recurrence.
  • Encoder-decoder architecture with cross-attention between the two.

What I found interesting

The scaling factor \(\frac{1}{\sqrt{d_k}}\) in the attention formula prevents dot products from growing too large in high dimensions, which would push softmax into low-gradient regions.

Open questions

  • How sensitive is performance to the specific positional encoding scheme?
  • What breaks first as sequence length grows?



Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Google Gemini updates: Flash 1.5, Gemma 2 and Project Astra
  • Displaying External Posts on Your al-folio Blog
  • a post with image galleries
  • a post with tabs
  • a post with typograms