Transformers — Architecture Overview
Core architecture
A transformer consists of stacked encoder and/or decoder blocks. Each block has two sub-layers:
- Multi-head self-attention — allows each token to attend to all others.
- Feed-forward network — applied independently to each position.
Residual connections and layer normalization wrap each sub-layer.
Key variants
| Model | Type | Notes |
|---|---|---|
| BERT | Encoder-only | Pre-training via MLM + NSP |
| GPT | Decoder-only | Autoregressive language modelling |
| T5 | Encoder-decoder | Text-to-text framing for all tasks |
| ViT | Encoder-only | Patches as tokens for vision |
Things still to understand
- Efficient attention variants (Linformer, Performer, Flash Attention).
- How positional encodings differ across models (learned vs. sinusoidal vs. RoPE).
Enjoy Reading This Article?
Here are some more articles you might like to read next: