Transformers Explained
The Revolution of 2017
In 2017, Google published “Attention Is All You Need” — the paper that introduced the Transformer architecture. It replaced RNNs and revolutionized all of NLP (and later vision, audio, and more).
Today, BERT, GPT-4, LLaMA, Claude, and virtually every powerful language model is based on the Transformer.
The Problem Transformers Solve
Old approach (RNNs/LSTMs): process text word by word, left to right.
- Slow (sequential, can’t parallelize)
- “Forgets” long-range dependencies
"The cat that sat in the corner of the room was [HUNGRY]"
By the time we reach "hungry", the model has "forgotten" about "cat"
Transformers: process all words simultaneously and each word attends to every other word.
The Attention Mechanism
The core insight: to understand a word, look at all other words and decide which ones are most relevant.
import torch
import torch.nn.functional as F
import math
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Q: Query matrix (what am I looking for?)
K: Key matrix (what do I contain?)
V: Value matrix (what do I actually output?)
"""
d_k = Q.size(-1)
# Attention scores: how much each position should attend to others
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
# Softmax → probabilities
attention_weights = F.softmax(scores, dim=-1)
# Weighted sum of values
output = torch.matmul(attention_weights, V)
return output, attention_weights
# Example: sentence with 5 words, d_model=64
seq_len, d_model = 5, 64
Q = torch.rand(1, seq_len, d_model)
K = torch.rand(1, seq_len, d_model)
V = torch.rand(1, seq_len, d_model)
output, attn = scaled_dot_product_attention(Q, K, V)
print(f"Output shape: {output.shape}") # (1, 5, 64)
print(f"Attention map: {attn.shape}") # (1, 5, 5) — each word attends to all 5
Visualizing Attention
The cat sat on mat
The [0.60 0.05 0.10 0.15 0.10]
cat [0.10 0.70 0.05 0.05 0.10]
sat [0.05 0.20 0.50 0.15 0.10]
on [0.10 0.05 0.20 0.50 0.15]
mat [0.10 0.10 0.10 0.20 0.50]
Each row shows where a word “looks” when building its representation.
Multi-Head Attention
Instead of one attention, use multiple “heads” in parallel — each learning different relationships:
class MultiHeadAttention(torch.nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
assert d_model % num_heads == 0
self.d_k = d_model // num_heads
self.num_heads = num_heads
self.W_q = torch.nn.Linear(d_model, d_model)
self.W_k = torch.nn.Linear(d_model, d_model)
self.W_v = torch.nn.Linear(d_model, d_model)
self.W_o = torch.nn.Linear(d_model, d_model)
def split_heads(self, x):
batch, seq_len, d_model = x.shape
x = x.view(batch, seq_len, self.num_heads, self.d_k)
return x.transpose(1, 2) # (batch, heads, seq_len, d_k)
def forward(self, Q, K, V, mask=None):
Q, K, V = self.W_q(Q), self.W_k(K), self.W_v(V)
Q, K, V = self.split_heads(Q), self.split_heads(K), self.split_heads(V)
out, attn = scaled_dot_product_attention(Q, K, V, mask)
batch, heads, seq, d_k = out.shape
out = out.transpose(1, 2).contiguous().view(batch, seq, heads * d_k)
return self.W_o(out)
The Full Transformer Block
class TransformerBlock(torch.nn.Module):
def __init__(self, d_model, num_heads, ff_dim, dropout=0.1):
super().__init__()
self.attention = MultiHeadAttention(d_model, num_heads)
self.norm1 = torch.nn.LayerNorm(d_model)
self.norm2 = torch.nn.LayerNorm(d_model)
self.ff = torch.nn.Sequential(
torch.nn.Linear(d_model, ff_dim),
torch.nn.GELU(),
torch.nn.Linear(ff_dim, d_model),
)
self.dropout = torch.nn.Dropout(dropout)
def forward(self, x, mask=None):
# Self-attention + residual connection
attn_out = self.attention(x, x, x, mask)
x = self.norm1(x + self.dropout(attn_out)) # "Add & Norm"
# Feed-forward + residual connection
ff_out = self.ff(x)
x = self.norm2(x + self.dropout(ff_out)) # "Add & Norm"
return x
BERT vs. GPT
| BERT | GPT | |
|---|---|---|
| Architecture | Encoder only | Decoder only |
| Training | Masked Language Modeling | Next token prediction |
| Direction | Bidirectional (sees all context) | Autoregressive (left-to-right) |
| Best for | Understanding (classification, QA) | Generation (text, code) |
| Examples | BERT, RoBERTa, DistilBERT | GPT-4, LLaMA, Claude |
BERT: Masked Language Modeling
Input: "The [MASK] sat on the mat"
Output: Predict: "cat" (with 94% confidence)
BERT reads the full sentence in both directions — making it excellent for understanding tasks.
GPT: Causal Language Modeling
Input: "The cat sat on the"
Output: Predict next token: "mat"
GPT can only see previous tokens — making it perfect for generation.
Positional Encoding
Transformers don’t inherently know word order (unlike RNNs). We add positional encodings to tell the model where each word is:
import numpy as np
import matplotlib.pyplot as plt
def get_positional_encoding(seq_len, d_model):
pe = np.zeros((seq_len, d_model))
positions = np.arange(seq_len)[:, np.newaxis]
dims = np.arange(0, d_model, 2)
pe[:, 0::2] = np.sin(positions / (10000 ** (dims / d_model)))
pe[:, 1::2] = np.cos(positions / (10000 ** (dims / d_model)))
return pe
pe = get_positional_encoding(50, 128)
plt.figure(figsize=(12, 4))
plt.imshow(pe.T, cmap="RdBu", aspect="auto")
plt.title("Positional Encoding Visualization")
plt.xlabel("Position in sequence")
plt.ylabel("Embedding dimension")
plt.colorbar()
plt.show()
What is the key advantage of the Transformer's self-attention over RNNs for long sequences?