Transformers Explained

The Revolution of 2017

In 2017, Google published “Attention Is All You Need” — the paper that introduced the Transformer architecture. It replaced RNNs and revolutionized all of NLP (and later vision, audio, and more).

Today, BERT, GPT-4, LLaMA, Claude, and virtually every powerful language model is based on the Transformer.

The Problem Transformers Solve

Old approach (RNNs/LSTMs): process text word by word, left to right.

Slow (sequential, can’t parallelize)
“Forgets” long-range dependencies

"The cat that sat in the corner of the room was [HUNGRY]"
By the time we reach "hungry", the model has "forgotten" about "cat"

Transformers: process all words simultaneously and each word attends to every other word.

The Attention Mechanism

The core insight: to understand a word, look at all other words and decide which ones are most relevant.

import torch
import torch.nn.functional as F
import math

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q: Query matrix  (what am I looking for?)
    K: Key matrix    (what do I contain?)
    V: Value matrix  (what do I actually output?)
    """
    d_k = Q.size(-1)
    
    # Attention scores: how much each position should attend to others
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    
    # Softmax → probabilities
    attention_weights = F.softmax(scores, dim=-1)
    
    # Weighted sum of values
    output = torch.matmul(attention_weights, V)
    
    return output, attention_weights

# Example: sentence with 5 words, d_model=64
seq_len, d_model = 5, 64
Q = torch.rand(1, seq_len, d_model)
K = torch.rand(1, seq_len, d_model)
V = torch.rand(1, seq_len, d_model)

output, attn = scaled_dot_product_attention(Q, K, V)
print(f"Output shape:  {output.shape}")   # (1, 5, 64)
print(f"Attention map: {attn.shape}")     # (1, 5, 5) — each word attends to all 5

Visualizing Attention

                  The   cat   sat  on  mat
             The [0.60  0.05  0.10 0.15 0.10]
             cat [0.10  0.70  0.05 0.05 0.10]
             sat [0.05  0.20  0.50 0.15 0.10]
              on [0.10  0.05  0.20 0.50 0.15]
             mat [0.10  0.10  0.10 0.20 0.50]

Each row shows where a word “looks” when building its representation.

Multi-Head Attention

Instead of one attention, use multiple “heads” in parallel — each learning different relationships:

class MultiHeadAttention(torch.nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0
        self.d_k = d_model // num_heads
        self.num_heads = num_heads
        
        self.W_q = torch.nn.Linear(d_model, d_model)
        self.W_k = torch.nn.Linear(d_model, d_model)
        self.W_v = torch.nn.Linear(d_model, d_model)
        self.W_o = torch.nn.Linear(d_model, d_model)
    
    def split_heads(self, x):
        batch, seq_len, d_model = x.shape
        x = x.view(batch, seq_len, self.num_heads, self.d_k)
        return x.transpose(1, 2)  # (batch, heads, seq_len, d_k)
    
    def forward(self, Q, K, V, mask=None):
        Q, K, V = self.W_q(Q), self.W_k(K), self.W_v(V)
        Q, K, V = self.split_heads(Q), self.split_heads(K), self.split_heads(V)
        
        out, attn = scaled_dot_product_attention(Q, K, V, mask)
        
        batch, heads, seq, d_k = out.shape
        out = out.transpose(1, 2).contiguous().view(batch, seq, heads * d_k)
        return self.W_o(out)

The Full Transformer Block

class TransformerBlock(torch.nn.Module):
    def __init__(self, d_model, num_heads, ff_dim, dropout=0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.norm1 = torch.nn.LayerNorm(d_model)
        self.norm2 = torch.nn.LayerNorm(d_model)
        self.ff = torch.nn.Sequential(
            torch.nn.Linear(d_model, ff_dim),
            torch.nn.GELU(),
            torch.nn.Linear(ff_dim, d_model),
        )
        self.dropout = torch.nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        # Self-attention + residual connection
        attn_out = self.attention(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_out))   # "Add & Norm"
        
        # Feed-forward + residual connection
        ff_out = self.ff(x)
        x = self.norm2(x + self.dropout(ff_out))     # "Add & Norm"
        
        return x

BERT vs. GPT

	BERT	GPT
Architecture	Encoder only	Decoder only
Training	Masked Language Modeling	Next token prediction
Direction	Bidirectional (sees all context)	Autoregressive (left-to-right)
Best for	Understanding (classification, QA)	Generation (text, code)
Examples	BERT, RoBERTa, DistilBERT	GPT-4, LLaMA, Claude

BERT: Masked Language Modeling

Input:  "The [MASK] sat on the mat"
Output: Predict: "cat" (with 94% confidence)

BERT reads the full sentence in both directions — making it excellent for understanding tasks.

GPT: Causal Language Modeling

Input:  "The cat sat on the"
Output: Predict next token: "mat"

GPT can only see previous tokens — making it perfect for generation.

Positional Encoding

Transformers don’t inherently know word order (unlike RNNs). We add positional encodings to tell the model where each word is:

import numpy as np
import matplotlib.pyplot as plt

def get_positional_encoding(seq_len, d_model):
    pe = np.zeros((seq_len, d_model))
    positions = np.arange(seq_len)[:, np.newaxis]
    dims = np.arange(0, d_model, 2)
    
    pe[:, 0::2] = np.sin(positions / (10000 ** (dims / d_model)))
    pe[:, 1::2] = np.cos(positions / (10000 ** (dims / d_model)))
    
    return pe

pe = get_positional_encoding(50, 128)
plt.figure(figsize=(12, 4))
plt.imshow(pe.T, cmap="RdBu", aspect="auto")
plt.title("Positional Encoding Visualization")
plt.xlabel("Position in sequence")
plt.ylabel("Embedding dimension")
plt.colorbar()
plt.show()

Knowledge Check

What is the key advantage of the Transformer's self-attention over RNNs for long sequences?