Forward Pass & Backpropagation
The Training Loop in Three Steps
Every training step in a neural network is:
1. FORWARD PASS → compute predictions from inputs
2. LOSS → measure how wrong the predictions are
3. BACKWARD PASS → compute gradients and update weights
Let’s understand each step.
Step 1: The Forward Pass
Data flows forward through the network, layer by layer:
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def relu(x):
return np.maximum(0, x)
# A tiny 2-layer network for XOR problem
# Input: [x1, x2], Output: probability of class 1
def forward_pass(x, W1, b1, W2, b2):
# Layer 1: hidden
z1 = x @ W1 + b1 # (n, 2) @ (2, 3) + (3,) = (n, 3)
a1 = relu(z1) # apply activation
# Layer 2: output
z2 = a1 @ W2 + b2 # (n, 3) @ (3, 1) + (1,) = (n, 1)
a2 = sigmoid(z2) # probability
return a2, (z1, a1, z2, a2) # output + cache for backprop
# Initialize weights randomly
np.random.seed(42)
W1 = np.random.randn(2, 3) * 0.1 # input → hidden
b1 = np.zeros(3)
W2 = np.random.randn(3, 1) * 0.1 # hidden → output
b2 = np.zeros(1)
# XOR dataset
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([[0], [1], [1], [0]]) # XOR outputs
output, cache = forward_pass(X, W1, b1, W2, b2)
print("Initial predictions:")
print(output.round(4))
Step 2: The Loss Function
The loss measures how wrong the predictions are. We want to minimize this during training.
Binary Cross-Entropy (for classification)
def binary_cross_entropy(y_true, y_pred, eps=1e-7):
y_pred = np.clip(y_pred, eps, 1 - eps) # avoid log(0)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
loss = binary_cross_entropy(y, output)
print(f"Initial loss: {loss:.4f}") # high — random weights are bad
Mean Squared Error (for regression)
Step 3: Backpropagation
Backpropagation uses the chain rule of calculus to compute how much each weight contributes to the loss.
The key question: “If I increase weight by a tiny amount, how much does the loss change?”
This is the gradient:
Forward: input → [W1, b1] → hidden → [W2, b2] → output → loss
Backward: loss → dL/dW2 → dL/dW1 (using chain rule)
Gradient Descent Update Rule
Where is the learning rate.
def backward_pass(X, y, W1, b1, W2, b2, cache, lr=0.1):
z1, a1, z2, a2 = cache
n = X.shape[0]
# Output layer gradients
dz2 = a2 - y # d_loss/d_z2
dW2 = (a1.T @ dz2) / n # d_loss/d_W2
db2 = dz2.mean(axis=0) # d_loss/d_b2
# Hidden layer gradients
da1 = dz2 @ W2.T
dz1 = da1 * (z1 > 0) # ReLU gradient: 1 if z>0 else 0
dW1 = (X.T @ dz1) / n
db1 = dz1.mean(axis=0)
# Update weights (gradient descent)
W2 -= lr * dW2
b2 -= lr * db2
W1 -= lr * dW1
b1 -= lr * db1
return W1, b1, W2, b2
# Training loop
losses = []
for epoch in range(1000):
output, cache = forward_pass(X, W1, b1, W2, b2)
loss = binary_cross_entropy(y, output)
losses.append(loss)
W1, b1, W2, b2 = backward_pass(X, y, W1, b1, W2, b2, cache, lr=0.5)
if (epoch + 1) % 200 == 0:
print(f"Epoch {epoch+1:4d}: loss = {loss:.4f}")
# Final predictions
output, _ = forward_pass(X, W1, b1, W2, b2)
print("\nFinal predictions (should be ~0, 1, 1, 0):")
print(output.round(4))
Visualizing Learning
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 4))
plt.plot(losses, color="royalblue", linewidth=2)
plt.title("Training Loss Curve")
plt.xlabel("Epoch")
plt.ylabel("Binary Cross-Entropy Loss")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
A properly training model should show a decreasing loss curve like this: steep drop early on, then gradually flattening as the model converges.
Key Concepts Summary
| Concept | Role |
|---|---|
| Forward pass | Compute predictions |
| Loss function | Measure error |
| Backpropagation | Compute gradients via chain rule |
| Gradient descent | Update weights using gradients |
| Learning rate | Step size for weight updates |
| Epoch | One full pass through the training data |
Why We Don’t Implement This Manually
Implementing backprop by hand is educational but:
- Error-prone for deep/complex networks
- Slow (not GPU-optimized)
That’s why we use PyTorch or TensorFlow — they compute gradients automatically (autograd). You just define the forward pass!
In gradient descent, the learning rate controls: