Forward Pass & Backpropagation

The Training Loop in Three Steps

Every training step in a neural network is:

1. FORWARD PASS   → compute predictions from inputs
2. LOSS           → measure how wrong the predictions are
3. BACKWARD PASS  → compute gradients and update weights

Let’s understand each step.

Step 1: The Forward Pass

Data flows forward through the network, layer by layer:

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def relu(x):
    return np.maximum(0, x)

# A tiny 2-layer network for XOR problem
# Input: [x1, x2], Output: probability of class 1

def forward_pass(x, W1, b1, W2, b2):
    # Layer 1: hidden
    z1 = x @ W1 + b1     # (n, 2) @ (2, 3) + (3,) = (n, 3)
    a1 = relu(z1)         # apply activation
    
    # Layer 2: output
    z2 = a1 @ W2 + b2    # (n, 3) @ (3, 1) + (1,) = (n, 1)
    a2 = sigmoid(z2)      # probability
    
    return a2, (z1, a1, z2, a2)   # output + cache for backprop

# Initialize weights randomly
np.random.seed(42)
W1 = np.random.randn(2, 3) * 0.1  # input → hidden
b1 = np.zeros(3)
W2 = np.random.randn(3, 1) * 0.1  # hidden → output
b2 = np.zeros(1)

# XOR dataset
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([[0],  [1],  [1],  [0]])   # XOR outputs

output, cache = forward_pass(X, W1, b1, W2, b2)
print("Initial predictions:")
print(output.round(4))

Step 2: The Loss Function

The loss measures how wrong the predictions are. We want to minimize this during training.

Binary Cross-Entropy (for classification)

$L = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) \right]$

def binary_cross_entropy(y_true, y_pred, eps=1e-7):
    y_pred = np.clip(y_pred, eps, 1 - eps)  # avoid log(0)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

loss = binary_cross_entropy(y, output)
print(f"Initial loss: {loss:.4f}")  # high — random weights are bad

Mean Squared Error (for regression)

$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

Step 3: Backpropagation

Backpropagation uses the chain rule of calculus to compute how much each weight contributes to the loss.

The key question: “If I increase weight $w$ by a tiny amount, how much does the loss change?”

This is the gradient: $\frac{\partial L}{\partial w}$

Forward:   input → [W1, b1] → hidden → [W2, b2] → output → loss
Backward:  loss → dL/dW2 → dL/dW1 (using chain rule)

Gradient Descent Update Rule

$w \leftarrow w - \alpha \cdot \frac{\partial L}{\partial w}$

Where $\alpha$ is the learning rate.

def backward_pass(X, y, W1, b1, W2, b2, cache, lr=0.1):
    z1, a1, z2, a2 = cache
    n = X.shape[0]
    
    # Output layer gradients
    dz2 = a2 - y                          # d_loss/d_z2
    dW2 = (a1.T @ dz2) / n               # d_loss/d_W2
    db2 = dz2.mean(axis=0)               # d_loss/d_b2
    
    # Hidden layer gradients
    da1 = dz2 @ W2.T
    dz1 = da1 * (z1 > 0)                 # ReLU gradient: 1 if z>0 else 0
    dW1 = (X.T @ dz1) / n
    db1 = dz1.mean(axis=0)
    
    # Update weights (gradient descent)
    W2 -= lr * dW2
    b2 -= lr * db2
    W1 -= lr * dW1
    b1 -= lr * db1
    
    return W1, b1, W2, b2

# Training loop
losses = []
for epoch in range(1000):
    output, cache = forward_pass(X, W1, b1, W2, b2)
    loss = binary_cross_entropy(y, output)
    losses.append(loss)
    W1, b1, W2, b2 = backward_pass(X, y, W1, b1, W2, b2, cache, lr=0.5)
    
    if (epoch + 1) % 200 == 0:
        print(f"Epoch {epoch+1:4d}: loss = {loss:.4f}")

# Final predictions
output, _ = forward_pass(X, W1, b1, W2, b2)
print("\nFinal predictions (should be ~0, 1, 1, 0):")
print(output.round(4))

Visualizing Learning

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 4))
plt.plot(losses, color="royalblue", linewidth=2)
plt.title("Training Loss Curve")
plt.xlabel("Epoch")
plt.ylabel("Binary Cross-Entropy Loss")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

A properly training model should show a decreasing loss curve like this: steep drop early on, then gradually flattening as the model converges.

Key Concepts Summary

Concept	Role
Forward pass	Compute predictions
Loss function	Measure error
Backpropagation	Compute gradients via chain rule
Gradient descent	Update weights using gradients
Learning rate	Step size for weight updates
Epoch	One full pass through the training data

Why We Don’t Implement This Manually

Implementing backprop by hand is educational but:

Error-prone for deep/complex networks
Slow (not GPU-optimized)

That’s why we use PyTorch or TensorFlow — they compute gradients automatically (autograd). You just define the forward pass!

Knowledge Check

In gradient descent, the learning rate controls: