Linear Regression — ML Course

The Simplest ML Model

Linear regression answers one question: given some features, what number should I predict?

Examples:

Given house size → predict price
Given hours studied → predict exam score
Given temperature and humidity → predict energy consumption

The Idea: Fit a Line

For a single feature, linear regression finds the best-fit line through your data:

$y = w \cdot x + b$

Where:

$y$ is the prediction (e.g., house price)
$x$ is the input feature (e.g., house size)
$w$ is the weight / slope (learned)
$b$ is the bias / intercept (learned)

For multiple features (multiple linear regression): $y = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b$

How Does It “Learn”? — Least Squares

The model learns by minimizing the Mean Squared Error (MSE):

$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

Where $y_i$ is the true value and $\hat{y}_i$ is the predicted value.

It finds the values of $w$ and $b$ that make this error as small as possible.

Implementation with scikit-learn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# ── 1. Create / load data ──────────────────────────────────────
np.random.seed(42)
n = 200
area = np.random.normal(1500, 400, n)                   # house area (sqft)
price = 200 * area + np.random.normal(0, 60000, n)      # house price ($)

df = pd.DataFrame({"area": area, "price": price})

# ── 2. Split into train / test ────────────────────────────────
X = df[["area"]]   # features must be 2D
y = df["price"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(f"Train: {len(X_train)}, Test: {len(X_test)}")

# ── 3. Create and train the model ─────────────────────────────
model = LinearRegression()
model.fit(X_train, y_train)      # ← learning happens here

# ── 4. Inspect learned parameters ────────────────────────────
print(f"Weight (w): {model.coef_[0]:.2f}")     # slope
print(f"Bias   (b): {model.intercept_:.2f}")   # intercept
# Weight: ~200 (we generated with 200)
# Bias:   ~0

# ── 5. Make predictions ───────────────────────────────────────
y_pred = model.predict(X_test)

# Predict for a new house
new_house = pd.DataFrame({"area": [2000]})
predicted_price = model.predict(new_house)
print(f"Predicted price for 2000sqft: ${predicted_price[0]:,.0f}")

# ── 6. Evaluate the model ─────────────────────────────────────
mse  = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2   = r2_score(y_test, y_pred)

print(f"RMSE:  ${rmse:,.0f}")
print(f"R²:    {r2:.4f}")   # 1.0 = perfect, 0 = terrible

Understanding the Metrics

RMSE (Root Mean Squared Error)

RMSE = sqrt( mean of (predicted - actual)² )

Same unit as the target (dollars, degrees, etc.)
Lower is better
Easy to interpret: “on average, my prediction is off by $X”

R² (R-Squared)

Ranges from 0 to 1 (can be negative for very bad models)
R² = 0.95 means the model explains 95% of the variance in the target
Higher is better

# R² Interpretation
if r2 >= 0.9:
    print("Excellent fit!")
elif r2 >= 0.7:
    print("Good fit.")
elif r2 >= 0.5:
    print("Moderate fit.")
else:
    print("Poor fit — try different features or a more complex model.")

Multiple Linear Regression

from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler

# Load dataset (housing features → median house value)
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target, name="price")

print(X.head())
print(f"\nData: {X.shape}")

# Scale features (important for interpretation, optional for linear regression)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

# Train
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(f"R²:   {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")

# Feature importance (coefficients)
coeff_df = pd.DataFrame({
    "Feature": housing.feature_names,
    "Coefficient": model.coef_,
}).sort_values("Coefficient", key=abs, ascending=False)
print(coeff_df)

When to Use Linear Regression

✅ Use it when:

The relationship between features and target is roughly linear
You want an interpretable model
You want a fast baseline before trying complex models

❌ Don’t use it when:

The relationship is highly non-linear
You’re doing classification (use logistic regression)
Features have complex interactions

Knowledge Check

R² = 0.85 means: