Module 4 — Classical Machine Learning intermediate 30 min
Linear Regression
The Simplest ML Model
Linear regression answers one question: given some features, what number should I predict?
Examples:
- Given house size → predict price
- Given hours studied → predict exam score
- Given temperature and humidity → predict energy consumption
The Idea: Fit a Line
For a single feature, linear regression finds the best-fit line through your data:
Where:
- is the prediction (e.g., house price)
- is the input feature (e.g., house size)
- is the weight / slope (learned)
- is the bias / intercept (learned)
For multiple features (multiple linear regression):
How Does It “Learn”? — Least Squares
The model learns by minimizing the Mean Squared Error (MSE):
Where is the true value and is the predicted value.
It finds the values of and that make this error as small as possible.
Implementation with scikit-learn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# ── 1. Create / load data ──────────────────────────────────────
np.random.seed(42)
n = 200
area = np.random.normal(1500, 400, n) # house area (sqft)
price = 200 * area + np.random.normal(0, 60000, n) # house price ($)
df = pd.DataFrame({"area": area, "price": price})
# ── 2. Split into train / test ────────────────────────────────
X = df[["area"]] # features must be 2D
y = df["price"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"Train: {len(X_train)}, Test: {len(X_test)}")
# ── 3. Create and train the model ─────────────────────────────
model = LinearRegression()
model.fit(X_train, y_train) # ← learning happens here
# ── 4. Inspect learned parameters ────────────────────────────
print(f"Weight (w): {model.coef_[0]:.2f}") # slope
print(f"Bias (b): {model.intercept_:.2f}") # intercept
# Weight: ~200 (we generated with 200)
# Bias: ~0
# ── 5. Make predictions ───────────────────────────────────────
y_pred = model.predict(X_test)
# Predict for a new house
new_house = pd.DataFrame({"area": [2000]})
predicted_price = model.predict(new_house)
print(f"Predicted price for 2000sqft: ${predicted_price[0]:,.0f}")
# ── 6. Evaluate the model ─────────────────────────────────────
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"RMSE: ${rmse:,.0f}")
print(f"R²: {r2:.4f}") # 1.0 = perfect, 0 = terrible
Understanding the Metrics
RMSE (Root Mean Squared Error)
RMSE = sqrt( mean of (predicted - actual)² )
- Same unit as the target (dollars, degrees, etc.)
- Lower is better
- Easy to interpret: “on average, my prediction is off by $X”
R² (R-Squared)
- Ranges from 0 to 1 (can be negative for very bad models)
- R² = 0.95 means the model explains 95% of the variance in the target
- Higher is better
# R² Interpretation
if r2 >= 0.9:
print("Excellent fit!")
elif r2 >= 0.7:
print("Good fit.")
elif r2 >= 0.5:
print("Moderate fit.")
else:
print("Poor fit — try different features or a more complex model.")
Multiple Linear Regression
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
# Load dataset (housing features → median house value)
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target, name="price")
print(X.head())
print(f"\nData: {X.shape}")
# Scale features (important for interpretation, optional for linear regression)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2, random_state=42
)
# Train
model = LinearRegression()
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print(f"R²: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
# Feature importance (coefficients)
coeff_df = pd.DataFrame({
"Feature": housing.feature_names,
"Coefficient": model.coef_,
}).sort_values("Coefficient", key=abs, ascending=False)
print(coeff_df)
When to Use Linear Regression
✅ Use it when:
- The relationship between features and target is roughly linear
- You want an interpretable model
- You want a fast baseline before trying complex models
❌ Don’t use it when:
- The relationship is highly non-linear
- You’re doing classification (use logistic regression)
- Features have complex interactions
R² = 0.85 means: