Module 4 — Classical Machine Learning intermediate 25 min
Logistic Regression
Classification vs. Regression
| Type | Output | Example |
|---|---|---|
| Regression | Number | $450,000 house price |
| Classification | Category | Spam / Not spam |
Logistic regression is used for classification tasks, even though it says “regression” in the name. It predicts the probability that an input belongs to a class.
The Sigmoid Function
Instead of a straight line, logistic regression uses the sigmoid function to squish outputs between 0 and 1:
import numpy as np
import matplotlib.pyplot as plt
z = np.linspace(-8, 8, 200)
sigmoid = 1 / (1 + np.exp(-z))
plt.figure(figsize=(8, 4))
plt.plot(z, sigmoid, color="royalblue", linewidth=2)
plt.axhline(0.5, color="gray", linestyle="--", alpha=0.5, label="threshold = 0.5")
plt.axvline(0, color="gray", linestyle="--", alpha=0.5)
plt.fill_between(z, sigmoid, 0.5, where=(z >= 0), alpha=0.1, color="green", label="predict class 1")
plt.fill_between(z, sigmoid, 0.5, where=(z < 0), alpha=0.1, color="red", label="predict class 0")
plt.xlabel("z (linear combination of features)")
plt.ylabel("probability")
plt.title("Sigmoid Function")
plt.legend()
plt.show()
The model outputs a probability (e.g., 0.87 = 87% chance of spam).
- probability ≥ 0.5 → predict class 1
- probability < 0.5 → predict class 0
Binary Classification with scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (accuracy_score, classification_report,
confusion_matrix, ConfusionMatrixDisplay)
import matplotlib.pyplot as plt
# ── 1. Load data ──────────────────────────────────────────────
data = load_breast_cancer()
X, y = data.data, data.target
# y: 0 = malignant, 1 = benign
print(f"Features: {X.shape[1]}")
print(f"Samples: {X.shape[0]}")
# ── 2. Preprocess ─────────────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # fit on train only
X_test = scaler.transform(X_test) # transform test with train stats
# ── 3. Train ──────────────────────────────────────────────────
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)
# ── 4. Predict ────────────────────────────────────────────────
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test) # probabilities for each class
print("Probabilities for first 5 samples:")
print(y_prob[:5].round(3))
# [[0.01 0.99] ← 99% chance benign
# [0.92 0.08] ← 92% chance malignant
# ...]
# ── 5. Evaluate ───────────────────────────────────────────────
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print(classification_report(y_test, y_pred,
target_names=data.target_names))
Understanding the Classification Report
precision recall f1-score support
malignant 0.97 0.93 0.95 42
benign 0.96 0.99 0.97 72
accuracy 0.96 114
| Metric | Meaning |
|---|---|
| Accuracy | Overall correct predictions |
| Precision | Of all predicted positive, how many are actually positive |
| Recall | Of all actual positives, how many did we catch |
| F1-score | Harmonic mean of precision and recall |
Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=data.target_names)
disp.plot(cmap="Blues")
plt.title("Confusion Matrix")
plt.show()
Predicted Malignant | Predicted Benign
Actual Malignant 39 | 3 ← 3 false negatives!
Actual Benign 1 | 71
- True Positives (TP): predicted benign, is benign ✅
- True Negatives (TN): predicted malignant, is malignant ✅
- False Positives (FP): predicted benign, actually malignant ❌ (dangerous!)
- False Negatives (FN): predicted malignant, actually benign
Multiclass Classification
Logistic regression also handles more than 2 classes:
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target # 3 classes: setosa, versicolor, virginica
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# multi_class is handled automatically in modern sklearn
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
print(f"Accuracy: {accuracy_score(y_test, model.predict(X_test)):.2%}")
# Usually ~97% on iris
Adjusting the Decision Threshold
The default threshold is 0.5, but you can change it:
# Lower threshold → catch more positives (higher recall)
# Higher threshold → more confident positives (higher precision)
threshold = 0.3 # more aggressive detection
y_pred_custom = (y_prob[:, 1] >= threshold).astype(int)
print(f"Recall at threshold 0.3: {recall_score(y_test, y_pred_custom):.2%}")
print(f"Recall at threshold 0.5: {recall_score(y_test, y_pred):.2%}")
In medical diagnosis, you’d lower the threshold to catch more true cases (higher recall), even at the cost of some false positives.
Logistic regression outputs 0.73 for a sample. With a default threshold of 0.5, what class is predicted?