Logistic Regression

Classification vs. Regression

Type	Output	Example
Regression	Number	$450,000 house price
Classification	Category	Spam / Not spam

Logistic regression is used for classification tasks, even though it says “regression” in the name. It predicts the probability that an input belongs to a class.

The Sigmoid Function

Instead of a straight line, logistic regression uses the sigmoid function to squish outputs between 0 and 1:

$\sigma(z) = \frac{1}{1 + e^{-z}}$

import numpy as np
import matplotlib.pyplot as plt

z = np.linspace(-8, 8, 200)
sigmoid = 1 / (1 + np.exp(-z))

plt.figure(figsize=(8, 4))
plt.plot(z, sigmoid, color="royalblue", linewidth=2)
plt.axhline(0.5, color="gray", linestyle="--", alpha=0.5, label="threshold = 0.5")
plt.axvline(0, color="gray", linestyle="--", alpha=0.5)
plt.fill_between(z, sigmoid, 0.5, where=(z >= 0), alpha=0.1, color="green", label="predict class 1")
plt.fill_between(z, sigmoid, 0.5, where=(z < 0),  alpha=0.1, color="red",   label="predict class 0")
plt.xlabel("z (linear combination of features)")
plt.ylabel("probability")
plt.title("Sigmoid Function")
plt.legend()
plt.show()

The model outputs a probability (e.g., 0.87 = 87% chance of spam).

probability ≥ 0.5 → predict class 1
probability < 0.5 → predict class 0

Binary Classification with scikit-learn

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (accuracy_score, classification_report,
                             confusion_matrix, ConfusionMatrixDisplay)
import matplotlib.pyplot as plt

# ── 1. Load data ──────────────────────────────────────────────
data = load_breast_cancer()
X, y = data.data, data.target
# y: 0 = malignant, 1 = benign

print(f"Features: {X.shape[1]}")
print(f"Samples:  {X.shape[0]}")

# ── 2. Preprocess ─────────────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)  # fit on train only
X_test  = scaler.transform(X_test)       # transform test with train stats

# ── 3. Train ──────────────────────────────────────────────────
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)

# ── 4. Predict ────────────────────────────────────────────────
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)  # probabilities for each class
print("Probabilities for first 5 samples:")
print(y_prob[:5].round(3))
# [[0.01 0.99]   ← 99% chance benign
#  [0.92 0.08]   ← 92% chance malignant
#  ...]

# ── 5. Evaluate ───────────────────────────────────────────────
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print(classification_report(y_test, y_pred,
                            target_names=data.target_names))

Understanding the Classification Report

              precision    recall  f1-score   support

   malignant       0.97      0.93      0.95        42
      benign       0.96      0.99      0.97        72

    accuracy                           0.96       114

Metric	Meaning
Accuracy	Overall correct predictions
Precision	Of all predicted positive, how many are actually positive
Recall	Of all actual positives, how many did we catch
F1-score	Harmonic mean of precision and recall

Confusion Matrix

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=data.target_names)
disp.plot(cmap="Blues")
plt.title("Confusion Matrix")
plt.show()

                Predicted Malignant | Predicted Benign
Actual Malignant        39          |       3         ← 3 false negatives!
Actual Benign            1          |      71

True Positives (TP): predicted benign, is benign ✅
True Negatives (TN): predicted malignant, is malignant ✅
False Positives (FP): predicted benign, actually malignant ❌ (dangerous!)
False Negatives (FN): predicted malignant, actually benign

Multiclass Classification

Logistic regression also handles more than 2 classes:

from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target  # 3 classes: setosa, versicolor, virginica

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# multi_class is handled automatically in modern sklearn
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

print(f"Accuracy: {accuracy_score(y_test, model.predict(X_test)):.2%}")
# Usually ~97% on iris

Adjusting the Decision Threshold

The default threshold is 0.5, but you can change it:

# Lower threshold → catch more positives (higher recall)
# Higher threshold → more confident positives (higher precision)

threshold = 0.3  # more aggressive detection
y_pred_custom = (y_prob[:, 1] >= threshold).astype(int)

print(f"Recall at threshold 0.3: {recall_score(y_test, y_pred_custom):.2%}")
print(f"Recall at threshold 0.5: {recall_score(y_test, y_pred):.2%}")

In medical diagnosis, you’d lower the threshold to catch more true cases (higher recall), even at the cost of some false positives.

Knowledge Check

Logistic regression outputs 0.73 for a sample. With a default threshold of 0.5, what class is predicted?