Module 4 — Classical Machine Learning intermediate 25 min
Support Vector Machines
The Core Idea: Maximum Margin
Logistic regression draws a line between classes. SVM draws the best possible line — the one that maximizes the margin (distance) between the two classes.
Class A × × | ○ ○ Class B
× × | margin
| ○ ○
↑
Decision boundary (hyperplane)
— maximally equidistant from both classes
The data points closest to the boundary are called support vectors — they define the margin.
Linear SVM
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
# ── Data ──────────────────────────────────────────────────────
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# ⚠️ IMPORTANT: Always scale for SVM (distance-based algorithm)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
# ── Linear SVM ────────────────────────────────────────────────
svm_linear = SVC(kernel="linear", C=1.0, probability=True)
svm_linear.fit(X_train_s, y_train)
y_pred = svm_linear.predict(X_test_s)
print(f"Linear SVM Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print(f"Support vectors per class: {svm_linear.n_support_}")
The C Parameter (Soft Margin)
Real data is never perfectly separable. The C parameter controls the trade-off:
- High C → small margin, few misclassifications (risk of overfitting)
- Low C → large margin, more misclassifications allowed (better generalization)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score
C_values = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
cv_means = []
for C in C_values:
svm = SVC(kernel="linear", C=C)
scores = cross_val_score(svm, data.data, data.target, cv=5)
cv_means.append(scores.mean())
best_C = C_values[np.argmax(cv_means)]
print(f"Best C = {best_C}, CV accuracy = {max(cv_means):.4f}")
The Kernel Trick
What if the data isn’t linearly separable? The kernel trick maps data to a higher-dimensional space where it becomes separable — without explicitly computing those higher dimensions.
Original 2D space Kernel-transformed space
○ × × ○ ○ ○ | × ×
×× × × ×× → ○ ○ | × ×
○ × × ○ linear boundary!
(not linearly separable) (linearly separable!)
Common Kernels
# Linear — works for linearly separable data
svm_linear = SVC(kernel="linear", C=1.0)
# RBF (Radial Basis Function) — most popular, works for non-linear data
svm_rbf = SVC(kernel="rbf", C=10, gamma="scale")
# Polynomial — for polynomial relationships
svm_poly = SVC(kernel="poly", degree=3, C=1.0)
RBF Kernel Parameters
- C — margin softness (as before)
- gamma — how far each training point’s influence reaches
- High gamma → each point influences a small region (risk of overfitting)
- Low gamma → wider influence (smoother boundary)
from sklearn.model_selection import GridSearchCV
param_grid = {
"C": [0.1, 1, 10, 100],
"gamma": ["scale", "auto", 0.001, 0.01, 0.1],
}
grid = GridSearchCV(SVC(kernel="rbf"), param_grid, cv=5, scoring="accuracy", n_jobs=-1)
grid.fit(X_train_s, y_train)
print(f"Best params: {grid.best_params_}")
print(f"Best score: {grid.best_score_:.4f}")
best_svm = grid.best_estimator_
print(f"Test accuracy: {accuracy_score(y_test, best_svm.predict(X_test_s)):.2%}")
SVR — Support Vector Regression
SVM can also do regression:
from sklearn.svm import SVR
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
svr = SVR(kernel="rbf", C=10, epsilon=0.1)
svr.fit(X_train_s, y_train)
from sklearn.metrics import r2_score
print(f"SVR R²: {r2_score(y_test, svr.predict(X_test_s)):.4f}")
SVM vs. Other Algorithms
| SVM | Logistic Regression | Random Forest | |
|---|---|---|---|
| Non-linear | ✅ (with kernel) | ❌ | ✅ |
| Scalability | ❌ slow on large data | ✅ | ✅ |
| Probability output | ✅ (with probability=True) | ✅ | ✅ |
| Interpretability | ❌ | ✅ | ✅ (feature importances) |
| Needs scaling | ✅ yes | ✅ recommended | ❌ no |
When to Use SVM
✅ Best for:
- High-dimensional data (text, genes)
- Small-to-medium datasets (< 10k samples)
- Clear margin of separation exists
- Non-linear boundaries (with RBF kernel)
❌ Avoid when:
- Millions of samples (too slow)
- Need fast training
- Need probability calibration without extra cost
Why is feature scaling (StandardScaler) especially important for SVM?