Support Vector Machines

The Core Idea: Maximum Margin

Logistic regression draws a line between classes. SVM draws the best possible line — the one that maximizes the margin (distance) between the two classes.

Class A  ×  ×  |   ○  ○  Class B
         ×  ×  | margin
                |   ○  ○
                ↑
          Decision boundary (hyperplane)
          — maximally equidistant from both classes

The data points closest to the boundary are called support vectors — they define the margin.

Linear SVM

from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report

# ── Data ──────────────────────────────────────────────────────
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# ⚠️ IMPORTANT: Always scale for SVM (distance-based algorithm)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

# ── Linear SVM ────────────────────────────────────────────────
svm_linear = SVC(kernel="linear", C=1.0, probability=True)
svm_linear.fit(X_train_s, y_train)

y_pred = svm_linear.predict(X_test_s)
print(f"Linear SVM Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print(f"Support vectors per class: {svm_linear.n_support_}")

The C Parameter (Soft Margin)

Real data is never perfectly separable. The C parameter controls the trade-off:

High C → small margin, few misclassifications (risk of overfitting)
Low C → large margin, more misclassifications allowed (better generalization)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score

C_values = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
cv_means = []

for C in C_values:
    svm = SVC(kernel="linear", C=C)
    scores = cross_val_score(svm, data.data, data.target, cv=5)
    cv_means.append(scores.mean())

best_C = C_values[np.argmax(cv_means)]
print(f"Best C = {best_C}, CV accuracy = {max(cv_means):.4f}")

The Kernel Trick

What if the data isn’t linearly separable? The kernel trick maps data to a higher-dimensional space where it becomes separable — without explicitly computing those higher dimensions.

Original 2D space        Kernel-transformed space
    ○  × × ○            ○  ○  |  × ×
    ×× × × ××    →      ○  ○  |  × ×
    ○  × × ○                     linear boundary!
(not linearly separable)     (linearly separable!)

Common Kernels

# Linear — works for linearly separable data
svm_linear = SVC(kernel="linear", C=1.0)

# RBF (Radial Basis Function) — most popular, works for non-linear data
svm_rbf = SVC(kernel="rbf", C=10, gamma="scale")

# Polynomial — for polynomial relationships
svm_poly = SVC(kernel="poly", degree=3, C=1.0)

RBF Kernel Parameters

C — margin softness (as before)
gamma — how far each training point’s influence reaches
- High gamma → each point influences a small region (risk of overfitting)
- Low gamma → wider influence (smoother boundary)

from sklearn.model_selection import GridSearchCV

param_grid = {
    "C":     [0.1, 1, 10, 100],
    "gamma": ["scale", "auto", 0.001, 0.01, 0.1],
}

grid = GridSearchCV(SVC(kernel="rbf"), param_grid, cv=5, scoring="accuracy", n_jobs=-1)
grid.fit(X_train_s, y_train)

print(f"Best params: {grid.best_params_}")
print(f"Best score: {grid.best_score_:.4f}")

best_svm = grid.best_estimator_
print(f"Test accuracy: {accuracy_score(y_test, best_svm.predict(X_test_s)):.2%}")

SVR — Support Vector Regression

SVM can also do regression:

from sklearn.svm import SVR
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

svr = SVR(kernel="rbf", C=10, epsilon=0.1)
svr.fit(X_train_s, y_train)

from sklearn.metrics import r2_score
print(f"SVR R²: {r2_score(y_test, svr.predict(X_test_s)):.4f}")

SVM vs. Other Algorithms

	SVM	Logistic Regression	Random Forest
Non-linear	✅ (with kernel)	❌	✅
Scalability	❌ slow on large data	✅	✅
Probability output	✅ (with `probability=True`)	✅	✅
Interpretability	❌	✅	✅ (feature importances)
Needs scaling	✅ yes	✅ recommended	❌ no

When to Use SVM

✅ Best for:

High-dimensional data (text, genes)
Small-to-medium datasets (< 10k samples)
Clear margin of separation exists
Non-linear boundaries (with RBF kernel)

❌ Avoid when:

Millions of samples (too slow)
Need fast training
Need probability calibration without extra cost

Knowledge Check

Why is feature scaling (StandardScaler) especially important for SVM?