Random Forests — ML Course

The Wisdom of Crowds

Imagine asking 500 experts the same question and taking the majority vote. This almost always beats asking one expert — even if that expert is very good.

That’s exactly what a Random Forest does with decision trees.

How Random Forest Works

A Random Forest builds many decision trees, each trained on a slightly different version of the data, and combines their predictions:

Training data (1000 samples)
        │
   ┌────┼────┐
   ↓    ↓    ↓       (random bootstrap samples)
Tree 1  Tree 2  Tree 3  ...  Tree 500
   ↓    ↓    ↓
Class1 Class0 Class1  ...  Class1
   └────┴────┘
        ↓
   Majority vote → Class 1 ✅

Two Sources of Randomness

Bootstrap sampling: Each tree is trained on a random sample (with replacement) of the training data
Feature subsampling: At each split, only a random subset of features is considered

This randomness makes the trees different from each other — and diverse trees working together outperform identical trees.

Random Forest in scikit-learn

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd
import numpy as np

# ── Data ──────────────────────────────────────────────────────
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42, stratify=data.target
)

# ── Train ─────────────────────────────────────────────────────
rf = RandomForestClassifier(
    n_estimators=200,        # number of trees
    max_depth=None,          # trees grow until pure (RF handles overfitting)
    min_samples_split=2,
    max_features="sqrt",     # sqrt(n_features) at each split (default)
    n_jobs=-1,               # use all CPU cores
    random_state=42,
)
rf.fit(X_train, y_train)

# ── Evaluate ──────────────────────────────────────────────────
y_pred = rf.predict(X_test)
print(f"Test Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print(classification_report(y_test, y_pred, target_names=data.target_names))

# Cross-validation for robust estimates
cv_scores = cross_val_score(rf, data.data, data.target, cv=5, scoring="accuracy")
print(f"\n5-fold CV: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

Feature Importance

One of Random Forest’s best features: it ranks how important each input feature is.

# Feature importances
importance_df = pd.DataFrame({
    "feature": data.feature_names,
    "importance": rf.feature_importances_,
}).sort_values("importance", ascending=False).head(10)

import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.barh(importance_df["feature"][::-1], importance_df["importance"][::-1],
         color="forestgreen", edgecolor="white")
plt.xlabel("Feature Importance (mean decrease in impurity)")
plt.title("Top 10 Most Important Features")
plt.tight_layout()
plt.show()

OOB (Out-of-Bag) Evaluation

Since each tree only sees ~63% of the data (bootstrap), the remaining 37% can be used for validation — for free!

rf_oob = RandomForestClassifier(
    n_estimators=200,
    oob_score=True,   # enable out-of-bag evaluation
    n_jobs=-1,
    random_state=42,
)
rf_oob.fit(data.data, data.target)

print(f"OOB accuracy: {rf_oob.oob_score_:.2%}")  # no test split needed!

Random Forest vs. Decision Tree

from sklearn.tree import DecisionTreeClassifier

# Single tree
tree = DecisionTreeClassifier(random_state=42)
tree_cv = cross_val_score(tree, data.data, data.target, cv=10)

# Random forest
rf = RandomForestClassifier(n_estimators=200, n_jobs=-1, random_state=42)
rf_cv = cross_val_score(rf, data.data, data.target, cv=10)

print(f"Decision Tree: {tree_cv.mean():.4f} ± {tree_cv.std():.4f}")
print(f"Random Forest: {rf_cv.mean():.4f} ± {rf_cv.std():.4f}")
# Decision Tree: 0.9228 ± 0.0277
# Random Forest: 0.9631 ± 0.0175  ← better AND more stable!

Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

param_grid = {
    "n_estimators": [100, 200, 500],
    "max_depth": [None, 10, 20],
    "max_features": ["sqrt", "log2", 0.5],
}

grid_search = GridSearchCV(
    RandomForestClassifier(n_jobs=-1, random_state=42),
    param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=-1,
)
grid_search.fit(X_train, y_train)

print(f"Best params: {grid_search.best_params_}")
print(f"Best CV accuracy: {grid_search.best_score_:.4f}")

When to Use Random Forest

✅ Great for:

Tabular / structured data
Mixed feature types (no need to scale)
Getting feature importances
Robust baseline model
Handling missing values (with some implementations)

⚡ Limitations:

Slower to train than a single tree
Less interpretable than a single tree
Can be outperformed by Gradient Boosting (XGBoost, LightGBM)

Knowledge Check

What is the key advantage of Random Forest over a single Decision Tree?