You're offline — showing cached content
Module 4 — Classical Machine Learning intermediate 25 min

Random Forests

The Wisdom of Crowds

Imagine asking 500 experts the same question and taking the majority vote. This almost always beats asking one expert — even if that expert is very good.

That’s exactly what a Random Forest does with decision trees.


How Random Forest Works

A Random Forest builds many decision trees, each trained on a slightly different version of the data, and combines their predictions:

Training data (1000 samples)

   ┌────┼────┐
   ↓    ↓    ↓       (random bootstrap samples)
Tree 1  Tree 2  Tree 3  ...  Tree 500
   ↓    ↓    ↓
Class1 Class0 Class1  ...  Class1
   └────┴────┘

   Majority vote → Class 1 ✅

Two Sources of Randomness

  1. Bootstrap sampling: Each tree is trained on a random sample (with replacement) of the training data
  2. Feature subsampling: At each split, only a random subset of features is considered

This randomness makes the trees different from each other — and diverse trees working together outperform identical trees.


Random Forest in scikit-learn

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd
import numpy as np

# ── Data ──────────────────────────────────────────────────────
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42, stratify=data.target
)

# ── Train ─────────────────────────────────────────────────────
rf = RandomForestClassifier(
    n_estimators=200,        # number of trees
    max_depth=None,          # trees grow until pure (RF handles overfitting)
    min_samples_split=2,
    max_features="sqrt",     # sqrt(n_features) at each split (default)
    n_jobs=-1,               # use all CPU cores
    random_state=42,
)
rf.fit(X_train, y_train)

# ── Evaluate ──────────────────────────────────────────────────
y_pred = rf.predict(X_test)
print(f"Test Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print(classification_report(y_test, y_pred, target_names=data.target_names))

# Cross-validation for robust estimates
cv_scores = cross_val_score(rf, data.data, data.target, cv=5, scoring="accuracy")
print(f"\n5-fold CV: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

Feature Importance

One of Random Forest’s best features: it ranks how important each input feature is.

# Feature importances
importance_df = pd.DataFrame({
    "feature": data.feature_names,
    "importance": rf.feature_importances_,
}).sort_values("importance", ascending=False).head(10)

import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.barh(importance_df["feature"][::-1], importance_df["importance"][::-1],
         color="forestgreen", edgecolor="white")
plt.xlabel("Feature Importance (mean decrease in impurity)")
plt.title("Top 10 Most Important Features")
plt.tight_layout()
plt.show()

OOB (Out-of-Bag) Evaluation

Since each tree only sees ~63% of the data (bootstrap), the remaining 37% can be used for validation — for free!

rf_oob = RandomForestClassifier(
    n_estimators=200,
    oob_score=True,   # enable out-of-bag evaluation
    n_jobs=-1,
    random_state=42,
)
rf_oob.fit(data.data, data.target)

print(f"OOB accuracy: {rf_oob.oob_score_:.2%}")  # no test split needed!

Random Forest vs. Decision Tree

from sklearn.tree import DecisionTreeClassifier

# Single tree
tree = DecisionTreeClassifier(random_state=42)
tree_cv = cross_val_score(tree, data.data, data.target, cv=10)

# Random forest
rf = RandomForestClassifier(n_estimators=200, n_jobs=-1, random_state=42)
rf_cv = cross_val_score(rf, data.data, data.target, cv=10)

print(f"Decision Tree: {tree_cv.mean():.4f} ± {tree_cv.std():.4f}")
print(f"Random Forest: {rf_cv.mean():.4f} ± {rf_cv.std():.4f}")
# Decision Tree: 0.9228 ± 0.0277
# Random Forest: 0.9631 ± 0.0175  ← better AND more stable!

Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

param_grid = {
    "n_estimators": [100, 200, 500],
    "max_depth": [None, 10, 20],
    "max_features": ["sqrt", "log2", 0.5],
}

grid_search = GridSearchCV(
    RandomForestClassifier(n_jobs=-1, random_state=42),
    param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=-1,
)
grid_search.fit(X_train, y_train)

print(f"Best params: {grid_search.best_params_}")
print(f"Best CV accuracy: {grid_search.best_score_:.4f}")

When to Use Random Forest

✅ Great for:

  • Tabular / structured data
  • Mixed feature types (no need to scale)
  • Getting feature importances
  • Robust baseline model
  • Handling missing values (with some implementations)

⚡ Limitations:

  • Slower to train than a single tree
  • Less interpretable than a single tree
  • Can be outperformed by Gradient Boosting (XGBoost, LightGBM)
Knowledge Check

What is the key advantage of Random Forest over a single Decision Tree?