Random Forests
The Wisdom of Crowds
Imagine asking 500 experts the same question and taking the majority vote. This almost always beats asking one expert — even if that expert is very good.
That’s exactly what a Random Forest does with decision trees.
How Random Forest Works
A Random Forest builds many decision trees, each trained on a slightly different version of the data, and combines their predictions:
Training data (1000 samples)
│
┌────┼────┐
↓ ↓ ↓ (random bootstrap samples)
Tree 1 Tree 2 Tree 3 ... Tree 500
↓ ↓ ↓
Class1 Class0 Class1 ... Class1
└────┴────┘
↓
Majority vote → Class 1 ✅
Two Sources of Randomness
- Bootstrap sampling: Each tree is trained on a random sample (with replacement) of the training data
- Feature subsampling: At each split, only a random subset of features is considered
This randomness makes the trees different from each other — and diverse trees working together outperform identical trees.
Random Forest in scikit-learn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd
import numpy as np
# ── Data ──────────────────────────────────────────────────────
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42, stratify=data.target
)
# ── Train ─────────────────────────────────────────────────────
rf = RandomForestClassifier(
n_estimators=200, # number of trees
max_depth=None, # trees grow until pure (RF handles overfitting)
min_samples_split=2,
max_features="sqrt", # sqrt(n_features) at each split (default)
n_jobs=-1, # use all CPU cores
random_state=42,
)
rf.fit(X_train, y_train)
# ── Evaluate ──────────────────────────────────────────────────
y_pred = rf.predict(X_test)
print(f"Test Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print(classification_report(y_test, y_pred, target_names=data.target_names))
# Cross-validation for robust estimates
cv_scores = cross_val_score(rf, data.data, data.target, cv=5, scoring="accuracy")
print(f"\n5-fold CV: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
Feature Importance
One of Random Forest’s best features: it ranks how important each input feature is.
# Feature importances
importance_df = pd.DataFrame({
"feature": data.feature_names,
"importance": rf.feature_importances_,
}).sort_values("importance", ascending=False).head(10)
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.barh(importance_df["feature"][::-1], importance_df["importance"][::-1],
color="forestgreen", edgecolor="white")
plt.xlabel("Feature Importance (mean decrease in impurity)")
plt.title("Top 10 Most Important Features")
plt.tight_layout()
plt.show()
OOB (Out-of-Bag) Evaluation
Since each tree only sees ~63% of the data (bootstrap), the remaining 37% can be used for validation — for free!
rf_oob = RandomForestClassifier(
n_estimators=200,
oob_score=True, # enable out-of-bag evaluation
n_jobs=-1,
random_state=42,
)
rf_oob.fit(data.data, data.target)
print(f"OOB accuracy: {rf_oob.oob_score_:.2%}") # no test split needed!
Random Forest vs. Decision Tree
from sklearn.tree import DecisionTreeClassifier
# Single tree
tree = DecisionTreeClassifier(random_state=42)
tree_cv = cross_val_score(tree, data.data, data.target, cv=10)
# Random forest
rf = RandomForestClassifier(n_estimators=200, n_jobs=-1, random_state=42)
rf_cv = cross_val_score(rf, data.data, data.target, cv=10)
print(f"Decision Tree: {tree_cv.mean():.4f} ± {tree_cv.std():.4f}")
print(f"Random Forest: {rf_cv.mean():.4f} ± {rf_cv.std():.4f}")
# Decision Tree: 0.9228 ± 0.0277
# Random Forest: 0.9631 ± 0.0175 ← better AND more stable!
Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV
param_grid = {
"n_estimators": [100, 200, 500],
"max_depth": [None, 10, 20],
"max_features": ["sqrt", "log2", 0.5],
}
grid_search = GridSearchCV(
RandomForestClassifier(n_jobs=-1, random_state=42),
param_grid,
cv=5,
scoring="accuracy",
n_jobs=-1,
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV accuracy: {grid_search.best_score_:.4f}")
When to Use Random Forest
✅ Great for:
- Tabular / structured data
- Mixed feature types (no need to scale)
- Getting feature importances
- Robust baseline model
- Handling missing values (with some implementations)
⚡ Limitations:
- Slower to train than a single tree
- Less interpretable than a single tree
- Can be outperformed by Gradient Boosting (XGBoost, LightGBM)
What is the key advantage of Random Forest over a single Decision Tree?