Data Cleaning — ML Course

Why Data Cleaning Matters

“Data scientists spend 60–80% of their time cleaning data.” — Every data scientist ever.

Garbage in = garbage out. Even the best ML model fails if trained on dirty data. Data cleaning means:

Handling missing values
Removing duplicates
Fixing wrong data types
Detecting and handling outliers
Fixing inconsistent formatting

Step 1: Explore Your Data First

Always understand what you’re working with:

import pandas as pd
import numpy as np

df = pd.read_csv("data/customers.csv")

# Overview
print(df.shape)         # rows × columns
print(df.dtypes)        # data type of each column
print(df.isnull().sum())  # count of missing values per column

# Percentage missing
missing_pct = (df.isnull().sum() / len(df) * 100).round(2)
print(missing_pct[missing_pct > 0])  # only show columns with missings

# Duplicates
print(f"Duplicate rows: {df.duplicated().sum()}")

Step 2: Handle Missing Values

Missing values show up as NaN (Not a Number) in pandas.

Strategy 1: Drop

# Drop rows where ANY column is NaN
df_clean = df.dropna()

# Drop rows where a SPECIFIC column is NaN
df_clean = df.dropna(subset=["age", "salary"])

# Drop columns with too many NaNs (>50%)
threshold = len(df) * 0.5
df_clean = df.dropna(axis=1, thresh=threshold)

Strategy 2: Fill (Imputation)

# Fill numeric with mean or median
df["age"] = df["age"].fillna(df["age"].median())
df["salary"] = df["salary"].fillna(df["salary"].mean())

# Fill categorical with mode (most common value)
df["city"] = df["city"].fillna(df["city"].mode()[0])

# Fill with a constant
df["notes"] = df["notes"].fillna("unknown")

# Forward fill (use previous row's value — good for time series)
df["temperature"] = df["temperature"].ffill()

# Backward fill
df["temperature"] = df["temperature"].bfill()

When to Use Which?

Strategy	When to use
Drop row	Few rows affected, you have enough data
Drop column	Column is mostly empty (>70% missing)
Fill with median	Numeric data, especially with outliers
Fill with mean	Numeric data, roughly normal distribution
Fill with mode	Categorical data
Fill with model	Advanced: predict the missing values

Step 3: Remove Duplicates

# Find duplicates
print(df.duplicated().sum())  # count of duplicate rows

# View them
print(df[df.duplicated()])

# Remove duplicates (keeps first occurrence)
df = df.drop_duplicates()

# Drop duplicates based on specific columns
df = df.drop_duplicates(subset=["email"], keep="first")

Step 4: Fix Data Types

# Check types
print(df.dtypes)
# age         object   ← should be int!
# date        object   ← should be datetime!
# price       object   ← should be float!

# Fix numeric stored as string
df["age"] = pd.to_numeric(df["age"], errors="coerce")
# errors="coerce" turns invalid values into NaN instead of crashing

# Fix currency strings → float
df["price"] = df["price"].str.replace("$", "").str.replace(",", "")
df["price"] = df["price"].astype(float)

# Parse dates
df["date"] = pd.to_datetime(df["date"])
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["day_of_week"] = df["date"].dt.dayofweek  # 0=Monday

Step 5: Fix Inconsistent Values

# Check unique values in categorical column
print(df["gender"].unique())
# ['Male' 'male' 'M' 'Female' 'FEMALE' 'F' nan]  ← MESS!

# Standardize
df["gender"] = df["gender"].str.lower().str.strip()
print(df["gender"].unique())
# ['male' 'female' nan]

# Map to consistent values
gender_map = {
    "male": "M", "m": "M",
    "female": "F", "f": "F"
}
df["gender"] = df["gender"].map(gender_map)

Step 6: Detect and Handle Outliers

Outliers are extreme values that can skew your model.

Method 1: IQR (Interquartile Range)

Q1 = df["salary"].quantile(0.25)
Q3 = df["salary"].quantile(0.75)
IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

# Find outliers
outliers = df[(df["salary"] < lower) | (df["salary"] > upper)]
print(f"Outliers: {len(outliers)} rows")

# Remove outliers
df_clean = df[(df["salary"] >= lower) & (df["salary"] <= upper)]

Method 2: Z-Score

from scipy import stats

z_scores = stats.zscore(df["salary"])
df_clean = df[abs(z_scores) < 3]  # keep rows within 3 standard deviations

Method 3: Cap Instead of Remove

# Cap at the 1st and 99th percentile (Winsorizing)
lower = df["salary"].quantile(0.01)
upper = df["salary"].quantile(0.99)
df["salary"] = df["salary"].clip(lower, upper)

Full Cleaning Pipeline

def clean_dataset(df):
    """Full cleaning pipeline."""
    print(f"Before: {df.shape}")
    
    # 1. Remove duplicates
    df = df.drop_duplicates()
    
    # 2. Fix types
    df["age"] = pd.to_numeric(df["age"], errors="coerce")
    df["salary"] = pd.to_numeric(df["salary"], errors="coerce")
    
    # 3. Handle missing values
    df["age"] = df["age"].fillna(df["age"].median())
    df["salary"] = df["salary"].fillna(df["salary"].median())
    df["department"] = df["department"].fillna("Unknown")
    
    # 4. Standardize text
    df["department"] = df["department"].str.strip().str.title()
    
    # 5. Remove extreme outliers in salary
    Q1, Q3 = df["salary"].quantile([0.25, 0.75])
    IQR = Q3 - Q1
    df = df[(df["salary"] >= Q1 - 3*IQR) & (df["salary"] <= Q3 + 3*IQR)]
    
    print(f"After:  {df.shape}")
    return df

df = clean_dataset(df)

Knowledge Check

For a numeric column with outliers, which imputation strategy is usually better: mean or median?