Module 3 — Data Exploration & Visualization beginner 22 min
Data Cleaning
Why Data Cleaning Matters
“Data scientists spend 60–80% of their time cleaning data.” — Every data scientist ever.
Garbage in = garbage out. Even the best ML model fails if trained on dirty data. Data cleaning means:
- Handling missing values
- Removing duplicates
- Fixing wrong data types
- Detecting and handling outliers
- Fixing inconsistent formatting
Step 1: Explore Your Data First
Always understand what you’re working with:
import pandas as pd
import numpy as np
df = pd.read_csv("data/customers.csv")
# Overview
print(df.shape) # rows × columns
print(df.dtypes) # data type of each column
print(df.isnull().sum()) # count of missing values per column
# Percentage missing
missing_pct = (df.isnull().sum() / len(df) * 100).round(2)
print(missing_pct[missing_pct > 0]) # only show columns with missings
# Duplicates
print(f"Duplicate rows: {df.duplicated().sum()}")
Step 2: Handle Missing Values
Missing values show up as NaN (Not a Number) in pandas.
Strategy 1: Drop
# Drop rows where ANY column is NaN
df_clean = df.dropna()
# Drop rows where a SPECIFIC column is NaN
df_clean = df.dropna(subset=["age", "salary"])
# Drop columns with too many NaNs (>50%)
threshold = len(df) * 0.5
df_clean = df.dropna(axis=1, thresh=threshold)
Strategy 2: Fill (Imputation)
# Fill numeric with mean or median
df["age"] = df["age"].fillna(df["age"].median())
df["salary"] = df["salary"].fillna(df["salary"].mean())
# Fill categorical with mode (most common value)
df["city"] = df["city"].fillna(df["city"].mode()[0])
# Fill with a constant
df["notes"] = df["notes"].fillna("unknown")
# Forward fill (use previous row's value — good for time series)
df["temperature"] = df["temperature"].ffill()
# Backward fill
df["temperature"] = df["temperature"].bfill()
When to Use Which?
| Strategy | When to use |
|---|---|
| Drop row | Few rows affected, you have enough data |
| Drop column | Column is mostly empty (>70% missing) |
| Fill with median | Numeric data, especially with outliers |
| Fill with mean | Numeric data, roughly normal distribution |
| Fill with mode | Categorical data |
| Fill with model | Advanced: predict the missing values |
Step 3: Remove Duplicates
# Find duplicates
print(df.duplicated().sum()) # count of duplicate rows
# View them
print(df[df.duplicated()])
# Remove duplicates (keeps first occurrence)
df = df.drop_duplicates()
# Drop duplicates based on specific columns
df = df.drop_duplicates(subset=["email"], keep="first")
Step 4: Fix Data Types
# Check types
print(df.dtypes)
# age object ← should be int!
# date object ← should be datetime!
# price object ← should be float!
# Fix numeric stored as string
df["age"] = pd.to_numeric(df["age"], errors="coerce")
# errors="coerce" turns invalid values into NaN instead of crashing
# Fix currency strings → float
df["price"] = df["price"].str.replace("$", "").str.replace(",", "")
df["price"] = df["price"].astype(float)
# Parse dates
df["date"] = pd.to_datetime(df["date"])
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["day_of_week"] = df["date"].dt.dayofweek # 0=Monday
Step 5: Fix Inconsistent Values
# Check unique values in categorical column
print(df["gender"].unique())
# ['Male' 'male' 'M' 'Female' 'FEMALE' 'F' nan] ← MESS!
# Standardize
df["gender"] = df["gender"].str.lower().str.strip()
print(df["gender"].unique())
# ['male' 'female' nan]
# Map to consistent values
gender_map = {
"male": "M", "m": "M",
"female": "F", "f": "F"
}
df["gender"] = df["gender"].map(gender_map)
Step 6: Detect and Handle Outliers
Outliers are extreme values that can skew your model.
Method 1: IQR (Interquartile Range)
Q1 = df["salary"].quantile(0.25)
Q3 = df["salary"].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
# Find outliers
outliers = df[(df["salary"] < lower) | (df["salary"] > upper)]
print(f"Outliers: {len(outliers)} rows")
# Remove outliers
df_clean = df[(df["salary"] >= lower) & (df["salary"] <= upper)]
Method 2: Z-Score
from scipy import stats
z_scores = stats.zscore(df["salary"])
df_clean = df[abs(z_scores) < 3] # keep rows within 3 standard deviations
Method 3: Cap Instead of Remove
# Cap at the 1st and 99th percentile (Winsorizing)
lower = df["salary"].quantile(0.01)
upper = df["salary"].quantile(0.99)
df["salary"] = df["salary"].clip(lower, upper)
Full Cleaning Pipeline
def clean_dataset(df):
"""Full cleaning pipeline."""
print(f"Before: {df.shape}")
# 1. Remove duplicates
df = df.drop_duplicates()
# 2. Fix types
df["age"] = pd.to_numeric(df["age"], errors="coerce")
df["salary"] = pd.to_numeric(df["salary"], errors="coerce")
# 3. Handle missing values
df["age"] = df["age"].fillna(df["age"].median())
df["salary"] = df["salary"].fillna(df["salary"].median())
df["department"] = df["department"].fillna("Unknown")
# 4. Standardize text
df["department"] = df["department"].str.strip().str.title()
# 5. Remove extreme outliers in salary
Q1, Q3 = df["salary"].quantile([0.25, 0.75])
IQR = Q3 - Q1
df = df[(df["salary"] >= Q1 - 3*IQR) & (df["salary"] <= Q3 + 3*IQR)]
print(f"After: {df.shape}")
return df
df = clean_dataset(df)
For a numeric column with outliers, which imputation strategy is usually better: mean or median?