Module 3 — Data Exploration & Visualization beginner 25 min
Pandas DataFrames
What is Pandas?
Pandas is the go-to Python library for working with tabular data — think spreadsheets, CSVs, databases. It provides two core structures:
- Series — a single column of data
- DataFrame — a table with rows and columns
import pandas as pd # convention: always alias as pd
Creating DataFrames
import pandas as pd
# From a dictionary
data = {
"name": ["Alice", "Bob", "Carol", "David"],
"age": [25, 30, 27, 35],
"salary": [70000, 85000, 62000, 90000],
"dept": ["Engineering", "Marketing", "Engineering", "HR"],
}
df = pd.DataFrame(data)
print(df)
# name age salary dept
# 0 Alice 25 70000 Engineering
# 1 Bob 30 85000 Marketing
# 2 Carol 27 62000 Engineering
# 3 David 35 90000 HR
Loading from CSV (Most Common)
# Load a CSV file
df = pd.read_csv("data/housing.csv")
# Load with options
df = pd.read_csv(
"data/housing.csv",
sep=",", # delimiter
header=0, # row 0 is the header
index_col=None, # don't use any column as index
)
Inspecting Your Data
These are always the first things you run on a new dataset:
df = pd.read_csv("data/housing.csv")
# Shape
print(df.shape) # (20640, 10) — rows × columns
# Column names
print(df.columns.tolist())
# ['longitude', 'latitude', 'housing_median_age', 'total_rooms', ...]
# First 5 rows
print(df.head())
# Last 5 rows
print(df.tail(3))
# Summary stats
print(df.describe())
# longitude latitude housing_median_age ...
# count 20640.00 20640.00 20640.00 ...
# mean -119.57 35.63 28.64 ...
# std 2.00 2.14 12.59 ...
# min -124.35 32.54 1.00 ...
# Data types
print(df.dtypes)
# Missing values count
print(df.isnull().sum())
Selecting Data
Select Columns
# Single column → returns a Series
ages = df["age"] # or df.age
# Multiple columns → returns a DataFrame
subset = df[["name", "salary"]]
Select Rows with .loc and .iloc
# .loc — label-based (uses row index)
row = df.loc[0] # first row
rows = df.loc[0:2] # rows 0, 1, 2 (inclusive)
# .iloc — position-based
row = df.iloc[0] # first row (same as loc[0] if default index)
last = df.iloc[-1] # last row
block = df.iloc[1:4, 0:3] # rows 1-3, cols 0-2
Filter Rows (Boolean Indexing)
# Engineers only
engineers = df[df["dept"] == "Engineering"]
# High earners
high_pay = df[df["salary"] >= 80000]
# Combine conditions (use & for AND, | for OR)
senior_engineers = df[
(df["dept"] == "Engineering") & (df["age"] > 26)
]
print(senior_engineers)
# name age salary dept
# 0 Carol 27 62000 Engineering ← (age 27 > 26)
Modifying DataFrames
# Add a new column
df["salary_k"] = df["salary"] / 1000
df["is_senior"] = df["age"] >= 30
# Rename columns
df = df.rename(columns={"salary": "annual_salary"})
# Drop columns
df = df.drop(columns=["salary_k"])
# Apply a function to a column
df["age_group"] = df["age"].apply(lambda x: "under 30" if x < 30 else "30+")
Grouping & Aggregation
# Average salary by department
print(df.groupby("dept")["salary"].mean())
# dept
# Engineering 66000
# HR 90000
# Marketing 85000
# Multiple aggregations
print(df.groupby("dept").agg({
"salary": ["mean", "min", "max"],
"age": "mean",
}))
Handling Missing Values
df = pd.read_csv("data/housing.csv")
# Count missing values
print(df.isnull().sum())
# total_bedrooms 207 ← 207 missing values!
# Drop rows with any missing values
df_clean = df.dropna()
# Fill missing with median (common ML practice)
median_val = df["total_bedrooms"].median()
df["total_bedrooms"] = df["total_bedrooms"].fillna(median_val)
# Fill with mean
df["price"] = df["price"].fillna(df["price"].mean())
Preparing Data for ML
# A typical pre-ML pipeline
df = pd.read_csv("data/housing.csv")
# 1. Drop rows with too many missing values
df = df.dropna()
# 2. Convert categorical to numeric
df["ocean_proximity"] = df["ocean_proximity"].map({
"NEAR BAY": 0, "<1H OCEAN": 1, "INLAND": 2,
"NEAR OCEAN": 3, "ISLAND": 4
})
# 3. Separate features (X) and target (y)
X = df.drop(columns=["median_house_value"])
y = df["median_house_value"]
print(f"Features: {X.shape}") # (20000, 9)
print(f"Target: {y.shape}") # (20000,)
Summary
| Task | Code |
|---|---|
| Load CSV | pd.read_csv("file.csv") |
| Inspect | df.head(), df.describe() |
| Filter rows | df[df["col"] > value] |
| Select columns | df[["col1", "col2"]] |
| Group | df.groupby("col").mean() |
| Fill NaN | df["col"].fillna(value) |
You want to select all rows where the column 'score' is greater than 80 AND the column 'status' equals 'active'. Which code is correct?