Pandas DataFrames

What is Pandas?

Pandas is the go-to Python library for working with tabular data — think spreadsheets, CSVs, databases. It provides two core structures:

Series — a single column of data
DataFrame — a table with rows and columns

import pandas as pd  # convention: always alias as pd

Creating DataFrames

import pandas as pd

# From a dictionary
data = {
    "name":   ["Alice", "Bob", "Carol", "David"],
    "age":    [25, 30, 27, 35],
    "salary": [70000, 85000, 62000, 90000],
    "dept":   ["Engineering", "Marketing", "Engineering", "HR"],
}

df = pd.DataFrame(data)
print(df)
#     name  age  salary         dept
# 0  Alice   25   70000  Engineering
# 1    Bob   30   85000    Marketing
# 2  Carol   27   62000  Engineering
# 3  David   35   90000           HR

Loading from CSV (Most Common)

# Load a CSV file
df = pd.read_csv("data/housing.csv")

# Load with options
df = pd.read_csv(
    "data/housing.csv",
    sep=",",          # delimiter
    header=0,         # row 0 is the header
    index_col=None,   # don't use any column as index
)

Inspecting Your Data

These are always the first things you run on a new dataset:

df = pd.read_csv("data/housing.csv")

# Shape
print(df.shape)          # (20640, 10) — rows × columns

# Column names
print(df.columns.tolist())
# ['longitude', 'latitude', 'housing_median_age', 'total_rooms', ...]

# First 5 rows
print(df.head())

# Last 5 rows
print(df.tail(3))

# Summary stats
print(df.describe())
#        longitude    latitude  housing_median_age  ...
# count  20640.00   20640.00           20640.00  ...
# mean    -119.57      35.63              28.64  ...
# std        2.00       2.14              12.59  ...
# min     -124.35      32.54               1.00  ...

# Data types
print(df.dtypes)

# Missing values count
print(df.isnull().sum())

Selecting Data

Select Columns

# Single column → returns a Series
ages = df["age"]             # or df.age

# Multiple columns → returns a DataFrame
subset = df[["name", "salary"]]

Select Rows with `.loc` and `.iloc`

# .loc — label-based (uses row index)
row = df.loc[0]         # first row
rows = df.loc[0:2]      # rows 0, 1, 2 (inclusive)

# .iloc — position-based
row = df.iloc[0]        # first row (same as loc[0] if default index)
last = df.iloc[-1]      # last row
block = df.iloc[1:4, 0:3]  # rows 1-3, cols 0-2

Filter Rows (Boolean Indexing)

# Engineers only
engineers = df[df["dept"] == "Engineering"]

# High earners
high_pay = df[df["salary"] >= 80000]

# Combine conditions (use & for AND, | for OR)
senior_engineers = df[
    (df["dept"] == "Engineering") & (df["age"] > 26)
]
print(senior_engineers)
#     name  age  salary         dept
# 0  Carol  27   62000  Engineering  ← (age 27 > 26)

Modifying DataFrames

# Add a new column
df["salary_k"] = df["salary"] / 1000
df["is_senior"] = df["age"] >= 30

# Rename columns
df = df.rename(columns={"salary": "annual_salary"})

# Drop columns
df = df.drop(columns=["salary_k"])

# Apply a function to a column
df["age_group"] = df["age"].apply(lambda x: "under 30" if x < 30 else "30+")

Grouping & Aggregation

# Average salary by department
print(df.groupby("dept")["salary"].mean())
# dept
# Engineering    66000
# HR             90000
# Marketing      85000

# Multiple aggregations
print(df.groupby("dept").agg({
    "salary": ["mean", "min", "max"],
    "age":    "mean",
}))

Handling Missing Values

df = pd.read_csv("data/housing.csv")

# Count missing values
print(df.isnull().sum())
# total_bedrooms    207   ← 207 missing values!

# Drop rows with any missing values
df_clean = df.dropna()

# Fill missing with median (common ML practice)
median_val = df["total_bedrooms"].median()
df["total_bedrooms"] = df["total_bedrooms"].fillna(median_val)

# Fill with mean
df["price"] = df["price"].fillna(df["price"].mean())

Preparing Data for ML

# A typical pre-ML pipeline
df = pd.read_csv("data/housing.csv")

# 1. Drop rows with too many missing values
df = df.dropna()

# 2. Convert categorical to numeric
df["ocean_proximity"] = df["ocean_proximity"].map({
    "NEAR BAY": 0, "<1H OCEAN": 1, "INLAND": 2,
    "NEAR OCEAN": 3, "ISLAND": 4
})

# 3. Separate features (X) and target (y)
X = df.drop(columns=["median_house_value"])
y = df["median_house_value"]

print(f"Features: {X.shape}")  # (20000, 9)
print(f"Target:   {y.shape}")  # (20000,)

Summary

Task	Code
Load CSV	`pd.read_csv("file.csv")`
Inspect	`df.head()`, `df.describe()`
Filter rows	`df[df["col"] > value]`
Select columns	`df[["col1", "col2"]]`
Group	`df.groupby("col").mean()`
Fill NaN	`df["col"].fillna(value)`

Knowledge Check

You want to select all rows where the column 'score' is greater than 80 AND the column 'status' equals 'active'. Which code is correct?