Python for Tidyverse Users

A complete, side-by-side course translating your R/tidyverse fluency into Python/pandas mastery — from first line to production ETL.

R / tidyverse Python / pandas 20 Chapters

Chapter 0 — Why Learn Python When You Know R?

You already think in tidy data. Now add a second superpower.

As a tidyverse user you already understand the hardest concepts: tidy data principles, piped transformations, grouped operations, and the grammar of graphics. Python won't replace R — it will complement it. Here's why it's worth adding:

The Good News
About 80% of tidyverse operations map directly to pandas. Your mental model transfers — you just need new syntax.

Chapter 1 — Setup & Environment

Getting your Python workspace running — the equivalent of installing R + RStudio + tidyverse.

1.1 The R → Python Rosetta Stone (Tools)

R WorldPython EquivalentNotes
RpythonThe language itself
RStudioVS Code or JupyterLabVS Code + Python extension is closest to RStudio
CRANPyPIPackage repository
install.packages()pip installPackage manager
library()importLoad a package
renvvenv / condaVirtual environments
.Rmd / .qmd.ipynb (Jupyter) / .qmdQuarto supports both!
tidyversepandas + numpyCore data stack
ggplot2matplotlib + seaborn / plotlyVisualisation

1.2 Installation

TERMINAL
# Install Python (Mac/Linux usually have it; Windows → python.org)
# Then create a virtual environment (like renv):
python -m venv myproject_env
source myproject_env/bin/activate   # Mac/Linux
myproject_env\Scripts\activate      # Windows

# Install the "tidyverse" of Python:
pip install pandas numpy matplotlib seaborn scipy scikit-learn
pip install openpyxl xlsxwriter     # Excel support
pip install sqlalchemy              # SQL support
pip install requests beautifulsoup4 # APIs & scraping
pip install jupyterlab              # Notebook interface

1.3 Your First Session

R
library(tidyverse)

# Everything is loaded — dplyr,
# ggplot2, tidyr, readr, etc.

# Interactive: RStudio console
# Scripts: .R files
PYTHON
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Convention: short aliases
# pd, np, plt — you'll see these everywhere

# Interactive: Jupyter notebook
# Scripts: .py files
Alias Convention
import pandas as pd is universal — every tutorial, every StackOverflow answer uses pd. Same for np (numpy), plt (matplotlib), sns (seaborn). Don't fight it!

Chapter 2 — Python Syntax Primer

The biggest "gotchas" when switching from R to Python.

2.1 Indentation Matters (No Curly Braces)

R
# Braces define blocks
if (x > 10) {
  print("big")
} else {
  print("small")
}

for (i in 1:5) {
  print(i)
}
PYTHON
# Indentation defines blocks (4 spaces)
if x > 10:
    print("big")
else:
    print("small")

for i in range(1, 6):  # 1 to 5
    print(i)

2.2 Zero-Indexed (The Big Difference)

R
x <- c(10, 20, 30)
x[1]    # 10  (1-indexed)
x[2:3]  # 20, 30 (inclusive)
PYTHON
x = [10, 20, 30]
x[0]    # 10  (0-indexed!)
x[1:3]  # [20, 30] (end exclusive)
Index Traps
Python uses 0-based indexing and slices are start-inclusive, end-exclusive: x[1:3] returns items at index 1 and 2 (not 3). This is the #1 source of bugs for R users switching to Python.

2.3 Key Syntax Differences

ConceptRPython
Assignmentx <- 5 or x = 5x = 5 only
Printprint(x) or just xprint(x) required in scripts
BooleanTRUE / FALSETrue / False (capitalised)
NULL / NoneNULLNone
NA / NaNNAnp.nan or pd.NA
Not equal!=!= (same)
And / Or& / |and / or (scalar), & / | (vectorised)
Pipe%>% or |>. method chaining
String"hi" or 'hi'"hi" or 'hi' (same)
f-stringglue("Hello {name}")f"Hello {name}"
Comment# comment# comment (same)
Negation!xnot x
Membershipx %in% c(1,2,3)x in [1, 2, 3]

2.4 The Pipe → Method Chaining

This is the conceptual leap. In R you pipe data through functions. In Python you chain methods on the object:

R (pipe)
df %>%
  filter(age > 30) %>%
  mutate(decade = age %/% 10) %>%
  group_by(decade) %>%
  summarise(avg_income = mean(income)) %>%
  arrange(desc(avg_income))
PYTHON (chain)
(df
  .query("age > 30")
  .assign(decade=lambda x: x["age"] // 10)
  .groupby("decade")
  .agg(avg_income=("income", "mean"))
  .sort_values("avg_income", ascending=False)
)
Chaining Trick
Wrap the entire chain in parentheses ( ) to break across lines without backslashes. This is the Python equivalent of the pipe's fluency.

Chapter 3 — Data Types & Structures

Lists, dictionaries, tuples — the building blocks you don't have in R.

3.1 Core Data Types

R TypePython TypeExample
numericint / float42, 3.14
characterstr"hello"
logicalboolTrue, False
c(1, 2, 3) (vector)[1, 2, 3] (list)Ordered, mutable
list(a=1, b=2) (named list){"a": 1, "b": 2} (dict)Key-value pairs
no equivalent(1, 2, 3) (tuple)Ordered, immutable
c(1, 2, 3) (vector)np.array([1,2,3])Vectorised operations
data.frame / tibblepd.DataFrameThe workhorse

3.2 The Big Insight: Python Lists ≠ R Vectors

R — vectorised by default
x <- c(1, 2, 3, 4, 5)
x * 2          # 2, 4, 6, 8, 10 ✓
x[x > 3]      # 4, 5           ✓
mean(x)        # 3              ✓
PYTHON — use numpy for vectorised ops
x = [1, 2, 3, 4, 5]
# x * 2  → [1,2,3,4,5,1,2,3,4,5] 😱

# Use numpy instead:
import numpy as np
x = np.array([1, 2, 3, 4, 5])
x * 2          # array([2,4,6,8,10]) ✓
x[x > 3]      # array([4, 5])       ✓
x.mean()       # 3.0                 ✓

3.3 Dictionaries (Your New Best Friend)

PYTHON
# Dicts are like R named lists but much more central to Python
person = {
    "name": "Meron",
    "role": "Analyst",
    "skills": ["R", "Python", "SQL"]
}

person["name"]               # "Meron"
person["skills"][0]          # "R" (0-indexed!)
person.get("age", "unknown") # safe access with default

# Dicts → DataFrames (very common pattern)
data = {
    "name": ["Alice", "Bob", "Carol"],
    "age": [32, 45, 28],
    "dept": ["Research", "Engineering", "Research"]
}
df = pd.DataFrame(data)      # instant DataFrame!

3.4 List Comprehensions (R's map() on steroids)

R
# purrr::map
map_dbl(1:5, ~ .x^2)
# [1]  1  4  9 16 25

# Keep evens
keep(1:10, ~ .x %% 2 == 0)
PYTHON
# List comprehension
[x**2 for x in range(1, 6)]
# [1, 4, 9, 16, 25]

# With condition
[x for x in range(1, 11) if x % 2 == 0]
# [2, 4, 6, 8, 10]

Chapter 4 — Pandas = dplyr + tidyr

The single library that replaces most of the tidyverse.

4.1 Creating DataFrames

R
df <- tibble(
  name = c("Alice", "Bob", "Carol"),
  age = c(32, 45, 28),
  score = c(88, 92, 85)
)

glimpse(df)
summary(df)
PYTHON
df = pd.DataFrame({
    "name": ["Alice", "Bob", "Carol"],
    "age": [32, 45, 28],
    "score": [88, 92, 85]
})

df.info()     # like glimpse()
df.describe() # like summary()

4.2 Exploring Data

R / tidyversePython / pandasPurpose
glimpse(df)df.info()Column types & non-nulls
head(df)df.head()First rows
tail(df)df.tail()Last rows
nrow(df)len(df)Row count
ncol(df)df.shape[1]Column count
dim(df)df.shape(rows, cols) tuple
names(df)df.columns.tolist()Column names
summary(df)df.describe()Numeric summary
count(df, col)df["col"].value_counts()Frequency table
n_distinct(df$col)df["col"].nunique()Unique count
is.na(df) %>% colSums()df.isna().sum()Missing per column
df %>% distinct()df.drop_duplicates()Remove duplicates

Chapter 5 — Selecting & Filtering

The dplyr verbs select() and filter() in pandas.

5.1 Selecting Columns

R
# Single column
df %>% select(name)

# Multiple columns
df %>% select(name, age)

# By pattern
df %>% select(starts_with("score"))
df %>% select(where(is.numeric))

# Drop columns
df %>% select(-age)

# Rename
df %>% rename(full_name = name)
PYTHON
# Single column (returns Series)
df["name"]

# Multiple columns (returns DataFrame)
df[["name", "age"]]

# By pattern
df.filter(regex="^score")
df.select_dtypes(include="number")

# Drop columns
df.drop(columns=["age"])

# Rename
df.rename(columns={"name": "full_name"})

5.2 Filtering Rows

R
# Simple filter
df %>% filter(age > 30)

# Multiple conditions
df %>% filter(age > 30, score > 85)

# OR condition
df %>% filter(age > 30 | score > 90)

# %in% check
df %>% filter(name %in% c("Alice","Bob"))

# Missing values
df %>% filter(!is.na(score))

# Slice
df %>% slice(1:3)
df %>% slice_max(score, n = 2)
PYTHON
# Simple filter
df[df["age"] > 30]            # boolean mask
df.query("age > 30")          # query string

# Multiple conditions
df.query("age > 30 and score > 85")

# OR condition
df.query("age > 30 or score > 90")

# isin check
df[df["name"].isin(["Alice", "Bob"])]

# Missing values
df[df["score"].notna()]        # or .dropna()

# Slice
df.iloc[0:3]                   # by position
df.nlargest(2, "score")        # top N
query() vs boolean masking
df.query("age > 30") is cleaner for simple filters (like dplyr). Boolean masking df[df["age"] > 30] is more flexible for complex logic. Use .query() for readability in chains.

Chapter 6 — Mutating & Summarising

mutate(), summarise(), and group_by() translated to pandas.

6.1 Creating New Columns (mutate)

R
df %>%
  mutate(
    score_pct = score / 100,
    age_group = case_when(
      age < 30 ~ "Young",
      age < 40 ~ "Mid",
      TRUE     ~ "Senior"
    ),
    rank = row_number()
  )
PYTHON
(df
  .assign(
    score_pct=lambda x: x["score"] / 100,
    age_group=lambda x: np.select(
      [x["age"] < 30, x["age"] < 40],
      ["Young", "Mid"],
      default="Senior"
    ),
    rank=lambda x: range(1, len(x) + 1)
  )
)
# Or directly: df["score_pct"] = df["score"]/100

6.2 case_when → np.select / pd.cut

R
# case_when
df %>% mutate(
  grade = case_when(
    score >= 90 ~ "A",
    score >= 80 ~ "B",
    score >= 70 ~ "C",
    TRUE        ~ "F"
  )
)

# cut for bins
df %>% mutate(
  age_bin = cut(age, breaks = c(0,30,40,50,Inf))
)
PYTHON
# np.select (order matters!)
conditions = [
    df["score"] >= 90,
    df["score"] >= 80,
    df["score"] >= 70,
]
choices = ["A", "B", "C"]
df["grade"] = np.select(conditions, choices, "F")

# pd.cut for bins
df["age_bin"] = pd.cut(
    df["age"], bins=[0, 30, 40, 50, np.inf]
)

6.3 Summarising & Grouping

R
# Ungrouped summary
df %>% summarise(
  avg_score = mean(score),
  n = n()
)

# Grouped summary
df %>%
  group_by(dept) %>%
  summarise(
    avg_score = mean(score),
    max_age = max(age),
    n = n()
  ) %>%
  ungroup()

# Window function in mutate
df %>%
  group_by(dept) %>%
  mutate(
    dept_avg = mean(score),
    rank = rank(-score)
  )
PYTHON
# Ungrouped summary
df.agg(
    avg_score=("score", "mean"),
    n=("score", "size")
)

# Grouped summary
(df
  .groupby("dept")
  .agg(
    avg_score=("score", "mean"),
    max_age=("age", "max"),
    n=("name", "size")
  )
  .reset_index()    # like ungroup()
)

# Window function (transform)
df["dept_avg"] = (
    df.groupby("dept")["score"].transform("mean")
)
df["rank"] = (
    df.groupby("dept")["score"].rank(ascending=False)
)
.transform() = grouped mutate
.agg() collapses rows (like summarise). .transform() keeps all rows (like mutate inside group_by). This is the key distinction.

Chapter 7 — Joins & Reshaping

dplyr joins and tidyr pivots, translated.

7.1 Joins

R
left_join(df1, df2, by = "id")
inner_join(df1, df2, by = "id")
full_join(df1, df2, by = "id")
anti_join(df1, df2, by = "id")

# Different column names
left_join(df1, df2,
  by = c("emp_id" = "employee_id")
)
PYTHON
df1.merge(df2, on="id", how="left")
df1.merge(df2, on="id", how="inner")
df1.merge(df2, on="id", how="outer")
df1.merge(df2, on="id", how="left",
          indicator=True)  # then filter

# Different column names
df1.merge(df2,
    left_on="emp_id",
    right_on="employee_id"
)

7.2 Anti-join Pattern in Pandas

PYTHON
# Anti-join: rows in df1 NOT in df2
merged = df1.merge(df2, on="id", how="left", indicator=True)
anti = merged[merged["_merge"] == "left_only"].drop(columns="_merge")

7.3 Reshaping: Pivot Longer & Wider

R
# Long → same as gather
df %>% pivot_longer(
  cols = c(q1, q2, q3),
  names_to = "quarter",
  values_to = "revenue"
)

# Wide → same as spread
df %>% pivot_wider(
  names_from = quarter,
  values_from = revenue
)
PYTHON
# Melt = pivot_longer
df.melt(
    id_vars=["name"],
    value_vars=["q1", "q2", "q3"],
    var_name="quarter",
    value_name="revenue"
)

# Pivot = pivot_wider
df.pivot_table(
    index="name",
    columns="quarter",
    values="revenue"
).reset_index()

Chapter 8 — Strings, Dates & Factors

stringr, lubridate, forcats → pandas str/dt accessors.

8.1 String Operations

R (stringr)
str_to_lower(x)
str_to_upper(x)
str_trim(x)
str_detect(x, "pattern")
str_replace(x, "old", "new")
str_sub(x, 1, 3)
str_c("a", "b", sep = "-")
str_extract(x, "\\d+")
PYTHON (pandas .str accessor)
df["col"].str.lower()
df["col"].str.upper()
df["col"].str.strip()
df["col"].str.contains("pattern")
df["col"].str.replace("old", "new")
df["col"].str[:3]
"a" + "-" + "b"  # or "-".join(...)
df["col"].str.extract(r"(\d+)")

8.2 Dates & Times

R (lubridate)
ymd("2024-03-15")
today()
x <- ymd("2024-03-15")
year(x); month(x); day(x)
x + days(30)
floor_date(x, "month")
interval(start, end) / days(1)
PYTHON (pandas)
pd.to_datetime("2024-03-15")
pd.Timestamp.today()
x = pd.to_datetime("2024-03-15")
x.year; x.month; x.day
x + pd.Timedelta(days=30)
x.to_period("M").to_timestamp()
(end - start).days

8.3 Factors → Categoricals

R (forcats)
factor(x, levels = c("Low","Med","High"))
fct_relevel(x, "High", "Med", "Low")
fct_reorder(x, value, .fun = median)
fct_lump_n(x, n = 5)
PYTHON
pd.Categorical(x, categories=["Low","Med","High"],
               ordered=True)
df["col"] = df["col"].cat.reorder_categories(
    ["High","Med","Low"])
# Reorder by value — set after sorting
# Lump — use value_counts + threshold

Chapter 9 — Reading & Writing Data

readr, readxl, haven → pandas read_* functions.

FormatRPython / pandas
CSVread_csv("f.csv")pd.read_csv("f.csv")
TSVread_tsv("f.tsv")pd.read_csv("f.tsv", sep="\t")
Excelreadxl::read_excel("f.xlsx")pd.read_excel("f.xlsx")
JSONjsonlite::read_json("f.json")pd.read_json("f.json")
Parquetarrow::read_parquet("f.parquet")pd.read_parquet("f.parquet")
SPSS / Statahaven::read_sav()pd.read_spss() / pd.read_stata()
SQLDBI::dbGetQuery()pd.read_sql(query, conn)
Clipboardreadr::clipboard()pd.read_clipboard()
PYTHON — Common read_csv options
df = pd.read_csv(
    "data.csv",
    usecols=["name", "age", "score"],  # select columns on read
    dtype={"age": int},                 # force types
    parse_dates=["date_col"],           # auto-parse dates
    na_values=["", "NA", "N/A", "-"],   # custom NA strings
    nrows=1000,                         # read only first 1000
    encoding="utf-8",                   # encoding
)

# Writing
df.to_csv("output.csv", index=False)     # index=False prevents row numbers
df.to_excel("output.xlsx", index=False)
df.to_parquet("output.parquet")           # fast, compressed — prefer this!
Parquet: Use It
Parquet is 5–10× smaller than CSV, 10–50× faster to read, and preserves data types. If you're doing ETL, always store intermediate files as Parquet.

Chapter 10 — ETL Pipelines

Extract, Transform, Load — building real data workflows in Python.

10.1 ETL Pattern

PYTHON — Full ETL Pipeline
import pandas as pd
import numpy as np
from pathlib import Path
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# ── EXTRACT ──────────────────────────────────────────
def extract(source_dir: str) -> pd.DataFrame:
    """Read all CSVs from a directory and combine."""
    files = list(Path(source_dir).glob("*.csv"))
    logger.info(f"Found {len(files)} files")

    dfs = []
    for f in files:
        df = pd.read_csv(f, parse_dates=["date"])
        df["source_file"] = f.name         # track provenance
        dfs.append(df)

    raw = pd.concat(dfs, ignore_index=True)
    logger.info(f"Extracted {len(raw):,} rows")
    return raw


# ── TRANSFORM ────────────────────────────────────────
def transform(df: pd.DataFrame) -> pd.DataFrame:
    """Clean, validate, and enrich the data."""
    return (df
        # 1. Standardise column names
        .rename(columns=str.lower)
        .rename(columns=lambda c: c.strip().replace(" ", "_"))

        # 2. Remove duplicates
        .drop_duplicates(subset=["id", "date"])

        # 3. Handle missing values
        .dropna(subset=["id"])               # must have ID
        .assign(
            amount=lambda x: x["amount"].fillna(0),
            category=lambda x: x["category"].fillna("Unknown"),
        )

        # 4. Type enforcement
        .astype({"category": "category"})

        # 5. Business logic
        .assign(
            year=lambda x: x["date"].dt.year,
            month=lambda x: x["date"].dt.month,
            quarter=lambda x: x["date"].dt.quarter,
            amount_gbp=lambda x: x["amount"] * 0.79,  # USD→GBP
            size_bucket=lambda x: pd.cut(
                x["amount"],
                bins=[0, 100, 1000, 10000, np.inf],
                labels=["Small", "Medium", "Large", "Enterprise"]
            ),
        )

        # 6. Filter invalid records
        .query("amount >= 0")

        # 7. Sort
        .sort_values(["date", "id"])
        .reset_index(drop=True)
    )


# ── LOAD ─────────────────────────────────────────────
def load(df: pd.DataFrame, output_path: str):
    """Save clean data to Parquet."""
    df.to_parquet(output_path, index=False)
    logger.info(f"Loaded {len(df):,} rows → {output_path}")


# ── ORCHESTRATE ──────────────────────────────────────
if __name__ == "__main__":
    raw = extract("data/raw/")
    clean = transform(raw)
    load(clean, "data/processed/clean_data.parquet")

10.2 Validation Layer (Quality Checks)

PYTHON
def validate(df: pd.DataFrame) -> pd.DataFrame:
    """Run data quality checks. Raise on critical failures."""
    checks = {
        "no_nulls_in_id":    df["id"].notna().all(),
        "positive_amounts":  (df["amount"] >= 0).all(),
        "valid_dates":       df["date"].between("2020-01-01", "2026-12-31").all(),
        "no_duplicates":     not df.duplicated(subset=["id", "date"]).any(),
        "row_count_min":     len(df) >= 100,
    }

    for check_name, passed in checks.items():
        status = "✓ PASS" if passed else "✗ FAIL"
        logger.info(f"  {status}: {check_name}")

    failures = [k for k, v in checks.items() if not v]
    if failures:
        raise ValueError(f"Validation failed: {failures}")

    return df

# Insert into pipeline:
# raw = extract(...)
# clean = transform(raw)
# validated = validate(clean)     ← new
# load(validated, ...)

10.3 The R Comparison

R — Typical pipeline
library(tidyverse)
library(fs)

raw <- dir_ls("data/raw", glob = "*.csv") %>%
  map_dfr(read_csv)

clean <- raw %>%
  janitor::clean_names() %>%
  distinct(id, date, .keep_all = TRUE) %>%
  filter(!is.na(id), amount >= 0) %>%
  mutate(
    year = year(date),
    quarter = quarter(date),
    size = cut(amount,
      breaks = c(0, 100, 1000, Inf),
      labels = c("S", "M", "L"))
  )

write_parquet(clean, "data/clean.parquet")
PYTHON — Same pipeline
import pandas as pd
from pathlib import Path

files = Path("data/raw").glob("*.csv")
raw = pd.concat(
    [pd.read_csv(f) for f in files],
    ignore_index=True
)

clean = (raw
    .rename(columns=str.lower)
    .drop_duplicates(subset=["id", "date"])
    .dropna(subset=["id"])
    .query("amount >= 0")
    .assign(
        year=lambda x: x["date"].dt.year,
        quarter=lambda x: x["date"].dt.quarter,
        size=lambda x: pd.cut(x["amount"],
            bins=[0,100,1000,np.inf],
            labels=["S","M","L"]),
    )
)

clean.to_parquet("data/clean.parquet")

Chapter 11 — APIs & Web Scraping

httr/rvest → requests/BeautifulSoup.

11.1 Calling APIs

R
library(httr)
library(jsonlite)

resp <- GET(
  "https://api.example.com/data",
  add_headers(Authorization = "Bearer xyz"),
  query = list(limit = 100)
)

data <- content(resp, as = "parsed")
df <- as_tibble(data$results)
PYTHON
import requests
import pandas as pd

resp = requests.get(
    "https://api.example.com/data",
    headers={"Authorization": "Bearer xyz"},
    params={"limit": 100}
)

data = resp.json()
df = pd.DataFrame(data["results"])

11.2 Web Scraping

PYTHON
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
resp = requests.get(url)
soup = BeautifulSoup(resp.text, "html.parser")

# Find all tables
tables = pd.read_html(resp.text)   # pandas can parse HTML tables!
gdp_df = tables[0]                 # first table on the page

# Or parse manually for more control
rows = []
for tr in soup.select("table.wikitable tr")[1:]:
    cells = [td.text.strip() for td in tr.find_all(["td", "th"])]
    rows.append(cells)

df = pd.DataFrame(rows, columns=["Rank", "Country", "GDP"])

11.3 Paginating Through an API

PYTHON — Real-world pagination pattern
def fetch_all_pages(base_url: str, api_key: str) -> pd.DataFrame:
    """Paginate through a REST API, collecting all results."""
    all_data = []
    page = 1

    while True:
        resp = requests.get(
            base_url,
            headers={"Authorization": f"Bearer {api_key}"},
            params={"page": page, "per_page": 100}
        )
        resp.raise_for_status()  # raise exception on HTTP errors

        data = resp.json()
        results = data.get("results", [])

        if not results:
            break

        all_data.extend(results)
        page += 1

        if page > data.get("total_pages", 1):
            break

    return pd.DataFrame(all_data)

Chapter 12 — SQL from Python

DBI/dbplyr → SQLAlchemy/pandas.

R
library(DBI)

con <- dbConnect(
  RPostgres::Postgres(),
  host = "localhost",
  dbname = "mydb",
  user = "user",
  password = "pass"
)

df <- dbGetQuery(con, "
  SELECT dept, AVG(salary)
  FROM employees
  WHERE active = true
  GROUP BY dept
")

dbDisconnect(con)
PYTHON
from sqlalchemy import create_engine
import pandas as pd

engine = create_engine(
    "postgresql://user:pass@localhost/mydb"
)

df = pd.read_sql("""
    SELECT dept, AVG(salary)
    FROM employees
    WHERE active = true
    GROUP BY dept
""", engine)

engine.dispose()
PYTHON — Writing DataFrames to SQL
# Write a DataFrame to a database table
df.to_sql(
    "clean_employees",      # table name
    engine,
    if_exists="replace",    # "append", "replace", or "fail"
    index=False,
    method="multi",         # faster batch insert
    chunksize=5000
)

# Supabase / PostgreSQL with connection pooling
engine = create_engine(
    "postgresql://user:pass@db.xyz.supabase.co:5432/postgres",
    pool_size=5,
    max_overflow=10
)

Chapter 13 — Matplotlib & Seaborn

ggplot2 → matplotlib + seaborn (the grammar of graphics, translated).

13.1 The Mental Model

ggplot2 Conceptmatplotlib/seaborn
ggplot(df, aes(x, y))fig, ax = plt.subplots()
geom_point()ax.scatter(x, y) or sns.scatterplot()
geom_line()ax.plot(x, y) or sns.lineplot()
geom_bar()ax.bar(x, y) or sns.barplot()
geom_histogram()ax.hist(x) or sns.histplot()
geom_boxplot()sns.boxplot()
facet_wrap(~var)sns.FacetGrid() or fig, axes = plt.subplots(1, 3)
labs(title, x, y)ax.set(title=, xlabel=, ylabel=)
theme_minimal()plt.style.use('seaborn-v0_8-whitegrid')
scale_fill_brewer()palette="Set2" in seaborn
ggsave("plot.png")plt.savefig("plot.png", dpi=150, bbox_inches="tight")

13.2 Side-by-Side Example

R (ggplot2)
ggplot(df, aes(x = age, y = score,
               colour = dept)) +
  geom_point(alpha = 0.7, size = 3) +
  geom_smooth(method = "lm") +
  labs(
    title = "Score by Age",
    x = "Age", y = "Score"
  ) +
  theme_minimal() +
  scale_colour_brewer(palette = "Set2")
PYTHON (seaborn)
import seaborn as sns
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 5))
sns.scatterplot(
    data=df, x="age", y="score",
    hue="dept", alpha=0.7, s=60,
    palette="Set2", ax=ax
)
sns.regplot(
    data=df, x="age", y="score",
    scatter=False, ax=ax, color="grey"
)
ax.set(title="Score by Age",
       xlabel="Age", ylabel="Score")
sns.despine()
plt.tight_layout()
plt.savefig("scatter.png", dpi=150)

13.3 Common Chart Recipes

PYTHON
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-v0_8-whitegrid')

# ── Bar Chart ──
fig, ax = plt.subplots(figsize=(8, 5))
sns.barplot(data=df, x="dept", y="score", palette="viridis", ax=ax)
ax.set_title("Average Score by Department", fontweight="bold")
ax.bar_label(ax.containers[0], fmt="%.1f")  # value labels on bars

# ── Histogram ──
fig, ax = plt.subplots()
sns.histplot(df["score"], bins=20, kde=True, ax=ax)

# ── Faceted chart (like facet_wrap) ──
g = sns.FacetGrid(df, col="dept", col_wrap=3, height=4)
g.map_dataframe(sns.scatterplot, x="age", y="score")
g.set_titles("{col_name}")

# ── Heatmap (correlation) ──
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(df.select_dtypes("number").corr(),
            annot=True, fmt=".2f", cmap="RdBu_r",
            center=0, ax=ax)

Chapter 14 — Plotly Interactive

plotly in R → plotly in Python (nearly identical!).

R
library(plotly)

plot_ly(df,
  x = ~date, y = ~value,
  color = ~group,
  type = "scatter",
  mode = "lines"
) %>%
  layout(title = "Trend")
PYTHON
import plotly.express as px

fig = px.line(
    df,
    x="date", y="value",
    color="group",
    title="Trend"
)
fig.show()
fig.write_html("trend.html")
Plotly Express
plotly.express (imported as px) is the high-level API — one function per chart type, very similar to ggplot. Use plotly.graph_objects for full customisation.

Chapter 15 — Statistics & Hypothesis Tests

stats::t.test, chisq.test → scipy.stats.

R
# t-test
t.test(score ~ group, data = df)

# Chi-squared
chisq.test(table(df$dept, df$grade))

# Correlation
cor.test(df$age, df$score)

# Shapiro-Wilk normality
shapiro.test(df$score)

# Mann-Whitney U
wilcox.test(score ~ group, data = df)

# Linear regression
model <- lm(score ~ age + dept, data = df)
summary(model)
PYTHON
from scipy import stats

# t-test
stats.ttest_ind(group_a, group_b)

# Chi-squared
ct = pd.crosstab(df["dept"], df["grade"])
stats.chi2_contingency(ct)

# Correlation
stats.pearsonr(df["age"], df["score"])

# Shapiro-Wilk normality
stats.shapiro(df["score"])

# Mann-Whitney U
stats.mannwhitneyu(group_a, group_b)

# Linear regression
import statsmodels.formula.api as smf
model = smf.ols("score ~ age + dept", data=df)
results = model.fit()
print(results.summary())

15.1 Effect Sizes (Don't Just Report p-values)

PYTHON
# Cohen's d (effect size for t-test)
def cohens_d(group1, group2):
    n1, n2 = len(group1), len(group2)
    pooled_std = np.sqrt(((n1-1)*group1.std()**2 + (n2-1)*group2.std()**2) / (n1+n2-2))
    return (group1.mean() - group2.mean()) / pooled_std

d = cohens_d(group_a, group_b)
# |d| < 0.2 = small, 0.5 = medium, 0.8 = large

# Confidence intervals (bootstrap)
from scipy.stats import bootstrap
rng = np.random.default_rng(42)
result = bootstrap((df["score"].values,), np.mean,
                    confidence_level=0.95, random_state=rng)
print(f"95% CI: ({result.confidence_interval.low:.2f}, "
      f"{result.confidence_interval.high:.2f})")

Chapter 16 — Regression & Machine Learning

caret/tidymodels → scikit-learn.

R (tidymodels)
library(tidymodels)

# Split
split <- initial_split(df, prop = 0.8)
train <- training(split)
test <- testing(split)

# Recipe + model
rec <- recipe(score ~ ., data = train) %>%
  step_dummy(all_nominal()) %>%
  step_normalize(all_numeric_predictors())

model <- linear_reg() %>%
  set_engine("lm") %>%
  fit(score ~ ., data = juice(prep(rec)))

# Predict
preds <- predict(model, bake(prep(rec), test))
PYTHON (scikit-learn)
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_squared_error

# Split
X = df.drop(columns=["score"])
y = df["score"]
X = pd.get_dummies(X, drop_first=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Pipeline (recipe + model)
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LinearRegression())
])
pipe.fit(X_train, y_train)

# Predict
preds = pipe.predict(X_test)
print(f"R² = {r2_score(y_test, preds):.3f}")

Chapter 17 — Functions & Scripts

Writing reusable code — functions, error handling, and CLI scripts.

17.1 Functions

R
clean_names <- function(df,
                        case = "snake") {
  df %>%
    janitor::clean_names(case = case)
}

# Default arguments
calc_rate <- function(n, total,
                      digits = 2) {
  round(n / total * 100, digits)
}
PYTHON
def clean_names(df: pd.DataFrame,
                case: str = "snake") -> pd.DataFrame:
    """Standardise column names."""
    cols = df.columns.str.strip().str.lower()
    cols = cols.str.replace(" ", "_")
    return df.set_axis(cols, axis=1)

# Default arguments + type hints
def calc_rate(n: int, total: int,
              digits: int = 2) -> float:
    return round(n / total * 100, digits)

17.2 Error Handling

R
result <- tryCatch(
  {
    risky_operation()
  },
  error = function(e) {
    message("Error: ", e$message)
    NA
  },
  warning = function(w) {
    message("Warning: ", w$message)
  }
)
PYTHON
try:
    result = risky_operation()
except FileNotFoundError as e:
    print(f"File missing: {e}")
    result = None
except ValueError as e:
    print(f"Bad value: {e}")
    result = None
except Exception as e:
    print(f"Unexpected: {e}")
    raise  # re-raise unexpected errors
finally:
    cleanup()  # always runs

17.3 Command-Line Scripts

PYTHON — CLI script with argparse
#!/usr/bin/env python3
"""ETL script that can be run from the command line."""
import argparse
import pandas as pd

def main():
    parser = argparse.ArgumentParser(description="Run ETL pipeline")
    parser.add_argument("input", help="Path to input CSV")
    parser.add_argument("-o", "--output", default="output.parquet")
    parser.add_argument("--year", type=int, help="Filter to year")
    args = parser.parse_args()

    df = pd.read_csv(args.input)
    if args.year:
        df = df.query("year == @args.year")
    df.to_parquet(args.output, index=False)
    print(f"✓ Saved {len(df):,} rows → {args.output}")

if __name__ == "__main__":
    main()

# Usage: python etl.py data.csv -o clean.parquet --year 2024

Chapter 18 — Packaging & Environments

renv → venv/pip, R packages → Python packages.

18.1 Virtual Environments

R (renv)
renv::init()        # create lockfile
renv::install("dplyr")
renv::snapshot()    # save state
renv::restore()     # reproduce
PYTHON (venv + pip)
python -m venv .venv
source .venv/bin/activate
pip install pandas
pip freeze > requirements.txt  # save
pip install -r requirements.txt # reproduce

18.2 Project Structure

PYTHON — Standard project layout
my_project/
├── data/
│   ├── raw/              # never modify raw data
│   └── processed/        # ETL output goes here
├── notebooks/
│   └── exploration.ipynb # Jupyter notebooks
├── src/
│   ├── __init__.py       # makes it a package
│   ├── extract.py
│   ├── transform.py
│   └── load.py
├── tests/
│   └── test_transform.py
├── requirements.txt      # like renv.lock
├── .gitignore
└── README.md

Chapter 19 — Complete Cheat Sheet

The full R → Python translation table. Bookmark this.

OperationR / tidyversePython / pandas
Read CSVread_csv("f.csv")pd.read_csv("f.csv")
Write CSVwrite_csv(df, "f.csv")df.to_csv("f.csv", index=False)
View structureglimpse(df)df.info()
Summary statssummary(df)df.describe()
Select colsselect(name, age)df[["name", "age"]]
Drop colsselect(-age)df.drop(columns=["age"])
Renamerename(new = old)df.rename(columns={"old":"new"})
Filter rowsfilter(age > 30)df.query("age > 30")
Sortarrange(desc(age))df.sort_values("age", ascending=False)
New columnmutate(x = a + b)df.assign(x=lambda d: d["a"]+d["b"])
case_whencase_when(...)np.select(conditions, choices)
Group + summarisegroup_by(g) %>% summarise(m=mean(x))df.groupby("g").agg(m=("x","mean"))
Window (grouped mutate)group_by(g) %>% mutate(m=mean(x))df.groupby("g")["x"].transform("mean")
Ungroupungroup().reset_index()
Countcount(col)df["col"].value_counts()
Distinctdistinct()df.drop_duplicates()
Left joinleft_join(a, b, by="id")a.merge(b, on="id", how="left")
Pivot longerpivot_longer()df.melt()
Pivot widerpivot_wider()df.pivot_table()
Missing countsum(is.na(x))df["x"].isna().sum()
Fill NAreplace_na(x, 0)df["x"].fillna(0)
String lowerstr_to_lower(x)df["x"].str.lower()
Date parseymd("2024-01-01")pd.to_datetime("2024-01-01")
Year extractyear(date)df["date"].dt.year
Apply funcmap_dbl(x, fn)df["x"].apply(fn)
Row-wiserowwise() %>% mutate(...)df.apply(fn, axis=1)
Bind rowsbind_rows(a, b)pd.concat([a, b])
Bind colsbind_cols(a, b)pd.concat([a, b], axis=1)
You've Got This
You already have the hardest part — the analytical thinking. Python is just new syntax on familiar concepts. Start by rewriting one R script in Python, then keep going. Every tidyverse verb has a pandas equivalent, and you now have the map.

Built for Meron · R → Python · 2026