Python for Tidyverse Users

A complete, side-by-side course translating your R/tidyverse fluency into Python/pandas mastery — from first line to production ETL.

R / tidyverse Python / pandas 20 Chapters

Chapter 0 — Why Learn Python When You Know R?

You already think in tidy data. Now add a second superpower.

As a tidyverse user you already understand the hardest concepts: tidy data principles, piped transformations, grouped operations, and the grammar of graphics. Python won't replace R — it will complement it. Here's why it's worth adding:

Engineering integration — Most production data systems, APIs, and cloud services have Python as a first-class citizen.
ML ecosystem — scikit-learn, PyTorch, TensorFlow, and Hugging Face are Python-native.
Web apps & SaaS — Flask, FastAPI, Streamlit, and Django let you ship products.
Job market — Python consistently tops hiring requirements for data & analytics roles in the UK.

The Good News

About 80% of tidyverse operations map directly to pandas. Your mental model transfers — you just need new syntax.

Chapter 1 — Setup & Environment

Getting your Python workspace running — the equivalent of installing R + RStudio + tidyverse.

1.1 The R → Python Rosetta Stone (Tools)

R World	Python Equivalent	Notes
`R`	`python`	The language itself
`RStudio`	`VS Code` or `JupyterLab`	VS Code + Python extension is closest to RStudio
`CRAN`	`PyPI`	Package repository
`install.packages()`	`pip install`	Package manager
`library()`	`import`	Load a package
`renv`	`venv` / `conda`	Virtual environments
`.Rmd` / `.qmd`	`.ipynb` (Jupyter) / `.qmd`	Quarto supports both!
`tidyverse`	`pandas` + `numpy`	Core data stack
`ggplot2`	`matplotlib` + `seaborn` / `plotly`	Visualisation

1.2 Installation

TERMINAL

# Install Python (Mac/Linux usually have it; Windows → python.org)
# Then create a virtual environment (like renv):
python -m venv myproject_env
source myproject_env/bin/activate   # Mac/Linux
myproject_env\Scripts\activate      # Windows

# Install the "tidyverse" of Python:
pip install pandas numpy matplotlib seaborn scipy scikit-learn
pip install openpyxl xlsxwriter     # Excel support
pip install sqlalchemy              # SQL support
pip install requests beautifulsoup4 # APIs & scraping
pip install jupyterlab              # Notebook interface

1.3 Your First Session

library(tidyverse)

# Everything is loaded — dplyr,
# ggplot2, tidyr, readr, etc.

# Interactive: RStudio console
# Scripts: .R files

PYTHON

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Convention: short aliases
# pd, np, plt — you'll see these everywhere

# Interactive: Jupyter notebook
# Scripts: .py files

Alias Convention

import pandas as pd is universal — every tutorial, every StackOverflow answer uses pd. Same for np (numpy), plt (matplotlib), sns (seaborn). Don't fight it!

Chapter 2 — Python Syntax Primer

The biggest "gotchas" when switching from R to Python.

2.1 Indentation Matters (No Curly Braces)

# Braces define blocks
if (x > 10) {
  print("big")
} else {
  print("small")
}

for (i in 1:5) {
  print(i)
}

PYTHON

# Indentation defines blocks (4 spaces)
if x > 10:
    print("big")
else:
    print("small")

for i in range(1, 6):  # 1 to 5
    print(i)

2.2 Zero-Indexed (The Big Difference)

x <- c(10, 20, 30)
x[1]    # 10  (1-indexed)
x[2:3]  # 20, 30 (inclusive)

PYTHON

x = [10, 20, 30]
x[0]    # 10  (0-indexed!)
x[1:3]  # [20, 30] (end exclusive)

Index Traps

Python uses 0-based indexing and slices are start-inclusive, end-exclusive: x[1:3] returns items at index 1 and 2 (not 3). This is the #1 source of bugs for R users switching to Python.

2.3 Key Syntax Differences

Concept	R	Python
Assignment	`x <- 5` or `x = 5`	`x = 5` only
Print	`print(x)` or just `x`	`print(x)` required in scripts
Boolean	`TRUE / FALSE`	`True / False` (capitalised)
NULL / None	`NULL`	`None`
NA / NaN	`NA`	`np.nan` or `pd.NA`
Not equal	`!=`	`!=` (same)
And / Or	`&` / `\|`	`and` / `or` (scalar), `&` / `\|` (vectorised)
Pipe	`%>%` or `\|>`	`.` method chaining
String	`"hi"` or `'hi'`	`"hi"` or `'hi'` (same)
f-string	`glue("Hello {name}")`	`f"Hello {name}"`
Comment	`# comment`	`# comment` (same)
Negation	`!x`	`not x`
Membership	`x %in% c(1,2,3)`	`x in [1, 2, 3]`

2.4 The Pipe → Method Chaining

This is the conceptual leap. In R you pipe data through functions. In Python you chain methods on the object:

R (pipe)

df %>%
  filter(age > 30) %>%
  mutate(decade = age %/% 10) %>%
  group_by(decade) %>%
  summarise(avg_income = mean(income)) %>%
  arrange(desc(avg_income))

PYTHON (chain)

(df
  .query("age > 30")
  .assign(decade=lambda x: x["age"] // 10)
  .groupby("decade")
  .agg(avg_income=("income", "mean"))
  .sort_values("avg_income", ascending=False)
)

Chaining Trick

Wrap the entire chain in parentheses ( ) to break across lines without backslashes. This is the Python equivalent of the pipe's fluency.

Chapter 3 — Data Types & Structures

Lists, dictionaries, tuples — the building blocks you don't have in R.

3.1 Core Data Types

R Type	Python Type	Example
`numeric`	`int` / `float`	`42`, `3.14`
`character`	`str`	`"hello"`
`logical`	`bool`	`True`, `False`
`c(1, 2, 3)` (vector)	`[1, 2, 3]` (list)	Ordered, mutable
`list(a=1, b=2)` (named list)	`{"a": 1, "b": 2}` (dict)	Key-value pairs
no equivalent	`(1, 2, 3)` (tuple)	Ordered, immutable
`c(1, 2, 3)` (vector)	`np.array([1,2,3])`	Vectorised operations
`data.frame` / `tibble`	`pd.DataFrame`	The workhorse

3.2 The Big Insight: Python Lists ≠ R Vectors

R — vectorised by default

x <- c(1, 2, 3, 4, 5)
x * 2          # 2, 4, 6, 8, 10 ✓
x[x > 3]      # 4, 5           ✓
mean(x)        # 3              ✓

PYTHON — use numpy for vectorised ops

x = [1, 2, 3, 4, 5]
# x * 2  → [1,2,3,4,5,1,2,3,4,5] 😱

# Use numpy instead:
import numpy as np
x = np.array([1, 2, 3, 4, 5])
x * 2          # array([2,4,6,8,10]) ✓
x[x > 3]      # array([4, 5])       ✓
x.mean()       # 3.0                 ✓

3.3 Dictionaries (Your New Best Friend)

PYTHON

# Dicts are like R named lists but much more central to Python
person = {
    "name": "Meron",
    "role": "Analyst",
    "skills": ["R", "Python", "SQL"]
}

person["name"]               # "Meron"
person["skills"][0]          # "R" (0-indexed!)
person.get("age", "unknown") # safe access with default

# Dicts → DataFrames (very common pattern)
data = {
    "name": ["Alice", "Bob", "Carol"],
    "age": [32, 45, 28],
    "dept": ["Research", "Engineering", "Research"]
}
df = pd.DataFrame(data)      # instant DataFrame!

3.4 List Comprehensions (R's `map()` on steroids)

# purrr::map
map_dbl(1:5, ~ .x^2)
# [1]  1  4  9 16 25

# Keep evens
keep(1:10, ~ .x %% 2 == 0)

PYTHON

# List comprehension
[x**2 for x in range(1, 6)]
# [1, 4, 9, 16, 25]

# With condition
[x for x in range(1, 11) if x % 2 == 0]
# [2, 4, 6, 8, 10]

Chapter 4 — Pandas = dplyr + tidyr

The single library that replaces most of the tidyverse.

4.1 Creating DataFrames

df <- tibble(
  name = c("Alice", "Bob", "Carol"),
  age = c(32, 45, 28),
  score = c(88, 92, 85)
)

glimpse(df)
summary(df)

PYTHON

df = pd.DataFrame({
    "name": ["Alice", "Bob", "Carol"],
    "age": [32, 45, 28],
    "score": [88, 92, 85]
})

df.info()     # like glimpse()
df.describe() # like summary()

4.2 Exploring Data

R / tidyverse	Python / pandas	Purpose
`glimpse(df)`	`df.info()`	Column types & non-nulls
`head(df)`	`df.head()`	First rows
`tail(df)`	`df.tail()`	Last rows
`nrow(df)`	`len(df)`	Row count
`ncol(df)`	`df.shape[1]`	Column count
`dim(df)`	`df.shape`	(rows, cols) tuple
`names(df)`	`df.columns.tolist()`	Column names
`summary(df)`	`df.describe()`	Numeric summary
`count(df, col)`	`df["col"].value_counts()`	Frequency table
`n_distinct(df$col)`	`df["col"].nunique()`	Unique count
`is.na(df) %>% colSums()`	`df.isna().sum()`	Missing per column
`df %>% distinct()`	`df.drop_duplicates()`	Remove duplicates

Chapter 5 — Selecting & Filtering

The dplyr verbs select() and filter() in pandas.

5.1 Selecting Columns

# Single column
df %>% select(name)

# Multiple columns
df %>% select(name, age)

# By pattern
df %>% select(starts_with("score"))
df %>% select(where(is.numeric))

# Drop columns
df %>% select(-age)

# Rename
df %>% rename(full_name = name)

PYTHON

# Single column (returns Series)
df["name"]

# Multiple columns (returns DataFrame)
df[["name", "age"]]

# By pattern
df.filter(regex="^score")
df.select_dtypes(include="number")

# Drop columns
df.drop(columns=["age"])

# Rename
df.rename(columns={"name": "full_name"})

5.2 Filtering Rows

# Simple filter
df %>% filter(age > 30)

# Multiple conditions
df %>% filter(age > 30, score > 85)

# OR condition
df %>% filter(age > 30 | score > 90)

# %in% check
df %>% filter(name %in% c("Alice","Bob"))

# Missing values
df %>% filter(!is.na(score))

# Slice
df %>% slice(1:3)
df %>% slice_max(score, n = 2)

PYTHON

# Simple filter
df[df["age"] > 30]            # boolean mask
df.query("age > 30")          # query string

# Multiple conditions
df.query("age > 30 and score > 85")

# OR condition
df.query("age > 30 or score > 90")

# isin check
df[df["name"].isin(["Alice", "Bob"])]

# Missing values
df[df["score"].notna()]        # or .dropna()

# Slice
df.iloc[0:3]                   # by position
df.nlargest(2, "score")        # top N

query() vs boolean masking

df.query("age > 30") is cleaner for simple filters (like dplyr). Boolean masking df[df["age"] > 30] is more flexible for complex logic. Use .query() for readability in chains.

Chapter 6 — Mutating & Summarising

mutate(), summarise(), and group_by() translated to pandas.

6.1 Creating New Columns (mutate)

df %>%
  mutate(
    score_pct = score / 100,
    age_group = case_when(
      age < 30 ~ "Young",
      age < 40 ~ "Mid",
      TRUE     ~ "Senior"
    ),
    rank = row_number()
  )

PYTHON

(df
  .assign(
    score_pct=lambda x: x["score"] / 100,
    age_group=lambda x: np.select(
      [x["age"] < 30, x["age"] < 40],
      ["Young", "Mid"],
      default="Senior"
    ),
    rank=lambda x: range(1, len(x) + 1)
  )
)
# Or directly: df["score_pct"] = df["score"]/100

6.2 case_when → np.select / pd.cut

# case_when
df %>% mutate(
  grade = case_when(
    score >= 90 ~ "A",
    score >= 80 ~ "B",
    score >= 70 ~ "C",
    TRUE        ~ "F"
  )
)

# cut for bins
df %>% mutate(
  age_bin = cut(age, breaks = c(0,30,40,50,Inf))
)

PYTHON

# np.select (order matters!)
conditions = [
    df["score"] >= 90,
    df["score"] >= 80,
    df["score"] >= 70,
]
choices = ["A", "B", "C"]
df["grade"] = np.select(conditions, choices, "F")

# pd.cut for bins
df["age_bin"] = pd.cut(
    df["age"], bins=[0, 30, 40, 50, np.inf]
)

6.3 Summarising & Grouping

# Ungrouped summary
df %>% summarise(
  avg_score = mean(score),
  n = n()
)

# Grouped summary
df %>%
  group_by(dept) %>%
  summarise(
    avg_score = mean(score),
    max_age = max(age),
    n = n()
  ) %>%
  ungroup()

# Window function in mutate
df %>%
  group_by(dept) %>%
  mutate(
    dept_avg = mean(score),
    rank = rank(-score)
  )

PYTHON

# Ungrouped summary
df.agg(
    avg_score=("score", "mean"),
    n=("score", "size")
)

# Grouped summary
(df
  .groupby("dept")
  .agg(
    avg_score=("score", "mean"),
    max_age=("age", "max"),
    n=("name", "size")
  )
  .reset_index()    # like ungroup()
)

# Window function (transform)
df["dept_avg"] = (
    df.groupby("dept")["score"].transform("mean")
)
df["rank"] = (
    df.groupby("dept")["score"].rank(ascending=False)
)

.transform() = grouped mutate

.agg() collapses rows (like summarise). .transform() keeps all rows (like mutate inside group_by). This is the key distinction.

Chapter 7 — Joins & Reshaping

dplyr joins and tidyr pivots, translated.

7.1 Joins

left_join(df1, df2, by = "id")
inner_join(df1, df2, by = "id")
full_join(df1, df2, by = "id")
anti_join(df1, df2, by = "id")

# Different column names
left_join(df1, df2,
  by = c("emp_id" = "employee_id")
)

PYTHON

df1.merge(df2, on="id", how="left")
df1.merge(df2, on="id", how="inner")
df1.merge(df2, on="id", how="outer")
df1.merge(df2, on="id", how="left",
          indicator=True)  # then filter

# Different column names
df1.merge(df2,
    left_on="emp_id",
    right_on="employee_id"
)

7.2 Anti-join Pattern in Pandas

PYTHON

# Anti-join: rows in df1 NOT in df2
merged = df1.merge(df2, on="id", how="left", indicator=True)
anti = merged[merged["_merge"] == "left_only"].drop(columns="_merge")

7.3 Reshaping: Pivot Longer & Wider

# Long → same as gather
df %>% pivot_longer(
  cols = c(q1, q2, q3),
  names_to = "quarter",
  values_to = "revenue"
)

# Wide → same as spread
df %>% pivot_wider(
  names_from = quarter,
  values_from = revenue
)

PYTHON

# Melt = pivot_longer
df.melt(
    id_vars=["name"],
    value_vars=["q1", "q2", "q3"],
    var_name="quarter",
    value_name="revenue"
)

# Pivot = pivot_wider
df.pivot_table(
    index="name",
    columns="quarter",
    values="revenue"
).reset_index()

Chapter 8 — Strings, Dates & Factors

stringr, lubridate, forcats → pandas str/dt accessors.

8.1 String Operations

R (stringr)

str_to_lower(x)
str_to_upper(x)
str_trim(x)
str_detect(x, "pattern")
str_replace(x, "old", "new")
str_sub(x, 1, 3)
str_c("a", "b", sep = "-")
str_extract(x, "\\d+")

PYTHON (pandas .str accessor)

df["col"].str.lower()
df["col"].str.upper()
df["col"].str.strip()
df["col"].str.contains("pattern")
df["col"].str.replace("old", "new")
df["col"].str[:3]
"a" + "-" + "b"  # or "-".join(...)
df["col"].str.extract(r"(\d+)")

8.2 Dates & Times

R (lubridate)

ymd("2024-03-15")
today()
x <- ymd("2024-03-15")
year(x); month(x); day(x)
x + days(30)
floor_date(x, "month")
interval(start, end) / days(1)

PYTHON (pandas)

pd.to_datetime("2024-03-15")
pd.Timestamp.today()
x = pd.to_datetime("2024-03-15")
x.year; x.month; x.day
x + pd.Timedelta(days=30)
x.to_period("M").to_timestamp()
(end - start).days

8.3 Factors → Categoricals

R (forcats)

factor(x, levels = c("Low","Med","High"))
fct_relevel(x, "High", "Med", "Low")
fct_reorder(x, value, .fun = median)
fct_lump_n(x, n = 5)

PYTHON

pd.Categorical(x, categories=["Low","Med","High"],
               ordered=True)
df["col"] = df["col"].cat.reorder_categories(
    ["High","Med","Low"])
# Reorder by value — set after sorting
# Lump — use value_counts + threshold

Chapter 9 — Reading & Writing Data

readr, readxl, haven → pandas read_* functions.

Format	R	Python / pandas
CSV	`read_csv("f.csv")`	`pd.read_csv("f.csv")`
TSV	`read_tsv("f.tsv")`	`pd.read_csv("f.tsv", sep="\t")`
Excel	`readxl::read_excel("f.xlsx")`	`pd.read_excel("f.xlsx")`
JSON	`jsonlite::read_json("f.json")`	`pd.read_json("f.json")`
Parquet	`arrow::read_parquet("f.parquet")`	`pd.read_parquet("f.parquet")`
SPSS / Stata	`haven::read_sav()`	`pd.read_spss() / pd.read_stata()`
SQL	`DBI::dbGetQuery()`	`pd.read_sql(query, conn)`
Clipboard	`readr::clipboard()`	`pd.read_clipboard()`

PYTHON — Common read_csv options

df = pd.read_csv(
    "data.csv",
    usecols=["name", "age", "score"],  # select columns on read
    dtype={"age": int},                 # force types
    parse_dates=["date_col"],           # auto-parse dates
    na_values=["", "NA", "N/A", "-"],   # custom NA strings
    nrows=1000,                         # read only first 1000
    encoding="utf-8",                   # encoding
)

# Writing
df.to_csv("output.csv", index=False)     # index=False prevents row numbers
df.to_excel("output.xlsx", index=False)
df.to_parquet("output.parquet")           # fast, compressed — prefer this!

Parquet: Use It

Parquet is 5–10× smaller than CSV, 10–50× faster to read, and preserves data types. If you're doing ETL, always store intermediate files as Parquet.

Chapter 10 — ETL Pipelines

Extract, Transform, Load — building real data workflows in Python.

10.1 ETL Pattern

PYTHON — Full ETL Pipeline

import pandas as pd
import numpy as np
from pathlib import Path
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# ── EXTRACT ──────────────────────────────────────────
def extract(source_dir: str) -> pd.DataFrame:
    """Read all CSVs from a directory and combine."""
    files = list(Path(source_dir).glob("*.csv"))
    logger.info(f"Found {len(files)} files")

    dfs = []
    for f in files:
        df = pd.read_csv(f, parse_dates=["date"])
        df["source_file"] = f.name         # track provenance
        dfs.append(df)

    raw = pd.concat(dfs, ignore_index=True)
    logger.info(f"Extracted {len(raw):,} rows")
    return raw


# ── TRANSFORM ────────────────────────────────────────
def transform(df: pd.DataFrame) -> pd.DataFrame:
    """Clean, validate, and enrich the data."""
    return (df
        # 1. Standardise column names
        .rename(columns=str.lower)
        .rename(columns=lambda c: c.strip().replace(" ", "_"))

        # 2. Remove duplicates
        .drop_duplicates(subset=["id", "date"])

        # 3. Handle missing values
        .dropna(subset=["id"])               # must have ID
        .assign(
            amount=lambda x: x["amount"].fillna(0),
            category=lambda x: x["category"].fillna("Unknown"),
        )

        # 4. Type enforcement
        .astype({"category": "category"})

        # 5. Business logic
        .assign(
            year=lambda x: x["date"].dt.year,
            month=lambda x: x["date"].dt.month,
            quarter=lambda x: x["date"].dt.quarter,
            amount_gbp=lambda x: x["amount"] * 0.79,  # USD→GBP
            size_bucket=lambda x: pd.cut(
                x["amount"],
                bins=[0, 100, 1000, 10000, np.inf],
                labels=["Small", "Medium", "Large", "Enterprise"]
            ),
        )

        # 6. Filter invalid records
        .query("amount >= 0")

        # 7. Sort
        .sort_values(["date", "id"])
        .reset_index(drop=True)
    )


# ── LOAD ─────────────────────────────────────────────
def load(df: pd.DataFrame, output_path: str):
    """Save clean data to Parquet."""
    df.to_parquet(output_path, index=False)
    logger.info(f"Loaded {len(df):,} rows → {output_path}")


# ── ORCHESTRATE ──────────────────────────────────────
if __name__ == "__main__":
    raw = extract("data/raw/")
    clean = transform(raw)
    load(clean, "data/processed/clean_data.parquet")

10.2 Validation Layer (Quality Checks)

PYTHON

def validate(df: pd.DataFrame) -> pd.DataFrame:
    """Run data quality checks. Raise on critical failures."""
    checks = {
        "no_nulls_in_id":    df["id"].notna().all(),
        "positive_amounts":  (df["amount"] >= 0).all(),
        "valid_dates":       df["date"].between("2020-01-01", "2026-12-31").all(),
        "no_duplicates":     not df.duplicated(subset=["id", "date"]).any(),
        "row_count_min":     len(df) >= 100,
    }

    for check_name, passed in checks.items():
        status = "✓ PASS" if passed else "✗ FAIL"
        logger.info(f"  {status}: {check_name}")

    failures = [k for k, v in checks.items() if not v]
    if failures:
        raise ValueError(f"Validation failed: {failures}")

    return df

# Insert into pipeline:
# raw = extract(...)
# clean = transform(raw)
# validated = validate(clean)     ← new
# load(validated, ...)

10.3 The R Comparison

R — Typical pipeline

library(tidyverse)
library(fs)

raw <- dir_ls("data/raw", glob = "*.csv") %>%
  map_dfr(read_csv)

clean <- raw %>%
  janitor::clean_names() %>%
  distinct(id, date, .keep_all = TRUE) %>%
  filter(!is.na(id), amount >= 0) %>%
  mutate(
    year = year(date),
    quarter = quarter(date),
    size = cut(amount,
      breaks = c(0, 100, 1000, Inf),
      labels = c("S", "M", "L"))
  )

write_parquet(clean, "data/clean.parquet")

PYTHON — Same pipeline

import pandas as pd
from pathlib import Path

files = Path("data/raw").glob("*.csv")
raw = pd.concat(
    [pd.read_csv(f) for f in files],
    ignore_index=True
)

clean = (raw
    .rename(columns=str.lower)
    .drop_duplicates(subset=["id", "date"])
    .dropna(subset=["id"])
    .query("amount >= 0")
    .assign(
        year=lambda x: x["date"].dt.year,
        quarter=lambda x: x["date"].dt.quarter,
        size=lambda x: pd.cut(x["amount"],
            bins=[0,100,1000,np.inf],
            labels=["S","M","L"]),
    )
)

clean.to_parquet("data/clean.parquet")

Chapter 11 — APIs & Web Scraping

httr/rvest → requests/BeautifulSoup.

11.1 Calling APIs

library(httr)
library(jsonlite)

resp <- GET(
  "https://api.example.com/data",
  add_headers(Authorization = "Bearer xyz"),
  query = list(limit = 100)
)

data <- content(resp, as = "parsed")
df <- as_tibble(data$results)

PYTHON

import requests
import pandas as pd

resp = requests.get(
    "https://api.example.com/data",
    headers={"Authorization": "Bearer xyz"},
    params={"limit": 100}
)

data = resp.json()
df = pd.DataFrame(data["results"])

11.2 Web Scraping

PYTHON

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
resp = requests.get(url)
soup = BeautifulSoup(resp.text, "html.parser")

# Find all tables
tables = pd.read_html(resp.text)   # pandas can parse HTML tables!
gdp_df = tables[0]                 # first table on the page

# Or parse manually for more control
rows = []
for tr in soup.select("table.wikitable tr")[1:]:
    cells = [td.text.strip() for td in tr.find_all(["td", "th"])]
    rows.append(cells)

df = pd.DataFrame(rows, columns=["Rank", "Country", "GDP"])

11.3 Paginating Through an API

PYTHON — Real-world pagination pattern

def fetch_all_pages(base_url: str, api_key: str) -> pd.DataFrame:
    """Paginate through a REST API, collecting all results."""
    all_data = []
    page = 1

    while True:
        resp = requests.get(
            base_url,
            headers={"Authorization": f"Bearer {api_key}"},
            params={"page": page, "per_page": 100}
        )
        resp.raise_for_status()  # raise exception on HTTP errors

        data = resp.json()
        results = data.get("results", [])

        if not results:
            break

        all_data.extend(results)
        page += 1

        if page > data.get("total_pages", 1):
            break

    return pd.DataFrame(all_data)

Chapter 12 — SQL from Python

DBI/dbplyr → SQLAlchemy/pandas.

library(DBI)

con <- dbConnect(
  RPostgres::Postgres(),
  host = "localhost",
  dbname = "mydb",
  user = "user",
  password = "pass"
)

df <- dbGetQuery(con, "
  SELECT dept, AVG(salary)
  FROM employees
  WHERE active = true
  GROUP BY dept
")

dbDisconnect(con)

PYTHON

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine(
    "postgresql://user:pass@localhost/mydb"
)

df = pd.read_sql("""
    SELECT dept, AVG(salary)
    FROM employees
    WHERE active = true
    GROUP BY dept
""", engine)

engine.dispose()

PYTHON — Writing DataFrames to SQL

# Write a DataFrame to a database table
df.to_sql(
    "clean_employees",      # table name
    engine,
    if_exists="replace",    # "append", "replace", or "fail"
    index=False,
    method="multi",         # faster batch insert
    chunksize=5000
)

# Supabase / PostgreSQL with connection pooling
engine = create_engine(
    "postgresql://user:pass@db.xyz.supabase.co:5432/postgres",
    pool_size=5,
    max_overflow=10
)

Chapter 13 — Matplotlib & Seaborn

ggplot2 → matplotlib + seaborn (the grammar of graphics, translated).

13.1 The Mental Model

ggplot2 Concept	matplotlib/seaborn
`ggplot(df, aes(x, y))`	`fig, ax = plt.subplots()`
`geom_point()`	`ax.scatter(x, y)` or `sns.scatterplot()`
`geom_line()`	`ax.plot(x, y)` or `sns.lineplot()`
`geom_bar()`	`ax.bar(x, y)` or `sns.barplot()`
`geom_histogram()`	`ax.hist(x)` or `sns.histplot()`
`geom_boxplot()`	`sns.boxplot()`
`facet_wrap(~var)`	`sns.FacetGrid()` or `fig, axes = plt.subplots(1, 3)`
`labs(title, x, y)`	`ax.set(title=, xlabel=, ylabel=)`
`theme_minimal()`	`plt.style.use('seaborn-v0_8-whitegrid')`
`scale_fill_brewer()`	`palette="Set2"` in seaborn
`ggsave("plot.png")`	`plt.savefig("plot.png", dpi=150, bbox_inches="tight")`

13.2 Side-by-Side Example

R (ggplot2)

ggplot(df, aes(x = age, y = score,
               colour = dept)) +
  geom_point(alpha = 0.7, size = 3) +
  geom_smooth(method = "lm") +
  labs(
    title = "Score by Age",
    x = "Age", y = "Score"
  ) +
  theme_minimal() +
  scale_colour_brewer(palette = "Set2")

PYTHON (seaborn)

import seaborn as sns
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 5))
sns.scatterplot(
    data=df, x="age", y="score",
    hue="dept", alpha=0.7, s=60,
    palette="Set2", ax=ax
)
sns.regplot(
    data=df, x="age", y="score",
    scatter=False, ax=ax, color="grey"
)
ax.set(title="Score by Age",
       xlabel="Age", ylabel="Score")
sns.despine()
plt.tight_layout()
plt.savefig("scatter.png", dpi=150)

13.3 Common Chart Recipes

PYTHON

import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-v0_8-whitegrid')

# ── Bar Chart ──
fig, ax = plt.subplots(figsize=(8, 5))
sns.barplot(data=df, x="dept", y="score", palette="viridis", ax=ax)
ax.set_title("Average Score by Department", fontweight="bold")
ax.bar_label(ax.containers[0], fmt="%.1f")  # value labels on bars

# ── Histogram ──
fig, ax = plt.subplots()
sns.histplot(df["score"], bins=20, kde=True, ax=ax)

# ── Faceted chart (like facet_wrap) ──
g = sns.FacetGrid(df, col="dept", col_wrap=3, height=4)
g.map_dataframe(sns.scatterplot, x="age", y="score")
g.set_titles("{col_name}")

# ── Heatmap (correlation) ──
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(df.select_dtypes("number").corr(),
            annot=True, fmt=".2f", cmap="RdBu_r",
            center=0, ax=ax)

Chapter 14 — Plotly Interactive

plotly in R → plotly in Python (nearly identical!).

library(plotly)

plot_ly(df,
  x = ~date, y = ~value,
  color = ~group,
  type = "scatter",
  mode = "lines"
) %>%
  layout(title = "Trend")

PYTHON

import plotly.express as px

fig = px.line(
    df,
    x="date", y="value",
    color="group",
    title="Trend"
)
fig.show()
fig.write_html("trend.html")

Plotly Express

plotly.express (imported as px) is the high-level API — one function per chart type, very similar to ggplot. Use plotly.graph_objects for full customisation.

Chapter 15 — Statistics & Hypothesis Tests

stats::t.test, chisq.test → scipy.stats.

# t-test
t.test(score ~ group, data = df)

# Chi-squared
chisq.test(table(df$dept, df$grade))

# Correlation
cor.test(df$age, df$score)

# Shapiro-Wilk normality
shapiro.test(df$score)

# Mann-Whitney U
wilcox.test(score ~ group, data = df)

# Linear regression
model <- lm(score ~ age + dept, data = df)
summary(model)

PYTHON

from scipy import stats

# t-test
stats.ttest_ind(group_a, group_b)

# Chi-squared
ct = pd.crosstab(df["dept"], df["grade"])
stats.chi2_contingency(ct)

# Correlation
stats.pearsonr(df["age"], df["score"])

# Shapiro-Wilk normality
stats.shapiro(df["score"])

# Mann-Whitney U
stats.mannwhitneyu(group_a, group_b)

# Linear regression
import statsmodels.formula.api as smf
model = smf.ols("score ~ age + dept", data=df)
results = model.fit()
print(results.summary())

15.1 Effect Sizes (Don't Just Report p-values)

PYTHON

# Cohen's d (effect size for t-test)
def cohens_d(group1, group2):
    n1, n2 = len(group1), len(group2)
    pooled_std = np.sqrt(((n1-1)*group1.std()**2 + (n2-1)*group2.std()**2) / (n1+n2-2))
    return (group1.mean() - group2.mean()) / pooled_std

d = cohens_d(group_a, group_b)
# |d| < 0.2 = small, 0.5 = medium, 0.8 = large

# Confidence intervals (bootstrap)
from scipy.stats import bootstrap
rng = np.random.default_rng(42)
result = bootstrap((df["score"].values,), np.mean,
                    confidence_level=0.95, random_state=rng)
print(f"95% CI: ({result.confidence_interval.low:.2f}, "
      f"{result.confidence_interval.high:.2f})")

Chapter 16 — Regression & Machine Learning

caret/tidymodels → scikit-learn.

R (tidymodels)

library(tidymodels)

# Split
split <- initial_split(df, prop = 0.8)
train <- training(split)
test <- testing(split)

# Recipe + model
rec <- recipe(score ~ ., data = train) %>%
  step_dummy(all_nominal()) %>%
  step_normalize(all_numeric_predictors())

model <- linear_reg() %>%
  set_engine("lm") %>%
  fit(score ~ ., data = juice(prep(rec)))

# Predict
preds <- predict(model, bake(prep(rec), test))

PYTHON (scikit-learn)

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_squared_error

# Split
X = df.drop(columns=["score"])
y = df["score"]
X = pd.get_dummies(X, drop_first=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Pipeline (recipe + model)
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LinearRegression())
])
pipe.fit(X_train, y_train)

# Predict
preds = pipe.predict(X_test)
print(f"R² = {r2_score(y_test, preds):.3f}")

Chapter 17 — Functions & Scripts

Writing reusable code — functions, error handling, and CLI scripts.

17.1 Functions

clean_names <- function(df,
                        case = "snake") {
  df %>%
    janitor::clean_names(case = case)
}

# Default arguments
calc_rate <- function(n, total,
                      digits = 2) {
  round(n / total * 100, digits)
}

PYTHON

def clean_names(df: pd.DataFrame,
                case: str = "snake") -> pd.DataFrame:
    """Standardise column names."""
    cols = df.columns.str.strip().str.lower()
    cols = cols.str.replace(" ", "_")
    return df.set_axis(cols, axis=1)

# Default arguments + type hints
def calc_rate(n: int, total: int,
              digits: int = 2) -> float:
    return round(n / total * 100, digits)

17.2 Error Handling

result <- tryCatch(
  {
    risky_operation()
  },
  error = function(e) {
    message("Error: ", e$message)
    NA
  },
  warning = function(w) {
    message("Warning: ", w$message)
  }
)

PYTHON

try:
    result = risky_operation()
except FileNotFoundError as e:
    print(f"File missing: {e}")
    result = None
except ValueError as e:
    print(f"Bad value: {e}")
    result = None
except Exception as e:
    print(f"Unexpected: {e}")
    raise  # re-raise unexpected errors
finally:
    cleanup()  # always runs

17.3 Command-Line Scripts

PYTHON — CLI script with argparse

#!/usr/bin/env python3
"""ETL script that can be run from the command line."""
import argparse
import pandas as pd

def main():
    parser = argparse.ArgumentParser(description="Run ETL pipeline")
    parser.add_argument("input", help="Path to input CSV")
    parser.add_argument("-o", "--output", default="output.parquet")
    parser.add_argument("--year", type=int, help="Filter to year")
    args = parser.parse_args()

    df = pd.read_csv(args.input)
    if args.year:
        df = df.query("year == @args.year")
    df.to_parquet(args.output, index=False)
    print(f"✓ Saved {len(df):,} rows → {args.output}")

if __name__ == "__main__":
    main()

# Usage: python etl.py data.csv -o clean.parquet --year 2024

Chapter 18 — Packaging & Environments

renv → venv/pip, R packages → Python packages.

18.1 Virtual Environments

R (renv)

renv::init()        # create lockfile
renv::install("dplyr")
renv::snapshot()    # save state
renv::restore()     # reproduce

PYTHON (venv + pip)

python -m venv .venv
source .venv/bin/activate
pip install pandas
pip freeze > requirements.txt  # save
pip install -r requirements.txt # reproduce

18.2 Project Structure

PYTHON — Standard project layout

my_project/
├── data/
│   ├── raw/              # never modify raw data
│   └── processed/        # ETL output goes here
├── notebooks/
│   └── exploration.ipynb # Jupyter notebooks
├── src/
│   ├── __init__.py       # makes it a package
│   ├── extract.py
│   ├── transform.py
│   └── load.py
├── tests/
│   └── test_transform.py
├── requirements.txt      # like renv.lock
├── .gitignore
└── README.md

Chapter 19 — Complete Cheat Sheet

The full R → Python translation table. Bookmark this.

Operation	R / tidyverse	Python / pandas
Read CSV	`read_csv("f.csv")`	`pd.read_csv("f.csv")`
Write CSV	`write_csv(df, "f.csv")`	`df.to_csv("f.csv", index=False)`
View structure	`glimpse(df)`	`df.info()`
Summary stats	`summary(df)`	`df.describe()`
Select cols	`select(name, age)`	`df[["name", "age"]]`
Drop cols	`select(-age)`	`df.drop(columns=["age"])`
Rename	`rename(new = old)`	`df.rename(columns={"old":"new"})`
Filter rows	`filter(age > 30)`	`df.query("age > 30")`
Sort	`arrange(desc(age))`	`df.sort_values("age", ascending=False)`
New column	`mutate(x = a + b)`	`df.assign(x=lambda d: d["a"]+d["b"])`
case_when	`case_when(...)`	`np.select(conditions, choices)`
Group + summarise	`group_by(g) %>% summarise(m=mean(x))`	`df.groupby("g").agg(m=("x","mean"))`
Window (grouped mutate)	`group_by(g) %>% mutate(m=mean(x))`	`df.groupby("g")["x"].transform("mean")`
Ungroup	`ungroup()`	`.reset_index()`
Count	`count(col)`	`df["col"].value_counts()`
Distinct	`distinct()`	`df.drop_duplicates()`
Left join	`left_join(a, b, by="id")`	`a.merge(b, on="id", how="left")`
Pivot longer	`pivot_longer()`	`df.melt()`
Pivot wider	`pivot_wider()`	`df.pivot_table()`
Missing count	`sum(is.na(x))`	`df["x"].isna().sum()`
Fill NA	`replace_na(x, 0)`	`df["x"].fillna(0)`
String lower	`str_to_lower(x)`	`df["x"].str.lower()`
Date parse	`ymd("2024-01-01")`	`pd.to_datetime("2024-01-01")`
Year extract	`year(date)`	`df["date"].dt.year`
Apply func	`map_dbl(x, fn)`	`df["x"].apply(fn)`
Row-wise	`rowwise() %>% mutate(...)`	`df.apply(fn, axis=1)`
Bind rows	`bind_rows(a, b)`	`pd.concat([a, b])`
Bind cols	`bind_cols(a, b)`	`pd.concat([a, b], axis=1)`

You've Got This

You already have the hardest part — the analytical thinking. Python is just new syntax on familiar concepts. Start by rewriting one R script in Python, then keep going. Every tidyverse verb has a pandas equivalent, and you now have the map.

Built for Meron · R → Python · 2026

Python for Tidyverse Users

Chapter 0 — Why Learn Python When You Know R?

Chapter 1 — Setup & Environment

1.1 The R → Python Rosetta Stone (Tools)

1.2 Installation

1.3 Your First Session

Chapter 2 — Python Syntax Primer

2.1 Indentation Matters (No Curly Braces)

2.2 Zero-Indexed (The Big Difference)

2.3 Key Syntax Differences

2.4 The Pipe → Method Chaining

Chapter 3 — Data Types & Structures

3.1 Core Data Types

3.2 The Big Insight: Python Lists ≠ R Vectors

3.3 Dictionaries (Your New Best Friend)

3.4 List Comprehensions (R's map() on steroids)

Chapter 4 — Pandas = dplyr + tidyr

4.1 Creating DataFrames

4.2 Exploring Data

Chapter 5 — Selecting & Filtering

5.1 Selecting Columns

5.2 Filtering Rows

Chapter 6 — Mutating & Summarising

6.1 Creating New Columns (mutate)

6.2 case_when → np.select / pd.cut

6.3 Summarising & Grouping

Chapter 7 — Joins & Reshaping

7.1 Joins

7.2 Anti-join Pattern in Pandas

7.3 Reshaping: Pivot Longer & Wider

Chapter 8 — Strings, Dates & Factors

8.1 String Operations

8.2 Dates & Times

8.3 Factors → Categoricals

Chapter 9 — Reading & Writing Data

Chapter 10 — ETL Pipelines

10.1 ETL Pattern

10.2 Validation Layer (Quality Checks)

10.3 The R Comparison

Chapter 11 — APIs & Web Scraping

11.1 Calling APIs

11.2 Web Scraping

11.3 Paginating Through an API

Chapter 12 — SQL from Python

Chapter 13 — Matplotlib & Seaborn

13.1 The Mental Model

13.2 Side-by-Side Example

13.3 Common Chart Recipes

Chapter 14 — Plotly Interactive

Chapter 15 — Statistics & Hypothesis Tests

15.1 Effect Sizes (Don't Just Report p-values)

Chapter 16 — Regression & Machine Learning

Chapter 17 — Functions & Scripts

17.1 Functions

17.2 Error Handling

17.3 Command-Line Scripts

Chapter 18 — Packaging & Environments

18.1 Virtual Environments

18.2 Project Structure

Chapter 19 — Complete Cheat Sheet

3.4 List Comprehensions (R's `map()` on steroids)