Python for Tidyverse Users
A complete, side-by-side course translating your R/tidyverse fluency into Python/pandas mastery — from first line to production ETL.
Chapter 0 — Why Learn Python When You Know R?
You already think in tidy data. Now add a second superpower.
As a tidyverse user you already understand the hardest concepts: tidy data principles, piped transformations, grouped operations, and the grammar of graphics. Python won't replace R — it will complement it. Here's why it's worth adding:
- Engineering integration — Most production data systems, APIs, and cloud services have Python as a first-class citizen.
- ML ecosystem — scikit-learn, PyTorch, TensorFlow, and Hugging Face are Python-native.
- Web apps & SaaS — Flask, FastAPI, Streamlit, and Django let you ship products.
- Job market — Python consistently tops hiring requirements for data & analytics roles in the UK.
Chapter 1 — Setup & Environment
Getting your Python workspace running — the equivalent of installing R + RStudio + tidyverse.
1.1 The R → Python Rosetta Stone (Tools)
| R World | Python Equivalent | Notes |
|---|---|---|
R | python | The language itself |
RStudio | VS Code or JupyterLab | VS Code + Python extension is closest to RStudio |
CRAN | PyPI | Package repository |
install.packages() | pip install | Package manager |
library() | import | Load a package |
renv | venv / conda | Virtual environments |
.Rmd / .qmd | .ipynb (Jupyter) / .qmd | Quarto supports both! |
tidyverse | pandas + numpy | Core data stack |
ggplot2 | matplotlib + seaborn / plotly | Visualisation |
1.2 Installation
# Install Python (Mac/Linux usually have it; Windows → python.org)
# Then create a virtual environment (like renv):
python -m venv myproject_env
source myproject_env/bin/activate # Mac/Linux
myproject_env\Scripts\activate # Windows
# Install the "tidyverse" of Python:
pip install pandas numpy matplotlib seaborn scipy scikit-learn
pip install openpyxl xlsxwriter # Excel support
pip install sqlalchemy # SQL support
pip install requests beautifulsoup4 # APIs & scraping
pip install jupyterlab # Notebook interface
1.3 Your First Session
library(tidyverse)
# Everything is loaded — dplyr,
# ggplot2, tidyr, readr, etc.
# Interactive: RStudio console
# Scripts: .R files
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Convention: short aliases
# pd, np, plt — you'll see these everywhere
# Interactive: Jupyter notebook
# Scripts: .py files
import pandas as pd is universal — every tutorial, every StackOverflow answer uses pd. Same for np (numpy), plt (matplotlib), sns (seaborn). Don't fight it!
Chapter 2 — Python Syntax Primer
The biggest "gotchas" when switching from R to Python.
2.1 Indentation Matters (No Curly Braces)
# Braces define blocks
if (x > 10) {
print("big")
} else {
print("small")
}
for (i in 1:5) {
print(i)
}
# Indentation defines blocks (4 spaces)
if x > 10:
print("big")
else:
print("small")
for i in range(1, 6): # 1 to 5
print(i)
2.2 Zero-Indexed (The Big Difference)
x <- c(10, 20, 30)
x[1] # 10 (1-indexed)
x[2:3] # 20, 30 (inclusive)
x = [10, 20, 30]
x[0] # 10 (0-indexed!)
x[1:3] # [20, 30] (end exclusive)
x[1:3] returns items at index 1 and 2 (not 3). This is the #1 source of bugs for R users switching to Python.
2.3 Key Syntax Differences
| Concept | R | Python |
|---|---|---|
| Assignment | x <- 5 or x = 5 | x = 5 only |
print(x) or just x | print(x) required in scripts | |
| Boolean | TRUE / FALSE | True / False (capitalised) |
| NULL / None | NULL | None |
| NA / NaN | NA | np.nan or pd.NA |
| Not equal | != | != (same) |
| And / Or | & / | | and / or (scalar), & / | (vectorised) |
| Pipe | %>% or |> | . method chaining |
| String | "hi" or 'hi' | "hi" or 'hi' (same) |
| f-string | glue("Hello {name}") | f"Hello {name}" |
| Comment | # comment | # comment (same) |
| Negation | !x | not x |
| Membership | x %in% c(1,2,3) | x in [1, 2, 3] |
2.4 The Pipe → Method Chaining
This is the conceptual leap. In R you pipe data through functions. In Python you chain methods on the object:
df %>%
filter(age > 30) %>%
mutate(decade = age %/% 10) %>%
group_by(decade) %>%
summarise(avg_income = mean(income)) %>%
arrange(desc(avg_income))
(df
.query("age > 30")
.assign(decade=lambda x: x["age"] // 10)
.groupby("decade")
.agg(avg_income=("income", "mean"))
.sort_values("avg_income", ascending=False)
)
( ) to break across lines without backslashes. This is the Python equivalent of the pipe's fluency.
Chapter 3 — Data Types & Structures
Lists, dictionaries, tuples — the building blocks you don't have in R.
3.1 Core Data Types
| R Type | Python Type | Example |
|---|---|---|
numeric | int / float | 42, 3.14 |
character | str | "hello" |
logical | bool | True, False |
c(1, 2, 3) (vector) | [1, 2, 3] (list) | Ordered, mutable |
list(a=1, b=2) (named list) | {"a": 1, "b": 2} (dict) | Key-value pairs |
| no equivalent | (1, 2, 3) (tuple) | Ordered, immutable |
c(1, 2, 3) (vector) | np.array([1,2,3]) | Vectorised operations |
data.frame / tibble | pd.DataFrame | The workhorse |
3.2 The Big Insight: Python Lists ≠ R Vectors
x <- c(1, 2, 3, 4, 5)
x * 2 # 2, 4, 6, 8, 10 ✓
x[x > 3] # 4, 5 ✓
mean(x) # 3 ✓
x = [1, 2, 3, 4, 5]
# x * 2 → [1,2,3,4,5,1,2,3,4,5] 😱
# Use numpy instead:
import numpy as np
x = np.array([1, 2, 3, 4, 5])
x * 2 # array([2,4,6,8,10]) ✓
x[x > 3] # array([4, 5]) ✓
x.mean() # 3.0 ✓
3.3 Dictionaries (Your New Best Friend)
# Dicts are like R named lists but much more central to Python
person = {
"name": "Meron",
"role": "Analyst",
"skills": ["R", "Python", "SQL"]
}
person["name"] # "Meron"
person["skills"][0] # "R" (0-indexed!)
person.get("age", "unknown") # safe access with default
# Dicts → DataFrames (very common pattern)
data = {
"name": ["Alice", "Bob", "Carol"],
"age": [32, 45, 28],
"dept": ["Research", "Engineering", "Research"]
}
df = pd.DataFrame(data) # instant DataFrame!
3.4 List Comprehensions (R's map() on steroids)
# purrr::map
map_dbl(1:5, ~ .x^2)
# [1] 1 4 9 16 25
# Keep evens
keep(1:10, ~ .x %% 2 == 0)
# List comprehension
[x**2 for x in range(1, 6)]
# [1, 4, 9, 16, 25]
# With condition
[x for x in range(1, 11) if x % 2 == 0]
# [2, 4, 6, 8, 10]
Chapter 4 — Pandas = dplyr + tidyr
The single library that replaces most of the tidyverse.
4.1 Creating DataFrames
df <- tibble(
name = c("Alice", "Bob", "Carol"),
age = c(32, 45, 28),
score = c(88, 92, 85)
)
glimpse(df)
summary(df)
df = pd.DataFrame({
"name": ["Alice", "Bob", "Carol"],
"age": [32, 45, 28],
"score": [88, 92, 85]
})
df.info() # like glimpse()
df.describe() # like summary()
4.2 Exploring Data
| R / tidyverse | Python / pandas | Purpose |
|---|---|---|
glimpse(df) | df.info() | Column types & non-nulls |
head(df) | df.head() | First rows |
tail(df) | df.tail() | Last rows |
nrow(df) | len(df) | Row count |
ncol(df) | df.shape[1] | Column count |
dim(df) | df.shape | (rows, cols) tuple |
names(df) | df.columns.tolist() | Column names |
summary(df) | df.describe() | Numeric summary |
count(df, col) | df["col"].value_counts() | Frequency table |
n_distinct(df$col) | df["col"].nunique() | Unique count |
is.na(df) %>% colSums() | df.isna().sum() | Missing per column |
df %>% distinct() | df.drop_duplicates() | Remove duplicates |
Chapter 5 — Selecting & Filtering
The dplyr verbs select() and filter() in pandas.
5.1 Selecting Columns
# Single column
df %>% select(name)
# Multiple columns
df %>% select(name, age)
# By pattern
df %>% select(starts_with("score"))
df %>% select(where(is.numeric))
# Drop columns
df %>% select(-age)
# Rename
df %>% rename(full_name = name)
# Single column (returns Series)
df["name"]
# Multiple columns (returns DataFrame)
df[["name", "age"]]
# By pattern
df.filter(regex="^score")
df.select_dtypes(include="number")
# Drop columns
df.drop(columns=["age"])
# Rename
df.rename(columns={"name": "full_name"})
5.2 Filtering Rows
# Simple filter
df %>% filter(age > 30)
# Multiple conditions
df %>% filter(age > 30, score > 85)
# OR condition
df %>% filter(age > 30 | score > 90)
# %in% check
df %>% filter(name %in% c("Alice","Bob"))
# Missing values
df %>% filter(!is.na(score))
# Slice
df %>% slice(1:3)
df %>% slice_max(score, n = 2)
# Simple filter
df[df["age"] > 30] # boolean mask
df.query("age > 30") # query string
# Multiple conditions
df.query("age > 30 and score > 85")
# OR condition
df.query("age > 30 or score > 90")
# isin check
df[df["name"].isin(["Alice", "Bob"])]
# Missing values
df[df["score"].notna()] # or .dropna()
# Slice
df.iloc[0:3] # by position
df.nlargest(2, "score") # top N
df.query("age > 30") is cleaner for simple filters (like dplyr). Boolean masking df[df["age"] > 30] is more flexible for complex logic. Use .query() for readability in chains.
Chapter 6 — Mutating & Summarising
mutate(), summarise(), and group_by() translated to pandas.
6.1 Creating New Columns (mutate)
df %>%
mutate(
score_pct = score / 100,
age_group = case_when(
age < 30 ~ "Young",
age < 40 ~ "Mid",
TRUE ~ "Senior"
),
rank = row_number()
)
(df
.assign(
score_pct=lambda x: x["score"] / 100,
age_group=lambda x: np.select(
[x["age"] < 30, x["age"] < 40],
["Young", "Mid"],
default="Senior"
),
rank=lambda x: range(1, len(x) + 1)
)
)
# Or directly: df["score_pct"] = df["score"]/100
6.2 case_when → np.select / pd.cut
# case_when
df %>% mutate(
grade = case_when(
score >= 90 ~ "A",
score >= 80 ~ "B",
score >= 70 ~ "C",
TRUE ~ "F"
)
)
# cut for bins
df %>% mutate(
age_bin = cut(age, breaks = c(0,30,40,50,Inf))
)
# np.select (order matters!)
conditions = [
df["score"] >= 90,
df["score"] >= 80,
df["score"] >= 70,
]
choices = ["A", "B", "C"]
df["grade"] = np.select(conditions, choices, "F")
# pd.cut for bins
df["age_bin"] = pd.cut(
df["age"], bins=[0, 30, 40, 50, np.inf]
)
6.3 Summarising & Grouping
# Ungrouped summary
df %>% summarise(
avg_score = mean(score),
n = n()
)
# Grouped summary
df %>%
group_by(dept) %>%
summarise(
avg_score = mean(score),
max_age = max(age),
n = n()
) %>%
ungroup()
# Window function in mutate
df %>%
group_by(dept) %>%
mutate(
dept_avg = mean(score),
rank = rank(-score)
)
# Ungrouped summary
df.agg(
avg_score=("score", "mean"),
n=("score", "size")
)
# Grouped summary
(df
.groupby("dept")
.agg(
avg_score=("score", "mean"),
max_age=("age", "max"),
n=("name", "size")
)
.reset_index() # like ungroup()
)
# Window function (transform)
df["dept_avg"] = (
df.groupby("dept")["score"].transform("mean")
)
df["rank"] = (
df.groupby("dept")["score"].rank(ascending=False)
)
.agg() collapses rows (like summarise). .transform() keeps all rows (like mutate inside group_by). This is the key distinction.
Chapter 7 — Joins & Reshaping
dplyr joins and tidyr pivots, translated.
7.1 Joins
left_join(df1, df2, by = "id")
inner_join(df1, df2, by = "id")
full_join(df1, df2, by = "id")
anti_join(df1, df2, by = "id")
# Different column names
left_join(df1, df2,
by = c("emp_id" = "employee_id")
)
df1.merge(df2, on="id", how="left")
df1.merge(df2, on="id", how="inner")
df1.merge(df2, on="id", how="outer")
df1.merge(df2, on="id", how="left",
indicator=True) # then filter
# Different column names
df1.merge(df2,
left_on="emp_id",
right_on="employee_id"
)
7.2 Anti-join Pattern in Pandas
# Anti-join: rows in df1 NOT in df2
merged = df1.merge(df2, on="id", how="left", indicator=True)
anti = merged[merged["_merge"] == "left_only"].drop(columns="_merge")
7.3 Reshaping: Pivot Longer & Wider
# Long → same as gather
df %>% pivot_longer(
cols = c(q1, q2, q3),
names_to = "quarter",
values_to = "revenue"
)
# Wide → same as spread
df %>% pivot_wider(
names_from = quarter,
values_from = revenue
)
# Melt = pivot_longer
df.melt(
id_vars=["name"],
value_vars=["q1", "q2", "q3"],
var_name="quarter",
value_name="revenue"
)
# Pivot = pivot_wider
df.pivot_table(
index="name",
columns="quarter",
values="revenue"
).reset_index()
Chapter 8 — Strings, Dates & Factors
stringr, lubridate, forcats → pandas str/dt accessors.
8.1 String Operations
str_to_lower(x)
str_to_upper(x)
str_trim(x)
str_detect(x, "pattern")
str_replace(x, "old", "new")
str_sub(x, 1, 3)
str_c("a", "b", sep = "-")
str_extract(x, "\\d+")
df["col"].str.lower()
df["col"].str.upper()
df["col"].str.strip()
df["col"].str.contains("pattern")
df["col"].str.replace("old", "new")
df["col"].str[:3]
"a" + "-" + "b" # or "-".join(...)
df["col"].str.extract(r"(\d+)")
8.2 Dates & Times
ymd("2024-03-15")
today()
x <- ymd("2024-03-15")
year(x); month(x); day(x)
x + days(30)
floor_date(x, "month")
interval(start, end) / days(1)
pd.to_datetime("2024-03-15")
pd.Timestamp.today()
x = pd.to_datetime("2024-03-15")
x.year; x.month; x.day
x + pd.Timedelta(days=30)
x.to_period("M").to_timestamp()
(end - start).days
8.3 Factors → Categoricals
factor(x, levels = c("Low","Med","High"))
fct_relevel(x, "High", "Med", "Low")
fct_reorder(x, value, .fun = median)
fct_lump_n(x, n = 5)
pd.Categorical(x, categories=["Low","Med","High"],
ordered=True)
df["col"] = df["col"].cat.reorder_categories(
["High","Med","Low"])
# Reorder by value — set after sorting
# Lump — use value_counts + threshold
Chapter 9 — Reading & Writing Data
readr, readxl, haven → pandas read_* functions.
| Format | R | Python / pandas |
|---|---|---|
| CSV | read_csv("f.csv") | pd.read_csv("f.csv") |
| TSV | read_tsv("f.tsv") | pd.read_csv("f.tsv", sep="\t") |
| Excel | readxl::read_excel("f.xlsx") | pd.read_excel("f.xlsx") |
| JSON | jsonlite::read_json("f.json") | pd.read_json("f.json") |
| Parquet | arrow::read_parquet("f.parquet") | pd.read_parquet("f.parquet") |
| SPSS / Stata | haven::read_sav() | pd.read_spss() / pd.read_stata() |
| SQL | DBI::dbGetQuery() | pd.read_sql(query, conn) |
| Clipboard | readr::clipboard() | pd.read_clipboard() |
df = pd.read_csv(
"data.csv",
usecols=["name", "age", "score"], # select columns on read
dtype={"age": int}, # force types
parse_dates=["date_col"], # auto-parse dates
na_values=["", "NA", "N/A", "-"], # custom NA strings
nrows=1000, # read only first 1000
encoding="utf-8", # encoding
)
# Writing
df.to_csv("output.csv", index=False) # index=False prevents row numbers
df.to_excel("output.xlsx", index=False)
df.to_parquet("output.parquet") # fast, compressed — prefer this!
Chapter 10 — ETL Pipelines
Extract, Transform, Load — building real data workflows in Python.
10.1 ETL Pattern
import pandas as pd
import numpy as np
from pathlib import Path
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# ── EXTRACT ──────────────────────────────────────────
def extract(source_dir: str) -> pd.DataFrame:
"""Read all CSVs from a directory and combine."""
files = list(Path(source_dir).glob("*.csv"))
logger.info(f"Found {len(files)} files")
dfs = []
for f in files:
df = pd.read_csv(f, parse_dates=["date"])
df["source_file"] = f.name # track provenance
dfs.append(df)
raw = pd.concat(dfs, ignore_index=True)
logger.info(f"Extracted {len(raw):,} rows")
return raw
# ── TRANSFORM ────────────────────────────────────────
def transform(df: pd.DataFrame) -> pd.DataFrame:
"""Clean, validate, and enrich the data."""
return (df
# 1. Standardise column names
.rename(columns=str.lower)
.rename(columns=lambda c: c.strip().replace(" ", "_"))
# 2. Remove duplicates
.drop_duplicates(subset=["id", "date"])
# 3. Handle missing values
.dropna(subset=["id"]) # must have ID
.assign(
amount=lambda x: x["amount"].fillna(0),
category=lambda x: x["category"].fillna("Unknown"),
)
# 4. Type enforcement
.astype({"category": "category"})
# 5. Business logic
.assign(
year=lambda x: x["date"].dt.year,
month=lambda x: x["date"].dt.month,
quarter=lambda x: x["date"].dt.quarter,
amount_gbp=lambda x: x["amount"] * 0.79, # USD→GBP
size_bucket=lambda x: pd.cut(
x["amount"],
bins=[0, 100, 1000, 10000, np.inf],
labels=["Small", "Medium", "Large", "Enterprise"]
),
)
# 6. Filter invalid records
.query("amount >= 0")
# 7. Sort
.sort_values(["date", "id"])
.reset_index(drop=True)
)
# ── LOAD ─────────────────────────────────────────────
def load(df: pd.DataFrame, output_path: str):
"""Save clean data to Parquet."""
df.to_parquet(output_path, index=False)
logger.info(f"Loaded {len(df):,} rows → {output_path}")
# ── ORCHESTRATE ──────────────────────────────────────
if __name__ == "__main__":
raw = extract("data/raw/")
clean = transform(raw)
load(clean, "data/processed/clean_data.parquet")
10.2 Validation Layer (Quality Checks)
def validate(df: pd.DataFrame) -> pd.DataFrame:
"""Run data quality checks. Raise on critical failures."""
checks = {
"no_nulls_in_id": df["id"].notna().all(),
"positive_amounts": (df["amount"] >= 0).all(),
"valid_dates": df["date"].between("2020-01-01", "2026-12-31").all(),
"no_duplicates": not df.duplicated(subset=["id", "date"]).any(),
"row_count_min": len(df) >= 100,
}
for check_name, passed in checks.items():
status = "✓ PASS" if passed else "✗ FAIL"
logger.info(f" {status}: {check_name}")
failures = [k for k, v in checks.items() if not v]
if failures:
raise ValueError(f"Validation failed: {failures}")
return df
# Insert into pipeline:
# raw = extract(...)
# clean = transform(raw)
# validated = validate(clean) ← new
# load(validated, ...)
10.3 The R Comparison
library(tidyverse)
library(fs)
raw <- dir_ls("data/raw", glob = "*.csv") %>%
map_dfr(read_csv)
clean <- raw %>%
janitor::clean_names() %>%
distinct(id, date, .keep_all = TRUE) %>%
filter(!is.na(id), amount >= 0) %>%
mutate(
year = year(date),
quarter = quarter(date),
size = cut(amount,
breaks = c(0, 100, 1000, Inf),
labels = c("S", "M", "L"))
)
write_parquet(clean, "data/clean.parquet")
import pandas as pd
from pathlib import Path
files = Path("data/raw").glob("*.csv")
raw = pd.concat(
[pd.read_csv(f) for f in files],
ignore_index=True
)
clean = (raw
.rename(columns=str.lower)
.drop_duplicates(subset=["id", "date"])
.dropna(subset=["id"])
.query("amount >= 0")
.assign(
year=lambda x: x["date"].dt.year,
quarter=lambda x: x["date"].dt.quarter,
size=lambda x: pd.cut(x["amount"],
bins=[0,100,1000,np.inf],
labels=["S","M","L"]),
)
)
clean.to_parquet("data/clean.parquet")
Chapter 11 — APIs & Web Scraping
httr/rvest → requests/BeautifulSoup.
11.1 Calling APIs
library(httr)
library(jsonlite)
resp <- GET(
"https://api.example.com/data",
add_headers(Authorization = "Bearer xyz"),
query = list(limit = 100)
)
data <- content(resp, as = "parsed")
df <- as_tibble(data$results)
import requests
import pandas as pd
resp = requests.get(
"https://api.example.com/data",
headers={"Authorization": "Bearer xyz"},
params={"limit": 100}
)
data = resp.json()
df = pd.DataFrame(data["results"])
11.2 Web Scraping
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
resp = requests.get(url)
soup = BeautifulSoup(resp.text, "html.parser")
# Find all tables
tables = pd.read_html(resp.text) # pandas can parse HTML tables!
gdp_df = tables[0] # first table on the page
# Or parse manually for more control
rows = []
for tr in soup.select("table.wikitable tr")[1:]:
cells = [td.text.strip() for td in tr.find_all(["td", "th"])]
rows.append(cells)
df = pd.DataFrame(rows, columns=["Rank", "Country", "GDP"])
11.3 Paginating Through an API
def fetch_all_pages(base_url: str, api_key: str) -> pd.DataFrame:
"""Paginate through a REST API, collecting all results."""
all_data = []
page = 1
while True:
resp = requests.get(
base_url,
headers={"Authorization": f"Bearer {api_key}"},
params={"page": page, "per_page": 100}
)
resp.raise_for_status() # raise exception on HTTP errors
data = resp.json()
results = data.get("results", [])
if not results:
break
all_data.extend(results)
page += 1
if page > data.get("total_pages", 1):
break
return pd.DataFrame(all_data)
Chapter 12 — SQL from Python
DBI/dbplyr → SQLAlchemy/pandas.
library(DBI)
con <- dbConnect(
RPostgres::Postgres(),
host = "localhost",
dbname = "mydb",
user = "user",
password = "pass"
)
df <- dbGetQuery(con, "
SELECT dept, AVG(salary)
FROM employees
WHERE active = true
GROUP BY dept
")
dbDisconnect(con)
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine(
"postgresql://user:pass@localhost/mydb"
)
df = pd.read_sql("""
SELECT dept, AVG(salary)
FROM employees
WHERE active = true
GROUP BY dept
""", engine)
engine.dispose()
# Write a DataFrame to a database table
df.to_sql(
"clean_employees", # table name
engine,
if_exists="replace", # "append", "replace", or "fail"
index=False,
method="multi", # faster batch insert
chunksize=5000
)
# Supabase / PostgreSQL with connection pooling
engine = create_engine(
"postgresql://user:pass@db.xyz.supabase.co:5432/postgres",
pool_size=5,
max_overflow=10
)
Chapter 13 — Matplotlib & Seaborn
ggplot2 → matplotlib + seaborn (the grammar of graphics, translated).
13.1 The Mental Model
| ggplot2 Concept | matplotlib/seaborn |
|---|---|
ggplot(df, aes(x, y)) | fig, ax = plt.subplots() |
geom_point() | ax.scatter(x, y) or sns.scatterplot() |
geom_line() | ax.plot(x, y) or sns.lineplot() |
geom_bar() | ax.bar(x, y) or sns.barplot() |
geom_histogram() | ax.hist(x) or sns.histplot() |
geom_boxplot() | sns.boxplot() |
facet_wrap(~var) | sns.FacetGrid() or fig, axes = plt.subplots(1, 3) |
labs(title, x, y) | ax.set(title=, xlabel=, ylabel=) |
theme_minimal() | plt.style.use('seaborn-v0_8-whitegrid') |
scale_fill_brewer() | palette="Set2" in seaborn |
ggsave("plot.png") | plt.savefig("plot.png", dpi=150, bbox_inches="tight") |
13.2 Side-by-Side Example
ggplot(df, aes(x = age, y = score,
colour = dept)) +
geom_point(alpha = 0.7, size = 3) +
geom_smooth(method = "lm") +
labs(
title = "Score by Age",
x = "Age", y = "Score"
) +
theme_minimal() +
scale_colour_brewer(palette = "Set2")
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 5))
sns.scatterplot(
data=df, x="age", y="score",
hue="dept", alpha=0.7, s=60,
palette="Set2", ax=ax
)
sns.regplot(
data=df, x="age", y="score",
scatter=False, ax=ax, color="grey"
)
ax.set(title="Score by Age",
xlabel="Age", ylabel="Score")
sns.despine()
plt.tight_layout()
plt.savefig("scatter.png", dpi=150)
13.3 Common Chart Recipes
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-v0_8-whitegrid')
# ── Bar Chart ──
fig, ax = plt.subplots(figsize=(8, 5))
sns.barplot(data=df, x="dept", y="score", palette="viridis", ax=ax)
ax.set_title("Average Score by Department", fontweight="bold")
ax.bar_label(ax.containers[0], fmt="%.1f") # value labels on bars
# ── Histogram ──
fig, ax = plt.subplots()
sns.histplot(df["score"], bins=20, kde=True, ax=ax)
# ── Faceted chart (like facet_wrap) ──
g = sns.FacetGrid(df, col="dept", col_wrap=3, height=4)
g.map_dataframe(sns.scatterplot, x="age", y="score")
g.set_titles("{col_name}")
# ── Heatmap (correlation) ──
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(df.select_dtypes("number").corr(),
annot=True, fmt=".2f", cmap="RdBu_r",
center=0, ax=ax)
Chapter 14 — Plotly Interactive
plotly in R → plotly in Python (nearly identical!).
library(plotly)
plot_ly(df,
x = ~date, y = ~value,
color = ~group,
type = "scatter",
mode = "lines"
) %>%
layout(title = "Trend")
import plotly.express as px
fig = px.line(
df,
x="date", y="value",
color="group",
title="Trend"
)
fig.show()
fig.write_html("trend.html")
plotly.express (imported as px) is the high-level API — one function per chart type, very similar to ggplot. Use plotly.graph_objects for full customisation.
Chapter 15 — Statistics & Hypothesis Tests
stats::t.test, chisq.test → scipy.stats.
# t-test
t.test(score ~ group, data = df)
# Chi-squared
chisq.test(table(df$dept, df$grade))
# Correlation
cor.test(df$age, df$score)
# Shapiro-Wilk normality
shapiro.test(df$score)
# Mann-Whitney U
wilcox.test(score ~ group, data = df)
# Linear regression
model <- lm(score ~ age + dept, data = df)
summary(model)
from scipy import stats
# t-test
stats.ttest_ind(group_a, group_b)
# Chi-squared
ct = pd.crosstab(df["dept"], df["grade"])
stats.chi2_contingency(ct)
# Correlation
stats.pearsonr(df["age"], df["score"])
# Shapiro-Wilk normality
stats.shapiro(df["score"])
# Mann-Whitney U
stats.mannwhitneyu(group_a, group_b)
# Linear regression
import statsmodels.formula.api as smf
model = smf.ols("score ~ age + dept", data=df)
results = model.fit()
print(results.summary())
15.1 Effect Sizes (Don't Just Report p-values)
# Cohen's d (effect size for t-test)
def cohens_d(group1, group2):
n1, n2 = len(group1), len(group2)
pooled_std = np.sqrt(((n1-1)*group1.std()**2 + (n2-1)*group2.std()**2) / (n1+n2-2))
return (group1.mean() - group2.mean()) / pooled_std
d = cohens_d(group_a, group_b)
# |d| < 0.2 = small, 0.5 = medium, 0.8 = large
# Confidence intervals (bootstrap)
from scipy.stats import bootstrap
rng = np.random.default_rng(42)
result = bootstrap((df["score"].values,), np.mean,
confidence_level=0.95, random_state=rng)
print(f"95% CI: ({result.confidence_interval.low:.2f}, "
f"{result.confidence_interval.high:.2f})")
Chapter 16 — Regression & Machine Learning
caret/tidymodels → scikit-learn.
library(tidymodels)
# Split
split <- initial_split(df, prop = 0.8)
train <- training(split)
test <- testing(split)
# Recipe + model
rec <- recipe(score ~ ., data = train) %>%
step_dummy(all_nominal()) %>%
step_normalize(all_numeric_predictors())
model <- linear_reg() %>%
set_engine("lm") %>%
fit(score ~ ., data = juice(prep(rec)))
# Predict
preds <- predict(model, bake(prep(rec), test))
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_squared_error
# Split
X = df.drop(columns=["score"])
y = df["score"]
X = pd.get_dummies(X, drop_first=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
# Pipeline (recipe + model)
pipe = Pipeline([
("scaler", StandardScaler()),
("model", LinearRegression())
])
pipe.fit(X_train, y_train)
# Predict
preds = pipe.predict(X_test)
print(f"R² = {r2_score(y_test, preds):.3f}")
Chapter 17 — Functions & Scripts
Writing reusable code — functions, error handling, and CLI scripts.
17.1 Functions
clean_names <- function(df,
case = "snake") {
df %>%
janitor::clean_names(case = case)
}
# Default arguments
calc_rate <- function(n, total,
digits = 2) {
round(n / total * 100, digits)
}
def clean_names(df: pd.DataFrame,
case: str = "snake") -> pd.DataFrame:
"""Standardise column names."""
cols = df.columns.str.strip().str.lower()
cols = cols.str.replace(" ", "_")
return df.set_axis(cols, axis=1)
# Default arguments + type hints
def calc_rate(n: int, total: int,
digits: int = 2) -> float:
return round(n / total * 100, digits)
17.2 Error Handling
result <- tryCatch(
{
risky_operation()
},
error = function(e) {
message("Error: ", e$message)
NA
},
warning = function(w) {
message("Warning: ", w$message)
}
)
try:
result = risky_operation()
except FileNotFoundError as e:
print(f"File missing: {e}")
result = None
except ValueError as e:
print(f"Bad value: {e}")
result = None
except Exception as e:
print(f"Unexpected: {e}")
raise # re-raise unexpected errors
finally:
cleanup() # always runs
17.3 Command-Line Scripts
#!/usr/bin/env python3
"""ETL script that can be run from the command line."""
import argparse
import pandas as pd
def main():
parser = argparse.ArgumentParser(description="Run ETL pipeline")
parser.add_argument("input", help="Path to input CSV")
parser.add_argument("-o", "--output", default="output.parquet")
parser.add_argument("--year", type=int, help="Filter to year")
args = parser.parse_args()
df = pd.read_csv(args.input)
if args.year:
df = df.query("year == @args.year")
df.to_parquet(args.output, index=False)
print(f"✓ Saved {len(df):,} rows → {args.output}")
if __name__ == "__main__":
main()
# Usage: python etl.py data.csv -o clean.parquet --year 2024
Chapter 18 — Packaging & Environments
renv → venv/pip, R packages → Python packages.
18.1 Virtual Environments
renv::init() # create lockfile
renv::install("dplyr")
renv::snapshot() # save state
renv::restore() # reproduce
python -m venv .venv
source .venv/bin/activate
pip install pandas
pip freeze > requirements.txt # save
pip install -r requirements.txt # reproduce
18.2 Project Structure
my_project/
├── data/
│ ├── raw/ # never modify raw data
│ └── processed/ # ETL output goes here
├── notebooks/
│ └── exploration.ipynb # Jupyter notebooks
├── src/
│ ├── __init__.py # makes it a package
│ ├── extract.py
│ ├── transform.py
│ └── load.py
├── tests/
│ └── test_transform.py
├── requirements.txt # like renv.lock
├── .gitignore
└── README.md
Chapter 19 — Complete Cheat Sheet
The full R → Python translation table. Bookmark this.
| Operation | R / tidyverse | Python / pandas |
|---|---|---|
| Read CSV | read_csv("f.csv") | pd.read_csv("f.csv") |
| Write CSV | write_csv(df, "f.csv") | df.to_csv("f.csv", index=False) |
| View structure | glimpse(df) | df.info() |
| Summary stats | summary(df) | df.describe() |
| Select cols | select(name, age) | df[["name", "age"]] |
| Drop cols | select(-age) | df.drop(columns=["age"]) |
| Rename | rename(new = old) | df.rename(columns={"old":"new"}) |
| Filter rows | filter(age > 30) | df.query("age > 30") |
| Sort | arrange(desc(age)) | df.sort_values("age", ascending=False) |
| New column | mutate(x = a + b) | df.assign(x=lambda d: d["a"]+d["b"]) |
| case_when | case_when(...) | np.select(conditions, choices) |
| Group + summarise | group_by(g) %>% summarise(m=mean(x)) | df.groupby("g").agg(m=("x","mean")) |
| Window (grouped mutate) | group_by(g) %>% mutate(m=mean(x)) | df.groupby("g")["x"].transform("mean") |
| Ungroup | ungroup() | .reset_index() |
| Count | count(col) | df["col"].value_counts() |
| Distinct | distinct() | df.drop_duplicates() |
| Left join | left_join(a, b, by="id") | a.merge(b, on="id", how="left") |
| Pivot longer | pivot_longer() | df.melt() |
| Pivot wider | pivot_wider() | df.pivot_table() |
| Missing count | sum(is.na(x)) | df["x"].isna().sum() |
| Fill NA | replace_na(x, 0) | df["x"].fillna(0) |
| String lower | str_to_lower(x) | df["x"].str.lower() |
| Date parse | ymd("2024-01-01") | pd.to_datetime("2024-01-01") |
| Year extract | year(date) | df["date"].dt.year |
| Apply func | map_dbl(x, fn) | df["x"].apply(fn) |
| Row-wise | rowwise() %>% mutate(...) | df.apply(fn, axis=1) |
| Bind rows | bind_rows(a, b) | pd.concat([a, b]) |
| Bind cols | bind_cols(a, b) | pd.concat([a, b], axis=1) |
Built for Meron · R → Python · 2026