R for Data Science

Set up a project, then play the whole game: load → tidy → transform → visualise

Postgraduate R Tutorial

2026-06-08

Welcome

Today we do two things:

  1. Set up a project properly — folders, paths, and the idea of tidy data — so your work is organised and reproducible from day one.
  2. Then play the whole game of data science, following data in its natural direction: load it in → tidy it → transform it → visualise it.

The data science cycle

Import → Tidy → Transform ⇄ Visualise → (Model → Communicate) …all sitting inside Program.

We’ll follow it left to right today:

  • Set up — a project structure that keeps everything findable and reproducible.
  • Import — get a data file into R.
  • Tidy — reshape it into a consistent form.
  • Transform — filter, derive variables, summarise by group.
  • Visualise — turn the table into a picture. (We finish here.)

Part 1 · Set up your project

Get organised before you analyse. Future-you will thank present-you.

Always work in an RStudio Project

An RStudio Project (.Rproj) bundles everything for one piece of work.

  • Open the project and your working directory is the project folder — automatically.
  • No more setwd("C:/Users/neil/Desktop/...") that breaks on every other machine.
  • Each project gets its own history, and you can have several open at once.

Create one with File → New Project → New Directory. One project = one paper / thesis chapter / analysis.

A sensible folder structure

my-analysis/
├── my-analysis.Rproj      # the project file — open THIS
├── README.md              # what is this, who, how to run it
├── data-raw/              # original data — READ-ONLY, never edit by hand
├── data/                  # cleaned / derived data you generate from data-raw
├── Code/                  # your analysis scripts (.R) or Quarto docs (.qmd)
├── output/
│   ├── figures/           # plots you export
│   └── tables/            # results tables
└── renv/                  # optional: a reproducible package library

Two principles: raw data is sacred and read-only; everything in output/ is regenerable by re-running your scripts.

Paths that don’t break

# Bad — absolute, machine-specific:
read_csv("C:/Users/neil/Desktop/project/data-raw/palmerpenguins.csv")

# Good — relative to the project root:
read_csv("data-raw/palmerpenguins.csv")

# Robust — builds the path correctly on any OS / machine:
library(here)
read_csv(here("data-raw", "palmerpenguins.csv"))

Because you’re in a Project, relative paths just work. here::here() makes them bullet-proof even from sub-folders.

Reproducible habits

  • Write your analysis in a script (.R) or Quarto (.qmd) — not the console — so it’s saved and re-runnable.
  • Run a line / selection with Cmd/Ctrl + Enter; run the whole file with Cmd/Ctrl + Shift + Enter.
  • Restart R often (Session → Restart R) and re-run from the top.

Turn off “save workspace to .RData”: Tools → Global Options → uncheck “Restore .RData”. A script that only works because of leftover objects isn’t reproducible.

Part 2 · R foundations

The minimum syntax you need before touching data.

Assignment & naming

# Assign with <-   (shortcut: Alt/Option + -)
mean_mass <- 4200

# Names: lowercase, words separated by _ (snake_case)
flipper_summary <- 1   # good
FlipperSummary  <- 1   # avoid
flipper.summary <- 1   # avoid
  • Use <- to assign; reserve = for arguments inside functions.
  • Nothing prints when you assign — type the object’s name to see it.

Calling functions

seq(from = 1, to = 10, by = 2)
#> [1] 1 3 5 7 9
  • Functions take arguments as name = value.
  • First arguments are usually passed by position; name the rest for clarity.
  • ?seq opens the help page; Tab-completion shows argument names.

Code style

# Spaces around operators and after commas:
mean(x, na.rm = TRUE)      # good
mean(x,na.rm=TRUE)         # cramped

# One statement per line; descriptive snake_case names:
mean_body_mass <- mean(penguins$body_mass_g, na.rm = TRUE)
  • Spaces around <-, ==, +, *; a space after every comma.
  • The {styler} package reformats code automatically; {lintr} flags problems.

Code is read far more often than written — including by you, in six months.

Part 3 · Tidy data

The shape we’re aiming for. Get here and everything downstream is easy.

The three rules of tidy data

  1. Each variable is a column.
  2. Each observation is a row.
  3. Each value is a cell.

For us: one row per specimen, one column per measured trait, with species, site, and sex as their own columns.

Tidy data is what dplyr, ggplot2 and most models expect. Field data rarely arrives this way — Part 4 shows how to get it there.

The whole game

Import → Tidy → Transform → Visualise. Now we run real data through it.

Import: read_csv()

library(tidyverse)

penguins <- read_csv("data-raw/palmerpenguins.csv")

readr::read_csv() gives you:

  • a tibble (a tidy-friendly data frame),
  • automatic column-type guessing, printed so you can check it,
  • handy arguments: na = c("", "NA", "-999"), skip =, col_types = to override guesses.

Always eyeball the type report and first rows — silent mis-typing is the #1 import bug.

Other importers

Format Function Package
.csv / delimited read_csv(), read_delim() readr
Excel .xlsx read_excel() readxl
.rds (one R object) read_rds() readr
SPSS / Stata / SAS read_sav(), read_dta() haven

Meet the data

We’ll use morphometric data — measurements across species and sites. Once loaded:

penguins
#> # A tibble: 344 × 8
#>   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
#> 1 Adelie  Torgersen           39.1          18.7               181        3750
#> 2 Adelie  Torgersen           39.5          17.4               186        3800
#> 3 Adelie  Torgersen           40.3          18                 195        3250
#> 4 Adelie  Torgersen           NA            NA                  NA          NA
#> 5 Adelie  Torgersen           36.7          19.3               193        3450
#> 6 Adelie  Torgersen           39.3          20.6               190        3650
#> # ℹ 338 more rows
#> # ℹ 2 more variables: sex <fct>, year <int>

A first look

glimpse(penguins)
#> Rows: 344
#> Columns: 8
#> $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex               <fct> male, female, female, NA, female, male, female, male…
#> $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

glimpse() is your first move on any new dataset: every column, its type, and a peek at the values.

Tidy in practice: a messy table

Field data often arrives wide — species as column headers:

counts_wide <- tribble(
  ~site,   ~Adelie, ~Gentoo, ~Chinstrap,
  "North",      12,      30,          5,
  "South",      18,      22,          9
)
counts_wide
#> # A tibble: 2 × 4
#>   site  Adelie Gentoo Chinstrap
#>   <chr>  <dbl>  <dbl>     <dbl>
#> 1 North     12     30         5
#> 2 South     18     22         9

species is a variable trapped in the headers — not tidy.

pivot_longer() to the rescue

counts_long <- counts_wide |>
  pivot_longer(
    cols      = -site,
    names_to  = "species",
    values_to = "count"
  )
counts_long
#> # A tibble: 6 × 3
#>   site  species   count
#>   <chr> <chr>     <dbl>
#> 1 North Adelie       12
#> 2 North Gentoo       30
#> 3 North Chinstrap     5
#> 4 South Adelie       18
#> 5 South Gentoo       22
#> 6 South Chinstrap     9

Now species and count are proper columns — ready to group, plot, model. (pivot_wider() does the reverse.)

Transform: the pipe |>

The pipe passes the result on the left into the first argument on the right.

# Nested (hard to read):
summarise(group_by(filter(penguins, !is.na(sex)), species), n = n())

# Piped (read top to bottom):
penguins |>
  filter(!is.na(sex)) |>
  group_by(species) |>
  summarise(n = n())

Read |> as “and then” — it turns nested calls into a readable recipe.

filter() — keep rows

penguins |>
  filter(species == "Adelie", body_mass_g > 4000)
#> # A tibble: 35 × 8
#>   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
#> 1 Adelie  Torgersen           39.2          19.6               195        4675
#> 2 Adelie  Torgersen           42            20.2               190        4250
#> 3 Adelie  Torgersen           34.6          21.1               198        4400
#> 4 Adelie  Torgersen           42.5          20.7               197        4500
#> 5 Adelie  Torgersen           46            21.5               194        4200
#> 6 Adelie  Dream               39.2          21.1               196        4150
#> # ℹ 29 more rows
#> # ℹ 2 more variables: sex <fct>, year <int>

Operators: == != < > <= >= & | %in%, plus is.na().

arrange() & select()

Reorder rows

penguins |>
  arrange(desc(body_mass_g)) |>
  select(species, body_mass_g)
#> # A tibble: 344 × 2
#>   species body_mass_g
#>   <fct>         <int>
#> 1 Gentoo         6300
#> 2 Gentoo         6050
#> 3 Gentoo         6000
#> 4 Gentoo         6000
#> 5 Gentoo         5950
#> 6 Gentoo         5950
#> # ℹ 338 more rows

Pick columns

penguins |>
  select(species, island, sex)
#> # A tibble: 344 × 3
#>   species island    sex   
#>   <fct>   <fct>     <fct> 
#> 1 Adelie  Torgersen male  
#> 2 Adelie  Torgersen female
#> 3 Adelie  Torgersen female
#> 4 Adelie  Torgersen <NA>  
#> 5 Adelie  Torgersen female
#> 6 Adelie  Torgersen male  
#> # ℹ 338 more rows

mutate() — create variables

penguins |>
  mutate(
    mass_kg    = body_mass_g / 1000,
    bill_ratio = bill_length_mm / bill_depth_mm
  ) |>
  select(species, mass_kg, bill_ratio)
#> # A tibble: 344 × 3
#>   species mass_kg bill_ratio
#>   <fct>     <dbl>      <dbl>
#> 1 Adelie     3.75       2.09
#> 2 Adelie     3.8        2.27
#> 3 Adelie     3.25       2.24
#> 4 Adelie    NA         NA   
#> 5 Adelie     3.45       1.90
#> 6 Adelie     3.65       1.91
#> # ℹ 338 more rows

Where derived morphometrics live — ratios, indices, log-transforms for allometry.

group_by() + summarise()

The workhorse for “compute something per species / site / treatment”:

penguins |>
  group_by(species, sex) |>
  summarise(
    n         = n(),
    mean_mass = mean(body_mass_g, na.rm = TRUE),
    sd_mass   = sd(body_mass_g, na.rm = TRUE),
    .groups   = "drop"
  )
#> # A tibble: 8 × 5
#>   species   sex        n mean_mass sd_mass
#>   <fct>     <fct>  <int>     <dbl>   <dbl>
#> 1 Adelie    female    73     3369.    269.
#> 2 Adelie    male      73     4043.    347.
#> 3 Adelie    <NA>       6     3540     477.
#> 4 Chinstrap female    34     3527.    285.
#> 5 Chinstrap male      34     3939.    362.
#> 6 Gentoo    female    58     4680.    282.
#> # ℹ 2 more rows

Visualise

The payoff. We finish the game by turning data into pictures.

The grammar of graphics

Every ggplot is built from the same pieces:

ggplot(data = <DATA>) +
  geom_<TYPE>(mapping = aes(<MAPPINGS>))
  • data — the data frame.
  • aes()aesthetic mappings: which variable maps to x, y, colour, shape…
  • geom_ — the geometric object representing the data (points, lines, bars…).
  • Layers are added with +.

A first plot

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm,
                y = body_mass_g)
) +
  geom_point()

Map a variable to colour

ggplot(
  penguins,
  aes(x = flipper_length_mm,
      y = body_mass_g,
      colour = species)
) +
  geom_point()

Anything that should respond to the data goes inside aes().

Set vs. map

Want one fixed colour? Set it outside aes():

ggplot(penguins, aes(flipper_length_mm, body_mass_g)) +
  geom_point(colour = "steelblue", alpha = 0.7)

Map = a variable controls it (inside aes()). Set = fixed constant (outside aes()).

Add a model layer

ggplot(
  penguins,
  aes(x = flipper_length_mm,
      y = body_mass_g,
      colour = species)
) +
  geom_point() +
  geom_smooth(method = "lm")

One trend line per group, because colour is mapped globally.

Facets — small multiples

ggplot(penguins, aes(flipper_length_mm, body_mass_g, colour = species)) +
  geom_point() +
  facet_wrap(~ island)

facet_wrap(~ var) splits one plot into a panel per level — ideal for sites / treatments.

Putting it together

The whole game in one piped pipeline, ending in a plot:

penguins |>                                   # import (loaded earlier)
  filter(!is.na(body_mass_g), !is.na(sex)) |> # transform: drop missing
  mutate(mass_kg = body_mass_g / 1000) |>     # transform: derive variable
  group_by(species, sex) |>                   # transform: group
  summarise(mean_kg = mean(mass_kg), .groups = "drop") |>
  ggplot(aes(species, mean_kg, fill = sex)) + # visualise
  geom_col(position = "dodge") +
  labs(x = NULL, y = "Mean mass (kg)", fill = "Sex")

Wrap up

What we covered

Set-up & foundations

  • RStudio Projects + folder structure
  • Relative paths / here()
  • Assignment, naming, functions, style
  • Tidy data: the 3 rules

The whole game

  • read_csv() and friends
  • pivot_longer() → tidy
  • dplyr: filter / arrange / select / mutate / summarise + |>
  • ggplot2: aes, geoms, set vs map, facets

Your turn — exercises

  1. Make a project folder for a small analysis with data-raw/, data/, R/, and output/figures/.
  2. Read one of your own CSVs with read_csv(), check the column types, and glimpse() it.
  3. Compute the number of specimens and mean body mass per island (group_by + summarise).
  4. Plot bill_length_mm vs bill_depth_mm, colour by species, add geom_smooth(method = "lm"). Then facet by species — what changes? (Simpson’s paradox!)

Bonus: take a wide spreadsheet from your own work and tidy it with pivot_longer().

Getting unstuck

  • ?function — the help page (the Examples section is gold).
  • vignette("dplyr") — narrative, worked guides for a package.
  • Google the exact error message; Posit Community & Stack Overflow [r] [tidyverse].
  • Make a reprex with the {reprex} package before asking — you’ll often solve it yourself.

Where to go next

The whole game is just this loop, run over and over: import → tidy → transform ⇄ visualise. Everything else is depth.

Thank you

Questions?

Slides built with Quarto + reveal.js. Source: whole-game.qmd. Content adapted from R for Data Science (2e) (CC BY-NC-ND).