OMICS workshop I Exercise

tidyverse

omics

workshop

Advanced Exercise: Integrative Tidy Omics data Wrangling

Authors

Mohammed N Hassan

Samuele Soraggi

Published

February 26, 2026

Goal

In this exercise you will combine core tidy tools from the tutorial in one end-to-end analysis:

left_join() and anti_join()
pivot_longer() and pivot_wider()
mutate() + case_when()
group_by() + summarise()
across()

You will work on a synthetic RNA-like count dataset and produce a compact analysis report.

Setup

library(tidyverse)
set.seed(12345) # for reproducibility

Data generation

# 1) Sample metadata
sample_meta <- tibble(
  sample_id = str_c("S", str_pad(1:30, width = 3, pad = "0")),
  tissue = sample(c("tumor", "normal"), 30, replace = TRUE, prob = c(0.55, 0.45)),
  cohort = sample(c("A", "B", "C"), 30, replace = TRUE),
  batch = sample(c("B1", "B2", "B3"), 30, replace = TRUE)
)

# 2) Clinical table (with intentional unmatched rows for join debugging)
clinical <- tibble(
  sample_id = c(sample_meta$sample_id[1:26], "S900", "S901"),
  patient_age = sample(35:85, 28, replace = TRUE),
  response = sample(c("CR", "PR", "SD", "PD"), 28, replace = TRUE)
)

# 3) Wide expression/count matrix (genes x samples)
genes <- str_c("GENE_", str_pad(1:120, 3, pad = "0"))
count_mat <- matrix(
  rnbinom(length(genes) * nrow(sample_meta), mu = 120, size = 1.3),
  nrow = length(genes),
  ncol = nrow(sample_meta),
  dimnames = list(genes, sample_meta$sample_id)
)

counts_wide <- as_tibble(count_mat, rownames = "gene_id")

sample_meta
clinical
counts_wide

Exercise tasks

Part 1 — Join diagnostics and clean analysis table

Use anti_join() to identify:
- sample IDs in sample_meta that do not exist in clinical
- sample IDs in clinical that do not exist in sample_meta
Create clinical_clean by keeping only rows that can match sample_meta.
Build sample_annot using left_join(sample_meta, clinical_clean, by = "sample_id").
Add a new age group variable with case_when():
- < 50 = "young"
- 50-69 = "middle"
- >= 70 = "older"

# TODO: write your Part 1 solution here

Part 2 — Reshape wide counts and annotate samples

Convert counts_wide to long format as counts_long with columns: gene_id, sample_id, count.
Join counts_long with sample_annot to create counts_annot.
Create a log-transformed value: log_count = log2(count + 1).

# TODO: write your Part 2 solution here

Part 3 — Multi-level summaries using `group_by()`, `summarise()`, and `across()`

Compute per-gene/per-tissue summary as gene_tissue_summary:
- mean_count
- sd_count
- mean_log_count
- n_samples
Compute a second table cohort_summary with one row per cohort and tissue:
- mean and median for all numeric columns in counts_annot using across(where(is.numeric), ...)
- include n() as n_rows

# TODO: write your Part 3 solution here

Part 4 — Wide result table and differential signal

Pivot gene_tissue_summary wider to get one row per gene with separate tumor/normal means:
- columns like mean_count_tumor, mean_count_normal
Create:
- delta = mean_count_tumor - mean_count_normal
- abs_delta = abs(delta)
Return top 15 genes by abs_delta.

# TODO: write your Part 4 solution here

Deliverables

Create and print these objects:

sample_annot
counts_annot
gene_tissue_summary
cohort_summary
top_genes_delta

# TODO: print your final deliverables

Self-check questions

Do your unmatched IDs from anti_join() make sense given the generated data?
Does counts_annot have exactly n_genes * n_samples rows for matched samples?
Are delta values positive for genes higher in tumor and negative for genes higher in normal?
Can you explain why log2(count + 1) is often used before summarizing count-like data?

Stretch challenge (optional)

Create a compact report table with one row per cohort containing:

top 3 genes by absolute tumor-normal delta within that cohort
mean age of included samples
fraction of response == "CR"

Tips: use grouped summaries, ranking, and string aggregation (str_c() with collapse = ", ").

# TODO: stretch solution