OMICS workshop I Exercise
R
tidyverse
omics
workshop
Advanced Exercise: Integrative Tidy Omics data Wrangling
Goal
In this exercise you will combine core tidy tools from the tutorial in one end-to-end analysis:
left_join()andanti_join()pivot_longer()andpivot_wider()mutate()+case_when()group_by()+summarise()across()
You will work on a synthetic RNA-like count dataset and produce a compact analysis report.
Setup
library(tidyverse)
set.seed(12345) # for reproducibilityData generation
# 1) Sample metadata
sample_meta <- tibble(
sample_id = str_c("S", str_pad(1:30, width = 3, pad = "0")),
tissue = sample(c("tumor", "normal"), 30, replace = TRUE, prob = c(0.55, 0.45)),
cohort = sample(c("A", "B", "C"), 30, replace = TRUE),
batch = sample(c("B1", "B2", "B3"), 30, replace = TRUE)
)
# 2) Clinical table (with intentional unmatched rows for join debugging)
clinical <- tibble(
sample_id = c(sample_meta$sample_id[1:26], "S900", "S901"),
patient_age = sample(35:85, 28, replace = TRUE),
response = sample(c("CR", "PR", "SD", "PD"), 28, replace = TRUE)
)
# 3) Wide expression/count matrix (genes x samples)
genes <- str_c("GENE_", str_pad(1:120, 3, pad = "0"))
count_mat <- matrix(
rnbinom(length(genes) * nrow(sample_meta), mu = 120, size = 1.3),
nrow = length(genes),
ncol = nrow(sample_meta),
dimnames = list(genes, sample_meta$sample_id)
)
counts_wide <- as_tibble(count_mat, rownames = "gene_id")
sample_meta
clinical
counts_wideExercise tasks
Part 1 — Join diagnostics and clean analysis table
- Use
anti_join()to identify:- sample IDs in
sample_metathat do not exist inclinical - sample IDs in
clinicalthat do not exist insample_meta
- sample IDs in
- Create
clinical_cleanby keeping only rows that can matchsample_meta. - Build
sample_annotusingleft_join(sample_meta, clinical_clean, by = "sample_id"). - Add a new age group variable with
case_when():< 50="young"50-69="middle">= 70="older"
# TODO: write your Part 1 solution herePart 2 — Reshape wide counts and annotate samples
- Convert
counts_wideto long format ascounts_longwith columns:gene_id,sample_id,count. - Join
counts_longwithsample_annotto createcounts_annot. - Create a log-transformed value:
log_count = log2(count + 1).
# TODO: write your Part 2 solution herePart 3 — Multi-level summaries using group_by(), summarise(), and across()
- Compute per-gene/per-tissue summary as
gene_tissue_summary:mean_countsd_countmean_log_countn_samples
- Compute a second table
cohort_summarywith one row percohortandtissue:- mean and median for all numeric columns in
counts_annotusingacross(where(is.numeric), ...) - include
n()asn_rows
- mean and median for all numeric columns in
# TODO: write your Part 3 solution herePart 4 — Wide result table and differential signal
- Pivot
gene_tissue_summarywider to get one row per gene with separate tumor/normal means:- columns like
mean_count_tumor,mean_count_normal
- columns like
- Create:
delta = mean_count_tumor - mean_count_normalabs_delta = abs(delta)
- Return top 15 genes by
abs_delta.
# TODO: write your Part 4 solution hereDeliverables
Create and print these objects:
sample_annotcounts_annotgene_tissue_summarycohort_summarytop_genes_delta
# TODO: print your final deliverablesSelf-check questions
- Do your unmatched IDs from
anti_join()make sense given the generated data? - Does
counts_annothave exactlyn_genes * n_samplesrows for matched samples? - Are
deltavalues positive for genes higher in tumor and negative for genes higher in normal? - Can you explain why
log2(count + 1)is often used before summarizing count-like data?
Stretch challenge (optional)
Create a compact report table with one row per cohort containing:
- top 3 genes by absolute tumor-normal delta within that cohort
- mean age of included samples
- fraction of
response == "CR"
Tips: use grouped summaries, ranking, and string aggregation (str_c() with collapse = ", ").
# TODO: stretch solution