Workshop goals
By the end of this workshop you should be able to:
- understand tidy data and why it matters for data
analysis
- clean and wrangle data using dplyr (filter, select,
mutate, summarise)
- reshape data using tidyr
(
pivot_longer(), pivot_wider(),
separate())
- combine datasets using joins
(
right_join(), left_join(),
inner_join(), anti_join())
- handle real messy cases: missing values, wrong column types, messy
IDs, duplicate keys
Getting started
Installing R and RStudio
Working with R is primarily text-based. The basic mode of use for R
is that the user types in a command in the R language and presses enter,
and then R computes and displays the result.

RStudio
We will be working in RStudio.
This surrounds the console, where one enters commands and views
the results, with various conveniences. In addition to the console,
RStudio provides panels containing:
- A text editor, where R commands can be recorded for future
reference.
- A history of commands that have been typed on the console.
- An “environment” panel with a list of variables, which
contain values that R has been told to save from previous commands.
- A file manager.
- Help on the functions available in R (Using a question mark before
the function)
- A panel to show plots (graphs).

R script vs RMD (Markdowns and Notebooks)
To start an R script click ctrl + shift + N or
cmd + shift + N on mac You can run lines of code, but the
output won’t be saved
To start an RMD click ctrl + B/N or
cmd + B/N on mac to initialize a notebook
Allows you to run and visualize output in the file, easy to share with
collaborators showing both the code and all results/plots
This is called the main chunk, where is has all the settings that you
would like to apply along the notebook
knitr::opts_chunk$set(echo = T, results = "show")
options(width=80)
require("knitr")
Loading required package: knitr
## To change the directory in a notebook/markdown
# opts_knit$set(root.dir = "path/to/directory/")
To insert a chunk, use ctrl + alt + I, or
command + option + I
R Basics
The console
Open RStudio, click on the “Console” panel, type 3+2 and
press enter. R displays the result of the calculation. In this document,
we will be showing such an interaction with R as below.
3 + 2
[1] 5
+ is called an operator. R has other operators for basic
mathematical calculations:
- for subtraction
* for multiplication
/ forward slash for division
^ hat for “power”
* has higher precedence than +. We can use
brackets ( ), if necessary. Try 1+2*3 and
(1+2)*3
1+2*3
[1] 7
(1+2)*3
[1] 9
Making lines of code according to styler, highlight the lines of
code, click ctrl + shift + A or
cmd + shift + A
Comparisons
We use >, <, >=,
==, <=, != for comparisons.
Note that for comparisons its always a double sign to be used. This
returns a “logical” value of TRUE or
FALSE.
3 * 3 == 9
[1] TRUE
This literally means Is the left hand side equal to the right hand
side? Is 3 times 3 equal to 9?
Variables
A variable is a name that stores a value. We create a variable by
assigining a value to it using <- which points to the
left assigning the value of the right to the one on the left
weight_kg = 65
weight_kg <- 65
To print the value stored in a variable, print the variable
weight_kg
[1] 65
Names of the variables should not contain special characters or
spaces, should not start with a number. Dots are ok, unlike many other
programming languages
To list all variables in an environment
ls()
[1] "weight_kg"
## to remove a variable/object
rm(weight_kg)
## to remove some variable of a certain pattern (starting/ending/including ....)
rm(list = ls(pattern = ""))
Any line of code inside a chunk or in the console that has a
# before, is a comment. Useful to document your code so its
easy to remember why we used this command
To add a # at the beginning of the line for multiple
lines, highlight all lines, and click ctrl + shift + c or
cmd + shift + c
We can do arithemtic on the variable
weight_kg <- 65
2.2 * weight_kg
[1] 143
Vectors
A vector is a collection of elements (numbers, words,.. etc) To be
able to combine elements into a vector, we use the function c() which
stands for combine or concatenate
c(1,2,3,4)
[1] 1 2 3 4
vec <- c(10,20,30,40)
vec + 1 # Here the arithmetic + 1 is applied on every element of the vector
[1] 11 21 31 41
We can also combine vectors
vec + vec
[1] 20 40 60 80
# In this case, the 1st element of vec is added to the 1st element of vec, 2nd with 2nd.. and so on
To check the length of the vector we use the function length()
length(vec)
[1] 4
Data types in R
Most common data types in R are mainly:
* Numeric (could be double which is the default, or integer)
* Character (String)
* Logical (TRUE/FALSE)
To check the type of a vector, we use typeof() function
x <- 5
typeof(x)
[1] "double"
## To create an integer add L to the number
x <- 5L
typeof(x)
[1] "integer"
print(x)
[1] 5
class(x)
[1] "integer"
For Characters, we use single or double quotes
"Hello World"
[1] "Hello World"
plain_txt = 'This is a string'
print(plain_txt)
[1] "This is a string"
typeof(plain_txt)
[1] "character"
class(plain_txt)
[1] "character"
To be able to look for and modify a pattern, the vector should be a
string!
There are also categorical vectors in which elements can be one of
several “categories/levels”
factor(c("mutant", "wildtype", "mutant"),
levels = c("wildtype", "mutant"))
[1] mutant wildtype mutant
Levels: wildtype mutant
Factor is important for arrangement of data in a plot for
example.
Accessing elements of vectors (Indexing)
To access elements, we use [] with a number (index)
vec[1]
[1] 10
vec[-1]
[1] 20 30 40
vec[c(1,3)]
[1] 10 30
vec[1:4]
[1] 10 20 30 40
vec[1] <- 6
We can use a vector instead of numbers to index a vector
myindex <- c(4,2,1,3)
vec[myindex]
[1] 40 20 6 30
## which is the same as
vec[c(4,2,1,3)]
[1] 40 20 6 30
Matrix
2-D tabular data structure in which all elements are of the same type
(typically numeric matrices, but could also be character or logical)
# To create a matrix
matrix()
[,1]
[1,] NA
# Example:
mat <- matrix(1:12, nrow = 2, ncol = 6, byrow = F)
mat
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 3 5 7 9 11
[2,] 2 4 6 8 10 12
typeof(mat)
[1] "integer"
class(mat)
[1] "matrix" "array"
# To get the dimensions of a matrix
dim(mat)
[1] 2 6
## get only how many rows
nrow(mat)
[1] 2
## get only how many columns
ncol(mat)
[1] 6
## get column names
colnames(mat)
NULL
## Accessing elements of a matrix
mat[1,3] # row number, # col number]
[1] 5
# this retuns the element at row 1, column 3
Data frame
Similar to a matrix, the difference is that the columns of a
dataframe could be of different types (could combine numeric values,
strings, logical) The index of each row could be from 1 to
length(dataframe) or could have a unique name (can be accessed/modified
using rownames function)
Functions
a function is followed by parentheses () mean(), sd(), print(),
rep()
# To create a function
name_of_function <- function(){ # inside parentheses what the function takes
# what the function should do
}
## then call the function
name_of_function()
The function takes arguments (options), some are mandatory and some
are optional. Arguments can be supplied in order (positional argument),
or by the argument name (better to use). Clicking tab for
autocompletion helps us remember what arguments a function could
take
calc <- function(num1, num2){
sum = num1 + num2
print(sum)
}
calc(num1 = 10, 5)
[1] 15
Lists
Lists can contain different kinds of elements
mylist <- list(num= 42, greeting="Hello, World")
# to access elements: ## double square bracket to access
mylist[[1]]
[1] 42
mylist$greeting
[1] "Hello, World"
If you are not sure about the type of a variable, run class
class(vec)
[1] "numeric"
class(mat)
[1] "matrix" "array"
str(mylist)
List of 2
$ num : num 42
$ greeting: chr "Hello, World"
Applying a function on a list (very practical)
list_nums <- list(
a = c(1, 2, 3),
b = c(10, 20, 30),
c = c(5, 5, 5)
)
lapply(list_nums, mean)
$a
[1] 2
$b
[1] 20
$c
[1] 5
sapply (simple apply)
for a simplified version
sapply(list_nums, function(elem) elem^2) ## returns a simplified form (vector)
a b c
[1,] 1 100 25
[2,] 4 400 25
[3,] 9 900 25
Loops
for (numb in list_nums){
print(numb)
}
[1] 1 2 3
[1] 10 20 30
[1] 5 5 5
If conditional
list_nums <- c(1,2,3,6,7)
for(numb in list_nums){
if(numb > 5){
print(numb)}
else print("This number is smaller than 5, so will not be printed")
}
[1] "This number is smaller than 5, so will not be printed"
[1] "This number is smaller than 5, so will not be printed"
[1] "This number is smaller than 5, so will not be printed"
[1] 6
[1] 7
Tidy data
Tidy data rules: 1. Each variable = one column 2. Each observation =
one row 3. Each type of entity = one table
Why it matters: - easier plotting - easier modelling - fewer bugs -
compatible with tidyverse grammar
Installing packages
To install a package from CRAN, we use the function
install.packages
install.packages("styler")
install.packages("tidyverse")
From Github
Github is a web-based platform for
version control, collaboration, and code sharing, built around the Git
system. It allows users to track changes to files, manage contributions
from multiple people, and publish open-source projects.
GitHub is widely used for:
Sharing R packages before they are published on CRAN
Developing packages collaboratively using pull requests and
issues
Storing analysis projects (scripts, R Markdown, data,
documentation)
Reproducible workflows, since every change is tracked
Installing a package from Github
install_github("author/package")

install.packages("devtools")
## To load a library
library(devtools)
devtools::install_github("githubusername/repository")
install_github("immunogenomics/presto")
## instead of loading a whole package, we can call a function from a package using the format:
package::function()
## function select
AnnotationDbi::select()
dplyr::select()
## Example
devtools::install_github("immunogenomics/presto")
From Bioconductor
Bioconductor
Bioconductor is an open-source project that provides a large
collection of R packages for biological and biomedical data analysis,
especially high-throughput sequencing, genomics, transcriptomics, and
proteomics.
Curated, peer-reviewed packages: Every package goes through a
structured review for code quality, documentation, and
reproducibility.
Focus on life sciences: Tools for RNA-seq, ChIP-seq, single-cell
RNA-seq, epigenomics, variant analysis, annotation, and more.
Reproducible workflows using standardized data structures (e.g.,
SummarizedExperiment, SingleCellExperiment).
Frequent release cycle (twice yearly), synchronized with R
versions.
Bioconductor has become a central ecosystem for computational biology
because it promotes reproducibility, integrates with modern R workflows,
and provides consistent interfaces for complex biological datasets.


## First, to install Bioconductor
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version = "3.22")
## Installing packages from Bioconductor
BiocManager::install("DESeq2")
install.packages("tidyverse")
install.packages("janitor")
install.packages("lubridate")
Loading libraries
library(tidyverse)
library(tidyr)
library(dplyr)
library(stringr)
library(readr)
library(janitor)
library(lubridate)
# To load multiple libraries at once, we can use pacman package, p_load function
pacman::p_load(tidyverse, styler, dplyr, stringr, janitor, lubridate)
Working with directories
## To check the current directory
getwd()
## To change the working directory
setwd("Path/to/directory/")
## Remember to change the directory for the whole notebook, we use the main chunk
Importing data
## Importing an R data structure object (.RDS)
myfile <- readRDS(file = "path/to/file.RDS")
## CSV file
myfile <- utils::read.csv()
## More general function
myfile <- utils::read.delim(file = "/path/to/file.extenstion",
header = T, sep = "\t") ## the file has a header with column names, and each observation is separated by a tab
myfile <- readxl::read_xlsx("/path/to/file/file.xlsx")
Checking the content of a file
head(myfile) ## shows the first 6 observations, can change n argument
tail(myfile) ## shows the last 6 observations, can change n argument
str(myfile) ## checks the structure of the file as a variable
dim(myfile) ## checks the dimensions of a file (returns number of rows, number of columns)
nrow(myfile) # checks the number of rows
ncol(myfile)# checks the number of columns
## checks the column names
colnames(myfile)
names(myfile)
Data maniuplation
Adding columns to a dataframe
myfile$newcol <- myfile$col1 * myfile$col2
Access columns by name
# one column
myfile$columnname
## more than a column
myfile[, c("col1", "col4", "col2")]
myfile[, c(1,4,2)]
Access a column by index
myfile[,4] # the row is empty, which means show all rows for column 4
Access rows
df[4, ] ## gets the 4th row for all columns
df[4:10] ## from 4th to 10th
Select rows by a condition (exact)
myfile_subset <- subset(myfile, given_column == "something") ## subset rows that follow this condition
Select rows by a condition (more than one category)
myfile_subset <- subset(myfile,
given_column %in% c("something", "something else")) ## subset rows that follow this condition
Select rows by a condition (exclusion of exact)
myfile_subset <- subset(myfile, given_column != "something") ## subset rows that follow this condition
Select rows by a condition (exclusion of more than one
category)
myfile_subset <- subset(myfile, given_column != "something" &
given_column != "something else" &
given_column != "another thing")
Using the ! to negate
myfile_subset <- subset(myfile,
!(given_column %in% c("something", "something else")))
There is no equivalent operator for %in%, but you can
create it!
"%out%" <- Negate("%in%")
myfile_subset <- subset(myfile,
given_column %out% c("something", "something else"))
## subset rows that do not follow this condition
# clean column names
# df <- janitor::clean_names(df)
# install.packages("tidytuesdayR")
# tuesdata <- tidytuesdayR::tt_load('2023-10-24')
# df <- tuesdata$patient_risk_profiles
df <- readxl::read_xlsx("tuesdata_df.xlsx")
head(df)
df$`predicted risk of Sudden Vision Loss, with no eye pathology causes`
[1] 1.115083e-04 1.607285e-03 1.458917e-04 1.527737e-04 3.258655e-04 4.667266e-04
[7] 2.140686e-04 1.600603e-04 2.184795e-04 6.122239e-04 1.036383e-04 6.725240e-04
[13] 3.901209e-05 6.773092e-05 2.323113e-04 9.834800e-05 4.658432e-05 1.349731e-04
[19] 3.089568e-04 3.136560e-04 6.938989e-05 1.141138e-04 1.059116e-04 3.656749e-04
[25] 4.289676e-04 1.280323e-04 6.686286e-05 1.063679e-04 2.121934e-04 7.246477e-04
[31] 9.408668e-05 4.846494e-04 2.092650e-04 3.848392e-04 2.493011e-04 6.565204e-04
[37] 2.682293e-04 5.494188e-04 1.903225e-04 7.163179e-05 9.823006e-05 2.013026e-04
[43] 2.274620e-04 1.126852e-04 9.863359e-05 6.779899e-04 3.971666e-05 4.995190e-05
[49] 5.199037e-05 8.074773e-05 1.597246e-04 5.455884e-04 1.431606e-04 4.426006e-04
[55] 1.203483e-04 9.213029e-04 1.043975e-04 1.553771e-04 2.758162e-04 2.382385e-04
[61] 9.461499e-05 1.672590e-04 2.414275e-04 1.585947e-04 3.395669e-04 2.180608e-05
[67] 1.913146e-04 2.044669e-04 2.813584e-04 9.473806e-05 7.132445e-05 1.358532e-04
[73] 1.018202e-04 2.385006e-05 4.251972e-05 6.189295e-04 5.478834e-04 9.701396e-04
[79] 3.324462e-04 2.345282e-04 4.010593e-04 1.223014e-04 1.369716e-04 3.559392e-05
[85] 2.381193e-05 2.206086e-04 6.609310e-04 6.652940e-05 1.531101e-04 6.950099e-05
[91] 6.898172e-05 7.661815e-05 4.003384e-04 7.201001e-04 8.395539e-04 1.982071e-04
[97] 4.538425e-05 5.966995e-04 4.032301e-04 3.110642e-04
df <- janitor::clean_names(df)
head(df)
Using tidyverse
Used for data wrangling and maniuplation
When we load tidyverse, by default, dplyr, readr, stringr, tibble,
tidyr, purrr, and ggplot2 are loaded
library(tidyverse)
── Attaching core tidyverse packages ────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4 ── Conflicts ──────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
We will work with care
state dataset from tidy tuesday
care_state <- readr::read_csv(
"https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-04-08/care_state.csv"
)
Rows: 1232 Columns: 8── Column specification ──────────────────────────────────────────────────────────────────
Delimiter: ","
chr (5): state, condition, measure_id, measure_name, footnote
dbl (1): score
date (2): start_date, end_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
care_state %>% glimpse()
Rows: 1,232
Columns: 8
$ state <chr> "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", "…
$ condition <chr> "Healthcare Personnel Vaccination", "Healthcare Personnel Vaccinati…
$ measure_id <chr> "HCP_COVID_19", "IMM_3", "OP_18b", "OP_18b_HIGH_MIN", "OP_18b_LOW_M…
$ measure_name <chr> "Percentage of healthcare personnel who are up to date with COVID-1…
$ score <dbl> 7.3, 80.0, 140.0, 157.0, 136.0, 136.0, NA, 196.0, 230.0, 182.0, 200…
$ footnote <chr> NA, NA, "25, 26", "25, 26", "25, 26", "25, 26", "25, 26", "25", "25…
$ start_date <date> 2024-01-01, 2023-10-01, 2023-04-01, 2023-04-01, 2023-04-01, 2023-0…
$ end_date <date> 2024-03-31, 2024-03-31, 2024-03-31, 2024-03-31, 2024-03-31, 2024-0…
- state >> The two-letter code for the state (or
territory, etc) where the hospital is located.
- condition >> The condition for which the patient was
admitted. Six categories of conditions are included in the data.
- measure_id >> The ID of the thing being measured.
Note that there are 22 unique IDs but only 21 unique names.
- measure_name >> The name of the thing being measured.
Note that there are 22 unique IDs but only 21 unique names.
- score >> The score of the measure.
- footnote >> Footnotes that apply to this measure: – 5
= “Results are not available for this reporting period.”, – 25 = “State
and national averages include Veterans Health Administration (VHA)
hospital data.”, – 26 = “State and national averages include Department
of Defense (DoD) hospital data.”.
- start_date >> date The date on which measurement
began for this measure.
- end_date >> date The date on which measurement ended
for this measure.
Q1: How many columns are there in the dataset? What columns exist?
Q2: How many observations(rows)? Q3: How many different states exist in
the dataset ? Q4: What categories are there in the condition column? Q5:
What is being measured in the “measure_name” column? Q6: How many
observations with footnote 5, 25, and 26? Q7: Which years are there in
the start and end date columns?
starwars <- dplyr::starwars
head(starwars)
typeof(starwars)
[1] "list"
class(starwars)
[1] "tbl_df" "tbl" "data.frame"
## Get the name of all columns
colnames(starwars)
[1] "name" "height" "mass" "hair_color" "skin_color" "eye_color"
[7] "birth_year" "sex" "gender" "homeworld" "species" "films"
[13] "vehicles" "starships"
# Or
names(starwars)
[1] "name" "height" "mass" "hair_color" "skin_color" "eye_color"
[7] "birth_year" "sex" "gender" "homeworld" "species" "films"
[13] "vehicles" "starships"
Functions on rows
1. Filtering
filter(dataframe, condition) like
filter(dataframe, column == "something")
filter(dataframe, condition1, condition2) like
filter(dataframe, column == "something", column == "something")
Why it matters in healthcare/bioinformatics: you
almost always filter: - QC failures - non-target tissue / condition -
outliers - incomplete cases
human_starwars <- filter(starwars,
species == "Human")
head(human_starwars)
filter(starwars,
species == "Human", height > 180)
Combining a lot of functions
df <- nycflights23::flights
mean(pull(filter(filter(df, origin == "JFK"), carrier == "AA", !is.na(arr_delay)), arr_delay))
[1] 0.5470231
Using a pipe %>%. Shortcut is
ctrl + shift + M or command + shift + M
df %>%
filter(origin == "JFK") %>%
filter(carrier == "AA") %>%
filter(!is.na(arr_delay)) %>%
pull(arr_delay) %>%
mean()
[1] 0.5470231
Even nicer
df %>%
filter(origin == "JFK",
carrier == "AA",
!is.na(arr_delay)) %>%
summarise(avg_delay = mean(arr_delay)) %>%
pull(avg_delay)
[1] 0.5470231
NOTE
The function pull is a function that turns a column into a
vector For example if you have a column called “genes”, using pull turns
the column into a vector with all these genes
Common patterns
starwars %>% filter(is.na(mass)) # missing values
starwars %>% filter(!is.na(height)) # non-missing
starwars %>% filter(between(height, 150, 200)) # within a range
2. arrange() — sort rows
starwars %>%
arrange(desc(height))
3. distinct() — unique rows
starwars %>%
distinct(species, homeworld)
4. slice_max, slice_min, slice_head, and slice_tail
DEGs <- readxl::read_xlsx("DEGs.xlsx")
DEGs %>%
mutate(pct_diff = pct.1 - pct.2) %>%
relocate(pct_diff, .after = pct.2) %>%
filter(p_val_adj < 0.05) %>%
group_by(cluster) %>%
slice_max(n = 10, order_by = avg_log2FC)
slice_max for each group, get the top n
observations slice_min for each group, get the
bottom n observations slice_head for each group,
get the first n observations slice_tail for each
group, get the last n observations
If not grouped
slice_max() → get the top n rows overall based on a
variable. slice_min() → get the bottom n rows overall based
on a variable. slice_head() → get the first n rows in the
dataset. slice_tail() → get the last n rows in the
dataset.
Functions on columns
1. select(), rename(), relocate()
starwars %>%
select(name, height, species, mass) %>%
rename("weight_kg" = "mass") %>%
relocate(weight_kg, .after = height)
rename(dataframe, newcolname = oldname) renames an
existing column name to another Some R versions requires “” for column
renaming
relocate function allows the new column is added in a
specific position, default (last column)
2. Removing a column
The function select can be used to positively or
negatively select columns (in any order)
df %>%
select(day, month, dep_time)
We can use select(df, -col1) for one column or multiple
columns like select(df, -col1, -col2, -col3) or even better
use (any_of)
cols_to_drop <- c("year", "carrier")
df %>%
select(-any_of(cols_to_drop))
df %>%
select(starts_with("dep"))
df %>%
select(ends_with("time"))
df %>%
select(contains("del"))
df %>%
select(matches("(delay|time)$"))
NA
df %>%
select(where(is.numeric))
df %>%
select(where(is.character))
3. Turning a column into multiple
df_time_hour <- df %>% select(time_hour)
head(df_time_hour)
The function
separate(dataframe, col = "column_to_split", into = c("piece1", "piece2"), sep = "separator"))
df_time_hour <- df_time_hour %>% separate(sep = " ",
col = time_hour,
into = c("yearmonthday", "time"),
remove = T)
head(df_time_hour)
df_time_hour <- df_time_hour %>%
separate(sep = "-",
col = yearmonthday,
into = c(NA, "month", "day"),
remove = T)
head(df_time_hour)
The same function can be applied on the time column, into hour,
minute, seconds
df_time_hour <- df_time_hour %>%
separate(sep = ":",
col = time,
into = c("hr", "mn", "sec"),
remove = T)
head(df_time_hour)
4. uniting columns with unite()
df_time_hour %>%
unite("newcol", month, day, hr, mn, sec, sep = "-", remove = FALSE)
5. mutate() — create new variables
Adding a column follows this structure
mutate(df, "newcolumn" = df$existingcolumn) or
mutate(df, newcolumn = df$existingcolumn)
starwars %>%
mutate(height_m = height / 100,
bmi = mass / (height_m^2)) %>%
select(name, height, mass, height_m, bmi)
new_df <- nycflights23::flights %>%
mutate("dag" = day)
new_df <- new_df %>%
mutate("Day_Month_Year" = str_c(new_df$day,
new_df$month,
new_df$year, sep = "_")) %>%
relocate(Day_Month_Year, .before = year)
head(new_df)
case_when() — categorical variables
mutate with case_when can be used to add a
column based on a condition of another column that follows a
pattern.
mutate(NewCol = case_when(grepl(“pattern”, ExistingColumn) ~
“AddedText”, TRUE ~ “AddedText2”))
which means Assign “AddedText” if the pattern exists in the
ExistingColumn, if not, assign “AddedText2”
starwars %>%
mutate(height_group = case_when(
height < 150 ~ "short",
height < 190 ~ "medium",
TRUE ~ "tall")
) %>%
count(height_group)
6.1 Grouping
group_by is used to group the dataframe by a certain
column, where another function could be applied on each group. For
instance, here we calculate the average arrival delay for each
carrier
df %>%
group_by(carrier) %>%
summarise(
avg_arr_delay = mean(arr_delay, na.rm = TRUE),
n_flights = n()
)
6.2 group_by() + summarise()
starwars %>%
group_by(species) %>%
summarise(
n = n(),
mean_height = mean(height, na.rm = TRUE)
) %>%
arrange(desc(n))
7. across() — multi-column operations
starwars %>%
summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
Merging/Joining dataframes
set.seed(12345)
patient_table <- tibble(
patient_id = paste0("P", str_pad(1:12, 3, pad = "0")),
diagnosis = sample(c("control", "sepsis", "cancer"), 12, replace = TRUE),
age = sample(30:90, 12, replace = TRUE),
sex = sample(c("F", "M"), 12, replace = TRUE)
)
site_table <- tibble(
patient_id = paste0("P", str_pad(sample(1:12, 10), 3, pad = "0")),
hospital = sample(c("SiteA", "SiteB", "SiteC"), 10, replace = TRUE),
country = sample(c("US", "DE", "FR"), 10, replace = TRUE)
)
patient_table
site_table
NA
left_join()
patients_full <- patient_table %>%
left_join(site_table, by = "patient_id")
patients_full
Debug: who did NOT match?
patient_table %>%
anti_join(site_table, by = "patient_id")
airlines_df <- nycflights23::airlines
head(airlines_df)
flights_df <- nycflights23::flights
merged_df <- left_join(flights_df, airlines_df, by = "carrier")
merged_df %>%
dplyr::rename(airlines_nm = name) %>%
relocate(airlines_nm, .after = carrier)
left_join(x, y) All from x
right_join(x, y) All from y
inner_join(x, y) Only matching rows
full_join(x, y) All rows from both tables, filling the
missing with NAs
Reshaping data (Wide vs Long data shape)
set.seed(12345)
metadata <- tibble(
sample_id = paste0("S", str_pad(1:24, 3, pad = "0")),
patient_id = paste0("P", str_pad(sample(1:12, 24, replace = TRUE), 3, pad = "0")),
tissue = sample(c("tumor", "normal"), 24, replace = TRUE),
treatment = sample(c("drugA", "drugB", "placebo"), 24, replace = TRUE),
sex = sample(c("F", "M"), 24, replace = TRUE),
age = sample(30:85, 24, replace = TRUE),
batch = sample(c("batch1", "batch2"), 24, replace = TRUE)
)
metadata
genes <- paste0("Gene",
str_pad(1:50, 4,
pad = "0"))
counts <- matrix(
rnbinom(50 * 24, mu = 80, size = 1),
nrow = 50,
ncol = 24,
dimnames = list(genes, metadata$sample_id)
)
counts_df <- as_tibble(counts, rownames = "gene_id")
counts_df
Problem: This is “wide”. Many analyses need “long”.
1. pivot_longer(): wide -> long
counts_long <- counts_df %>%
pivot_longer(
cols = starts_with("S"),
names_to = "sample_id",
values_to = "count"
)
counts_long
Now each row is: gene × sample.
Now we join counts with metadata
head(metadata)
counts_annot <- counts_long %>%
left_join(metadata, by = "sample_id")
counts_annot %>% glimpse()
Here we summarise expression by group
Example: mean counts by tissue per gene
gene_summary <- counts_annot %>%
group_by(gene_id, tissue) %>%
summarise(mean_count = mean(count), .groups = "drop")
gene_summary
2. pivot_wider(): long -> wide
Create a gene × tissue table (2 columns: tumor/normal)
gene_wide <- gene_summary %>%
pivot_wider(
names_from = tissue,
values_from = mean_count
)
gene_wide
3. Binding rows and columns
visit_a <- tibble(
id = c("P01","P02","P03"),
visit = "baseline",
crp = c(2.1, 5.4, 1.9)
)
visit_b <- tibble(
id = c("P04","P05"),
visit = "week4",
crp = c(3.3, 2.8),
wbc = c(6.1, 5.7) # extra column not in visit_a
)
3.1 rbind vs bind_rows (stacking rows)
rbind(visit_a, visit_b)
bind_rows(visit_a, visit_b)
here bind_rows does not give an error, making sure rows
are stacked, and missing columns are created and filled with NA (wbc is
NA for baseline rows)
3.2 bind_cols() vs cbind() (gluing columns)
patients <- tibble(
id = c("P01","P02","P03"),
sex = c("F","M","F")
)
labs <- tibble(
crp = c(2.1, 5.4, 1.9),
wbc = c(5.8, 6.2, 5.1)
)
cbind(patients, labs)
bind_cols(patients, labs)
labs_short <- tibble(crp = c(2.1, 5.4))
cbind(patients, labs_short)
bind_cols(patients, labs_short)
Recap
1. Column-wise functions
(modify or choose columns):
select()
rename()
relocate()
Identify columns using tidyselect helpers:
- starts_with(), ends_with(), contains(), matches(), where()
2. Row-wise functions (filter, slice, row
operations)
Keep or remove rows
filter()
slice(), slice_head(), slice_tail(), slice_min(),
slice_max()
3. Group-wise functions
(operate within groups)
Grouping changes how other functions behave:
group_by(), ungroup()
4. Data-frame–wise functions
Reshape data:
pivot_longer() and pivot_wider()
5. Summary functions
(usually inside summarise())
mean(), median(), sum(), n(), n_distinct(), sd(), var()
Good-to-know symbols names
() parentheses
[] square brackets
{} curly braces
* asterisk
& ampersand
^ hat
/ forward slash
\ backslash
’’ single quotes
“” double quotes
- hyphen (dash)
_ underscore
~ tilda
` backtick
! exclamation mark
. period
, comma
: colon
; semi-colon
$ dollar sign
# sharp
% percentage
---
title: "R + tidyr (Data Wrangling)"
date: "`r format(Sys.time(),  '%d %B %Y')`"
author: "Mohamed Hassan"
output:
  html_notebook:
    theme: cerulean
    toc: true
    toc_depth: 3
editor_options:
  markdown:
    wrap: 72
---

# Workshop goals

By the end of this workshop you should be able to:

- understand **tidy data** and why it matters for data analysis
- clean and wrangle data using **dplyr** (filter, select, mutate, summarise)
- reshape data using **tidyr** (`pivot_longer()`, `pivot_wider()`, `separate()`)
- combine datasets using **joins** (`right_join()`, `left_join()`, `inner_join()`, `anti_join()`)
- handle real messy cases: missing values, wrong column types, messy IDs, duplicate keys


# Getting started

## Installing R and RStudio

- [R:](https://cran.rstudio.com/) https://cran.rstudio.com/
- [RStudio:](https://www.rstudio.com/products/rstudio/download/) https://www.rstudio.com/products/rstudio/download/


Working with R is primarily text-based. The basic mode of use for R is that the user types in a command in the R language and presses enter, and then R computes and displays the result.

![](r_gui.jpg)

### RStudio

We will be working in [RStudio](https://www.rstudio.com/products/rstudio/download/). This surrounds the *console*, where one enters commands and views the results, with various conveniences. In addition to the console, RStudio provides panels containing:

* A *text editor*, where R commands can be recorded for future reference.
* A history of commands that have been typed on the console.
* An "environment" panel with a list of *variables*, which contain values that R has been told to save from previous commands.
* A file manager.
* Help on the functions available in R (Using a question mark before the function)
* A panel to show plots (graphs).


![](Rstudio_layout.png)





## R script vs RMD (Markdowns and Notebooks)

* To start an R script click `ctrl + shift + N` or `cmd + shift + N` on mac
You can run lines of code, but the output won't be saved    

* To start an RMD click `ctrl + B/N` or `cmd + B/N` on mac to initialize a notebook  
Allows you to run and visualize output in the file, easy to share with collaborators showing both the code and all results/plots  



This is called the main chunk, where is has all the settings that you would like to apply along the notebook
```{r setup}
knitr::opts_chunk$set(echo = T, results = "show")
options(width=80)
require("knitr")

## To change the directory in a notebook/markdown
# opts_knit$set(root.dir = "path/to/directory/")
```


To insert a chunk, use `ctrl + alt + I`, or `command + option + I`

# R Basics

## The console
Open RStudio, click on the "Console" panel, type `3+2` and press enter. R displays the result of the calculation. In this document, we will be showing such an interaction with R as below.
```{r}
3 + 2
```

`+` is called an operator. R has other operators for basic mathematical calculations:  
`-` for subtraction  
`*` for multiplication  
`/` forward slash for division  
`^` hat for "power"

`*` has higher precedence than `+`.  We can use brackets  `( )`, if necessary. Try `1+2*3` and `(1+2)*3`

```{r}
1+2*3
```

```{r}
(1+2)*3
```


Making lines of code according to styler, highlight the lines of code, click `ctrl + shift + A` or `cmd + shift + A`


## Comparisons

We use `>`, `<`, `>=`, `==`, `<=`, `!=` for comparisons. Note that for comparisons its always a double sign to be used. This returns a "logical" value of `TRUE` or `FALSE`.

```{r}
3 * 3 == 9
```

This literally means Is the left hand side equal to the right hand side? Is 3 times 3 equal to 9?  


## Variables

A variable is a name that stores a value. We create a variable by assigining a value to it using `<-` which points to the left assigning the value of the right to the one on the left

```{r}
weight_kg = 65
weight_kg <- 65
```


To print the value stored in a variable, print the variable
```{r}
weight_kg
```

Names of the variables should not contain special characters or spaces, should not start with a number. Dots are ok, unlike many other programming languages


To list all variables in an environment 
```{r}
ls()
## to remove a variable/object
rm(weight_kg)

## to remove some variable of a certain pattern (starting/ending/including ....)
rm(list = ls(pattern = ""))
```




Any line of code inside a chunk or in the console that has a `#` before, is a comment. Useful to document your code so its easy to remember why we used this command

To add a `#` at the beginning of the line for multiple lines, highlight all lines, and click `ctrl + shift + c` or `cmd + shift + c`

We can do arithemtic on the variable
```{r}
weight_kg <- 65
2.2 * weight_kg
```


## Vectors
A vector is a collection of elements (numbers, words,.. etc)
To be able to combine elements into a vector, we use the function c() which stands for combine or concatenate

```{r}
c(1,2,3,4)
```


```{r}
vec <- c(10,20,30,40)
vec + 1 # Here the arithmetic + 1 is applied on every element of the vector
```


We can also combine vectors
```{r}
vec + vec 
# In this case, the 1st element of vec is added to the 1st element of vec, 2nd with 2nd.. and so on
```

To check the length of the vector we use the function length()
```{r}
length(vec)
```

## Data types in R

Most common data types in R are mainly:  
* Numeric (could be double which is the default, or integer)  
* Character (String)  
* Logical (TRUE/FALSE)  

To check the type of a vector, we use typeof() function
```{r}
x <- 5
typeof(x)

## To create an integer add L to the number
x <- 5L
typeof(x)
print(x)
class(x)
```

For Characters, we use single or double quotes 

```{r}
"Hello World"
plain_txt = 'This is a string'
print(plain_txt)
typeof(plain_txt)
class(plain_txt)
```

To be able to look for and modify a pattern, the vector should be a string!

There are also categorical vectors in which elements can be one of several "categories/levels"
```{r}
factor(c("mutant", "wildtype", "mutant"), 
       levels = c("wildtype", "mutant"))
```

Factor is important for arrangement of data in a plot for example.



## Accessing elements of vectors (Indexing)
To access elements, we use [] with a number (index)
```{r}
vec[1]
vec[-1]
vec[c(1,3)]
vec[1:4]
vec[1] <- 6
```

We can use a vector instead of numbers to index a vector 
```{r}
myindex <- c(4,2,1,3)
vec[myindex]

## which is the same as
vec[c(4,2,1,3)]
```

## Matrix
2-D tabular data structure in which all elements are of the same type (typically numeric matrices, but could also be character or logical)
```{r}
# To create a matrix
matrix()
# Example:
mat <- matrix(1:12, nrow = 2, ncol = 6, byrow = F)
mat
typeof(mat)
class(mat)
```

```{r}
# To get the dimensions of a matrix
dim(mat)
## get only how many rows
nrow(mat)
## get only how many columns
ncol(mat)
## get column names
colnames(mat)
## Accessing elements of a matrix
mat[1,3] # row number, # col number]
# this retuns the element at row 1, column 3
```

## Data frame
Similar to a matrix, the difference is that the columns of a dataframe could be of different types (could combine numeric values, strings, logical)
The index of each row could be from 1 to length(dataframe) or could have a unique name (can be accessed/modified using rownames function)

## Functions
a function is followed by parentheses ()
mean(), sd(), print(), rep()
```{r}
# To create a function
name_of_function <- function(){ # inside parentheses what the function takes
  # what the function should do
}

## then call the function
name_of_function()
```


The function takes arguments (options), some are mandatory and some are optional.
Arguments can be supplied in order (positional argument), or by the argument name (better to use).
Clicking `tab` for autocompletion helps us remember what arguments a function could take

```{r}
calc <- function(num1, num2){
  sum = num1 + num2
  print(sum)
}

calc(num1 = 10, 5)
```



## Lists

Lists can contain different kinds of elements
```{r}
mylist <- list(num= 42, greeting="Hello, World")

# to access elements: ## double square bracket to access
mylist[[1]]

mylist$greeting

```

If you are not sure about the type of a variable, run class
```{r}
class(vec)
class(mat)
str(mylist)
```

### Applying a function on a list (very practical)
```{r}
list_nums <- list(
  a = c(1, 2, 3),
  b = c(10, 20, 30),
  c = c(5, 5, 5)
)

lapply(list_nums, mean)
```

### Applying an anonymous function (when we want to perform many steps over the list)
```{r}
list_nums_modified <- lapply(list_nums, function(element){
  ## first do this on each element
  element + 4
  ## then do this
  return(element)
})
```


```{r}
## Applying the mean function over a list
lapply(list_nums, function(elem) elem^2) ## returns a list
```


### sapply (simple apply)
for a simplified version
```{r}
sapply(list_nums, function(elem) elem^2) ## returns a simplified form (vector)
```



## Loops

```{r}
for (numb in list_nums){
  print(numb)
}
```

## If conditional
```{r}
list_nums <- c(1,2,3,6,7)
for(numb in list_nums){
  if(numb > 5){
    print(numb)}
    else print("This number is smaller than 5, so will not be printed")
}
```



# Tidy data


Tidy data rules:
1. Each variable = one column
2. Each observation = one row
3. Each type of entity = one table

Why it matters:
- easier plotting
- easier modelling
- fewer bugs
- compatible with tidyverse grammar



## Installing packages

To install a package from [CRAN](https://cran.r-project.org/), we use the function install.packages 
```{r}
install.packages("styler")
install.packages("tidyverse")
```

### From Github

[Github](https://github.com) is a web-based platform for version control, collaboration, and code sharing, built around the Git system. It allows users to track changes to files, manage contributions from multiple people, and publish open-source projects.  

GitHub is widely used for:

- Sharing R packages before they are published on CRAN

- Developing packages collaboratively using pull requests and issues

- Storing analysis projects (scripts, R Markdown, data, documentation)

- Reproducible workflows, since every change is tracked


Installing a package from Github `install_github("author/package")`


![](Github_screenshot_1.png)

```{r}
install.packages("devtools")

## To load a library
library(devtools)
devtools::install_github("githubusername/repository")
install_github("immunogenomics/presto")

## instead of loading a whole package, we can call a function from a package using the format:
package::function()
  
## function select
AnnotationDbi::select()
dplyr::select()
  
## Example
devtools::install_github("immunogenomics/presto")
```


### From Bioconductor
[Bioconductor](https://www.bioconductor.org/)


Bioconductor is an open-source project that provides a large collection of R packages for biological and biomedical data analysis, especially high-throughput sequencing, genomics, transcriptomics, and proteomics.

- Curated, peer-reviewed packages: Every package goes through a structured review for code quality, documentation, and reproducibility.

- Focus on life sciences: Tools for RNA-seq, ChIP-seq, single-cell RNA-seq, epigenomics, variant analysis, annotation, and more.

- Reproducible workflows using standardized data structures (e.g., SummarizedExperiment, SingleCellExperiment).

- Frequent release cycle (twice yearly), synchronized with R versions.

Bioconductor has become a central ecosystem for computational biology because it promotes reproducibility, integrates with modern R workflows, and provides consistent interfaces for complex biological datasets.



![](Bioconductor_screen1.png)

![](Bioconductor_screen2.png)  

```{r}
## First, to install Bioconductor
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.22")

## Installing packages from Bioconductor
BiocManager::install("DESeq2")
```

```{r}
install.packages("tidyverse")
install.packages("janitor")
install.packages("lubridate")
```

## Loading libraries
```{r}
library(tidyverse)
library(tidyr)
library(dplyr)
library(stringr)
library(readr)
library(janitor)
library(lubridate)

# To load multiple libraries at once, we can use pacman package, p_load function
pacman::p_load(tidyverse, styler, dplyr, stringr, janitor, lubridate)

```



## Working with directories
```{r}
## To check the current directory
getwd()
## To change the working directory
setwd("Path/to/directory/")

## Remember to change the directory for the whole notebook, we use the main chunk
```

## Importing data
```{r}
## Importing an R data structure object (.RDS)
myfile <- readRDS(file = "path/to/file.RDS")

## CSV file
myfile <- utils::read.csv()

## More general function
myfile <- utils::read.delim(file = "/path/to/file.extenstion", 
                  header = T, sep = "\t") ## the file has a header with column names, and each observation is separated by a tab
myfile <- readxl::read_xlsx("/path/to/file/file.xlsx")
```


## Checking the content of a file
```{r}
head(myfile) ## shows the first 6 observations, can change n argument
tail(myfile) ## shows the last 6 observations, can change n argument
str(myfile) ## checks the structure of the file as a variable
dim(myfile) ## checks the dimensions of a file (returns number of rows, number of columns)
nrow(myfile) # checks the number of rows
ncol(myfile)# checks the number of columns

## checks the column names
colnames(myfile)
names(myfile)
```

## Data maniuplation

### Adding columns to a dataframe
```{r}
myfile$newcol <- myfile$col1 * myfile$col2
```

### Access columns by name
```{r}
# one column
myfile$columnname
## more than a column
myfile[, c("col1", "col4", "col2")]
myfile[, c(1,4,2)]
```

### Access a column by index
```{r}
myfile[,4] # the row is empty, which means show all rows for column 4
```

### Access rows
```{r}
df[4, ] ## gets the 4th row for all columns
df[4:10] ## from 4th to 10th
```

### Select rows by a condition (exact)
```{r}
myfile_subset <- subset(myfile, given_column == "something") ## subset rows that follow this condition
```


### Select rows by a condition (more than one category)
```{r}
myfile_subset <- subset(myfile, 
                        given_column %in% c("something", "something else")) ## subset rows that follow this condition
```


### Select rows by a condition (exclusion of exact)
```{r}
myfile_subset <- subset(myfile, given_column != "something") ## subset rows that follow this condition
```

### Select rows by a condition (exclusion of more than one category)

```{r}
myfile_subset <- subset(myfile, given_column != "something" & 
         given_column != "something else" &
         given_column != "another thing")
```

Using the `!` to negate 
```{r}
myfile_subset <- subset(myfile,
                        !(given_column %in% c("something", "something else")))
```



There is no equivalent operator for `%in%`, but you can create it!  

```{r}
"%out%" <- Negate("%in%")
```


```{r}
myfile_subset <- subset(myfile, 
                        given_column %out% c("something", "something else")) 
## subset rows that do not follow this condition
```


```{r}
# clean column names
# df <- janitor::clean_names(df)

# install.packages("tidytuesdayR")
# tuesdata <- tidytuesdayR::tt_load('2023-10-24')
# df <- tuesdata$patient_risk_profiles

df <- readxl::read_xlsx("tuesdata_df.xlsx")
head(df)
```

```{r}
df$`predicted risk of Sudden Vision Loss, with no eye pathology causes`
```


```{r}
df <- janitor::clean_names(df)
head(df)
```



# Using tidyverse
Used for data wrangling and maniuplation

When we load tidyverse, by default, dplyr, readr, stringr, tibble, tidyr, purrr, and ggplot2 are loaded
```{r}
library(tidyverse)
```

We will work with [care state](https://github.com/rfordatascience/tidytuesday/tree/main/data/2025/2025-04-08) dataset from tidy tuesday
```{r}
care_state <- readr::read_csv(
  "https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-04-08/care_state.csv"
)

care_state %>% glimpse()

```


- *state* >> The two-letter code for the state (or territory, etc) where the hospital is located.  
- *condition* >>	The condition for which the patient was admitted. Six categories of conditions are included in the data.  
- *measure_id* >>	The ID of the thing being measured. Note that there are 22 unique IDs but only 21 unique names.  
- *measure_name* >>	The name of the thing being measured. Note that there are 22 unique IDs but only 21 unique names.  
- *score* >>	The score of the measure.  
- *footnote* 	>> 	Footnotes that apply to this measure: 
  -- 5 = "Results are not available for this reporting period.", 
  -- 25 = "State and national averages include Veterans Health Administration (VHA) hospital data.", 
  -- 26 = "State and national averages include Department of Defense (DoD) hospital data.".  
- *start_date* >>	date 	The date on which measurement began for this measure.  
- *end_date* >>	date 	The date on which measurement ended for this measure.  

Q1: How many columns are there in the dataset? What columns exist? 
Q2: How many observations(rows)? 
Q3: How many different states exist in the dataset ?
Q4: What categories are there in the condition column?
Q5: What is being measured in the "measure_name" column?
Q6: How many observations with footnote 5, 25, and 26?
Q7: Which years are there in the start and end date columns?



```{r}
starwars <- dplyr::starwars
```

```{r}
head(starwars)
```


```{r}
typeof(starwars)
class(starwars)
```

```{r}
## Get the name of all columns
colnames(starwars)
# Or 
names(starwars)
```
# Functions on rows
## 1. Filtering

`filter(dataframe, condition)` like `filter(dataframe, column == "something")`
`filter(dataframe, condition1, condition2)` like `filter(dataframe, column == "something", column == "something")`


**Why it matters in healthcare/bioinformatics:** you almost always filter:
- QC failures
- non-target tissue / condition
- outliers
- incomplete cases


```{r}
human_starwars <- filter(starwars, 
                         species == "Human")
head(human_starwars)
```


```{r}
filter(starwars,
       species == "Human", height > 180)
```

## Combining a lot of functions
```{r}
df <- nycflights23::flights
mean(pull(filter(filter(df, origin == "JFK"), carrier == "AA", !is.na(arr_delay)), arr_delay))
```

Using a pipe ` %>% `. Shortcut is `ctrl + shift + M` or `command + shift + M`
```{r}
df %>%
  filter(origin == "JFK") %>%
  filter(carrier == "AA") %>%
  filter(!is.na(arr_delay)) %>%
  pull(arr_delay) %>%
  mean()
```

Even nicer
```{r}
df %>%
  filter(origin == "JFK",
         carrier == "AA",
         !is.na(arr_delay)) %>%
  summarise(avg_delay = mean(arr_delay)) %>% 
  pull(avg_delay)
```

*NOTE*  
The function `pull` is a function that turns a column into a vector
For example if you have a column called "genes", using pull turns the column into a vector with all these genes


## Common patterns
```{r}
starwars %>% filter(is.na(mass))                # missing values
starwars %>% filter(!is.na(height))             # non-missing
starwars %>% filter(between(height, 150, 200))  # within a range
```


## 2. arrange() — sort rows

```{r}
starwars %>% 
  arrange(desc(height))
```



## 3. distinct() — unique rows

```{r}
starwars %>% 
  distinct(species, homeworld)
```

## 4. slice_max, slice_min, slice_head, and slice_tail

```{r}
DEGs <- readxl::read_xlsx("DEGs.xlsx")

DEGs %>% 
  mutate(pct_diff = pct.1 - pct.2) %>% 
  relocate(pct_diff, .after = pct.2) %>% 
  filter(p_val_adj < 0.05) %>% 
  group_by(cluster) %>% 
  slice_max(n = 10, order_by = avg_log2FC)
```


`slice_max` for each group, get the *top* n observations
`slice_min` for each group, get the *bottom* n observations 
`slice_head`  for each group, get the *first* n observations
`slice_tail`  for each group, get the *last* n observations

If not grouped 

`slice_max()` → get the top n rows overall based on a variable.
`slice_min()` → get the bottom n rows overall based on a variable.
`slice_head()` → get the first n rows in the dataset.
`slice_tail()` → get the last n rows in the dataset.


################################################################################
################################################################################

# Functions on columns

## 1. select(), rename(), relocate()

```{r}
starwars %>%
  select(name, height, species, mass) %>%
  rename("weight_kg" = "mass") %>%
  relocate(weight_kg, .after = height)
```

`rename(dataframe, newcolname = oldname)` renames an existing column name to another
Some R versions requires "" for column renaming


`relocate` function allows the new column is added in a specific position, default (last column)


## 2. Removing a column

The function `select` can be used to positively or negatively select columns (in any order)


```{r}
df %>% 
  select(day, month, dep_time)
```


We can use `select(df, -col1)` for one column or multiple columns like `select(df, -col1, -col2, -col3)` or even better use (any_of)

```{r}
cols_to_drop <- c("year", "carrier")

df %>% 
  select(-any_of(cols_to_drop))
```


```{r}
df %>% 
  select(starts_with("dep"))

df %>% 
  select(ends_with("time"))

df %>% 
  select(contains("del"))

df %>% 
  select(matches("(delay|time)$"))

```

```{r}
df %>% 
  select(where(is.numeric))

df %>% 
  select(where(is.character))
```


## 3. Turning a column into multiple
```{r}
df_time_hour <- df %>% select(time_hour)
head(df_time_hour)
```


The function `separate(dataframe, col = "column_to_split", into = c("piece1", "piece2"), sep = "separator"))`

```{r}
df_time_hour <-  df_time_hour %>% separate(sep = " ", 
                col = time_hour, 
                into = c("yearmonthday", "time"), 
                remove = T)
head(df_time_hour)
```

```{r}
df_time_hour <- df_time_hour %>% 
  separate(sep = "-", 
                col = yearmonthday, 
                into = c(NA, "month", "day"), 
                remove = T)
head(df_time_hour)
```
The same function can be applied on the time column, into hour, minute, seconds

```{r}
df_time_hour <- df_time_hour %>% 
  separate(sep = ":", 
                col = time, 
                into = c("hr", "mn", "sec"), 
                remove = T)
head(df_time_hour)
```
## 4. uniting columns with unite()

```{r}
df_time_hour %>%
  unite("newcol", month, day, hr, mn, sec, sep = "-", remove = FALSE)
```

## 5. mutate() — create new variables

Adding a column follows this structure `mutate(df, "newcolumn" = df$existingcolumn)` or  `mutate(df, newcolumn = df$existingcolumn)`

```{r}
starwars %>%
  mutate(height_m = height / 100, 
         bmi = mass / (height_m^2)) %>%
  select(name, height, mass, height_m, bmi)
```

```{r}
new_df <- nycflights23::flights %>% 
  mutate("dag" = day)

new_df <- new_df %>% 
  mutate("Day_Month_Year" = str_c(new_df$day, 
                                  new_df$month, 
                                  new_df$year, sep = "_")) %>%
  relocate(Day_Month_Year, .before = year)
head(new_df)
```

### case_when() — categorical variables

`mutate` with `case_when` can be used to add a column based on a condition of another column that follows a pattern.

mutate(NewCol = case_when(grepl("pattern", ExistingColumn) ~ 
"AddedText", 
TRUE ~ "AddedText2"))  

which means Assign "AddedText" if the pattern exists in the ExistingColumn, if not, assign "AddedText2"

```{r}
starwars %>%
  mutate(height_group = case_when(
    height < 150 ~ "short",
    height < 190 ~ "medium",
    TRUE ~ "tall")
    ) %>%
  count(height_group)
```

## 6.1 Grouping

`group_by` is used to group the dataframe by a certain column, where another function could be applied on each group. For instance, here we calculate the average arrival delay for each carrier

```{r}
df %>%
  group_by(carrier) %>%
  summarise(
    avg_arr_delay = mean(arr_delay, na.rm = TRUE),
    n_flights = n()
  )
```


## 6.2 group_by() + summarise()

```{r}
starwars %>%
  group_by(species) %>%
  summarise(
    n = n(),
    mean_height = mean(height, na.rm = TRUE)
  ) %>%
  arrange(desc(n))
```

## 7. across() — multi-column operations

```{r}
starwars %>%
  summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
```


# Merging/Joining dataframes

```{r}
set.seed(12345)
patient_table <- tibble(
  patient_id = paste0("P", str_pad(1:12, 3, pad = "0")),
  diagnosis = sample(c("control", "sepsis", "cancer"), 12, replace = TRUE),
  age = sample(30:90, 12, replace = TRUE),
  sex = sample(c("F", "M"), 12, replace = TRUE)
)

site_table <- tibble(
  patient_id = paste0("P", str_pad(sample(1:12, 10), 3, pad = "0")),
  hospital = sample(c("SiteA", "SiteB", "SiteC"), 10, replace = TRUE),
  country = sample(c("US", "DE", "FR"), 10, replace = TRUE)
)

patient_table
site_table

```

## left_join()

```{r}
patients_full <- patient_table %>%
  left_join(site_table, by = "patient_id")

patients_full
```

## Debug: who did NOT match?

```{r}
patient_table %>%
  anti_join(site_table, by = "patient_id")
```

```{r}
airlines_df <- nycflights23::airlines
head(airlines_df)
```


```{r}
flights_df <- nycflights23::flights
merged_df <- left_join(flights_df, airlines_df, by = "carrier") 

merged_df %>%  
  dplyr::rename(airlines_nm = name) %>% 
  relocate(airlines_nm, .after = carrier)
```

`left_join(x, y)`	All from x  
`right_join(x, y)`	All from y  
`inner_join(x, y)`	Only matching rows  
`full_join(x, y)`	All rows from both tables, filling the missing with NAs  




# Reshaping data (Wide vs Long data shape)
```{r}
set.seed(12345)

metadata <- tibble(
  sample_id = paste0("S", str_pad(1:24, 3, pad = "0")),
  patient_id = paste0("P", str_pad(sample(1:12, 24, replace = TRUE), 3, pad = "0")),
  tissue = sample(c("tumor", "normal"), 24, replace = TRUE),
  treatment = sample(c("drugA", "drugB", "placebo"), 24, replace = TRUE),
  sex = sample(c("F", "M"), 24, replace = TRUE),
  age = sample(30:85, 24, replace = TRUE),
  batch = sample(c("batch1", "batch2"), 24, replace = TRUE)
)

metadata

genes <- paste0("Gene", 
                str_pad(1:50, 4, 
                        pad = "0"))

counts <- matrix(
  rnbinom(50 * 24, mu = 80, size = 1),
  nrow = 50,
  ncol = 24,
  dimnames = list(genes, metadata$sample_id)
)

counts_df <- as_tibble(counts, rownames = "gene_id")
counts_df
```

Problem: This is "wide". Many analyses need "long".


## 1. pivot_longer(): wide -> long

```{r}
counts_long <- counts_df %>%
  pivot_longer(
    cols = starts_with("S"),
    names_to = "sample_id",
    values_to = "count"
  )

counts_long
```

Now each row is: gene × sample.


Now we join counts with metadata

```{r}
head(metadata)
```


```{r}
counts_annot <- counts_long %>%
  left_join(metadata, by = "sample_id")

counts_annot %>% glimpse()
```



Here we summarise expression by group

Example: mean counts by tissue per gene

```{r}
gene_summary <- counts_annot %>%
  group_by(gene_id, tissue) %>%
  summarise(mean_count = mean(count), .groups = "drop")

gene_summary
```

## 2. pivot_wider(): long -> wide

Create a gene × tissue table (2 columns: tumor/normal)

```{r}
gene_wide <- gene_summary %>%
  pivot_wider(
    names_from = tissue,
    values_from = mean_count
  )

gene_wide
```


## 3. Binding rows and columns

```{r}
visit_a <- tibble(
  id = c("P01","P02","P03"),
  visit = "baseline",
  crp = c(2.1, 5.4, 1.9)
)

visit_b <- tibble(
  id = c("P04","P05"),
  visit = "week4",
  crp = c(3.3, 2.8),
  wbc = c(6.1, 5.7)   # extra column not in visit_a
)

```


### 3.1 rbind vs bind_rows (stacking rows)
```{r}
rbind(visit_a, visit_b)
```


```{r}
bind_rows(visit_a, visit_b)
```
here `bind_rows` does not give an error, making sure rows are stacked, and missing columns are created and filled with NA (wbc is NA for baseline rows)


### 3.2 bind_cols() vs cbind() (gluing columns)

```{r}
patients <- tibble(
  id = c("P01","P02","P03"),
  sex = c("F","M","F")
)

labs <- tibble(
  crp = c(2.1, 5.4, 1.9),
  wbc = c(5.8, 6.2, 5.1)
)

```

```{r}
cbind(patients, labs)
```

```{r}
bind_cols(patients, labs)
```

```{r}
labs_short <- tibble(crp = c(2.1, 5.4))

cbind(patients, labs_short)
```

```{r}
bind_cols(patients, labs_short)
```

# Recap 

## 1. **Column-wise** functions 
### (modify or choose columns): 

- select()  

- rename()  

- relocate()  

### Create or modify columns  

- mutate()  

### Identify columns using tidyselect helpers:  
- starts_with(), ends_with(), contains(), matches(), where()


## 2. **Row-wise** functions (filter, slice, row operations)
### Keep or remove rows

- filter()

- slice(), slice_head(), slice_tail(), slice_min(), slice_max()

### Order rows

- arrange()


## 3. **Group-wise** functions 
### (operate within groups)

### Grouping changes how other functions behave:

group_by(), ungroup()

## 4. **Data-frame–wise** functions 
### (operate on whole table)

- bind_rows()

- bind_cols()

- left_join(), inner_join(), full_join(), anti_join()


### Reshape data:

pivot_longer() and pivot_wider()


## 5. **Summary** functions 
### (usually inside summarise())

mean(), median(), sum(), n(), n_distinct(), sd(), var()



# Good-to-know symbols names
()  parentheses  
[]  square brackets  
{}  curly braces  
*  asterisk  
&  ampersand  
^  hat  
/  forward slash  
\\  backslash  
''  single quotes  
""  double quotes  
-  hyphen (dash)  
_  underscore  
~  tilda  
`  backtick  
!  exclamation mark  
.  period  
,  comma  
:  colon  
;  semi-colon  
$  dollar sign  
#  sharp  
%  percentage  
  
