Workshop goals

By the end of this workshop you should be able to:

Getting started

Installing R and RStudio

Working with R is primarily text-based. The basic mode of use for R is that the user types in a command in the R language and presses enter, and then R computes and displays the result.

RStudio

We will be working in RStudio. This surrounds the console, where one enters commands and views the results, with various conveniences. In addition to the console, RStudio provides panels containing:

  • A text editor, where R commands can be recorded for future reference.
  • A history of commands that have been typed on the console.
  • An “environment” panel with a list of variables, which contain values that R has been told to save from previous commands.
  • A file manager.
  • Help on the functions available in R (Using a question mark before the function)
  • A panel to show plots (graphs).

R script vs RMD (Markdowns and Notebooks)

  • To start an R script click ctrl + shift + N or cmd + shift + N on mac You can run lines of code, but the output won’t be saved

  • To start an RMD click ctrl + B/N or cmd + B/N on mac to initialize a notebook
    Allows you to run and visualize output in the file, easy to share with collaborators showing both the code and all results/plots

This is called the main chunk, where is has all the settings that you would like to apply along the notebook

knitr::opts_chunk$set(echo = T, results = "show")
options(width=80)
require("knitr")
Loading required package: knitr
## To change the directory in a notebook/markdown
# opts_knit$set(root.dir = "path/to/directory/")

To insert a chunk, use ctrl + alt + I, or command + option + I

R Basics

The console

Open RStudio, click on the “Console” panel, type 3+2 and press enter. R displays the result of the calculation. In this document, we will be showing such an interaction with R as below.

3 + 2
[1] 5

+ is called an operator. R has other operators for basic mathematical calculations:
- for subtraction
* for multiplication
/ forward slash for division
^ hat for “power”

* has higher precedence than +. We can use brackets ( ), if necessary. Try 1+2*3 and (1+2)*3

1+2*3
[1] 7
(1+2)*3
[1] 9

Making lines of code according to styler, highlight the lines of code, click ctrl + shift + A or cmd + shift + A

Comparisons

We use >, <, >=, ==, <=, != for comparisons. Note that for comparisons its always a double sign to be used. This returns a “logical” value of TRUE or FALSE.

3 * 3 == 9
[1] TRUE

This literally means Is the left hand side equal to the right hand side? Is 3 times 3 equal to 9?

Variables

A variable is a name that stores a value. We create a variable by assigining a value to it using <- which points to the left assigning the value of the right to the one on the left

weight_kg = 65
weight_kg <- 65

To print the value stored in a variable, print the variable

weight_kg
[1] 65

Names of the variables should not contain special characters or spaces, should not start with a number. Dots are ok, unlike many other programming languages

To list all variables in an environment

ls()
[1] "weight_kg"
## to remove a variable/object
rm(weight_kg)

## to remove some variable of a certain pattern (starting/ending/including ....)
rm(list = ls(pattern = ""))

Any line of code inside a chunk or in the console that has a # before, is a comment. Useful to document your code so its easy to remember why we used this command

To add a # at the beginning of the line for multiple lines, highlight all lines, and click ctrl + shift + c or cmd + shift + c

We can do arithemtic on the variable

weight_kg <- 65
2.2 * weight_kg
[1] 143

Vectors

A vector is a collection of elements (numbers, words,.. etc) To be able to combine elements into a vector, we use the function c() which stands for combine or concatenate

c(1,2,3,4)
[1] 1 2 3 4
vec <- c(10,20,30,40)
vec + 1 # Here the arithmetic + 1 is applied on every element of the vector
[1] 11 21 31 41

We can also combine vectors

vec + vec 
[1] 20 40 60 80
# In this case, the 1st element of vec is added to the 1st element of vec, 2nd with 2nd.. and so on

To check the length of the vector we use the function length()

length(vec)
[1] 4

Data types in R

Most common data types in R are mainly:
* Numeric (could be double which is the default, or integer)
* Character (String)
* Logical (TRUE/FALSE)

To check the type of a vector, we use typeof() function

x <- 5
typeof(x)
[1] "double"
## To create an integer add L to the number
x <- 5L
typeof(x)
[1] "integer"
print(x)
[1] 5
class(x)
[1] "integer"

For Characters, we use single or double quotes

"Hello World"
[1] "Hello World"
plain_txt = 'This is a string'
print(plain_txt)
[1] "This is a string"
typeof(plain_txt)
[1] "character"
class(plain_txt)
[1] "character"

To be able to look for and modify a pattern, the vector should be a string!

There are also categorical vectors in which elements can be one of several “categories/levels”

factor(c("mutant", "wildtype", "mutant"), 
       levels = c("wildtype", "mutant"))
[1] mutant   wildtype mutant  
Levels: wildtype mutant

Factor is important for arrangement of data in a plot for example.

Accessing elements of vectors (Indexing)

To access elements, we use [] with a number (index)

vec[1]
[1] 10
vec[-1]
[1] 20 30 40
vec[c(1,3)]
[1] 10 30
vec[1:4]
[1] 10 20 30 40
vec[1] <- 6

We can use a vector instead of numbers to index a vector

myindex <- c(4,2,1,3)
vec[myindex]
[1] 40 20  6 30
## which is the same as
vec[c(4,2,1,3)]
[1] 40 20  6 30

Matrix

2-D tabular data structure in which all elements are of the same type (typically numeric matrices, but could also be character or logical)

# To create a matrix
matrix()
     [,1]
[1,]   NA
# Example:
mat <- matrix(1:12, nrow = 2, ncol = 6, byrow = F)
mat
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    3    5    7    9   11
[2,]    2    4    6    8   10   12
typeof(mat)
[1] "integer"
class(mat)
[1] "matrix" "array" 
# To get the dimensions of a matrix
dim(mat)
[1] 2 6
## get only how many rows
nrow(mat)
[1] 2
## get only how many columns
ncol(mat)
[1] 6
## get column names
colnames(mat)
NULL
## Accessing elements of a matrix
mat[1,3] # row number, # col number]
[1] 5
# this retuns the element at row 1, column 3

Data frame

Similar to a matrix, the difference is that the columns of a dataframe could be of different types (could combine numeric values, strings, logical) The index of each row could be from 1 to length(dataframe) or could have a unique name (can be accessed/modified using rownames function)

Functions

a function is followed by parentheses () mean(), sd(), print(), rep()

# To create a function
name_of_function <- function(){ # inside parentheses what the function takes
  # what the function should do
}

## then call the function
name_of_function()

The function takes arguments (options), some are mandatory and some are optional. Arguments can be supplied in order (positional argument), or by the argument name (better to use). Clicking tab for autocompletion helps us remember what arguments a function could take

calc <- function(num1, num2){
  sum = num1 + num2
  print(sum)
}

calc(num1 = 10, 5)
[1] 15

Lists

Lists can contain different kinds of elements

mylist <- list(num= 42, greeting="Hello, World")

# to access elements: ## double square bracket to access
mylist[[1]]
[1] 42
mylist$greeting
[1] "Hello, World"

If you are not sure about the type of a variable, run class

class(vec)
[1] "numeric"
class(mat)
[1] "matrix" "array" 
str(mylist)
List of 2
 $ num     : num 42
 $ greeting: chr "Hello, World"

Applying a function on a list (very practical)

list_nums <- list(
  a = c(1, 2, 3),
  b = c(10, 20, 30),
  c = c(5, 5, 5)
)

lapply(list_nums, mean)
$a
[1] 2

$b
[1] 20

$c
[1] 5

Applying an anonymous function (when we want to perform many steps over the list)

list_nums_modified <- lapply(list_nums, function(element){
  ## first do this on each element
  element + 4
  ## then do this
  return(element)
})
## Applying the mean function over a list
lapply(list_nums, function(elem) elem^2) ## returns a list
$a
[1] 1 4 9

$b
[1] 100 400 900

$c
[1] 25 25 25

sapply (simple apply)

for a simplified version

sapply(list_nums, function(elem) elem^2) ## returns a simplified form (vector)
     a   b  c
[1,] 1 100 25
[2,] 4 400 25
[3,] 9 900 25

Loops

for (numb in list_nums){
  print(numb)
}
[1] 1 2 3
[1] 10 20 30
[1] 5 5 5

If conditional

list_nums <- c(1,2,3,6,7)
for(numb in list_nums){
  if(numb > 5){
    print(numb)}
    else print("This number is smaller than 5, so will not be printed")
}
[1] "This number is smaller than 5, so will not be printed"
[1] "This number is smaller than 5, so will not be printed"
[1] "This number is smaller than 5, so will not be printed"
[1] 6
[1] 7

Tidy data

Tidy data rules: 1. Each variable = one column 2. Each observation = one row 3. Each type of entity = one table

Why it matters: - easier plotting - easier modelling - fewer bugs - compatible with tidyverse grammar

Installing packages

To install a package from CRAN, we use the function install.packages

install.packages("styler")
install.packages("tidyverse")

From Github

Github is a web-based platform for version control, collaboration, and code sharing, built around the Git system. It allows users to track changes to files, manage contributions from multiple people, and publish open-source projects.

GitHub is widely used for:

  • Sharing R packages before they are published on CRAN

  • Developing packages collaboratively using pull requests and issues

  • Storing analysis projects (scripts, R Markdown, data, documentation)

  • Reproducible workflows, since every change is tracked

Installing a package from Github install_github("author/package")

install.packages("devtools")

## To load a library
library(devtools)
devtools::install_github("githubusername/repository")
install_github("immunogenomics/presto")

## instead of loading a whole package, we can call a function from a package using the format:
package::function()
  
## function select
AnnotationDbi::select()
dplyr::select()
  
## Example
devtools::install_github("immunogenomics/presto")

From Bioconductor

Bioconductor

Bioconductor is an open-source project that provides a large collection of R packages for biological and biomedical data analysis, especially high-throughput sequencing, genomics, transcriptomics, and proteomics.

  • Curated, peer-reviewed packages: Every package goes through a structured review for code quality, documentation, and reproducibility.

  • Focus on life sciences: Tools for RNA-seq, ChIP-seq, single-cell RNA-seq, epigenomics, variant analysis, annotation, and more.

  • Reproducible workflows using standardized data structures (e.g., SummarizedExperiment, SingleCellExperiment).

  • Frequent release cycle (twice yearly), synchronized with R versions.

Bioconductor has become a central ecosystem for computational biology because it promotes reproducibility, integrates with modern R workflows, and provides consistent interfaces for complex biological datasets.

## First, to install Bioconductor
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.22")

## Installing packages from Bioconductor
BiocManager::install("DESeq2")
install.packages("tidyverse")
install.packages("janitor")
install.packages("lubridate")

Loading libraries

library(tidyverse)
library(tidyr)
library(dplyr)
library(stringr)
library(readr)
library(janitor)
library(lubridate)

# To load multiple libraries at once, we can use pacman package, p_load function
pacman::p_load(tidyverse, styler, dplyr, stringr, janitor, lubridate)

Working with directories

## To check the current directory
getwd()
## To change the working directory
setwd("Path/to/directory/")

## Remember to change the directory for the whole notebook, we use the main chunk

Importing data

## Importing an R data structure object (.RDS)
myfile <- readRDS(file = "path/to/file.RDS")

## CSV file
myfile <- utils::read.csv()

## More general function
myfile <- utils::read.delim(file = "/path/to/file.extenstion", 
                  header = T, sep = "\t") ## the file has a header with column names, and each observation is separated by a tab
myfile <- readxl::read_xlsx("/path/to/file/file.xlsx")

Checking the content of a file

head(myfile) ## shows the first 6 observations, can change n argument
tail(myfile) ## shows the last 6 observations, can change n argument
str(myfile) ## checks the structure of the file as a variable
dim(myfile) ## checks the dimensions of a file (returns number of rows, number of columns)
nrow(myfile) # checks the number of rows
ncol(myfile)# checks the number of columns

## checks the column names
colnames(myfile)
names(myfile)

Data maniuplation

Adding columns to a dataframe

myfile$newcol <- myfile$col1 * myfile$col2

Access columns by name

# one column
myfile$columnname
## more than a column
myfile[, c("col1", "col4", "col2")]
myfile[, c(1,4,2)]

Access a column by index

myfile[,4] # the row is empty, which means show all rows for column 4

Access rows

df[4, ] ## gets the 4th row for all columns
df[4:10] ## from 4th to 10th

Select rows by a condition (exact)

myfile_subset <- subset(myfile, given_column == "something") ## subset rows that follow this condition

Select rows by a condition (more than one category)

myfile_subset <- subset(myfile, 
                        given_column %in% c("something", "something else")) ## subset rows that follow this condition

Select rows by a condition (exclusion of exact)

myfile_subset <- subset(myfile, given_column != "something") ## subset rows that follow this condition

Select rows by a condition (exclusion of more than one category)

myfile_subset <- subset(myfile, given_column != "something" & 
         given_column != "something else" &
         given_column != "another thing")

Using the ! to negate

myfile_subset <- subset(myfile,
                        !(given_column %in% c("something", "something else")))

There is no equivalent operator for %in%, but you can create it!

"%out%" <- Negate("%in%")
myfile_subset <- subset(myfile, 
                        given_column %out% c("something", "something else")) 
## subset rows that do not follow this condition
# clean column names
# df <- janitor::clean_names(df)

# install.packages("tidytuesdayR")
# tuesdata <- tidytuesdayR::tt_load('2023-10-24')
# df <- tuesdata$patient_risk_profiles

df <- readxl::read_xlsx("tuesdata_df.xlsx")
head(df)
df$`predicted risk of Sudden Vision Loss, with no eye pathology causes`
  [1] 1.115083e-04 1.607285e-03 1.458917e-04 1.527737e-04 3.258655e-04 4.667266e-04
  [7] 2.140686e-04 1.600603e-04 2.184795e-04 6.122239e-04 1.036383e-04 6.725240e-04
 [13] 3.901209e-05 6.773092e-05 2.323113e-04 9.834800e-05 4.658432e-05 1.349731e-04
 [19] 3.089568e-04 3.136560e-04 6.938989e-05 1.141138e-04 1.059116e-04 3.656749e-04
 [25] 4.289676e-04 1.280323e-04 6.686286e-05 1.063679e-04 2.121934e-04 7.246477e-04
 [31] 9.408668e-05 4.846494e-04 2.092650e-04 3.848392e-04 2.493011e-04 6.565204e-04
 [37] 2.682293e-04 5.494188e-04 1.903225e-04 7.163179e-05 9.823006e-05 2.013026e-04
 [43] 2.274620e-04 1.126852e-04 9.863359e-05 6.779899e-04 3.971666e-05 4.995190e-05
 [49] 5.199037e-05 8.074773e-05 1.597246e-04 5.455884e-04 1.431606e-04 4.426006e-04
 [55] 1.203483e-04 9.213029e-04 1.043975e-04 1.553771e-04 2.758162e-04 2.382385e-04
 [61] 9.461499e-05 1.672590e-04 2.414275e-04 1.585947e-04 3.395669e-04 2.180608e-05
 [67] 1.913146e-04 2.044669e-04 2.813584e-04 9.473806e-05 7.132445e-05 1.358532e-04
 [73] 1.018202e-04 2.385006e-05 4.251972e-05 6.189295e-04 5.478834e-04 9.701396e-04
 [79] 3.324462e-04 2.345282e-04 4.010593e-04 1.223014e-04 1.369716e-04 3.559392e-05
 [85] 2.381193e-05 2.206086e-04 6.609310e-04 6.652940e-05 1.531101e-04 6.950099e-05
 [91] 6.898172e-05 7.661815e-05 4.003384e-04 7.201001e-04 8.395539e-04 1.982071e-04
 [97] 4.538425e-05 5.966995e-04 4.032301e-04 3.110642e-04
df <- janitor::clean_names(df)
head(df)

Using tidyverse

Used for data wrangling and maniuplation

When we load tidyverse, by default, dplyr, readr, stringr, tibble, tidyr, purrr, and ggplot2 are loaded

library(tidyverse)
── Attaching core tidyverse packages ────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     ── Conflicts ──────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

We will work with care state dataset from tidy tuesday

care_state <- readr::read_csv(
  "https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-04-08/care_state.csv"
)
Rows: 1232 Columns: 8── Column specification ──────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): state, condition, measure_id, measure_name, footnote
dbl  (1): score
date (2): start_date, end_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
care_state %>% glimpse()
Rows: 1,232
Columns: 8
$ state        <chr> "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", "…
$ condition    <chr> "Healthcare Personnel Vaccination", "Healthcare Personnel Vaccinati…
$ measure_id   <chr> "HCP_COVID_19", "IMM_3", "OP_18b", "OP_18b_HIGH_MIN", "OP_18b_LOW_M…
$ measure_name <chr> "Percentage of healthcare personnel who are up to date with COVID-1…
$ score        <dbl> 7.3, 80.0, 140.0, 157.0, 136.0, 136.0, NA, 196.0, 230.0, 182.0, 200…
$ footnote     <chr> NA, NA, "25, 26", "25, 26", "25, 26", "25, 26", "25, 26", "25", "25…
$ start_date   <date> 2024-01-01, 2023-10-01, 2023-04-01, 2023-04-01, 2023-04-01, 2023-0…
$ end_date     <date> 2024-03-31, 2024-03-31, 2024-03-31, 2024-03-31, 2024-03-31, 2024-0…

Q1: How many columns are there in the dataset? What columns exist? Q2: How many observations(rows)? Q3: How many different states exist in the dataset ? Q4: What categories are there in the condition column? Q5: What is being measured in the “measure_name” column? Q6: How many observations with footnote 5, 25, and 26? Q7: Which years are there in the start and end date columns?

starwars <- dplyr::starwars
head(starwars)
typeof(starwars)
[1] "list"
class(starwars)
[1] "tbl_df"     "tbl"        "data.frame"
## Get the name of all columns
colnames(starwars)
 [1] "name"       "height"     "mass"       "hair_color" "skin_color" "eye_color" 
 [7] "birth_year" "sex"        "gender"     "homeworld"  "species"    "films"     
[13] "vehicles"   "starships" 
# Or 
names(starwars)
 [1] "name"       "height"     "mass"       "hair_color" "skin_color" "eye_color" 
 [7] "birth_year" "sex"        "gender"     "homeworld"  "species"    "films"     
[13] "vehicles"   "starships" 

Functions on rows

1. Filtering

filter(dataframe, condition) like filter(dataframe, column == "something") filter(dataframe, condition1, condition2) like filter(dataframe, column == "something", column == "something")

Why it matters in healthcare/bioinformatics: you almost always filter: - QC failures - non-target tissue / condition - outliers - incomplete cases

human_starwars <- filter(starwars, 
                         species == "Human")
head(human_starwars)
filter(starwars,
       species == "Human", height > 180)

Combining a lot of functions

df <- nycflights23::flights
mean(pull(filter(filter(df, origin == "JFK"), carrier == "AA", !is.na(arr_delay)), arr_delay))
[1] 0.5470231

Using a pipe %>%. Shortcut is ctrl + shift + M or command + shift + M

df %>%
  filter(origin == "JFK") %>%
  filter(carrier == "AA") %>%
  filter(!is.na(arr_delay)) %>%
  pull(arr_delay) %>%
  mean()
[1] 0.5470231

Even nicer

df %>%
  filter(origin == "JFK",
         carrier == "AA",
         !is.na(arr_delay)) %>%
  summarise(avg_delay = mean(arr_delay)) %>% 
  pull(avg_delay)
[1] 0.5470231

NOTE
The function pull is a function that turns a column into a vector For example if you have a column called “genes”, using pull turns the column into a vector with all these genes

Common patterns

starwars %>% filter(is.na(mass))                # missing values
starwars %>% filter(!is.na(height))             # non-missing
starwars %>% filter(between(height, 150, 200))  # within a range

2. arrange() — sort rows

starwars %>% 
  arrange(desc(height))

3. distinct() — unique rows

starwars %>% 
  distinct(species, homeworld)

4. slice_max, slice_min, slice_head, and slice_tail

DEGs <- readxl::read_xlsx("DEGs.xlsx")

DEGs %>% 
  mutate(pct_diff = pct.1 - pct.2) %>% 
  relocate(pct_diff, .after = pct.2) %>% 
  filter(p_val_adj < 0.05) %>% 
  group_by(cluster) %>% 
  slice_max(n = 10, order_by = avg_log2FC)

slice_max for each group, get the top n observations slice_min for each group, get the bottom n observations slice_head for each group, get the first n observations slice_tail for each group, get the last n observations

If not grouped

slice_max() → get the top n rows overall based on a variable. slice_min() → get the bottom n rows overall based on a variable. slice_head() → get the first n rows in the dataset. slice_tail() → get the last n rows in the dataset.

Functions on columns

1. select(), rename(), relocate()

starwars %>%
  select(name, height, species, mass) %>%
  rename("weight_kg" = "mass") %>%
  relocate(weight_kg, .after = height)

rename(dataframe, newcolname = oldname) renames an existing column name to another Some R versions requires “” for column renaming

relocate function allows the new column is added in a specific position, default (last column)

2. Removing a column

The function select can be used to positively or negatively select columns (in any order)

df %>% 
  select(day, month, dep_time)

We can use select(df, -col1) for one column or multiple columns like select(df, -col1, -col2, -col3) or even better use (any_of)

cols_to_drop <- c("year", "carrier")

df %>% 
  select(-any_of(cols_to_drop))
df %>% 
  select(starts_with("dep"))

df %>% 
  select(ends_with("time"))

df %>% 
  select(contains("del"))

df %>% 
  select(matches("(delay|time)$"))
NA
df %>% 
  select(where(is.numeric))

df %>% 
  select(where(is.character))

3. Turning a column into multiple

df_time_hour <- df %>% select(time_hour)
head(df_time_hour)

The function separate(dataframe, col = "column_to_split", into = c("piece1", "piece2"), sep = "separator"))

df_time_hour <-  df_time_hour %>% separate(sep = " ", 
                col = time_hour, 
                into = c("yearmonthday", "time"), 
                remove = T)
head(df_time_hour)
df_time_hour <- df_time_hour %>% 
  separate(sep = "-", 
                col = yearmonthday, 
                into = c(NA, "month", "day"), 
                remove = T)
head(df_time_hour)

The same function can be applied on the time column, into hour, minute, seconds

df_time_hour <- df_time_hour %>% 
  separate(sep = ":", 
                col = time, 
                into = c("hr", "mn", "sec"), 
                remove = T)
head(df_time_hour)

4. uniting columns with unite()

df_time_hour %>%
  unite("newcol", month, day, hr, mn, sec, sep = "-", remove = FALSE)

5. mutate() — create new variables

Adding a column follows this structure mutate(df, "newcolumn" = df$existingcolumn) or mutate(df, newcolumn = df$existingcolumn)

starwars %>%
  mutate(height_m = height / 100, 
         bmi = mass / (height_m^2)) %>%
  select(name, height, mass, height_m, bmi)
new_df <- nycflights23::flights %>% 
  mutate("dag" = day)

new_df <- new_df %>% 
  mutate("Day_Month_Year" = str_c(new_df$day, 
                                  new_df$month, 
                                  new_df$year, sep = "_")) %>%
  relocate(Day_Month_Year, .before = year)
head(new_df)

case_when() — categorical variables

mutate with case_when can be used to add a column based on a condition of another column that follows a pattern.

mutate(NewCol = case_when(grepl(“pattern”, ExistingColumn) ~ “AddedText”, TRUE ~ “AddedText2”))

which means Assign “AddedText” if the pattern exists in the ExistingColumn, if not, assign “AddedText2”

starwars %>%
  mutate(height_group = case_when(
    height < 150 ~ "short",
    height < 190 ~ "medium",
    TRUE ~ "tall")
    ) %>%
  count(height_group)

6.1 Grouping

group_by is used to group the dataframe by a certain column, where another function could be applied on each group. For instance, here we calculate the average arrival delay for each carrier

df %>%
  group_by(carrier) %>%
  summarise(
    avg_arr_delay = mean(arr_delay, na.rm = TRUE),
    n_flights = n()
  )

6.2 group_by() + summarise()

starwars %>%
  group_by(species) %>%
  summarise(
    n = n(),
    mean_height = mean(height, na.rm = TRUE)
  ) %>%
  arrange(desc(n))

7. across() — multi-column operations

starwars %>%
  summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))

Merging/Joining dataframes

set.seed(12345)
patient_table <- tibble(
  patient_id = paste0("P", str_pad(1:12, 3, pad = "0")),
  diagnosis = sample(c("control", "sepsis", "cancer"), 12, replace = TRUE),
  age = sample(30:90, 12, replace = TRUE),
  sex = sample(c("F", "M"), 12, replace = TRUE)
)

site_table <- tibble(
  patient_id = paste0("P", str_pad(sample(1:12, 10), 3, pad = "0")),
  hospital = sample(c("SiteA", "SiteB", "SiteC"), 10, replace = TRUE),
  country = sample(c("US", "DE", "FR"), 10, replace = TRUE)
)

patient_table
site_table
NA

left_join()

patients_full <- patient_table %>%
  left_join(site_table, by = "patient_id")

patients_full

Debug: who did NOT match?

patient_table %>%
  anti_join(site_table, by = "patient_id")
airlines_df <- nycflights23::airlines
head(airlines_df)
flights_df <- nycflights23::flights
merged_df <- left_join(flights_df, airlines_df, by = "carrier") 

merged_df %>%  
  dplyr::rename(airlines_nm = name) %>% 
  relocate(airlines_nm, .after = carrier)

left_join(x, y) All from x
right_join(x, y) All from y
inner_join(x, y) Only matching rows
full_join(x, y) All rows from both tables, filling the missing with NAs

Reshaping data (Wide vs Long data shape)

set.seed(12345)

metadata <- tibble(
  sample_id = paste0("S", str_pad(1:24, 3, pad = "0")),
  patient_id = paste0("P", str_pad(sample(1:12, 24, replace = TRUE), 3, pad = "0")),
  tissue = sample(c("tumor", "normal"), 24, replace = TRUE),
  treatment = sample(c("drugA", "drugB", "placebo"), 24, replace = TRUE),
  sex = sample(c("F", "M"), 24, replace = TRUE),
  age = sample(30:85, 24, replace = TRUE),
  batch = sample(c("batch1", "batch2"), 24, replace = TRUE)
)

metadata

genes <- paste0("Gene", 
                str_pad(1:50, 4, 
                        pad = "0"))

counts <- matrix(
  rnbinom(50 * 24, mu = 80, size = 1),
  nrow = 50,
  ncol = 24,
  dimnames = list(genes, metadata$sample_id)
)

counts_df <- as_tibble(counts, rownames = "gene_id")
counts_df

Problem: This is “wide”. Many analyses need “long”.

1. pivot_longer(): wide -> long

counts_long <- counts_df %>%
  pivot_longer(
    cols = starts_with("S"),
    names_to = "sample_id",
    values_to = "count"
  )

counts_long

Now each row is: gene × sample.

Now we join counts with metadata

head(metadata)
counts_annot <- counts_long %>%
  left_join(metadata, by = "sample_id")

counts_annot %>% glimpse()

Here we summarise expression by group

Example: mean counts by tissue per gene

gene_summary <- counts_annot %>%
  group_by(gene_id, tissue) %>%
  summarise(mean_count = mean(count), .groups = "drop")

gene_summary

2. pivot_wider(): long -> wide

Create a gene × tissue table (2 columns: tumor/normal)

gene_wide <- gene_summary %>%
  pivot_wider(
    names_from = tissue,
    values_from = mean_count
  )

gene_wide

3. Binding rows and columns

visit_a <- tibble(
  id = c("P01","P02","P03"),
  visit = "baseline",
  crp = c(2.1, 5.4, 1.9)
)

visit_b <- tibble(
  id = c("P04","P05"),
  visit = "week4",
  crp = c(3.3, 2.8),
  wbc = c(6.1, 5.7)   # extra column not in visit_a
)

3.1 rbind vs bind_rows (stacking rows)

rbind(visit_a, visit_b)
bind_rows(visit_a, visit_b)

here bind_rows does not give an error, making sure rows are stacked, and missing columns are created and filled with NA (wbc is NA for baseline rows)

3.2 bind_cols() vs cbind() (gluing columns)

patients <- tibble(
  id = c("P01","P02","P03"),
  sex = c("F","M","F")
)

labs <- tibble(
  crp = c(2.1, 5.4, 1.9),
  wbc = c(5.8, 6.2, 5.1)
)
cbind(patients, labs)
bind_cols(patients, labs)
labs_short <- tibble(crp = c(2.1, 5.4))

cbind(patients, labs_short)
bind_cols(patients, labs_short)

Recap

1. Column-wise functions

(modify or choose columns):

  • select()

  • rename()

  • relocate()

Create or modify columns

  • mutate()

Identify columns using tidyselect helpers:

  • starts_with(), ends_with(), contains(), matches(), where()

2. Row-wise functions (filter, slice, row operations)

Keep or remove rows

  • filter()

  • slice(), slice_head(), slice_tail(), slice_min(), slice_max()

Order rows

  • arrange()

3. Group-wise functions

(operate within groups)

Grouping changes how other functions behave:

group_by(), ungroup()

4. Data-frame–wise functions

(operate on whole table)

  • bind_rows()

  • bind_cols()

  • left_join(), inner_join(), full_join(), anti_join()

Reshape data:

pivot_longer() and pivot_wider()

5. Summary functions

(usually inside summarise())

mean(), median(), sum(), n(), n_distinct(), sd(), var()

Good-to-know symbols names

() parentheses
[] square brackets
{} curly braces
* asterisk
& ampersand
^ hat
/ forward slash
\ backslash
’’ single quotes
“” double quotes
- hyphen (dash)
_ underscore
~ tilda
` backtick
! exclamation mark
. period
, comma
: colon
; semi-colon
$ dollar sign
# sharp
% percentage

---
title: "R + tidyr (Data Wrangling)"
date: "`r format(Sys.time(),  '%d %B %Y')`"
author: "Mohamed Hassan"
output:
  html_notebook:
    theme: cerulean
    toc: true
    toc_depth: 3
editor_options:
  markdown:
    wrap: 72
---

# Workshop goals

By the end of this workshop you should be able to:

- understand **tidy data** and why it matters for data analysis
- clean and wrangle data using **dplyr** (filter, select, mutate, summarise)
- reshape data using **tidyr** (`pivot_longer()`, `pivot_wider()`, `separate()`)
- combine datasets using **joins** (`right_join()`, `left_join()`, `inner_join()`, `anti_join()`)
- handle real messy cases: missing values, wrong column types, messy IDs, duplicate keys


# Getting started

## Installing R and RStudio

- [R:](https://cran.rstudio.com/) https://cran.rstudio.com/
- [RStudio:](https://www.rstudio.com/products/rstudio/download/) https://www.rstudio.com/products/rstudio/download/


Working with R is primarily text-based. The basic mode of use for R is that the user types in a command in the R language and presses enter, and then R computes and displays the result.

![](r_gui.jpg)

### RStudio

We will be working in [RStudio](https://www.rstudio.com/products/rstudio/download/). This surrounds the *console*, where one enters commands and views the results, with various conveniences. In addition to the console, RStudio provides panels containing:

* A *text editor*, where R commands can be recorded for future reference.
* A history of commands that have been typed on the console.
* An "environment" panel with a list of *variables*, which contain values that R has been told to save from previous commands.
* A file manager.
* Help on the functions available in R (Using a question mark before the function)
* A panel to show plots (graphs).


![](Rstudio_layout.png)





## R script vs RMD (Markdowns and Notebooks)

* To start an R script click `ctrl + shift + N` or `cmd + shift + N` on mac
You can run lines of code, but the output won't be saved    

* To start an RMD click `ctrl + B/N` or `cmd + B/N` on mac to initialize a notebook  
Allows you to run and visualize output in the file, easy to share with collaborators showing both the code and all results/plots  



This is called the main chunk, where is has all the settings that you would like to apply along the notebook
```{r setup}
knitr::opts_chunk$set(echo = T, results = "show")
options(width=80)
require("knitr")

## To change the directory in a notebook/markdown
# opts_knit$set(root.dir = "path/to/directory/")
```


To insert a chunk, use `ctrl + alt + I`, or `command + option + I`

# R Basics

## The console
Open RStudio, click on the "Console" panel, type `3+2` and press enter. R displays the result of the calculation. In this document, we will be showing such an interaction with R as below.
```{r}
3 + 2
```

`+` is called an operator. R has other operators for basic mathematical calculations:  
`-` for subtraction  
`*` for multiplication  
`/` forward slash for division  
`^` hat for "power"

`*` has higher precedence than `+`.  We can use brackets  `( )`, if necessary. Try `1+2*3` and `(1+2)*3`

```{r}
1+2*3
```

```{r}
(1+2)*3
```


Making lines of code according to styler, highlight the lines of code, click `ctrl + shift + A` or `cmd + shift + A`


## Comparisons

We use `>`, `<`, `>=`, `==`, `<=`, `!=` for comparisons. Note that for comparisons its always a double sign to be used. This returns a "logical" value of `TRUE` or `FALSE`.

```{r}
3 * 3 == 9
```

This literally means Is the left hand side equal to the right hand side? Is 3 times 3 equal to 9?  


## Variables

A variable is a name that stores a value. We create a variable by assigining a value to it using `<-` which points to the left assigning the value of the right to the one on the left

```{r}
weight_kg = 65
weight_kg <- 65
```


To print the value stored in a variable, print the variable
```{r}
weight_kg
```

Names of the variables should not contain special characters or spaces, should not start with a number. Dots are ok, unlike many other programming languages


To list all variables in an environment 
```{r}
ls()
## to remove a variable/object
rm(weight_kg)

## to remove some variable of a certain pattern (starting/ending/including ....)
rm(list = ls(pattern = ""))
```




Any line of code inside a chunk or in the console that has a `#` before, is a comment. Useful to document your code so its easy to remember why we used this command

To add a `#` at the beginning of the line for multiple lines, highlight all lines, and click `ctrl + shift + c` or `cmd + shift + c`

We can do arithemtic on the variable
```{r}
weight_kg <- 65
2.2 * weight_kg
```


## Vectors
A vector is a collection of elements (numbers, words,.. etc)
To be able to combine elements into a vector, we use the function c() which stands for combine or concatenate

```{r}
c(1,2,3,4)
```


```{r}
vec <- c(10,20,30,40)
vec + 1 # Here the arithmetic + 1 is applied on every element of the vector
```


We can also combine vectors
```{r}
vec + vec 
# In this case, the 1st element of vec is added to the 1st element of vec, 2nd with 2nd.. and so on
```

To check the length of the vector we use the function length()
```{r}
length(vec)
```

## Data types in R

Most common data types in R are mainly:  
* Numeric (could be double which is the default, or integer)  
* Character (String)  
* Logical (TRUE/FALSE)  

To check the type of a vector, we use typeof() function
```{r}
x <- 5
typeof(x)

## To create an integer add L to the number
x <- 5L
typeof(x)
print(x)
class(x)
```

For Characters, we use single or double quotes 

```{r}
"Hello World"
plain_txt = 'This is a string'
print(plain_txt)
typeof(plain_txt)
class(plain_txt)
```

To be able to look for and modify a pattern, the vector should be a string!

There are also categorical vectors in which elements can be one of several "categories/levels"
```{r}
factor(c("mutant", "wildtype", "mutant"), 
       levels = c("wildtype", "mutant"))
```

Factor is important for arrangement of data in a plot for example.



## Accessing elements of vectors (Indexing)
To access elements, we use [] with a number (index)
```{r}
vec[1]
vec[-1]
vec[c(1,3)]
vec[1:4]
vec[1] <- 6
```

We can use a vector instead of numbers to index a vector 
```{r}
myindex <- c(4,2,1,3)
vec[myindex]

## which is the same as
vec[c(4,2,1,3)]
```

## Matrix
2-D tabular data structure in which all elements are of the same type (typically numeric matrices, but could also be character or logical)
```{r}
# To create a matrix
matrix()
# Example:
mat <- matrix(1:12, nrow = 2, ncol = 6, byrow = F)
mat
typeof(mat)
class(mat)
```

```{r}
# To get the dimensions of a matrix
dim(mat)
## get only how many rows
nrow(mat)
## get only how many columns
ncol(mat)
## get column names
colnames(mat)
## Accessing elements of a matrix
mat[1,3] # row number, # col number]
# this retuns the element at row 1, column 3
```

## Data frame
Similar to a matrix, the difference is that the columns of a dataframe could be of different types (could combine numeric values, strings, logical)
The index of each row could be from 1 to length(dataframe) or could have a unique name (can be accessed/modified using rownames function)

## Functions
a function is followed by parentheses ()
mean(), sd(), print(), rep()
```{r}
# To create a function
name_of_function <- function(){ # inside parentheses what the function takes
  # what the function should do
}

## then call the function
name_of_function()
```


The function takes arguments (options), some are mandatory and some are optional.
Arguments can be supplied in order (positional argument), or by the argument name (better to use).
Clicking `tab` for autocompletion helps us remember what arguments a function could take

```{r}
calc <- function(num1, num2){
  sum = num1 + num2
  print(sum)
}

calc(num1 = 10, 5)
```



## Lists

Lists can contain different kinds of elements
```{r}
mylist <- list(num= 42, greeting="Hello, World")

# to access elements: ## double square bracket to access
mylist[[1]]

mylist$greeting

```

If you are not sure about the type of a variable, run class
```{r}
class(vec)
class(mat)
str(mylist)
```

### Applying a function on a list (very practical)
```{r}
list_nums <- list(
  a = c(1, 2, 3),
  b = c(10, 20, 30),
  c = c(5, 5, 5)
)

lapply(list_nums, mean)
```

### Applying an anonymous function (when we want to perform many steps over the list)
```{r}
list_nums_modified <- lapply(list_nums, function(element){
  ## first do this on each element
  element + 4
  ## then do this
  return(element)
})
```


```{r}
## Applying the mean function over a list
lapply(list_nums, function(elem) elem^2) ## returns a list
```


### sapply (simple apply)
for a simplified version
```{r}
sapply(list_nums, function(elem) elem^2) ## returns a simplified form (vector)
```



## Loops

```{r}
for (numb in list_nums){
  print(numb)
}
```

## If conditional
```{r}
list_nums <- c(1,2,3,6,7)
for(numb in list_nums){
  if(numb > 5){
    print(numb)}
    else print("This number is smaller than 5, so will not be printed")
}
```



# Tidy data


Tidy data rules:
1. Each variable = one column
2. Each observation = one row
3. Each type of entity = one table

Why it matters:
- easier plotting
- easier modelling
- fewer bugs
- compatible with tidyverse grammar



## Installing packages

To install a package from [CRAN](https://cran.r-project.org/), we use the function install.packages 
```{r}
install.packages("styler")
install.packages("tidyverse")
```

### From Github

[Github](https://github.com) is a web-based platform for version control, collaboration, and code sharing, built around the Git system. It allows users to track changes to files, manage contributions from multiple people, and publish open-source projects.  

GitHub is widely used for:

- Sharing R packages before they are published on CRAN

- Developing packages collaboratively using pull requests and issues

- Storing analysis projects (scripts, R Markdown, data, documentation)

- Reproducible workflows, since every change is tracked


Installing a package from Github `install_github("author/package")`


![](Github_screenshot_1.png)

```{r}
install.packages("devtools")

## To load a library
library(devtools)
devtools::install_github("githubusername/repository")
install_github("immunogenomics/presto")

## instead of loading a whole package, we can call a function from a package using the format:
package::function()
  
## function select
AnnotationDbi::select()
dplyr::select()
  
## Example
devtools::install_github("immunogenomics/presto")
```


### From Bioconductor
[Bioconductor](https://www.bioconductor.org/)


Bioconductor is an open-source project that provides a large collection of R packages for biological and biomedical data analysis, especially high-throughput sequencing, genomics, transcriptomics, and proteomics.

- Curated, peer-reviewed packages: Every package goes through a structured review for code quality, documentation, and reproducibility.

- Focus on life sciences: Tools for RNA-seq, ChIP-seq, single-cell RNA-seq, epigenomics, variant analysis, annotation, and more.

- Reproducible workflows using standardized data structures (e.g., SummarizedExperiment, SingleCellExperiment).

- Frequent release cycle (twice yearly), synchronized with R versions.

Bioconductor has become a central ecosystem for computational biology because it promotes reproducibility, integrates with modern R workflows, and provides consistent interfaces for complex biological datasets.



![](Bioconductor_screen1.png)

![](Bioconductor_screen2.png)  

```{r}
## First, to install Bioconductor
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.22")

## Installing packages from Bioconductor
BiocManager::install("DESeq2")
```

```{r}
install.packages("tidyverse")
install.packages("janitor")
install.packages("lubridate")
```

## Loading libraries
```{r}
library(tidyverse)
library(tidyr)
library(dplyr)
library(stringr)
library(readr)
library(janitor)
library(lubridate)

# To load multiple libraries at once, we can use pacman package, p_load function
pacman::p_load(tidyverse, styler, dplyr, stringr, janitor, lubridate)

```



## Working with directories
```{r}
## To check the current directory
getwd()
## To change the working directory
setwd("Path/to/directory/")

## Remember to change the directory for the whole notebook, we use the main chunk
```

## Importing data
```{r}
## Importing an R data structure object (.RDS)
myfile <- readRDS(file = "path/to/file.RDS")

## CSV file
myfile <- utils::read.csv()

## More general function
myfile <- utils::read.delim(file = "/path/to/file.extenstion", 
                  header = T, sep = "\t") ## the file has a header with column names, and each observation is separated by a tab
myfile <- readxl::read_xlsx("/path/to/file/file.xlsx")
```


## Checking the content of a file
```{r}
head(myfile) ## shows the first 6 observations, can change n argument
tail(myfile) ## shows the last 6 observations, can change n argument
str(myfile) ## checks the structure of the file as a variable
dim(myfile) ## checks the dimensions of a file (returns number of rows, number of columns)
nrow(myfile) # checks the number of rows
ncol(myfile)# checks the number of columns

## checks the column names
colnames(myfile)
names(myfile)
```

## Data maniuplation

### Adding columns to a dataframe
```{r}
myfile$newcol <- myfile$col1 * myfile$col2
```

### Access columns by name
```{r}
# one column
myfile$columnname
## more than a column
myfile[, c("col1", "col4", "col2")]
myfile[, c(1,4,2)]
```

### Access a column by index
```{r}
myfile[,4] # the row is empty, which means show all rows for column 4
```

### Access rows
```{r}
df[4, ] ## gets the 4th row for all columns
df[4:10] ## from 4th to 10th
```

### Select rows by a condition (exact)
```{r}
myfile_subset <- subset(myfile, given_column == "something") ## subset rows that follow this condition
```


### Select rows by a condition (more than one category)
```{r}
myfile_subset <- subset(myfile, 
                        given_column %in% c("something", "something else")) ## subset rows that follow this condition
```


### Select rows by a condition (exclusion of exact)
```{r}
myfile_subset <- subset(myfile, given_column != "something") ## subset rows that follow this condition
```

### Select rows by a condition (exclusion of more than one category)

```{r}
myfile_subset <- subset(myfile, given_column != "something" & 
         given_column != "something else" &
         given_column != "another thing")
```

Using the `!` to negate 
```{r}
myfile_subset <- subset(myfile,
                        !(given_column %in% c("something", "something else")))
```



There is no equivalent operator for `%in%`, but you can create it!  

```{r}
"%out%" <- Negate("%in%")
```


```{r}
myfile_subset <- subset(myfile, 
                        given_column %out% c("something", "something else")) 
## subset rows that do not follow this condition
```


```{r}
# clean column names
# df <- janitor::clean_names(df)

# install.packages("tidytuesdayR")
# tuesdata <- tidytuesdayR::tt_load('2023-10-24')
# df <- tuesdata$patient_risk_profiles

df <- readxl::read_xlsx("tuesdata_df.xlsx")
head(df)
```

```{r}
df$`predicted risk of Sudden Vision Loss, with no eye pathology causes`
```


```{r}
df <- janitor::clean_names(df)
head(df)
```



# Using tidyverse
Used for data wrangling and maniuplation

When we load tidyverse, by default, dplyr, readr, stringr, tibble, tidyr, purrr, and ggplot2 are loaded
```{r}
library(tidyverse)
```

We will work with [care state](https://github.com/rfordatascience/tidytuesday/tree/main/data/2025/2025-04-08) dataset from tidy tuesday
```{r}
care_state <- readr::read_csv(
  "https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-04-08/care_state.csv"
)

care_state %>% glimpse()

```


- *state* >> The two-letter code for the state (or territory, etc) where the hospital is located.  
- *condition* >>	The condition for which the patient was admitted. Six categories of conditions are included in the data.  
- *measure_id* >>	The ID of the thing being measured. Note that there are 22 unique IDs but only 21 unique names.  
- *measure_name* >>	The name of the thing being measured. Note that there are 22 unique IDs but only 21 unique names.  
- *score* >>	The score of the measure.  
- *footnote* 	>> 	Footnotes that apply to this measure: 
  -- 5 = "Results are not available for this reporting period.", 
  -- 25 = "State and national averages include Veterans Health Administration (VHA) hospital data.", 
  -- 26 = "State and national averages include Department of Defense (DoD) hospital data.".  
- *start_date* >>	date 	The date on which measurement began for this measure.  
- *end_date* >>	date 	The date on which measurement ended for this measure.  

Q1: How many columns are there in the dataset? What columns exist? 
Q2: How many observations(rows)? 
Q3: How many different states exist in the dataset ?
Q4: What categories are there in the condition column?
Q5: What is being measured in the "measure_name" column?
Q6: How many observations with footnote 5, 25, and 26?
Q7: Which years are there in the start and end date columns?



```{r}
starwars <- dplyr::starwars
```

```{r}
head(starwars)
```


```{r}
typeof(starwars)
class(starwars)
```

```{r}
## Get the name of all columns
colnames(starwars)
# Or 
names(starwars)
```
# Functions on rows
## 1. Filtering

`filter(dataframe, condition)` like `filter(dataframe, column == "something")`
`filter(dataframe, condition1, condition2)` like `filter(dataframe, column == "something", column == "something")`


**Why it matters in healthcare/bioinformatics:** you almost always filter:
- QC failures
- non-target tissue / condition
- outliers
- incomplete cases


```{r}
human_starwars <- filter(starwars, 
                         species == "Human")
head(human_starwars)
```


```{r}
filter(starwars,
       species == "Human", height > 180)
```

## Combining a lot of functions
```{r}
df <- nycflights23::flights
mean(pull(filter(filter(df, origin == "JFK"), carrier == "AA", !is.na(arr_delay)), arr_delay))
```

Using a pipe ` %>% `. Shortcut is `ctrl + shift + M` or `command + shift + M`
```{r}
df %>%
  filter(origin == "JFK") %>%
  filter(carrier == "AA") %>%
  filter(!is.na(arr_delay)) %>%
  pull(arr_delay) %>%
  mean()
```

Even nicer
```{r}
df %>%
  filter(origin == "JFK",
         carrier == "AA",
         !is.na(arr_delay)) %>%
  summarise(avg_delay = mean(arr_delay)) %>% 
  pull(avg_delay)
```

*NOTE*  
The function `pull` is a function that turns a column into a vector
For example if you have a column called "genes", using pull turns the column into a vector with all these genes


## Common patterns
```{r}
starwars %>% filter(is.na(mass))                # missing values
starwars %>% filter(!is.na(height))             # non-missing
starwars %>% filter(between(height, 150, 200))  # within a range
```


## 2. arrange() — sort rows

```{r}
starwars %>% 
  arrange(desc(height))
```



## 3. distinct() — unique rows

```{r}
starwars %>% 
  distinct(species, homeworld)
```

## 4. slice_max, slice_min, slice_head, and slice_tail

```{r}
DEGs <- readxl::read_xlsx("DEGs.xlsx")

DEGs %>% 
  mutate(pct_diff = pct.1 - pct.2) %>% 
  relocate(pct_diff, .after = pct.2) %>% 
  filter(p_val_adj < 0.05) %>% 
  group_by(cluster) %>% 
  slice_max(n = 10, order_by = avg_log2FC)
```


`slice_max` for each group, get the *top* n observations
`slice_min` for each group, get the *bottom* n observations 
`slice_head`  for each group, get the *first* n observations
`slice_tail`  for each group, get the *last* n observations

If not grouped 

`slice_max()` → get the top n rows overall based on a variable.
`slice_min()` → get the bottom n rows overall based on a variable.
`slice_head()` → get the first n rows in the dataset.
`slice_tail()` → get the last n rows in the dataset.


################################################################################
################################################################################

# Functions on columns

## 1. select(), rename(), relocate()

```{r}
starwars %>%
  select(name, height, species, mass) %>%
  rename("weight_kg" = "mass") %>%
  relocate(weight_kg, .after = height)
```

`rename(dataframe, newcolname = oldname)` renames an existing column name to another
Some R versions requires "" for column renaming


`relocate` function allows the new column is added in a specific position, default (last column)


## 2. Removing a column

The function `select` can be used to positively or negatively select columns (in any order)


```{r}
df %>% 
  select(day, month, dep_time)
```


We can use `select(df, -col1)` for one column or multiple columns like `select(df, -col1, -col2, -col3)` or even better use (any_of)

```{r}
cols_to_drop <- c("year", "carrier")

df %>% 
  select(-any_of(cols_to_drop))
```


```{r}
df %>% 
  select(starts_with("dep"))

df %>% 
  select(ends_with("time"))

df %>% 
  select(contains("del"))

df %>% 
  select(matches("(delay|time)$"))

```

```{r}
df %>% 
  select(where(is.numeric))

df %>% 
  select(where(is.character))
```


## 3. Turning a column into multiple
```{r}
df_time_hour <- df %>% select(time_hour)
head(df_time_hour)
```


The function `separate(dataframe, col = "column_to_split", into = c("piece1", "piece2"), sep = "separator"))`

```{r}
df_time_hour <-  df_time_hour %>% separate(sep = " ", 
                col = time_hour, 
                into = c("yearmonthday", "time"), 
                remove = T)
head(df_time_hour)
```

```{r}
df_time_hour <- df_time_hour %>% 
  separate(sep = "-", 
                col = yearmonthday, 
                into = c(NA, "month", "day"), 
                remove = T)
head(df_time_hour)
```
The same function can be applied on the time column, into hour, minute, seconds

```{r}
df_time_hour <- df_time_hour %>% 
  separate(sep = ":", 
                col = time, 
                into = c("hr", "mn", "sec"), 
                remove = T)
head(df_time_hour)
```
## 4. uniting columns with unite()

```{r}
df_time_hour %>%
  unite("newcol", month, day, hr, mn, sec, sep = "-", remove = FALSE)
```

## 5. mutate() — create new variables

Adding a column follows this structure `mutate(df, "newcolumn" = df$existingcolumn)` or  `mutate(df, newcolumn = df$existingcolumn)`

```{r}
starwars %>%
  mutate(height_m = height / 100, 
         bmi = mass / (height_m^2)) %>%
  select(name, height, mass, height_m, bmi)
```

```{r}
new_df <- nycflights23::flights %>% 
  mutate("dag" = day)

new_df <- new_df %>% 
  mutate("Day_Month_Year" = str_c(new_df$day, 
                                  new_df$month, 
                                  new_df$year, sep = "_")) %>%
  relocate(Day_Month_Year, .before = year)
head(new_df)
```

### case_when() — categorical variables

`mutate` with `case_when` can be used to add a column based on a condition of another column that follows a pattern.

mutate(NewCol = case_when(grepl("pattern", ExistingColumn) ~ 
"AddedText", 
TRUE ~ "AddedText2"))  

which means Assign "AddedText" if the pattern exists in the ExistingColumn, if not, assign "AddedText2"

```{r}
starwars %>%
  mutate(height_group = case_when(
    height < 150 ~ "short",
    height < 190 ~ "medium",
    TRUE ~ "tall")
    ) %>%
  count(height_group)
```

## 6.1 Grouping

`group_by` is used to group the dataframe by a certain column, where another function could be applied on each group. For instance, here we calculate the average arrival delay for each carrier

```{r}
df %>%
  group_by(carrier) %>%
  summarise(
    avg_arr_delay = mean(arr_delay, na.rm = TRUE),
    n_flights = n()
  )
```


## 6.2 group_by() + summarise()

```{r}
starwars %>%
  group_by(species) %>%
  summarise(
    n = n(),
    mean_height = mean(height, na.rm = TRUE)
  ) %>%
  arrange(desc(n))
```

## 7. across() — multi-column operations

```{r}
starwars %>%
  summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
```


# Merging/Joining dataframes

```{r}
set.seed(12345)
patient_table <- tibble(
  patient_id = paste0("P", str_pad(1:12, 3, pad = "0")),
  diagnosis = sample(c("control", "sepsis", "cancer"), 12, replace = TRUE),
  age = sample(30:90, 12, replace = TRUE),
  sex = sample(c("F", "M"), 12, replace = TRUE)
)

site_table <- tibble(
  patient_id = paste0("P", str_pad(sample(1:12, 10), 3, pad = "0")),
  hospital = sample(c("SiteA", "SiteB", "SiteC"), 10, replace = TRUE),
  country = sample(c("US", "DE", "FR"), 10, replace = TRUE)
)

patient_table
site_table

```

## left_join()

```{r}
patients_full <- patient_table %>%
  left_join(site_table, by = "patient_id")

patients_full
```

## Debug: who did NOT match?

```{r}
patient_table %>%
  anti_join(site_table, by = "patient_id")
```

```{r}
airlines_df <- nycflights23::airlines
head(airlines_df)
```


```{r}
flights_df <- nycflights23::flights
merged_df <- left_join(flights_df, airlines_df, by = "carrier") 

merged_df %>%  
  dplyr::rename(airlines_nm = name) %>% 
  relocate(airlines_nm, .after = carrier)
```

`left_join(x, y)`	All from x  
`right_join(x, y)`	All from y  
`inner_join(x, y)`	Only matching rows  
`full_join(x, y)`	All rows from both tables, filling the missing with NAs  




# Reshaping data (Wide vs Long data shape)
```{r}
set.seed(12345)

metadata <- tibble(
  sample_id = paste0("S", str_pad(1:24, 3, pad = "0")),
  patient_id = paste0("P", str_pad(sample(1:12, 24, replace = TRUE), 3, pad = "0")),
  tissue = sample(c("tumor", "normal"), 24, replace = TRUE),
  treatment = sample(c("drugA", "drugB", "placebo"), 24, replace = TRUE),
  sex = sample(c("F", "M"), 24, replace = TRUE),
  age = sample(30:85, 24, replace = TRUE),
  batch = sample(c("batch1", "batch2"), 24, replace = TRUE)
)

metadata

genes <- paste0("Gene", 
                str_pad(1:50, 4, 
                        pad = "0"))

counts <- matrix(
  rnbinom(50 * 24, mu = 80, size = 1),
  nrow = 50,
  ncol = 24,
  dimnames = list(genes, metadata$sample_id)
)

counts_df <- as_tibble(counts, rownames = "gene_id")
counts_df
```

Problem: This is "wide". Many analyses need "long".


## 1. pivot_longer(): wide -> long

```{r}
counts_long <- counts_df %>%
  pivot_longer(
    cols = starts_with("S"),
    names_to = "sample_id",
    values_to = "count"
  )

counts_long
```

Now each row is: gene × sample.


Now we join counts with metadata

```{r}
head(metadata)
```


```{r}
counts_annot <- counts_long %>%
  left_join(metadata, by = "sample_id")

counts_annot %>% glimpse()
```



Here we summarise expression by group

Example: mean counts by tissue per gene

```{r}
gene_summary <- counts_annot %>%
  group_by(gene_id, tissue) %>%
  summarise(mean_count = mean(count), .groups = "drop")

gene_summary
```

## 2. pivot_wider(): long -> wide

Create a gene × tissue table (2 columns: tumor/normal)

```{r}
gene_wide <- gene_summary %>%
  pivot_wider(
    names_from = tissue,
    values_from = mean_count
  )

gene_wide
```


## 3. Binding rows and columns

```{r}
visit_a <- tibble(
  id = c("P01","P02","P03"),
  visit = "baseline",
  crp = c(2.1, 5.4, 1.9)
)

visit_b <- tibble(
  id = c("P04","P05"),
  visit = "week4",
  crp = c(3.3, 2.8),
  wbc = c(6.1, 5.7)   # extra column not in visit_a
)

```


### 3.1 rbind vs bind_rows (stacking rows)
```{r}
rbind(visit_a, visit_b)
```


```{r}
bind_rows(visit_a, visit_b)
```
here `bind_rows` does not give an error, making sure rows are stacked, and missing columns are created and filled with NA (wbc is NA for baseline rows)


### 3.2 bind_cols() vs cbind() (gluing columns)

```{r}
patients <- tibble(
  id = c("P01","P02","P03"),
  sex = c("F","M","F")
)

labs <- tibble(
  crp = c(2.1, 5.4, 1.9),
  wbc = c(5.8, 6.2, 5.1)
)

```

```{r}
cbind(patients, labs)
```

```{r}
bind_cols(patients, labs)
```

```{r}
labs_short <- tibble(crp = c(2.1, 5.4))

cbind(patients, labs_short)
```

```{r}
bind_cols(patients, labs_short)
```

# Recap 

## 1. **Column-wise** functions 
### (modify or choose columns): 

- select()  

- rename()  

- relocate()  

### Create or modify columns  

- mutate()  

### Identify columns using tidyselect helpers:  
- starts_with(), ends_with(), contains(), matches(), where()


## 2. **Row-wise** functions (filter, slice, row operations)
### Keep or remove rows

- filter()

- slice(), slice_head(), slice_tail(), slice_min(), slice_max()

### Order rows

- arrange()


## 3. **Group-wise** functions 
### (operate within groups)

### Grouping changes how other functions behave:

group_by(), ungroup()

## 4. **Data-frame–wise** functions 
### (operate on whole table)

- bind_rows()

- bind_cols()

- left_join(), inner_join(), full_join(), anti_join()


### Reshape data:

pivot_longer() and pivot_wider()


## 5. **Summary** functions 
### (usually inside summarise())

mean(), median(), sum(), n(), n_distinct(), sd(), var()



# Good-to-know symbols names
()  parentheses  
[]  square brackets  
{}  curly braces  
*  asterisk  
&  ampersand  
^  hat  
/  forward slash  
\\  backslash  
''  single quotes  
""  double quotes  
-  hyphen (dash)  
_  underscore  
~  tilda  
`  backtick  
!  exclamation mark  
.  period  
,  comma  
:  colon  
;  semi-colon  
$  dollar sign  
#  sharp  
%  percentage  
  
