Goals

In this tutorial, you will learn how to:

  1. Import tabular data from CSV files.
  2. Import tabular data from Excel files.
  3. Clean and manipulate tables with the tidyverse.

1. Setup

pacman::p_load(tidyverse, readxl, readr, janitor, ggrepel, gridExtra, ggthemes, scales, writexl)

source("https://github.com/koundy/ggplot_theme_Publication/raw/refs/heads/master/ggplot_theme_Publication-2.R")

Import CSV files

Use readr::read_csv() for fast and friendly CSV import. The separator (delimiter between columns) is automatically detected as a comma. However you can specify another delimiter in the options if needed.

DGE <- read_csv("DGE.csv")

head(DGE)

Example on how to specify a different delimiter (semicolon):

#DGE <- read_delim("DGE.csv", delim = ";")

A file can also contain the first lines with comments which you want to skip. Then you can use the skip argument to ignore those lines:

DGE_skip <- read_csv("weirdDGE.csv", skip = 3)

Import Excel files

Use readxl::read_excel() to import Excel files. You can specify the sheet name or number, and optionally a range of cells.

DGE_CaseControl <- read_excel("DGEtables.xlsx", sheet = "CaseControl")

head(DGE_CaseControl)

Read a specific range of cells in Excel

DGE_subset <- read_excel("DGEtables.xlsx", 
                        sheet = "CaseControl",
                        range = "A1:F50")

head(DGE_subset)

Read all sheets and combine

A more advanced example merging all sheets from excel into one table, and naming the source sheet in a new column:

#name of each sheet in excel file
sheet_names <- excel_sheets("DGEtables.xlsx")
cat("sheet names are ", paste(sheet_names, collapse = ", "), "\n")
sheet names are  CaseExposed, CaseControl 
#takes sheet names and run read_excel for each sheet, 
#we use map_fdr to combine all the tables into one, 
#and we add a new column with the sheet name to keep 
#track of the source of each row
DGE_all <- sheet_names %>%
    set_names() %>%
    map_dfr(
        ~read_excel("DGEtables.xlsx", sheet = .x),
        .id = "sheet"
    )

summary(DGE_all)
    sheet               ...1               logFC         
 Length:13344       Length:13344       Min.   :-15.2190  
 Class :character   Class :character   1st Qu.: -2.4412  
 Mode  :character   Mode  :character   Median : -0.0223  
                                       Mean   : -0.1515  
                                       3rd Qu.:  2.1338  
                                       Max.   : 14.8596  
     logCPM              F               PValue       
 Min.   :-0.6586   Min.   : 0.0000   Min.   :0.00000  
 1st Qu.: 3.2794   1st Qu.: 0.2718   1st Qu.:0.06444  
 Median : 4.7527   Median : 1.2249   Median :0.26843  
 Mean   : 4.7831   Mean   : 2.2634   Mean   :0.34949  
 3rd Qu.: 6.2484   3rd Qu.: 3.4201   3rd Qu.:0.60215  
 Max.   :17.0813   Max.   :43.1210   Max.   :1.00000  
      FDR              diffexpr           geneType        
 Min.   :0.0000004   Length:13344       Length:13344      
 1st Qu.:0.2617545   Class :character   Class :character  
 Median :0.5351169   Mode  :character   Mode  :character  
 Mean   :0.5344405                                        
 3rd Qu.:0.8014345                                        
 Max.   :1.0000000                                        

Core tidyverse table manipulation and piping

Those are operations to manipulate tables. You can chain them together with the pipe operator (%>%). What does the pipe do?

Look below: we define significant_DGE as a sequence of piped operations. Their order is from the first to the last, and in tidyverse names recall very clearly what the operations do. The output of each operation is piped into the next. Here we

glimpse will give an overview of the tibble’s structure and the first few values.

significant_DGE <- DGE %>%
  rename(gene_name = 1) %>% #rename column 1 to gene_name
  janitor::clean_names() %>% #clean up column names
  mutate(gene_type = factor(gene_type)) %>% #factorize categories of gene types
    select(gene_name, log_fc, fdr, gene_type) %>% #select only some columns of interest
    filter(fdr < .05 & gene_type=="protein-coding") %>% #filter by fdr and gene type
    arrange(desc(fdr)) #arrange by descending pvalue

glimpse(significant_DGE)
Rows: 47
Columns: 4
$ gene_name <chr> "VPS39", "GLT1D1", "SDHAF3", "PI3", "NES", …
$ log_fc    <dbl> 11.886819, 12.711685, 11.179818, 11.815255,…
$ fdr       <dbl> 0.04961619, 0.04961619, 0.04961619, 0.04961…
$ gene_type <fct> protein-coding, protein-coding, protein-cod…

Create new columns with mutate

You can also create a new column with mutte. Let’s say we want to define a DGE tibble like above, but without filtering. Instead we create a category saying SIGN(.001), SIGN(.01), SIGN(.05) or NOT.SIGN, all depending on the pvalue, and we also want that the logfold change is at least above 1 or below -1.

You can use mutate to create a new column based on those criteria:

extended_DGE <- DGE %>%
  rename(gene_name = 1) %>% #rename column 1 to gene_name
  janitor::clean_names() %>% #clean up column names
  mutate(gene_type = factor(gene_type)) %>% #factorize categories of gene types
    select(gene_name, log_fc, fdr, gene_type) %>% #select only some columns of interest
  mutate( significance_level = case_when(
    fdr < .001 & abs(log_fc) > 1 ~ "SIGN(.001)",
    fdr < .01 & abs(log_fc) > 1 ~ "SIGN(.01)",
    fdr < .05 & abs(log_fc) > 1 ~ "SIGN(.05)",
    TRUE ~ "NOT.SIGN"
  )) %>% #create a new column with the significance level
  mutate( significance_level = factor(significance_level)) %>%
  arrange(desc(fdr)) #arrange by descending pvalue

glimpse(extended_DGE)
Rows: 6,672
Columns: 5
$ gene_name          <chr> "MIR1184-1", "SLC16A7", "TSPAN33",…
$ log_fc             <dbl> 0.000000000, -0.001326973, -0.0009…
$ fdr                <dbl> 1.0000000, 0.9997601, 0.9997601, 0…
$ gene_type          <fct> ncRNA, protein-coding, protein-cod…
$ significance_level <fct> NOT.SIGN, NOT.SIGN, NOT.SIGN, NOT.…

Summarize by group

You can summarize by group and get group statistics. For example the number of genes in each gene type and their median fdr and log_fc.

summary_DGE <- extended_DGE %>%
    group_by(gene_type) %>%
    summarise(
        n = n(),
        median_lfc = median(log_fc, na.rm = TRUE),
        median_fdr = median(fdr, na.rm = TRUE),
        .groups = "drop"
    )

summary_DGE

Plotting with ggplot2

You can also plot your data with ggplot2, which is part of the tidyverse. For example, a scatter plot of log fold change vs fdr, colored by gene type:

ggplot(extended_DGE, aes(x = log_fc, y = -log10(fdr), color = gene_type)) +
    geom_point(size=3) +
    theme_minimal() +
    labs(title = "DGE Scatter Plot", x = "Log Fold Change", y = "-Log10(FDR)")

ggplot takes the dataframe/tibble as input and adds layers of the plot using the symbol +. There are a lot of ways to customize the plot or do other types of plots.

We can add a column where we write the name of the genes which are differentially expressed. We can do this with the mutate function and case_when. Then we add to the plot a layer of text.

extended_DGE <- extended_DGE %>%
  mutate( plot_label = case_when(
    significance_level ==  "NOT.SIGN" ~ NA,
    TRUE ~ gene_name
    ) )

glimpse(extended_DGE)
Rows: 6,672
Columns: 6
$ gene_name          <chr> "MIR1184-1", "SLC16A7", "TSPAN33",…
$ log_fc             <dbl> 0.000000000, -0.001326973, -0.0009…
$ fdr                <dbl> 1.0000000, 0.9997601, 0.9997601, 0…
$ gene_type          <fct> ncRNA, protein-coding, protein-cod…
$ significance_level <fct> NOT.SIGN, NOT.SIGN, NOT.SIGN, NOT.…
$ plot_label         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA…
ggplot(extended_DGE, aes(x = log_fc, y = -log10(fdr), color = gene_type)) +
    geom_point() +
  ggrepel::geom_text_repel(aes(label = plot_label), size = 3, ) + #add labels to significant genes
    theme_minimal() +
  geom_hline(yintercept=-log10(0.05)) + geom_vline(xintercept=c(1,-1)) +    #vertical line ggplot
    labs(title = "DGE Scatter Plot", x = "Log Fold Change", y = "-Log10(FDR)")

Themes

You can change theme for adapting the appearence. All standard themes are here: https://ggplot2.tidyverse.org/reference/ggtheme.html. For example, it is popular to use the gray theme


<!-- rnb-plot-begin eyJoZWlnaHQiOjQzMi42MzI5LCJ3aWR0aCI6NzAwLCJkcGkiOi0xLCJzaXplX2JlaGF2aW9yIjowLCJjb25kaXRpb25zIjpbWzEsIlx1MDAxYlsxbVx1MDAxYlszM21XYXJuaW5nXHUwMDFiWzM5bTpcdTAwMWJbMjJtXG5cdTAwMWJbMzg7NTsyMzJtUmVtb3ZlZCA2NjE5IHJvd3MgY29udGFpbmluZyBtaXNzaW5nIHZhbHVlcyBvciB2YWx1ZXMgb3V0c2lkZSB0aGUgc2NhbGUgcmFuZ2VcbihgZ2VvbV90ZXh0X3JlcGVsKClgKS5cdTAwMWJbMzltXG5cbiJdXX0= -->

<img src=\data:image/png;base64

You can do nicer themes using the ones developed here https://github.com/koundy/ggplot_theme_Publication, where someone has developed its own themes. We have imported them at the beginning of this notebook. For example you can use a theme developed for publications (nicer sizes and font of text and nicer legend).


<!-- rnb-plot-begin eyJoZWlnaHQiOjQzMi42MzI5LCJ3aWR0aCI6NzAwLCJkcGkiOi0xLCJzaXplX2JlaGF2aW9yIjowLCJjb25kaXRpb25zIjpbWzEsIlx1MDAxYlsxbVx1MDAxYlszM21XYXJuaW5nXHUwMDFiWzM5bTpcdTAwMWJbMjJtXG5cdTAwMWJbMzg7NTsyMzJtUmVtb3ZlZCA2NjE5IHJvd3MgY29udGFpbmluZyBtaXNzaW5nIHZhbHVlcyBvciB2YWx1ZXMgb3V0c2lkZSB0aGUgc2NhbGUgcmFuZ2VcbihgZ2VvbV90ZXh0X3JlcGVsKClgKS5cdTAwMWJbMzltXG5cbiJdXX0= -->

<img src=\data:image/png;base64

Or combine publication theme with a color palette for discrete variables (gene type) from the same author. You can look at the webpage for the themes to see more examples.


<!-- rnb-plot-begin eyJoZWlnaHQiOjQzMi42MzI5LCJ3aWR0aCI6NzAwLCJkcGkiOi0xLCJzaXplX2JlaGF2aW9yIjowLCJjb25kaXRpb25zIjpbWzEsIlx1MDAxYlsxbVx1MDAxYlszM21XYXJuaW5nXHUwMDFiWzM5bTpcdTAwMWJbMjJtXG5cdTAwMWJbMzg7NTsyMzJtUmVtb3ZlZCA2NjE5IHJvd3MgY29udGFpbmluZyBtaXNzaW5nIHZhbHVlcyBvciB2YWx1ZXMgb3V0c2lkZSB0aGUgc2NhbGUgcmFuZ2VcbihgZ2VvbV90ZXh0X3JlcGVsKClgKS5cdTAwMWJbMzltXG5cbiJdXX0= -->

<img src=\data:image/png;base64

Still not enough? The package ggthemes has even more ways of combining color palettes and themes. Look here https://github.com/jrnold/ggthemes where the author makes a lot of examples and try one of them!

saving ggplots

You can save a ggplot with the dedicated command, but to do that you need to assign the plot to a variable. For example:

volcano_plot <- ggplot(extended_DGE, aes(x = log_fc, y = -log10(fdr), color = gene_type)) +
    geom_point() +
  ggrepel::geom_text_repel(aes(label = plot_label), size = 3, ) + #add labels to significant genes
    theme_gray() +
  geom_hline(yintercept=-log10(0.05)) + geom_vline(xintercept=c(1,-1)) +    #vertical line ggplot
    labs(title = "DGE Scatter Plot", x = "Log Fold Change", y = "-Log10(FDR)")

ggsave(filename = "./lifeExp.png", plot = volcano_plot, width = 12, height = 10, dpi = 300, units = "cm")

Some other plots: barplots, boxplots, density plots, …

There is a wide range of plot types you can do with ggplot, and it really depends on your data and what you want to visualize. For example, you can do a bar plot of the number of genes in each gene type:

ggplot(summary_DGE, aes(x = gene_type, y = log(n), fill = gene_type)) +
    geom_bar(stat = "identity") +
  scale_fill_Publication() + 
  theme_dark_blue()

Note that above we needed only one value per gene type to plot the bar. If we had more values per gene type, we would need to use stat = "summary" and specify a summary function (for example fun = "mean" or fun = "median"). For example, if we want to plot the median log fold change for each gene type:

ggplot(extended_DGE, aes(x = gene_type, y = log_fc, fill = gene_type)) +
    geom_bar(stat = "summary", fun = "median") +
  scale_fill_Publication() + 
  theme_dark_blue() +
#chnge x axis angle to 45
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Faceting

Faceting is a powerful way to create multiple plots based on a categorical variable. For example, we can facet the scatter plot by gene type:

ggplot(extended_DGE, aes(x = log_fc, y = -log10(fdr), color = gene_type)) +
    geom_point() +
  ggrepel::geom_text_repel(aes(label = plot_label)) +
    theme_minimal() +
  geom_hline(yintercept=-log10(0.05)) + geom_vline(xintercept=c(1,-1)) +    #vertical line ggplot
    labs(title = "DGE Scatter Plot", x = "Log Fold Change", y = "-Log10(FDR)") +
  facet_wrap(~ gene_type)

A density plot of log fold change by gene type:

ggplot(extended_DGE, aes(x = log_fc, fill = gene_type)) +
    geom_density(alpha = 0.5) +
  scale_fill_Publication() +
  theme_excel_new() +
  labs(title = "Density Plot of Log Fold Change\nby Gene Type", x = "Log Fold Change", y = "Density") 

Export cleaned results

# Export to CSV
write_csv(extended_DGE, "final_DGE.csv")

# Export to Excel (needs writexl package)
writexl::write_xlsx(extended_DGE, "final_DGE.xlsx")

Quick practice tasks

  1. Import one CSV or one Excel file from your own project.
  2. Use clean_names() and mutate() to standardize/cleanup columns
  3. apply some labels of interested, as in the example for significance
  4. make a plot you would like to see, for example a scatter plot or a bar plot, and customize it with themes and labels
  5. Export the final cleaned table

Troubleshooting tips

Some resources

I will just list a few things, as the internet is a wide ocean, and you might want to focus on few selected things.

---
title: "Importing Tabular Data in R"
author: "Samuele Soraggi"
date: "`r Sys.Date()`"
output:
  html_notebook:
    theme: cerulean
    toc: true
    toc_depth: 3
editor_options:
  markdown:
    wrap: 72
---

## Goals

In this tutorial, you will learn how to:

1. Import tabular data from CSV files.
2. Import tabular data from Excel files.
3. Clean and manipulate tables with the tidyverse.

## 1. Setup

```{r setup, message=FALSE, warning=FALSE}
pacman::p_load(tidyverse, readxl, readr, janitor, ggrepel, gridExtra, ggthemes, scales, writexl)

source("https://github.com/koundy/ggplot_theme_Publication/raw/refs/heads/master/ggplot_theme_Publication-2.R")
```

## Import CSV files

Use `readr::read_csv()` for fast and friendly CSV import. The separator (delimiter between columns) is automatically detected as a comma. However you can specify another delimiter in the options if needed.

```{r read-single-csv}
DGE <- read_csv("DGE.csv")

head(DGE)
```

Example on how to specify a different delimiter (semicolon):

```{r read-delim, eval=FALSE}
#DGE <- read_delim("DGE.csv", delim = ";")
```

A file can also contain the first lines with comments which you want to skip. Then you can use the `skip` argument to ignore those lines:

```{r read-skip}
DGE_skip <- read_csv("weirdDGE.csv", skip = 3)
```

## Import Excel files

Use `readxl::read_excel()` to import Excel files. You can specify the sheet name or number, and optionally a range of cells.

```{r read-excel-one}
DGE_CaseControl <- read_excel("DGEtables.xlsx", sheet = "CaseControl")

head(DGE_CaseControl)
```

### Read a specific range of cells in Excel

```{r read-excel-range}
DGE_subset <- read_excel("DGEtables.xlsx", 
                        sheet = "CaseControl",
                        range = "A1:F50")

head(DGE_subset)
```

### Read all sheets and combine

A more advanced example merging all sheets from excel into one table, and naming the source sheet in a new column:

```{r read-all-sheets}
#name of each sheet in excel file
sheet_names <- excel_sheets("DGEtables.xlsx")
cat("sheet names are ", paste(sheet_names, collapse = ", "), "\n")

#takes sheet names and run read_excel for each sheet, 
#we use map_fdr to combine all the tables into one, 
#and we add a new column with the sheet name to keep 
#track of the source of each row
DGE_all <- sheet_names %>%
	set_names() %>%
	map_dfr(
		~read_excel("DGEtables.xlsx", sheet = .x),
		.id = "sheet"
	)

summary(DGE_all)
```


## Core tidyverse table manipulation and piping

Those are operations to manipulate tables. You can chain them together with the pipe operator (`%>%`). What does the pipe do?

Look below: we define significant_DGE as a sequence of piped operations. Their order is from the first to the last, and in tidyverse names recall very clearly what the operations do. The output of each operation is piped into the next. Here we

 - 1. use the DGE tibble (tidyverse table)
 - 2. THEN rename the first column to `gene_name`
 - 3. THEN clean up column names with `clean_names()`
 - 4. THEN mutate the `gene_type` column into a factor (category)
 - 5. THEN select only the columns of interest
 - 6. THEN filter the rows by FDR and gene type
 - 7. THEN arrange the rows by descending FDR
 
`glimpse` will give an overview of the tibble's structure and the first few values.

```{r core-verbs-1}
significant_DGE <- DGE %>%
  rename(gene_name = 1) %>% #rename column 1 to gene_name
  janitor::clean_names() %>% #clean up column names
  mutate(gene_type = factor(gene_type)) %>% #factorize categories of gene types
	select(gene_name, log_fc, fdr, gene_type) %>% #select only some columns of interest
	filter(fdr < .05 & gene_type=="protein-coding") %>% #filter by fdr and gene type
	arrange(desc(fdr)) #arrange by descending pvalue

glimpse(significant_DGE)
```

### Create new columns with mutate

You can also create a new column with mutte. Let's say we want to define a DGE tibble like above, but without filtering. Instead we create a category saying SIGN(.001), SIGN(.01), SIGN(.05) or NOT.SIGN, all depending on the pvalue, and we also want that the logfold change is at least above 1 or below -1.

You can use mutate to create a new column based on those criteria:

```{r mutate-cols}
extended_DGE <- DGE %>%
  rename(gene_name = 1) %>% #rename column 1 to gene_name
  janitor::clean_names() %>% #clean up column names
  mutate(gene_type = factor(gene_type)) %>% #factorize categories of gene types
	select(gene_name, log_fc, fdr, gene_type) %>% #select only some columns of interest
  mutate( significance_level = case_when(
    fdr < .001 & abs(log_fc) > 1 ~ "SIGN(.001)",
    fdr < .01 & abs(log_fc) > 1 ~ "SIGN(.01)",
    fdr < .05 & abs(log_fc) > 1 ~ "SIGN(.05)",
    TRUE ~ "NOT.SIGN"
  )) %>% #create a new column with the significance level
  mutate( significance_level = factor(significance_level)) %>%
  arrange(desc(fdr)) #arrange by descending pvalue

glimpse(extended_DGE)
```

### Summarize by group

You can summarize by group and get group statistics. For example the number of genes in each gene type and their median fdr and log_fc.

```{r summarize-group}
summary_DGE <- extended_DGE %>%
	group_by(gene_type) %>%
	summarise(
		n = n(),
		median_lfc = median(log_fc, na.rm = TRUE),
		median_fdr = median(fdr, na.rm = TRUE),
		.groups = "drop"
	)

summary_DGE
```
## Plotting with ggplot2

You can also plot your data with ggplot2, which is part of the tidyverse. For example, a scatter plot of log fold change vs fdr, colored by gene type:

```{r ggplot}
ggplot(extended_DGE, aes(x = log_fc, y = -log10(fdr), color = gene_type)) +
	geom_point(size=3) +
	theme_minimal() +
	labs(title = "DGE Scatter Plot", x = "Log Fold Change", y = "-Log10(FDR)")
```

ggplot takes the dataframe/tibble as input and adds layers of the plot using the symbol +.
There are a lot of ways to customize the plot or do other types of plots.

We can add a column where we write the name of the genes which are differentially expressed. We can do this with the mutate function and case_when. Then we add to the plot a layer of text.

```{r label-name}
extended_DGE <- extended_DGE %>%
  mutate( plot_label = case_when(
    significance_level ==  "NOT.SIGN" ~ NA,
    TRUE ~ gene_name
    ) )

glimpse(extended_DGE)
```

```{r plot-with-labels}
ggplot(extended_DGE, aes(x = log_fc, y = -log10(fdr), color = gene_type)) +
	geom_point() +
  ggrepel::geom_text_repel(aes(label = plot_label), size = 3, ) + #add labels to significant genes
	theme_minimal() +
  geom_hline(yintercept=-log10(0.05)) + geom_vline(xintercept=c(1,-1)) +    #vertical line ggplot
	labs(title = "DGE Scatter Plot", x = "Log Fold Change", y = "-Log10(FDR)")
```
### Themes

You can change theme for adapting the appearence. All standard themes are here: https://ggplot2.tidyverse.org/reference/ggtheme.html. For example, it is popular to use the gray theme

```{r theme-ggplot }
ggplot(extended_DGE, aes(x = log_fc, y = -log10(fdr), color = gene_type)) +
	geom_point() +
  ggrepel::geom_text_repel(aes(label = plot_label), size = 3, ) + #add labels to significant genes
	theme_gray() +
  geom_hline(yintercept=-log10(0.05)) + geom_vline(xintercept=c(1,-1)) +    #vertical line ggplot
	labs(title = "DGE Scatter Plot", x = "Log Fold Change", y = "-Log10(FDR)")
```

You can do nicer themes using the ones developed here https://github.com/koundy/ggplot_theme_Publication, where someone has developed its own themes. We have imported them at the beginning of this notebook. For example you can use a theme developed for publications (nicer sizes and font of text and nicer legend).

```{r theme-publication}
ggplot(extended_DGE, aes(x = log_fc, y = -log10(fdr), color = gene_type)) +
	geom_point() +
  ggrepel::geom_text_repel(aes(label = plot_label), size = 3, ) + #add labels to significant genes
	theme_Publication() +
  geom_hline(yintercept=-log10(0.05)) + geom_vline(xintercept=c(1,-1)) +    #vertical line ggplot
	labs(title = "DGE Scatter Plot", x = "Log Fold Change", y = "-Log10(FDR)")
```
Or combine publication theme with a color palette for discrete variables (gene type) from the same author. You can look at the webpage for the themes to see more examples.

```{r theme-publication-palette}
ggplot(extended_DGE, aes(x = log_fc, y = -log10(fdr), color = gene_type)) +
	geom_point() +
  ggrepel::geom_text_repel(aes(label = plot_label), size = 3, ) + #add labels to significant genes
	scale_fill_Publication() + theme_dark_blue() +
  geom_hline(yintercept=-log10(0.05)) + geom_vline(xintercept=c(1,-1)) +    #vertical line ggplot
	labs(title = "DGE Scatter Plot", x = "Log Fold Change", y = "-Log10(FDR)") 
```
Still not enough? The package ggthemes has even more ways of combining color palettes and themes. Look here https://github.com/jrnold/ggthemes where the author makes a lot of examples and try one of them!

### saving ggplots

You can save a ggplot with the dedicated command, but to do that you need to assign the plot to a variable. For example:

```{r ggsave}
volcano_plot <- ggplot(extended_DGE, aes(x = log_fc, y = -log10(fdr), color = gene_type)) +
	geom_point() +
  ggrepel::geom_text_repel(aes(label = plot_label), size = 3, ) + #add labels to significant genes
	theme_gray() +
  geom_hline(yintercept=-log10(0.05)) + geom_vline(xintercept=c(1,-1)) +    #vertical line ggplot
	labs(title = "DGE Scatter Plot", x = "Log Fold Change", y = "-Log10(FDR)")

ggsave(filename = "./lifeExp.png", plot = volcano_plot, width = 12, height = 10, dpi = 300, units = "cm")
```

### Some other plots: barplots, boxplots, density plots, ...

There is a wide range of plot types you can do with ggplot, and it really depends on your data and what you want to visualize. For example, you can do a bar plot of the number of genes in each gene type:

```{r barplot}
ggplot(summary_DGE, aes(x = gene_type, y = log(n), fill = gene_type)) +
	geom_bar(stat = "identity") +
  scale_fill_Publication() + 
  theme_dark_blue()
```

Note that above we needed only one value per gene type to plot the bar. If we had more values per gene type, we would need to use `stat = "summary"` and specify a summary function (for example `fun = "mean"` or `fun = "median"`). For example, if we want to plot the median log fold change for each gene type:

```{r barplot-median}
ggplot(extended_DGE, aes(x = gene_type, y = log_fc, fill = gene_type)) +
	geom_bar(stat = "summary", fun = "median") +
  scale_fill_Publication() + 
  theme_dark_blue() +
#chnge x axis angle to 45
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

### Faceting

Faceting is a powerful way to create multiple plots based on a categorical variable. For example, we can facet the scatter plot by gene type:

```{r facet-ggplot}
ggplot(extended_DGE, aes(x = log_fc, y = -log10(fdr), color = gene_type)) +
	geom_point() +
  ggrepel::geom_text_repel(aes(label = plot_label)) +
  	theme_minimal() +
  geom_hline(yintercept=-log10(0.05)) + geom_vline(xintercept=c(1,-1)) +    #vertical line ggplot
  	labs(title = "DGE Scatter Plot", x = "Log Fold Change", y = "-Log10(FDR)") +
  facet_wrap(~ gene_type)
```
  
  
A density plot of log fold change by gene type:

```{r density-ggplot}
ggplot(extended_DGE, aes(x = log_fc, fill = gene_type)) +
  	geom_density(alpha = 0.5) +
  scale_fill_Publication() +
  theme_excel_new() +
  labs(title = "Density Plot of Log Fold Change\nby Gene Type", x = "Log Fold Change", y = "Density") 
```

## Export cleaned results

```{r export, eval=FALSE}
# Export to CSV
write_csv(extended_DGE, "final_DGE.csv")

# Export to Excel (needs writexl package)
writexl::write_xlsx(extended_DGE, "final_DGE.xlsx")
```

## Quick practice tasks

1. Import one CSV or one Excel file from your own project.
2. Use `clean_names()` and `mutate()` to standardize/cleanup columns
3. apply some labels of interested, as in the example for significance
4. make a plot you would like to see, for example a scatter plot or a bar plot, and customize it with themes and labels
5. Export the final cleaned table

## Troubleshooting tips

- If import fails due to separators, try `read_delim()` with the correct `delim`.
- If text appears broken, check encoding with `locale(encoding = "UTF-8")`.
- If column types are wrong, fix with `mutate()` and `as.numeric()`, `as.factor()`, etc.
- If joins add many missing values, verify the key columns (for example `sample_id`) match exactly in both tables.

## Some resources

I will just list a few things, as the internet is a wide ocean, and you might want to focus on few selected things.

- R novice gap reminder https://swcarpentry.github.io/r-novice-gapminder/index.html
- A slightly more in depth R and tidyr workshop we just had at the coding cafe: https://abc.au.dk/documentation/2026-02-26-tidyR-omics1.html
- tidyr cheatseet from Posit  https://github.com/rstudio/cheatsheets/blob/main/tidyr.pdf
- the help pane of Rstudio where you can find the documentation of all function
- the tab button for autocompleting functions and arguments in Rstudio. It is the starting condition to have an easier life.
