Import Excel files
Use readxl::read_excel() to import Excel files. You can
specify the sheet name or number, and optionally a range of cells.
DGE_CaseControl <- read_excel("DGEtables.xlsx", sheet = "CaseControl")
head(DGE_CaseControl)
Read a specific range of cells in Excel
DGE_subset <- read_excel("DGEtables.xlsx",
sheet = "CaseControl",
range = "A1:F50")
head(DGE_subset)
Read all sheets and combine
A more advanced example merging all sheets from excel into one table,
and naming the source sheet in a new column:
#name of each sheet in excel file
sheet_names <- excel_sheets("DGEtables.xlsx")
cat("sheet names are ", paste(sheet_names, collapse = ", "), "\n")
sheet names are CaseExposed, CaseControl
#takes sheet names and run read_excel for each sheet,
#we use map_fdr to combine all the tables into one,
#and we add a new column with the sheet name to keep
#track of the source of each row
DGE_all <- sheet_names %>%
set_names() %>%
map_dfr(
~read_excel("DGEtables.xlsx", sheet = .x),
.id = "sheet"
)
summary(DGE_all)
sheet ...1 logFC
Length:13344 Length:13344 Min. :-15.2190
Class :character Class :character 1st Qu.: -2.4412
Mode :character Mode :character Median : -0.0223
Mean : -0.1515
3rd Qu.: 2.1338
Max. : 14.8596
logCPM F PValue
Min. :-0.6586 Min. : 0.0000 Min. :0.00000
1st Qu.: 3.2794 1st Qu.: 0.2718 1st Qu.:0.06444
Median : 4.7527 Median : 1.2249 Median :0.26843
Mean : 4.7831 Mean : 2.2634 Mean :0.34949
3rd Qu.: 6.2484 3rd Qu.: 3.4201 3rd Qu.:0.60215
Max. :17.0813 Max. :43.1210 Max. :1.00000
FDR diffexpr geneType
Min. :0.0000004 Length:13344 Length:13344
1st Qu.:0.2617545 Class :character Class :character
Median :0.5351169 Mode :character Mode :character
Mean :0.5344405
3rd Qu.:0.8014345
Max. :1.0000000
Core tidyverse table manipulation and piping
Those are operations to manipulate tables. You can chain them
together with the pipe operator (%>%). What does the
pipe do?
Look below: we define significant_DGE as a sequence of piped
operations. Their order is from the first to the last, and in tidyverse
names recall very clearly what the operations do. The output of each
operation is piped into the next. Here we
- use the DGE tibble (tidyverse table)
- THEN rename the first column to
gene_name
- THEN clean up column names with
clean_names()
- THEN mutate the
gene_type column into a factor
(category)
- THEN select only the columns of interest
- THEN filter the rows by FDR and gene type
- THEN arrange the rows by descending FDR
glimpse will give an overview of the tibble’s structure
and the first few values.
significant_DGE <- DGE %>%
rename(gene_name = 1) %>% #rename column 1 to gene_name
janitor::clean_names() %>% #clean up column names
mutate(gene_type = factor(gene_type)) %>% #factorize categories of gene types
select(gene_name, log_fc, fdr, gene_type) %>% #select only some columns of interest
filter(fdr < .05 & gene_type=="protein-coding") %>% #filter by fdr and gene type
arrange(desc(fdr)) #arrange by descending pvalue
glimpse(significant_DGE)
Rows: 47
Columns: 4
$ gene_name <chr> "VPS39", "GLT1D1", "SDHAF3", "PI3", "NES", …
$ log_fc <dbl> 11.886819, 12.711685, 11.179818, 11.815255,…
$ fdr <dbl> 0.04961619, 0.04961619, 0.04961619, 0.04961…
$ gene_type <fct> protein-coding, protein-coding, protein-cod…
Create new columns with mutate
You can also create a new column with mutte. Let’s say we want to
define a DGE tibble like above, but without filtering. Instead we create
a category saying SIGN(.001), SIGN(.01), SIGN(.05) or NOT.SIGN, all
depending on the pvalue, and we also want that the logfold change is at
least above 1 or below -1.
You can use mutate to create a new column based on those
criteria:
extended_DGE <- DGE %>%
rename(gene_name = 1) %>% #rename column 1 to gene_name
janitor::clean_names() %>% #clean up column names
mutate(gene_type = factor(gene_type)) %>% #factorize categories of gene types
select(gene_name, log_fc, fdr, gene_type) %>% #select only some columns of interest
mutate( significance_level = case_when(
fdr < .001 & abs(log_fc) > 1 ~ "SIGN(.001)",
fdr < .01 & abs(log_fc) > 1 ~ "SIGN(.01)",
fdr < .05 & abs(log_fc) > 1 ~ "SIGN(.05)",
TRUE ~ "NOT.SIGN"
)) %>% #create a new column with the significance level
mutate( significance_level = factor(significance_level)) %>%
arrange(desc(fdr)) #arrange by descending pvalue
glimpse(extended_DGE)
Rows: 6,672
Columns: 5
$ gene_name <chr> "MIR1184-1", "SLC16A7", "TSPAN33",…
$ log_fc <dbl> 0.000000000, -0.001326973, -0.0009…
$ fdr <dbl> 1.0000000, 0.9997601, 0.9997601, 0…
$ gene_type <fct> ncRNA, protein-coding, protein-cod…
$ significance_level <fct> NOT.SIGN, NOT.SIGN, NOT.SIGN, NOT.…
Summarize by group
You can summarize by group and get group statistics. For example the
number of genes in each gene type and their median fdr and log_fc.
summary_DGE <- extended_DGE %>%
group_by(gene_type) %>%
summarise(
n = n(),
median_lfc = median(log_fc, na.rm = TRUE),
median_fdr = median(fdr, na.rm = TRUE),
.groups = "drop"
)
summary_DGE
Plotting with ggplot2
You can also plot your data with ggplot2, which is part of the
tidyverse. For example, a scatter plot of log fold change vs fdr,
colored by gene type:
ggplot(extended_DGE, aes(x = log_fc, y = -log10(fdr), color = gene_type)) +
geom_point(size=3) +
theme_minimal() +
labs(title = "DGE Scatter Plot", x = "Log Fold Change", y = "-Log10(FDR)")

ggplot takes the dataframe/tibble as input and adds layers of the
plot using the symbol +. There are a lot of ways to customize the plot
or do other types of plots.
We can add a column where we write the name of the genes which are
differentially expressed. We can do this with the mutate function and
case_when. Then we add to the plot a layer of text.
extended_DGE <- extended_DGE %>%
mutate( plot_label = case_when(
significance_level == "NOT.SIGN" ~ NA,
TRUE ~ gene_name
) )
glimpse(extended_DGE)
Rows: 6,672
Columns: 6
$ gene_name <chr> "MIR1184-1", "SLC16A7", "TSPAN33",…
$ log_fc <dbl> 0.000000000, -0.001326973, -0.0009…
$ fdr <dbl> 1.0000000, 0.9997601, 0.9997601, 0…
$ gene_type <fct> ncRNA, protein-coding, protein-cod…
$ significance_level <fct> NOT.SIGN, NOT.SIGN, NOT.SIGN, NOT.…
$ plot_label <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA…
ggplot(extended_DGE, aes(x = log_fc, y = -log10(fdr), color = gene_type)) +
geom_point() +
ggrepel::geom_text_repel(aes(label = plot_label), size = 3, ) + #add labels to significant genes
theme_minimal() +
geom_hline(yintercept=-log10(0.05)) + geom_vline(xintercept=c(1,-1)) + #vertical line ggplot
labs(title = "DGE Scatter Plot", x = "Log Fold Change", y = "-Log10(FDR)")

Themes
You can change theme for adapting the appearence. All standard themes
are here: https://ggplot2.tidyverse.org/reference/ggtheme.html.
For example, it is popular to use the gray theme
<!-- rnb-plot-begin eyJoZWlnaHQiOjQzMi42MzI5LCJ3aWR0aCI6NzAwLCJkcGkiOi0xLCJzaXplX2JlaGF2aW9yIjowLCJjb25kaXRpb25zIjpbWzEsIlx1MDAxYlsxbVx1MDAxYlszM21XYXJuaW5nXHUwMDFiWzM5bTpcdTAwMWJbMjJtXG5cdTAwMWJbMzg7NTsyMzJtUmVtb3ZlZCA2NjE5IHJvd3MgY29udGFpbmluZyBtaXNzaW5nIHZhbHVlcyBvciB2YWx1ZXMgb3V0c2lkZSB0aGUgc2NhbGUgcmFuZ2VcbihgZ2VvbV90ZXh0X3JlcGVsKClgKS5cdTAwMWJbMzltXG5cbiJdXX0= -->
<img src=\data:image/png;base64

You can do nicer themes using the ones developed here https://github.com/koundy/ggplot_theme_Publication,
where someone has developed its own themes. We have imported them at the
beginning of this notebook. For example you can use a theme developed
for publications (nicer sizes and font of text and nicer legend).
<!-- rnb-plot-begin eyJoZWlnaHQiOjQzMi42MzI5LCJ3aWR0aCI6NzAwLCJkcGkiOi0xLCJzaXplX2JlaGF2aW9yIjowLCJjb25kaXRpb25zIjpbWzEsIlx1MDAxYlsxbVx1MDAxYlszM21XYXJuaW5nXHUwMDFiWzM5bTpcdTAwMWJbMjJtXG5cdTAwMWJbMzg7NTsyMzJtUmVtb3ZlZCA2NjE5IHJvd3MgY29udGFpbmluZyBtaXNzaW5nIHZhbHVlcyBvciB2YWx1ZXMgb3V0c2lkZSB0aGUgc2NhbGUgcmFuZ2VcbihgZ2VvbV90ZXh0X3JlcGVsKClgKS5cdTAwMWJbMzltXG5cbiJdXX0= -->
<img src=\data:image/png;base64

Or combine publication theme with a color palette for discrete
variables (gene type) from the same author. You can look at the webpage
for the themes to see more examples.
<!-- rnb-plot-begin eyJoZWlnaHQiOjQzMi42MzI5LCJ3aWR0aCI6NzAwLCJkcGkiOi0xLCJzaXplX2JlaGF2aW9yIjowLCJjb25kaXRpb25zIjpbWzEsIlx1MDAxYlsxbVx1MDAxYlszM21XYXJuaW5nXHUwMDFiWzM5bTpcdTAwMWJbMjJtXG5cdTAwMWJbMzg7NTsyMzJtUmVtb3ZlZCA2NjE5IHJvd3MgY29udGFpbmluZyBtaXNzaW5nIHZhbHVlcyBvciB2YWx1ZXMgb3V0c2lkZSB0aGUgc2NhbGUgcmFuZ2VcbihgZ2VvbV90ZXh0X3JlcGVsKClgKS5cdTAwMWJbMzltXG5cbiJdXX0= -->
<img src=\data:image/png;base64

Still not enough? The package ggthemes has even more ways of
combining color palettes and themes. Look here https://github.com/jrnold/ggthemes where the author
makes a lot of examples and try one of them!
saving ggplots
You can save a ggplot with the dedicated command, but to do that you
need to assign the plot to a variable. For example:
volcano_plot <- ggplot(extended_DGE, aes(x = log_fc, y = -log10(fdr), color = gene_type)) +
geom_point() +
ggrepel::geom_text_repel(aes(label = plot_label), size = 3, ) + #add labels to significant genes
theme_gray() +
geom_hline(yintercept=-log10(0.05)) + geom_vline(xintercept=c(1,-1)) + #vertical line ggplot
labs(title = "DGE Scatter Plot", x = "Log Fold Change", y = "-Log10(FDR)")
ggsave(filename = "./lifeExp.png", plot = volcano_plot, width = 12, height = 10, dpi = 300, units = "cm")
Some other plots: barplots, boxplots, density plots, …
There is a wide range of plot types you can do with ggplot, and it
really depends on your data and what you want to visualize. For example,
you can do a bar plot of the number of genes in each gene type:
ggplot(summary_DGE, aes(x = gene_type, y = log(n), fill = gene_type)) +
geom_bar(stat = "identity") +
scale_fill_Publication() +
theme_dark_blue()

Note that above we needed only one value per gene type to plot the
bar. If we had more values per gene type, we would need to use
stat = "summary" and specify a summary function (for
example fun = "mean" or fun = "median"). For
example, if we want to plot the median log fold change for each gene
type:
ggplot(extended_DGE, aes(x = gene_type, y = log_fc, fill = gene_type)) +
geom_bar(stat = "summary", fun = "median") +
scale_fill_Publication() +
theme_dark_blue() +
#chnge x axis angle to 45
theme(axis.text.x = element_text(angle = 45, hjust = 1))

Faceting
Faceting is a powerful way to create multiple plots based on a
categorical variable. For example, we can facet the scatter plot by gene
type:
ggplot(extended_DGE, aes(x = log_fc, y = -log10(fdr), color = gene_type)) +
geom_point() +
ggrepel::geom_text_repel(aes(label = plot_label)) +
theme_minimal() +
geom_hline(yintercept=-log10(0.05)) + geom_vline(xintercept=c(1,-1)) + #vertical line ggplot
labs(title = "DGE Scatter Plot", x = "Log Fold Change", y = "-Log10(FDR)") +
facet_wrap(~ gene_type)

A density plot of log fold change by gene type:
ggplot(extended_DGE, aes(x = log_fc, fill = gene_type)) +
geom_density(alpha = 0.5) +
scale_fill_Publication() +
theme_excel_new() +
labs(title = "Density Plot of Log Fold Change\nby Gene Type", x = "Log Fold Change", y = "Density")

---
title: "Importing Tabular Data in R"
author: "Samuele Soraggi"
date: "`r Sys.Date()`"
output:
  html_notebook:
    theme: cerulean
    toc: true
    toc_depth: 3
editor_options:
  markdown:
    wrap: 72
---

## Goals

In this tutorial, you will learn how to:

1. Import tabular data from CSV files.
2. Import tabular data from Excel files.
3. Clean and manipulate tables with the tidyverse.

## 1. Setup

```{r setup, message=FALSE, warning=FALSE}
pacman::p_load(tidyverse, readxl, readr, janitor, ggrepel, gridExtra, ggthemes, scales, writexl)

source("https://github.com/koundy/ggplot_theme_Publication/raw/refs/heads/master/ggplot_theme_Publication-2.R")
```

## Import CSV files

Use `readr::read_csv()` for fast and friendly CSV import. The separator (delimiter between columns) is automatically detected as a comma. However you can specify another delimiter in the options if needed.

```{r read-single-csv}
DGE <- read_csv("DGE.csv")

head(DGE)
```

Example on how to specify a different delimiter (semicolon):

```{r read-delim, eval=FALSE}
#DGE <- read_delim("DGE.csv", delim = ";")
```

A file can also contain the first lines with comments which you want to skip. Then you can use the `skip` argument to ignore those lines:

```{r read-skip}
DGE_skip <- read_csv("weirdDGE.csv", skip = 3)
```

## Import Excel files

Use `readxl::read_excel()` to import Excel files. You can specify the sheet name or number, and optionally a range of cells.

```{r read-excel-one}
DGE_CaseControl <- read_excel("DGEtables.xlsx", sheet = "CaseControl")

head(DGE_CaseControl)
```

### Read a specific range of cells in Excel

```{r read-excel-range}
DGE_subset <- read_excel("DGEtables.xlsx", 
                        sheet = "CaseControl",
                        range = "A1:F50")

head(DGE_subset)
```

### Read all sheets and combine

A more advanced example merging all sheets from excel into one table, and naming the source sheet in a new column:

```{r read-all-sheets}
#name of each sheet in excel file
sheet_names <- excel_sheets("DGEtables.xlsx")
cat("sheet names are ", paste(sheet_names, collapse = ", "), "\n")

#takes sheet names and run read_excel for each sheet, 
#we use map_fdr to combine all the tables into one, 
#and we add a new column with the sheet name to keep 
#track of the source of each row
DGE_all <- sheet_names %>%
	set_names() %>%
	map_dfr(
		~read_excel("DGEtables.xlsx", sheet = .x),
		.id = "sheet"
	)

summary(DGE_all)
```


## Core tidyverse table manipulation and piping

Those are operations to manipulate tables. You can chain them together with the pipe operator (`%>%`). What does the pipe do?

Look below: we define significant_DGE as a sequence of piped operations. Their order is from the first to the last, and in tidyverse names recall very clearly what the operations do. The output of each operation is piped into the next. Here we

 - 1. use the DGE tibble (tidyverse table)
 - 2. THEN rename the first column to `gene_name`
 - 3. THEN clean up column names with `clean_names()`
 - 4. THEN mutate the `gene_type` column into a factor (category)
 - 5. THEN select only the columns of interest
 - 6. THEN filter the rows by FDR and gene type
 - 7. THEN arrange the rows by descending FDR
 
`glimpse` will give an overview of the tibble's structure and the first few values.

```{r core-verbs-1}
significant_DGE <- DGE %>%
  rename(gene_name = 1) %>% #rename column 1 to gene_name
  janitor::clean_names() %>% #clean up column names
  mutate(gene_type = factor(gene_type)) %>% #factorize categories of gene types
	select(gene_name, log_fc, fdr, gene_type) %>% #select only some columns of interest
	filter(fdr < .05 & gene_type=="protein-coding") %>% #filter by fdr and gene type
	arrange(desc(fdr)) #arrange by descending pvalue

glimpse(significant_DGE)
```

### Create new columns with mutate

You can also create a new column with mutte. Let's say we want to define a DGE tibble like above, but without filtering. Instead we create a category saying SIGN(.001), SIGN(.01), SIGN(.05) or NOT.SIGN, all depending on the pvalue, and we also want that the logfold change is at least above 1 or below -1.

You can use mutate to create a new column based on those criteria:

```{r mutate-cols}
extended_DGE <- DGE %>%
  rename(gene_name = 1) %>% #rename column 1 to gene_name
  janitor::clean_names() %>% #clean up column names
  mutate(gene_type = factor(gene_type)) %>% #factorize categories of gene types
	select(gene_name, log_fc, fdr, gene_type) %>% #select only some columns of interest
  mutate( significance_level = case_when(
    fdr < .001 & abs(log_fc) > 1 ~ "SIGN(.001)",
    fdr < .01 & abs(log_fc) > 1 ~ "SIGN(.01)",
    fdr < .05 & abs(log_fc) > 1 ~ "SIGN(.05)",
    TRUE ~ "NOT.SIGN"
  )) %>% #create a new column with the significance level
  mutate( significance_level = factor(significance_level)) %>%
  arrange(desc(fdr)) #arrange by descending pvalue

glimpse(extended_DGE)
```

### Summarize by group

You can summarize by group and get group statistics. For example the number of genes in each gene type and their median fdr and log_fc.

```{r summarize-group}
summary_DGE <- extended_DGE %>%
	group_by(gene_type) %>%
	summarise(
		n = n(),
		median_lfc = median(log_fc, na.rm = TRUE),
		median_fdr = median(fdr, na.rm = TRUE),
		.groups = "drop"
	)

summary_DGE
```
## Plotting with ggplot2

You can also plot your data with ggplot2, which is part of the tidyverse. For example, a scatter plot of log fold change vs fdr, colored by gene type:

```{r ggplot}
ggplot(extended_DGE, aes(x = log_fc, y = -log10(fdr), color = gene_type)) +
	geom_point(size=3) +
	theme_minimal() +
	labs(title = "DGE Scatter Plot", x = "Log Fold Change", y = "-Log10(FDR)")
```

ggplot takes the dataframe/tibble as input and adds layers of the plot using the symbol +.
There are a lot of ways to customize the plot or do other types of plots.

We can add a column where we write the name of the genes which are differentially expressed. We can do this with the mutate function and case_when. Then we add to the plot a layer of text.

```{r label-name}
extended_DGE <- extended_DGE %>%
  mutate( plot_label = case_when(
    significance_level ==  "NOT.SIGN" ~ NA,
    TRUE ~ gene_name
    ) )

glimpse(extended_DGE)
```

```{r plot-with-labels}
ggplot(extended_DGE, aes(x = log_fc, y = -log10(fdr), color = gene_type)) +
	geom_point() +
  ggrepel::geom_text_repel(aes(label = plot_label), size = 3, ) + #add labels to significant genes
	theme_minimal() +
  geom_hline(yintercept=-log10(0.05)) + geom_vline(xintercept=c(1,-1)) +    #vertical line ggplot
	labs(title = "DGE Scatter Plot", x = "Log Fold Change", y = "-Log10(FDR)")
```
### Themes

You can change theme for adapting the appearence. All standard themes are here: https://ggplot2.tidyverse.org/reference/ggtheme.html. For example, it is popular to use the gray theme

```{r theme-ggplot }
ggplot(extended_DGE, aes(x = log_fc, y = -log10(fdr), color = gene_type)) +
	geom_point() +
  ggrepel::geom_text_repel(aes(label = plot_label), size = 3, ) + #add labels to significant genes
	theme_gray() +
  geom_hline(yintercept=-log10(0.05)) + geom_vline(xintercept=c(1,-1)) +    #vertical line ggplot
	labs(title = "DGE Scatter Plot", x = "Log Fold Change", y = "-Log10(FDR)")
```

You can do nicer themes using the ones developed here https://github.com/koundy/ggplot_theme_Publication, where someone has developed its own themes. We have imported them at the beginning of this notebook. For example you can use a theme developed for publications (nicer sizes and font of text and nicer legend).

```{r theme-publication}
ggplot(extended_DGE, aes(x = log_fc, y = -log10(fdr), color = gene_type)) +
	geom_point() +
  ggrepel::geom_text_repel(aes(label = plot_label), size = 3, ) + #add labels to significant genes
	theme_Publication() +
  geom_hline(yintercept=-log10(0.05)) + geom_vline(xintercept=c(1,-1)) +    #vertical line ggplot
	labs(title = "DGE Scatter Plot", x = "Log Fold Change", y = "-Log10(FDR)")
```
Or combine publication theme with a color palette for discrete variables (gene type) from the same author. You can look at the webpage for the themes to see more examples.

```{r theme-publication-palette}
ggplot(extended_DGE, aes(x = log_fc, y = -log10(fdr), color = gene_type)) +
	geom_point() +
  ggrepel::geom_text_repel(aes(label = plot_label), size = 3, ) + #add labels to significant genes
	scale_fill_Publication() + theme_dark_blue() +
  geom_hline(yintercept=-log10(0.05)) + geom_vline(xintercept=c(1,-1)) +    #vertical line ggplot
	labs(title = "DGE Scatter Plot", x = "Log Fold Change", y = "-Log10(FDR)") 
```
Still not enough? The package ggthemes has even more ways of combining color palettes and themes. Look here https://github.com/jrnold/ggthemes where the author makes a lot of examples and try one of them!

### saving ggplots

You can save a ggplot with the dedicated command, but to do that you need to assign the plot to a variable. For example:

```{r ggsave}
volcano_plot <- ggplot(extended_DGE, aes(x = log_fc, y = -log10(fdr), color = gene_type)) +
	geom_point() +
  ggrepel::geom_text_repel(aes(label = plot_label), size = 3, ) + #add labels to significant genes
	theme_gray() +
  geom_hline(yintercept=-log10(0.05)) + geom_vline(xintercept=c(1,-1)) +    #vertical line ggplot
	labs(title = "DGE Scatter Plot", x = "Log Fold Change", y = "-Log10(FDR)")

ggsave(filename = "./lifeExp.png", plot = volcano_plot, width = 12, height = 10, dpi = 300, units = "cm")
```

### Some other plots: barplots, boxplots, density plots, ...

There is a wide range of plot types you can do with ggplot, and it really depends on your data and what you want to visualize. For example, you can do a bar plot of the number of genes in each gene type:

```{r barplot}
ggplot(summary_DGE, aes(x = gene_type, y = log(n), fill = gene_type)) +
	geom_bar(stat = "identity") +
  scale_fill_Publication() + 
  theme_dark_blue()
```

Note that above we needed only one value per gene type to plot the bar. If we had more values per gene type, we would need to use `stat = "summary"` and specify a summary function (for example `fun = "mean"` or `fun = "median"`). For example, if we want to plot the median log fold change for each gene type:

```{r barplot-median}
ggplot(extended_DGE, aes(x = gene_type, y = log_fc, fill = gene_type)) +
	geom_bar(stat = "summary", fun = "median") +
  scale_fill_Publication() + 
  theme_dark_blue() +
#chnge x axis angle to 45
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

### Faceting

Faceting is a powerful way to create multiple plots based on a categorical variable. For example, we can facet the scatter plot by gene type:

```{r facet-ggplot}
ggplot(extended_DGE, aes(x = log_fc, y = -log10(fdr), color = gene_type)) +
	geom_point() +
  ggrepel::geom_text_repel(aes(label = plot_label)) +
  	theme_minimal() +
  geom_hline(yintercept=-log10(0.05)) + geom_vline(xintercept=c(1,-1)) +    #vertical line ggplot
  	labs(title = "DGE Scatter Plot", x = "Log Fold Change", y = "-Log10(FDR)") +
  facet_wrap(~ gene_type)
```
  
  
A density plot of log fold change by gene type:

```{r density-ggplot}
ggplot(extended_DGE, aes(x = log_fc, fill = gene_type)) +
  	geom_density(alpha = 0.5) +
  scale_fill_Publication() +
  theme_excel_new() +
  labs(title = "Density Plot of Log Fold Change\nby Gene Type", x = "Log Fold Change", y = "Density") 
```

## Export cleaned results

```{r export, eval=FALSE}
# Export to CSV
write_csv(extended_DGE, "final_DGE.csv")

# Export to Excel (needs writexl package)
writexl::write_xlsx(extended_DGE, "final_DGE.xlsx")
```

## Quick practice tasks

1. Import one CSV or one Excel file from your own project.
2. Use `clean_names()` and `mutate()` to standardize/cleanup columns
3. apply some labels of interested, as in the example for significance
4. make a plot you would like to see, for example a scatter plot or a bar plot, and customize it with themes and labels
5. Export the final cleaned table

## Troubleshooting tips

- If import fails due to separators, try `read_delim()` with the correct `delim`.
- If text appears broken, check encoding with `locale(encoding = "UTF-8")`.
- If column types are wrong, fix with `mutate()` and `as.numeric()`, `as.factor()`, etc.
- If joins add many missing values, verify the key columns (for example `sample_id`) match exactly in both tables.

## Some resources

I will just list a few things, as the internet is a wide ocean, and you might want to focus on few selected things.

- R novice gap reminder https://swcarpentry.github.io/r-novice-gapminder/index.html
- A slightly more in depth R and tidyr workshop we just had at the coding cafe: https://abc.au.dk/documentation/2026-02-26-tidyR-omics1.html
- tidyr cheatseet from Posit  https://github.com/rstudio/cheatsheets/blob/main/tidyr.pdf
- the help pane of Rstudio where you can find the documentation of all function
- the tab button for autocompleting functions and arguments in Rstudio. It is the starting condition to have an easier life.
