library(clusterProfiler)
library(airway)
library(ggplot2)
library(dplyr)
library(DESeq2)
library(org.Hs.eg.db)
library(enrichplot)
ABC.12: Pixi and Desktop on genomeDK, R-clusterProfiler analysis
Slides
Today’s slides with news about GenomeDK and Pixi, and topic presentation about clusterProfiler and pathway analysis.
Tutorial
In this tutorial, we use clusterProfiler for pathway enrichment analysis in R. Pathway enrichment analysis is a method that identifies if a list of genes is over-represented in a pathway. There two widely popular methods for this, Over Reprentation Analysis(ORA) and Gene Set Enrichment Analysis. Both of these can be done by using clusterProfiler and both methods require a list of genes that are the output of a differential expression analysis. We will follow the examples presented by clusterProfiler here: https://yulab-smu.top/biomedical-knowledge-mining-book/enrichment-overview.html
We cover
- Basic theory: What are the issues with the test for enrichment, and what does the test means
- Analysis: How to use associated R packages to extrapolate pathway names connected to genes
- Result and plots: Viewing and reading the results correctly with beautiful plots
You should be able to run the tutorial using the packages loaded in the R code below. Many of them are Bioconductor packages.
Over Representation Analysis (ORA)
ORA answers the question: “Which pathways are overrepresented in my list of significant genes?”
To answer this, we need a list of background genes (also referred to as the universe). This is important because overrepresentation is always relative, i.e. overrepresented compared to what? Therefore, we essentially create a contingency table to determine the significance of the pathway’s representation in the gene list.e. For example:
<- data.frame(not_interesting_genes=c(2613, 15310), interesting_genes=c(28, 29))
mytable row.names(mytable) <- c("In pathway", "not in pathway")
mytable
not_interesting_genes | interesting_genes | |
---|---|---|
<dbl> | <dbl> | |
In pathway | 2613 | 28 |
not in pathway | 15310 | 29 |
Statistical Test
To determine over-representation, we use Fisher’s Exact Test, which is a statistical test commonly applied in contingency table analysis. The test is represented by the following formula:
\[ p = 1 - \sum_{i=0}^{k-1} \frac{\binom{M}{i} \binom{N-M}{n-i}}{\binom{N}{n}} \]
Where: - ( M ) is the total number of genes in the pathway, - ( N ) is the total number of genes in the universe, - ( n ) is the number of significant genes (those of interest), - ( k ) is the number of significant genes in the pathway.
fisher.test(mytable, alternative = "greater")
Fisher's Exact Test for Count Data
data: mytable
p-value = 1
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
0.110242 Inf
sample estimates:
odds ratio
0.1767937
Since we are testing multiple pathways we should correct for multiple testing.
<- data.frame(
gene_data Gene_Name = paste0("Gene", 1:10),
LogFC = c(2.5, -1.2, 0.8, -2.3, 3.1, -0.5, 1.8, -1.0, 2.0, -3.0),
P_value = c(0.001, 0.04, 0.2, 0.005, 0.0001, 0.3, 0.02, 0.06, 0.0005, 0.00001),
Regulation = c("UP", "DOWN", "UP", "DOWN", "UP", "DOWN", "UP", "DOWN", "UP", "DOWN")
)
print(gene_data)
Gene_Name LogFC P_value Regulation
1 Gene1 2.5 1e-03 UP
2 Gene2 -1.2 4e-02 DOWN
3 Gene3 0.8 2e-01 UP
4 Gene4 -2.3 5e-03 DOWN
5 Gene5 3.1 1e-04 UP
6 Gene6 -0.5 3e-01 DOWN
7 Gene7 1.8 2e-02 UP
8 Gene8 -1.0 6e-02 DOWN
9 Gene9 2.0 5e-04 UP
10 Gene10 -3.0 1e-05 DOWN
ORA has some significant drawbacks.
- Mainly it is highly dependent on what we (subjectively) will deem to be interesting genes. Usually that would be genes that are differentially expressed.
- Additionally, it is highly dependent on the background genes. If there are lot of non annotated genes in the pathways then the power reduces substantially and also creates very biased results depending on which pathway databases you use.
- It does not take into account the combination of the p value and the logfc. What if a gene has a fold change of 5 but a p value of 0.051?
Gene Set Enrichment Analysis
These issues are addressed by Gene Set Enrichment analysis. GSEA answers the question “these changes I see between treatment and control are similar to which known list of changes”. The big difference now is that instead of choosing a list of interesting genes we consider all the genes in the output of our degs analysis and we rank them!
The core idea of Gene Set Enrichment Analysis (GSEA) is to assess whether the genes in a predefined gene set (e.g., a pathway) are enriched at the top or bottom of a ranked gene list.
Gene Ranking: Genes are ranked by a statistic, such as the log-fold change or p-value, representing their differential expression.
Enrichment Score (ES): The algorithm traverses the ranked gene list and computes a cumulative running score. When encountering a gene in the pathway, the score is increased proportionally to its rank. For genes not in the pathway, the score is decreased. The enrichment score (ES) is the maximum deviation from zero observed during this traversal.
Significance Testing: The observed ES is compared to a null distribution of ES values generated by permuting the gene labels. The significance is quantified as a p-value, and adjustments for multiple testing are applied.
Normalization: The ES is normalized (NES) to account for variations in gene set size, making results comparable across pathways.
Before we go any further lets quickly get a degs list using the airway package and deseq2
data(airway)
<- assay(airway) # The counts matrix
cts <- as.data.frame(colData(airway)) # Our metadata
metadata $condition <- factor(metadata$dex) # factor the dex col in condition
metadata<- DESeqDataSetFromMatrix(
dds countData = cts,
colData = metadata,
design = ~ condition
)<- DESeq(dds)
dds <- results(dds)
res <- as.data.frame(res)
resdf head(res)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
log2 fold change (MLE): condition untrt vs trt
Wald test p-value: condition untrt vs trt
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000000003 708.602170 0.3788470 0.173141 2.188082 0.0286636
ENSG00000000005 0.000000 NA NA NA NA
ENSG00000000419 520.297901 -0.2037604 0.100599 -2.025478 0.0428183
ENSG00000000457 237.163037 -0.0340428 0.126279 -0.269584 0.7874802
ENSG00000000460 57.932633 0.1171786 0.301237 0.388992 0.6972820
ENSG00000000938 0.318098 1.7245505 3.493633 0.493627 0.6215698
padj
<numeric>
ENSG00000000003 0.139308
ENSG00000000005 NA
ENSG00000000419 0.183359
ENSG00000000457 0.930572
ENSG00000000460 0.895441
ENSG00000000938 NA
Now we can rank the list. There is not on a correct way to rank the gene lists but we would like the ranking to represent both the significance and the fold change. One suggested formula is:
\[ \text{Ranking Score} = \log_{2}(\text{Fold Change}) \times -\log_{10}(\text{P-value}) \]
Additionally we can order them by the stat column of the deseq2 output which is the wald statistic. This creates a list where over expressed and significant genes are at the top, under expressed and significant genes in the bottom and combinations of not significant genes in the middle.
<- res[order(res$stat),] resOrdered
Cluster Profiler examples
clusterProfiler makes it easy to perform both ORA and GSEA analyses. It uses two main functions in the background: * enricher()
(for ORA) * GSEA()
(for GSEA)
These two allow you to submit your own pathways via the TERM2GENE parameter. For example
<- data.frame(
TERM2GENE TERM = c("GO:0005737", "GO:0008150", "KEGG:04110"),
GENE = c("gene1, gene2, gene3", "gene4, gene5, gene6", "gene7, gene8")
)# Example
#enricher(gene_list, TERM2GENE = TERM2GENE)
Curated Pathway Lists
ClusterProfiler also has built-in support for curated lists of pathways, such as KEGG and GO. These are already implemented in the following functions:
enrichKEGG
: Enrichment analysis for KEGG pathways.enrichGO
: Enrichment analysis for Gene Ontology (GO) terms.gseGO()
: GSEA for Gene Ontology (GO) terms.gseKEGG
: GSEA for KEGG pathways.
Now lets see some examples for GO. For ORA we will use the examples from the guide
data(geneList, package="DOSE")
<- names(geneList)[abs(geneList) > 2] # Create our list of significant genes
gene <- enrichGO(gene = gene, # Our genelist
ego universe = names(geneList), # The background universe we talked about
OrgDb = org.Hs.eg.db, # Our database
ont = "CC", # The Gene Ontology
pAdjustMethod = "BH", # Correction for multiple testint
pvalueCutoff = 0.01, # Our cutoffs
qvalueCutoff = 0.05,
readable = TRUE) # SetReadable converts the genes to SYMBOL automatically. Very handy.
head(ego)
ID | Description | GeneRatio | BgRatio | RichFactor | FoldEnrichment | zScore | pvalue | p.adjust | qvalue | geneID | Count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <chr> | <int> | |
GO:0000775 | GO:0000775 | chromosome, centromeric region | 21/201 | 199/11919 | 0.10552764 | 6.257631 | 9.795483 | 1.849874e-11 | 5.272141e-09 | 4.829145e-09 | CDCA8/CDC20/CENPE/NDC80/TOP2A/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/EZH2/TTK/NCAPG/AURKB/AURKA/CCNB1 | 21 |
GO:0000779 | GO:0000779 | condensed chromosome, centromeric region | 18/201 | 148/11919 | 0.12162162 | 7.211981 | 9.959264 | 5.132324e-11 | 5.985415e-09 | 5.482485e-09 | CDC20/CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/NCAPG/AURKB/AURKA/CCNB1 | 18 |
GO:0072686 | GO:0072686 | mitotic spindle | 18/201 | 152/11919 | 0.11842105 | 7.022192 | 9.786254 | 8.053824e-11 | 5.985415e-09 | 5.482485e-09 | KIF23/CENPE/ASPM/DLGAP5/SKA1/NUSAP1/TPX2/TACC3/CDK1/MAD2L1/KIF18A/KIF11/TRAT1/AURKB/PRC1/KIFC1/KIF18B/AURKA | 18 |
GO:0005819 | GO:0005819 | spindle | 26/201 | 338/11919 | 0.07692308 | 4.561424 | 8.699252 | 8.806167e-11 | 5.985415e-09 | 5.482485e-09 | CDCA8/CDC20/KIF23/CENPE/ASPM/DLGAP5/SKA1/NUSAP1/TPX2/TACC3/NEK2/CDK1/MAD2L1/KIF18A/BIRC5/KIF11/TRAT1/TTK/AURKB/PRC1/KIFC1/KIF18B/KIF20A/AURKA/CCNB1/KIF4A | 26 |
GO:0000793 | GO:0000793 | condensed chromosome | 21/201 | 219/11919 | 0.09589041 | 5.686158 | 9.166827 | 1.142683e-10 | 5.985415e-09 | 5.482485e-09 | CDC20/CENPE/NDC80/TOP2A/NCAPH/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/NCAPG/AURKB/CHEK1/AURKA/CCNB1 | 21 |
GO:0098687 | GO:0098687 | chromosomal region | 25/201 | 319/11919 | 0.07836991 | 4.647218 | 8.647710 | 1.406548e-10 | 5.985415e-09 | 5.482485e-09 | CDCA8/CDC20/CENPE/NDC80/TOP2A/HJURP/SKA1/NEK2/CENPM/RAD51AP1/CENPN/CDK1/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/EZH2/TTK/NCAPG/AURKB/CHEK1/AURKA/CCNB1/MCM5 | 25 |
Similarly we can to GSEA. Here we set min and max gene set sizes because gene sets that are too small might not have enough data for reliable statistical analysis, while very large gene sets may not provide useful biological insights because they tend to be too broad. Your call.
<- gseGO(geneList = geneList, # Here we have all the genes
ego3 OrgDb = org.Hs.eg.db, # Our database
ont = "CC",
minGSSize = 100, # This is the minimum gene set size
maxGSSize = 500, # Maximum geneset size
pvalueCutoff = 0.05,
verbose = FALSE)
Warning message in fgseaMultilevel(pathways = pathways, stats = stats, minSize = minSize, :
“For some of the pathways the P-values were likely overestimated. For such pathways log2err is set to NA.”
Warning message in fgseaMultilevel(pathways = pathways, stats = stats, minSize = minSize, :
“For some pathways, in reality P-values are less than 1e-10. You can set the `eps` argument to zero for better estimation.”
Gene Ontology (GO)
Gene Ontology (GO) categorizes genes into three domains:
- Biological Process (BP): The biological objectives or processes in which the gene products are involved (e.g., cell cycle, metabolic processes).
- Molecular Function (MF): The molecular activities performed by gene products (e.g., enzyme activity, receptor binding).
- Cellular Component (CC): The locations within the cell where gene products are found (e.g., nucleus, mitochondria).
KEGG Pathways
KEGG (Kyoto Encyclopedia of Genes and Genomes) provides a comprehensive database of metabolic and signaling pathways that help identify biological functions and interactions.
Other Databases
- MsigDB: A Molecular Signature Database that is particularly useful for performing GSEA.
- Reactome: A curated pathway database for functional annotation of genes, focusing on human pathways.
Plots
One of the great things about clusterProfiler is that you can very easily create plots with the enrichplot
library and they are very well integrated with ggplot. Here we will show some of them as there are many. You can see more here: https://yulab-smu.top/biomedical-knowledge-mining-book/enrichplot.html
Converting Gene Lists
When working with gene lists, it’s often necessary to convert between different types of gene identifiers. For instance, you may need to convert between ENTREZID, ENSEMBL, and SYMBOL identifiers, which are commonly used in gene annotation databases. The bitr() function from the clusterProfiler package is a useful tool for this task.
bitr(gene, fromType = "ENTREZID",
toType = c("ENSEMBL", "SYMBOL"),
OrgDb = org.Hs.eg.db)
'select()' returned 1:many mapping between keys and columns
Warning message in bitr(gene, fromType = "ENTREZID", toType = c("ENSEMBL", "SYMBOL"), :
“0.48% of input gene IDs are fail to map...”
ENTREZID | ENSEMBL | SYMBOL | |
---|---|---|---|
<chr> | <chr> | <chr> | |
1 | 4312 | ENSG00000196611 | MMP1 |
2 | 8318 | ENSG00000093009 | CDC45 |
3 | 10874 | ENSG00000109255 | NMU |
4 | 55143 | ENSG00000134690 | CDCA8 |
5 | 55388 | ENSG00000065328 | MCM10 |
6 | 991 | ENSG00000117399 | CDC20 |
7 | 6280 | ENSG00000163220 | S100A9 |
8 | 2305 | ENSG00000111206 | FOXM1 |
9 | 9493 | ENSG00000137807 | KIF23 |
10 | 1062 | ENSG00000138778 | CENPE |
11 | 3868 | ENSG00000186832 | KRT16 |
12 | 4605 | ENSG00000101057 | MYBL2 |
13 | 9833 | ENSG00000165304 | MELK |
14 | 9133 | ENSG00000157456 | CCNB2 |
15 | 6279 | ENSG00000143546 | S100A8 |
16 | 10403 | ENSG00000080986 | NDC80 |
17 | 8685 | ENSG00000019169 | MARCO |
18 | 597 | ENSG00000140379 | BCL2A1 |
19 | 7153 | ENSG00000131747 | TOP2A |
20 | 23397 | ENSG00000121152 | NCAPH |
21 | 6278 | ENSG00000143556 | S100A7 |
22 | 79733 | ENSG00000129173 | E2F8 |
23 | 259266 | ENSG00000066279 | ASPM |
24 | 1381 | ENSG00000166426 | CRABP1 |
25 | 3627 | ENSG00000169245 | CXCL10 |
26 | 27074 | ENSG00000078081 | LAMP3 |
27 | 6241 | ENSG00000171848 | RRM2 |
28 | 55165 | ENSG00000138180 | CEP55 |
29 | 9787 | ENSG00000126787 | DLGAP5 |
30 | 7368 | ENSG00000174607 | UGT8 |
⋮ | ⋮ | ⋮ | ⋮ |
214 | 64499 | ENSG00000197253 | TPSB2 |
215 | 54829 | ENSG00000106819 | ASPN |
216 | 54829 | ENSG00000292224 | ASPN |
217 | 1524 | ENSG00000168329 | CX3CR1 |
218 | 10234 | ENSG00000128606 | LRRC17 |
219 | 1580 | ENSG00000142973 | CYP4B1 |
220 | 10647 | ENSG00000124935 | SCGB1D2 |
221 | 25893 | ENSG00000162722 | TRIM58 |
222 | 24141 | ENSG00000125869 | LAMP5 |
223 | 10351 | ENSG00000141338 | ABCA8 |
224 | 2330 | ENSG00000131781 | FMO5 |
225 | 5304 | ENSG00000159763 | PIP |
226 | 79846 | ENSG00000105792 | CFAP69 |
227 | 8614 | ENSG00000113739 | STC2 |
228 | 2625 | ENSG00000107485 | GATA3 |
229 | 7021 | ENSG00000008196 | TFAP2B |
230 | 7802 | ENSG00000163879 | DNALI1 |
231 | 79689 | ENSG00000127954 | STEAP4 |
232 | 11122 | ENSG00000196090 | PTPRT |
233 | 55351 | ENSG00000152953 | STK32B |
234 | 9 | ENSG00000171428 | NAT1 |
235 | 4239 | ENSG00000166482 | MFAP4 |
236 | 5241 | ENSG00000082175 | PGR |
237 | 10551 | ENSG00000106541 | AGR2 |
238 | 10974 | ENSG00000148671 | ADIRF |
239 | 79838 | ENSG00000103534 | TMC5 |
240 | 79901 | ENSG00000071967 | CYBRD1 |
241 | 57758 | ENSG00000175356 | SCUBE2 |
242 | 4969 | ENSG00000106809 | OGN |
243 | 4969 | ENSG00000291350 | OGN |
Lets see from our examples:
mutate(ego, qscore = -log(p.adjust, base=10)) %>%
barplot(x="qscore") + ggtitle("Bar plot from the ORA above")
<- setReadable(ego, 'org.Hs.eg.db', 'ENTREZID') # Convert to symbol
edox <- cnetplot(edox, foldChange=geneList, circular = TRUE, colorEdge = TRUE)
p p
A very informative plot is a heatplot which you can generate like this:
<- heatplot(edox, showCategory=5) # Show category dictates show many genesets you see
p1 <- heatplot(edox, foldChange=geneList, showCategory=5)
p2 ::plot_grid(p1, p2, ncol=1, labels=LETTERS[1:2]) cowplot
One of the most useful ones I find to be the upset plot. For over-representation analysis, upsetplot will calculate the overlaps among different gene sets and for GSEA result, it will plot the fold change distributions of different categories.
upsetplot(ego3)
options(repr.plot.height=20)
ridgeplot(ego3)
Picking joint bandwidth of 0.223
Lastly we have: Running score and preranked list are traditional methods for visualizing the results of Gene Set Enrichment Analysis (GSEA). These visualizations help to understand how a particular gene set is enriched along the ordered gene list, as well as how the enrichment score changes as you progress through the gene list.
<- gseaplot(ego3, geneSetID = 1, by = "runningScore", title = ego3$Description[1])
p1 <- gseaplot(ego3, geneSetID = 1, by = "preranked", title = ego3$Description[1]) p2
options(repr.plot.height=10)
::plot_grid(p1, p2, ncol=1, labels=LETTERS[1:3]) cowplot
Running Score Plot Interpretation
The running score plot illustrates how the enrichment score (ES) accumulates as you move through the ranked gene list. The key features to observe are:
- Sharp Peaks: A sharp rise indicates that a cluster of genes from the pathway is highly ranked, contributing strongly to enrichment.
- Peak Position: The x-axis location of the peak corresponds to where the pathway’s genes are most enriched within the ranked list.
- Running Score Return: If the score drops back down, it suggests that the pathway genes are not uniformly distributed across the gene list.