ABC.12: Pixi and Desktop on genomeDK, R-clusterProfiler analysis

GenomeDK
pixi
conda
package-environment
R
clusterProfiler
pathway-analysis
Pathway enrichment analysis with R package clusterProfiler
Author

Dimitrios Maniatis

Published

January 30, 2025

Slides

Today’s slides with news about GenomeDK and Pixi, and topic presentation about clusterProfiler and pathway analysis. 

 

Download Slides

 

Tutorial

In this tutorial, we use clusterProfiler for pathway enrichment analysis in R. Pathway enrichment analysis is a method that identifies if a list of genes is over-represented in a pathway. There two widely popular methods for this, Over Reprentation Analysis(ORA) and Gene Set Enrichment Analysis. Both of these can be done by using clusterProfiler and both methods require a list of genes that are the output of a differential expression analysis. We will follow the examples presented by clusterProfiler here: https://yulab-smu.top/biomedical-knowledge-mining-book/enrichment-overview.html

We cover

  • Basic theory: What are the issues with the test for enrichment, and what does the test means
  • Analysis: How to use associated R packages to extrapolate pathway names connected to genes
  • Result and plots: Viewing and reading the results correctly with beautiful plots
Warning

You should be able to run the tutorial using the packages loaded in the R code below. Many of them are Bioconductor packages.

library(clusterProfiler)
library(airway)
library(ggplot2)
library(dplyr)
library(DESeq2)
library(org.Hs.eg.db)
library(enrichplot)

Over Representation Analysis (ORA)

ORA answers the question: “Which pathways are overrepresented in my list of significant genes?”

To answer this, we need a list of background genes (also referred to as the universe). This is important because overrepresentation is always relative, i.e. overrepresented compared to what? Therefore, we essentially create a contingency table to determine the significance of the pathway’s representation in the gene list.e. For example:

mytable <- data.frame(not_interesting_genes=c(2613, 15310), interesting_genes=c(28, 29))
row.names(mytable) <- c("In pathway", "not in pathway")
mytable
A data.frame: 2 × 2
not_interesting_genes interesting_genes
<dbl> <dbl>
In pathway 2613 28
not in pathway 15310 29

Statistical Test

To determine over-representation, we use Fisher’s Exact Test, which is a statistical test commonly applied in contingency table analysis. The test is represented by the following formula:

\[ p = 1 - \sum_{i=0}^{k-1} \frac{\binom{M}{i} \binom{N-M}{n-i}}{\binom{N}{n}} \]

Where: - ( M ) is the total number of genes in the pathway, - ( N ) is the total number of genes in the universe, - ( n ) is the number of significant genes (those of interest), - ( k ) is the number of significant genes in the pathway.

fisher.test(mytable, alternative = "greater")

    Fisher's Exact Test for Count Data

data:  mytable
p-value = 1
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
 0.110242      Inf
sample estimates:
odds ratio 
 0.1767937 

Since we are testing multiple pathways we should correct for multiple testing.

gene_data <- data.frame(
  Gene_Name = paste0("Gene", 1:10),
  LogFC = c(2.5, -1.2, 0.8, -2.3, 3.1, -0.5, 1.8, -1.0, 2.0, -3.0),
  P_value = c(0.001, 0.04, 0.2, 0.005, 0.0001, 0.3, 0.02, 0.06, 0.0005, 0.00001),
  Regulation = c("UP", "DOWN", "UP", "DOWN", "UP", "DOWN", "UP", "DOWN", "UP", "DOWN")
)

print(gene_data)
   Gene_Name LogFC P_value Regulation
1      Gene1   2.5   1e-03         UP
2      Gene2  -1.2   4e-02       DOWN
3      Gene3   0.8   2e-01         UP
4      Gene4  -2.3   5e-03       DOWN
5      Gene5   3.1   1e-04         UP
6      Gene6  -0.5   3e-01       DOWN
7      Gene7   1.8   2e-02         UP
8      Gene8  -1.0   6e-02       DOWN
9      Gene9   2.0   5e-04         UP
10    Gene10  -3.0   1e-05       DOWN

ORA has some significant drawbacks.

  • Mainly it is highly dependent on what we (subjectively) will deem to be interesting genes. Usually that would be genes that are differentially expressed.
  • Additionally, it is highly dependent on the background genes. If there are lot of non annotated genes in the pathways then the power reduces substantially and also creates very biased results depending on which pathway databases you use.
  • It does not take into account the combination of the p value and the logfc. What if a gene has a fold change of 5 but a p value of 0.051?

Gene Set Enrichment Analysis

These issues are addressed by Gene Set Enrichment analysis. GSEA answers the question “these changes I see between treatment and control are similar to which known list of changes”. The big difference now is that instead of choosing a list of interesting genes we consider all the genes in the output of our degs analysis and we rank them!

The core idea of Gene Set Enrichment Analysis (GSEA) is to assess whether the genes in a predefined gene set (e.g., a pathway) are enriched at the top or bottom of a ranked gene list.

  1. Gene Ranking: Genes are ranked by a statistic, such as the log-fold change or p-value, representing their differential expression.

  2. Enrichment Score (ES): The algorithm traverses the ranked gene list and computes a cumulative running score. When encountering a gene in the pathway, the score is increased proportionally to its rank. For genes not in the pathway, the score is decreased. The enrichment score (ES) is the maximum deviation from zero observed during this traversal.

  3. Significance Testing: The observed ES is compared to a null distribution of ES values generated by permuting the gene labels. The significance is quantified as a p-value, and adjustments for multiple testing are applied.

  4. Normalization: The ES is normalized (NES) to account for variations in gene set size, making results comparable across pathways.

Before we go any further lets quickly get a degs list using the airway package and deseq2

data(airway)
cts  <- assay(airway) # The counts matrix
metadata <- as.data.frame(colData(airway)) # Our metadata
metadata$condition <- factor(metadata$dex) # factor the dex col in condition
dds <- DESeqDataSetFromMatrix(
  countData = cts,
  colData = metadata,
  design = ~ condition
)
dds <- DESeq(dds)
res <- results(dds)
resdf <- as.data.frame(res)
head(res)
estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing
log2 fold change (MLE): condition untrt vs trt 
Wald test p-value: condition untrt vs trt 
DataFrame with 6 rows and 6 columns
                  baseMean log2FoldChange     lfcSE      stat    pvalue
                 <numeric>      <numeric> <numeric> <numeric> <numeric>
ENSG00000000003 708.602170      0.3788470  0.173141  2.188082 0.0286636
ENSG00000000005   0.000000             NA        NA        NA        NA
ENSG00000000419 520.297901     -0.2037604  0.100599 -2.025478 0.0428183
ENSG00000000457 237.163037     -0.0340428  0.126279 -0.269584 0.7874802
ENSG00000000460  57.932633      0.1171786  0.301237  0.388992 0.6972820
ENSG00000000938   0.318098      1.7245505  3.493633  0.493627 0.6215698
                     padj
                <numeric>
ENSG00000000003  0.139308
ENSG00000000005        NA
ENSG00000000419  0.183359
ENSG00000000457  0.930572
ENSG00000000460  0.895441
ENSG00000000938        NA

Now we can rank the list. There is not on a correct way to rank the gene lists but we would like the ranking to represent both the significance and the fold change. One suggested formula is:

\[ \text{Ranking Score} = \log_{2}(\text{Fold Change}) \times -\log_{10}(\text{P-value}) \]

Additionally we can order them by the stat column of the deseq2 output which is the wald statistic. This creates a list where over expressed and significant genes are at the top, under expressed and significant genes in the bottom and combinations of not significant genes in the middle.

resOrdered <- res[order(res$stat),] 

Cluster Profiler examples

clusterProfiler makes it easy to perform both ORA and GSEA analyses. It uses two main functions in the background: * enricher() (for ORA) * GSEA() (for GSEA)

These two allow you to submit your own pathways via the TERM2GENE parameter. For example

TERM2GENE <- data.frame(
  TERM = c("GO:0005737", "GO:0008150", "KEGG:04110"),
  GENE = c("gene1, gene2, gene3", "gene4, gene5, gene6", "gene7, gene8")
)
# Example
#enricher(gene_list, TERM2GENE = TERM2GENE)

Curated Pathway Lists

ClusterProfiler also has built-in support for curated lists of pathways, such as KEGG and GO. These are already implemented in the following functions:

  1. enrichKEGG: Enrichment analysis for KEGG pathways.
  2. enrichGO: Enrichment analysis for Gene Ontology (GO) terms.
  3. gseGO(): GSEA for Gene Ontology (GO) terms.
  4. gseKEGG: GSEA for KEGG pathways.

Now lets see some examples for GO. For ORA we will use the examples from the guide

data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2] # Create our list of significant genes
ego <- enrichGO(gene          = gene, # Our genelist
                universe      = names(geneList), # The background universe we talked about
                OrgDb         = org.Hs.eg.db, # Our database 
                ont           = "CC", # The Gene Ontology
                pAdjustMethod = "BH", # Correction for multiple testint
                pvalueCutoff  = 0.01, # Our cutoffs
                qvalueCutoff  = 0.05,
        readable      = TRUE) # SetReadable converts the genes to SYMBOL automatically. Very handy.
head(ego)
A data.frame: 6 × 12
ID Description GeneRatio BgRatio RichFactor FoldEnrichment zScore pvalue p.adjust qvalue geneID Count
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <int>
GO:0000775 GO:0000775 chromosome, centromeric region 21/201 199/11919 0.10552764 6.257631 9.795483 1.849874e-11 5.272141e-09 4.829145e-09 CDCA8/CDC20/CENPE/NDC80/TOP2A/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/EZH2/TTK/NCAPG/AURKB/AURKA/CCNB1 21
GO:0000779 GO:0000779 condensed chromosome, centromeric region 18/201 148/11919 0.12162162 7.211981 9.959264 5.132324e-11 5.985415e-09 5.482485e-09 CDC20/CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/NCAPG/AURKB/AURKA/CCNB1 18
GO:0072686 GO:0072686 mitotic spindle 18/201 152/11919 0.11842105 7.022192 9.786254 8.053824e-11 5.985415e-09 5.482485e-09 KIF23/CENPE/ASPM/DLGAP5/SKA1/NUSAP1/TPX2/TACC3/CDK1/MAD2L1/KIF18A/KIF11/TRAT1/AURKB/PRC1/KIFC1/KIF18B/AURKA 18
GO:0005819 GO:0005819 spindle 26/201 338/11919 0.07692308 4.561424 8.699252 8.806167e-11 5.985415e-09 5.482485e-09 CDCA8/CDC20/KIF23/CENPE/ASPM/DLGAP5/SKA1/NUSAP1/TPX2/TACC3/NEK2/CDK1/MAD2L1/KIF18A/BIRC5/KIF11/TRAT1/TTK/AURKB/PRC1/KIFC1/KIF18B/KIF20A/AURKA/CCNB1/KIF4A 26
GO:0000793 GO:0000793 condensed chromosome 21/201 219/11919 0.09589041 5.686158 9.166827 1.142683e-10 5.985415e-09 5.482485e-09 CDC20/CENPE/NDC80/TOP2A/NCAPH/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/NCAPG/AURKB/CHEK1/AURKA/CCNB1 21
GO:0098687 GO:0098687 chromosomal region 25/201 319/11919 0.07836991 4.647218 8.647710 1.406548e-10 5.985415e-09 5.482485e-09 CDCA8/CDC20/CENPE/NDC80/TOP2A/HJURP/SKA1/NEK2/CENPM/RAD51AP1/CENPN/CDK1/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/EZH2/TTK/NCAPG/AURKB/CHEK1/AURKA/CCNB1/MCM5 25

Similarly we can to GSEA. Here we set min and max gene set sizes because gene sets that are too small might not have enough data for reliable statistical analysis, while very large gene sets may not provide useful biological insights because they tend to be too broad. Your call.

ego3 <- gseGO(geneList     = geneList, # Here we have all the genes
              OrgDb        = org.Hs.eg.db, # Our database
              ont          = "CC",
              minGSSize    = 100, # This is the minimum gene set size
              maxGSSize    = 500, # Maximum geneset size
              pvalueCutoff = 0.05,
              verbose      = FALSE)
Warning message in fgseaMultilevel(pathways = pathways, stats = stats, minSize = minSize, :
“For some of the pathways the P-values were likely overestimated. For such pathways log2err is set to NA.”
Warning message in fgseaMultilevel(pathways = pathways, stats = stats, minSize = minSize, :
“For some pathways, in reality P-values are less than 1e-10. You can set the `eps` argument to zero for better estimation.”

Gene Ontology (GO)

Gene Ontology (GO) categorizes genes into three domains:

  • Biological Process (BP): The biological objectives or processes in which the gene products are involved (e.g., cell cycle, metabolic processes).
  • Molecular Function (MF): The molecular activities performed by gene products (e.g., enzyme activity, receptor binding).
  • Cellular Component (CC): The locations within the cell where gene products are found (e.g., nucleus, mitochondria).

KEGG Pathways

KEGG (Kyoto Encyclopedia of Genes and Genomes) provides a comprehensive database of metabolic and signaling pathways that help identify biological functions and interactions.

Other Databases

  • MsigDB: A Molecular Signature Database that is particularly useful for performing GSEA.
  • Reactome: A curated pathway database for functional annotation of genes, focusing on human pathways.

Plots

One of the great things about clusterProfiler is that you can very easily create plots with the enrichplot library and they are very well integrated with ggplot. Here we will show some of them as there are many. You can see more here: https://yulab-smu.top/biomedical-knowledge-mining-book/enrichplot.html

Converting Gene Lists

When working with gene lists, it’s often necessary to convert between different types of gene identifiers. For instance, you may need to convert between ENTREZID, ENSEMBL, and SYMBOL identifiers, which are commonly used in gene annotation databases. The bitr() function from the clusterProfiler package is a useful tool for this task.

bitr(gene, fromType = "ENTREZID",
        toType = c("ENSEMBL", "SYMBOL"),
        OrgDb = org.Hs.eg.db)
'select()' returned 1:many mapping between keys and columns

Warning message in bitr(gene, fromType = "ENTREZID", toType = c("ENSEMBL", "SYMBOL"), :
“0.48% of input gene IDs are fail to map...”
A data.frame: 242 × 3
ENTREZID ENSEMBL SYMBOL
<chr> <chr> <chr>
1 4312 ENSG00000196611 MMP1
2 8318 ENSG00000093009 CDC45
3 10874 ENSG00000109255 NMU
4 55143 ENSG00000134690 CDCA8
5 55388 ENSG00000065328 MCM10
6 991 ENSG00000117399 CDC20
7 6280 ENSG00000163220 S100A9
8 2305 ENSG00000111206 FOXM1
9 9493 ENSG00000137807 KIF23
10 1062 ENSG00000138778 CENPE
11 3868 ENSG00000186832 KRT16
12 4605 ENSG00000101057 MYBL2
13 9833 ENSG00000165304 MELK
14 9133 ENSG00000157456 CCNB2
15 6279 ENSG00000143546 S100A8
16 10403 ENSG00000080986 NDC80
17 8685 ENSG00000019169 MARCO
18 597 ENSG00000140379 BCL2A1
19 7153 ENSG00000131747 TOP2A
20 23397 ENSG00000121152 NCAPH
21 6278 ENSG00000143556 S100A7
22 79733 ENSG00000129173 E2F8
23 259266 ENSG00000066279 ASPM
24 1381 ENSG00000166426 CRABP1
25 3627 ENSG00000169245 CXCL10
26 27074 ENSG00000078081 LAMP3
27 6241 ENSG00000171848 RRM2
28 55165 ENSG00000138180 CEP55
29 9787 ENSG00000126787 DLGAP5
30 7368 ENSG00000174607 UGT8
214 64499 ENSG00000197253 TPSB2
215 54829 ENSG00000106819 ASPN
216 54829 ENSG00000292224 ASPN
217 1524 ENSG00000168329 CX3CR1
218 10234 ENSG00000128606 LRRC17
219 1580 ENSG00000142973 CYP4B1
220 10647 ENSG00000124935 SCGB1D2
221 25893 ENSG00000162722 TRIM58
222 24141 ENSG00000125869 LAMP5
223 10351 ENSG00000141338 ABCA8
224 2330 ENSG00000131781 FMO5
225 5304 ENSG00000159763 PIP
226 79846 ENSG00000105792 CFAP69
227 8614 ENSG00000113739 STC2
228 2625 ENSG00000107485 GATA3
229 7021 ENSG00000008196 TFAP2B
230 7802 ENSG00000163879 DNALI1
231 79689 ENSG00000127954 STEAP4
232 11122 ENSG00000196090 PTPRT
233 55351 ENSG00000152953 STK32B
234 9 ENSG00000171428 NAT1
235 4239 ENSG00000166482 MFAP4
236 5241 ENSG00000082175 PGR
237 10551 ENSG00000106541 AGR2
238 10974 ENSG00000148671 ADIRF
239 79838 ENSG00000103534 TMC5
240 79901 ENSG00000071967 CYBRD1
241 57758 ENSG00000175356 SCUBE2
242 4969 ENSG00000106809 OGN
243 4969 ENSG00000291350 OGN

Lets see from our examples:

mutate(ego, qscore = -log(p.adjust, base=10)) %>% 
    barplot(x="qscore") + ggtitle("Bar plot from the ORA  above")

edox <- setReadable(ego, 'org.Hs.eg.db', 'ENTREZID') # Convert to symbol
p <- cnetplot(edox, foldChange=geneList, circular = TRUE, colorEdge = TRUE)
p

A very informative plot is a heatplot which you can generate like this:

p1 <- heatplot(edox, showCategory=5) # Show category dictates show many genesets you see
p2 <- heatplot(edox, foldChange=geneList, showCategory=5)
cowplot::plot_grid(p1, p2, ncol=1, labels=LETTERS[1:2])

One of the most useful ones I find to be the upset plot. For over-representation analysis, upsetplot will calculate the overlaps among different gene sets and for GSEA result, it will plot the fold change distributions of different categories.

upsetplot(ego3) 

options(repr.plot.height=20)
ridgeplot(ego3)
Picking joint bandwidth of 0.223

Lastly we have: Running score and preranked list are traditional methods for visualizing the results of Gene Set Enrichment Analysis (GSEA). These visualizations help to understand how a particular gene set is enriched along the ordered gene list, as well as how the enrichment score changes as you progress through the gene list.

p1 <- gseaplot(ego3, geneSetID = 1, by = "runningScore", title = ego3$Description[1])
p2 <- gseaplot(ego3, geneSetID = 1, by = "preranked", title = ego3$Description[1])
options(repr.plot.height=10)
cowplot::plot_grid(p1, p2, ncol=1, labels=LETTERS[1:3])

Running Score Plot Interpretation

The running score plot illustrates how the enrichment score (ES) accumulates as you move through the ranked gene list. The key features to observe are:

  • Sharp Peaks: A sharp rise indicates that a cluster of genes from the pathway is highly ranked, contributing strongly to enrichment.
  • Peak Position: The x-axis location of the peak corresponds to where the pathway’s genes are most enriched within the ranked list.
  • Running Score Return: If the score drops back down, it suggests that the pathway genes are not uniformly distributed across the gene list.