1. Introduction

This module covers pathway analysis and covers the basic differences between pathway analysis and GO. It goes over several of the different pathway databases and resources that are most commonly used. First, we will set the working directory.

knitr::opts_knit$set(root.dir = '/Users/aaron/Dropbox (Personal)/Aaron_denDekker/BFX/Miller_RNAseq/')

2. Background

Pathway analysis is very similar to GO analysis but differs in that contains topology between genes. Thus, the role of a particular gene product is illustrated in the pathway. For example, MDM2 is an inhibitor of p53. p53 can also activate MDM2. These interactions are depicted in the p53 signaling KEGG pathway:

Thus, knowing what pathways are impacted in your experiment can provide even more information than GO or GSEA analysis.

3. Pathway Databases

There are a number of databases devoted to molecular pathways. The ones mentioned here only apply to pathways involving genes and/or gene products. There are databases that incorporate other molecules such drugs that aren’t mentioned. Each of these have their own unique applications. And using them for data analysis is also unique to each. If necessary these may be touched on in a separate module.

3.1 KEGG

There are several databases that can be used to perform pathway analyses. The most common one is the Kyoto Enyclopedia of Genes and Genomes database, commonly referred to as KEGG. This database has been around since 1995 and was developed by Minoru Kanehisa at Kyoto University. It was originally free but now requires a license which is pretty expensive and also requires you to have a pretty good background in computer programming to use directly. This is the most common database used in online “push button” GUI tools you’ll find so it’s accessible even if you don’t have a license for direct access. However, there are several issues with this:

1. Often not current. Even though online tools utilize the KEGG database for performing pathway analysis, the databases are often outdated. The KEGG database is updated regularly and it’s difficult for many free online tools to update their database to reflect updates in the KEGG database. Also, the providers of free tools often do not maintain regular licenses. Thus, the KEGG database that is available may be several years old. For example, the online tool DAVID (Database for Annotation, Visualization and Integrated Discovery) which is maintained by the NIH, did not update their KEGG database at all between 2016 and 2021. Lacking current knowledgebase updates is actually a problem across all freeware tools.

2. KEGG itself isn’t expanding. Although there are are regulat updates to the KEGG database, there aren’t really any new additions to the database. The vast majority of the updates are done to reflect changes in ID status of particular genes and proteins. Although, the base KEGG pathways are informative, there is a lack of novel pathways in KEGG.

3.2 Reactome

Another pathway database that is highly used is the Reactome database. Reactome is an open access, peer-reviewed, and manually curated pathway database that attempts to integrate ALL molecular pathways. The Reactome database is well-maintained. If you are a developer, the database is free to access (at least for academic purposes) via an API. So, you can build your own tools to use it if you wish. However, there are a number of tools that can be used to analyze data using the Reactome database.

3.3 WikiPathways

WikiPathways is an open access database that is contributed to by the scientific community. Although it isn’t technically peer-reviewed in the classic sense, pathways are reviewed by contributors and users. One of the major benfits of WikiPathways is that as more pathways are elucidated and published, they are being added to the WikiPathways database. This allows for identification of novel mechanisms in your data. Many tools now access and analyze data using the WikiPathways database and there are a number of Cytoscape plugins to visualize and curate the results.

3.4 Others

When you are looking at pathways, you want to be able to visualize results using the pathway “map”. There a number of databases in addition to the ones I’ve mentioned above, e.g. Biocarta. However, many of these are not easily accessed. Additionally, these databases aren’t as well maintained or contributed to. In the analyses below, I am more concerned with the output of the databases I’ve mentioned above since these are the most comprehensive and well maintained.

4. Performing pathway analysis

4.1 Statistical tests

Pathway analysis is typically performed using over-enrichment analysis (ORA) similar to GO analysis. Thus, although there is topology on the pathway showing activity relationships between genes in the pathway, this information is not included when performing the analysis. There is a test that does take these relationships into account called impact analysis that was developed by Adi Tarca at Wayne State University. There is a Bioconductor package that uses this algorithm called SPIA (Signaling Pathway Impact Analysis) that was built and is maintained by Adi Tarca’s group. However, there are a couple of caveats to using SPIA:

1. It only applies to KEGG: SPIA was designed for use with KEGG pathways only. So, it can’t be leveraged against Reactome or WikiPathways (at least not right now).

2. It does not contain an up to date version of the KEGG database: The SPIA package implements an old version of the KEGG pathway database that was available when the software was originally written in 2012. Although the SPIA Bioconductor package is regularly maintained to work with updates in R, the KEGG database is not updated. However, it does allow you to load in your own KEGG database if you have access to it. But, as I mentioned above, to do this you have to have a license which is upwards of $5000 annually.

Despite these limitations, the SPIA package can be still be very useful. We will come back to using SPIA later.

4.2 Running a pathway analysis using ORA

There is a huge amount of tools that can be used to run an enrichment analysis for pathway analysis and fall into 2 main categories, R/Bioconductor packages or online tools. R/Biocnductor packages, such as clusterProfiler, are going to be more up to date since the majority of them allow, or require, that you upload the database you want to analyze your data against. I have already mentioned above the limitations imposed on the different databases and although, you have the ability to upload a up-to-date version of a database, that doesn’t mean you actually have an up-to-date version of a database, e.g. KEGG. Additionally, R/Bioconductor packages are more cumbersome because they require a good working knowledge of the particular package and that involves a learning curve. Thus, I am actully going to focus here on using online tools for pathway analysis since they are easy to use. The tool that we will be using is EnrichR because I think they do a better job at keeping the databases updated.

4.3 Using EnrichR

EnrichR is an online tool maintained by the Ma’ayan lab at the Icahn School of Medicine at Mount Sinai in NYC. To access EnrichR, click here. If you’ve gone through the GO module, I’ve already gone over EnrichR. But here, we’re going to go through the output focusing on pathway databases.

1. Once you open to EnrichR homepage, you will be presented with the ability to input just your set of interesting genes. To do this, simply copy and paste your genes from your DE gene file.

2. As always, we want to make sure that we also input our reference gene set. Above the box to input your genes, there is a paragraph describing the operation. Within that, there is the option to adding your reference set: “Also, you can now try adding a background.” When you click this, EnrichR will populate the backgroun box with its default reference gene list. You do not want to use this. Select all the genes in the background box and delete them. Then, copy and paste the genes from your reference gene list into the box. Now hit Submit. It may take a few moments to run.

3. Once the results are generated, select the Pathways tab at the top. The screen will show thumbnails of each database tested and each will be populated the top 5 hits. There are many, many more databases run by EnrichR than I mentioned above, but many of these I find less useful and less accessible.

4. Click on a thumbnail of a database that you are interested in and it will repopulate with the top hits of that database in a bar chart corresponding to the p-value of the hit (default). If you click on the chart, it will change to a different bar chart sorted by a different metric (combined score ranking or rank based ranking.) If you want to get more information about the results, select Table to generate a table of the hits. These will be sorted by the same metrics, but include the p-value, the adjusted p-value after multiple comparisons, the odds ratio, and the combined score. If you want to know more about how these are calculated, check out this paper or you can check the FAQs page.

If you want to plot the output of the data, you will need to download the table using the Export entries to table button at the bottom of the table view. Unfortunately, EnrichR does not include links to external resources, i.e, you can’t click on a hit to jump to a page for the particular pathway tested. To do this, you will need to go out of EnrichR and navigate to the website for the particular database, e.g. Reactome, and input the pathway name or ID. This will allow you to view the structure of the pathway and the genes in it. Although, it doesn’t overlay the DE genes from your dataset on it. There are modules built for Cytoscape that allow you to do this but these are beyond the scope of this module.

4.4 Using SPIA

4.4.1 Impact analysis

As mentioned above, SPIA is a R/Bioconductor package to perform pathway analysis using the impact analysis algorithm. In essence, impact analysis calculates the total accumulate perturbation in a pathway based on the differential expression of your genes. Specifically, a perturbation factor for each gene is calculated based on its expression and the expression of genes upstream from it that act on it:

g represents the gene for which the perturbation factor is calculated. u represents the upstream genes that may act on it. Impact analysis calculates a factor based on the the expression of g and the partial effect from u where the impact acting on g is divided by the number of downstream targets of u.
g represents the gene for which the perturbation factor is calculated. u represents the upstream genes that may act on it. Impact analysis calculates a factor based on the the expression of g and the partial effect from u where the impact acting on g is divided by the number of downstream targets of u.


Then, a p-value is calculated for each pathway using the perturbation of factors for all of genes in the pathway and also a p-value based on standard over-enrichment.

Finally, a combined p-value is calculated from both the ORA p-value (pORA or pNDE) and the perturbation p-value (pPERT).

If this seems complex, don’t worry about it. It’s basically just a way to include the topology of the genes in pathway when performing the analysis. If you want to know more about you can read the original articles that describe impact analysis here and here.

4.4.2 Running SPIA

The steps to run SPIA are as follows.

First, read in the gene matrix we generated. This is the same as what is done for GO analysis.

gene_matrix<-read.table("de_gene_matrix.txt", header=TRUE)
gene_matrix<-na.omit(gene_matrix) ## remove any rows with missing values

head(gene_matrix,10)
##      baseMean log2FoldChange      lfcSE       stat       pvalue         padj
## 1   4039.8363      0.1777647 0.10638128  1.6710151 9.471870e-02 1.715284e-01
## 2   1484.0676      0.3366905 0.07079863  4.7556085 1.978496e-06 1.428035e-05
## 3    802.4782     -0.2426161 0.09796361 -2.4765942 1.326426e-02 3.373410e-02
## 4    188.7049      0.3108578 0.17280967  1.7988450 7.204320e-02 1.375432e-01
## 5  16056.5082     -0.3389572 0.19785169 -1.7131885 8.667787e-02 1.596503e-01
## 6  12011.2178      0.5571671 0.14169897  3.9320475 8.422542e-05 4.158889e-04
## 7   2538.1323     -0.1405250 0.14519350 -0.9678461 3.331212e-01 4.556764e-01
## 8   1947.5685     -0.5692705 0.10832086 -5.2554100 1.476951e-07 1.338705e-06
## 9    910.3267     -0.1299670 0.07601476 -1.7097603 8.731021e-02 1.606176e-01
## 10  4607.9456      0.4016412 0.10992504  3.6537736 2.584142e-04 1.120626e-03
##           ensembl_var         ensembl symbol entrez
## 1  ENSG00000000003.15 ENSG00000000003 TSPAN6   7105
## 2  ENSG00000000419.14 ENSG00000000419   DPM1   8813
## 3  ENSG00000000457.14 ENSG00000000457  SCYL3  57147
## 4  ENSG00000000460.17 ENSG00000000460  FIRRM  55732
## 5  ENSG00000000971.17 ENSG00000000971    CFH   3075
## 6  ENSG00000001036.14 ENSG00000001036  FUCA2   2519
## 7  ENSG00000001084.13 ENSG00000001084   GCLC   2729
## 8  ENSG00000001167.15 ENSG00000001167   NFYA   4800
## 9  ENSG00000001460.18 ENSG00000001460  STPG1  90529
## 10 ENSG00000001461.17 ENSG00000001461 NIPAL3  57185

Now, we subset the matrix based on the differential expression thresholds.

de_matrix<-gene_matrix[gene_matrix$padj<=0.05,]
de_matrix<-de_matrix[de_matrix$log2FoldChange >=1 | de_matrix$log2FoldChange <= -1,]

Now, we want just the LFC values and the Entrez IDs. SPIA will only work with the Entrez ID numbers! This is why included this in the matrix when we performed the DE analysis.

de_list<-de_matrix$log2FoldChange
names(de_list)<-as.vector(de_matrix$entrez)

Make the list of background genes.

bkgd_list<-as.character(gene_matrix$entrez)

Find and remove any duplicates from the DE gene list.

dups<-unique(names(de_list[which(duplicated(names(de_list)))]))
de_list<-de_list[!(names(de_list) %in% dups)] # remove duplicates from sigGenes

Now, load and run SPIA. If you haven’t downloaded SPIA already, you will need to use the Bioconductor installer to install it. The code here checks if SPIA is installed and, if not, uses the Bioconductor installer to install it.

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
## Bioconductor version '3.18' is out-of-date; the current release version '3.19'
##   is available with R version '4.4'; see https://bioconductor.org/install
library(BiocManager)

packages <- c("SPIA")
installed<- packages %in% rownames(installed.packages())

if (any(installed == FALSE)) {
  BiocManager::install(!installed)
}

library(SPIA)
## Loading required package: KEGGgraph
## 
## Attaching package: 'KEGGgraph'
## The following object is masked from 'package:graphics':
## 
##     plot
## The following object is masked from 'package:base':
## 
##     plot

Now we have to set up the spia data object that will be used to run SPIA.

res=spia(de=de_list, all=bkgd_list, organism="hsa", nB=2000, plots=TRUE, beta=NULL, combine="fisher", verbose=FALSE, data.dir = system.file("extdata/", package="SPIA"))

head(res,10)
##                                       Name    ID pSize NDE         pNDE
## 1   Cytokine-cytokine receptor interaction 04060   131  49 1.267761e-11
## 2  Neuroactive ligand-receptor interaction 04080   134  43 4.722207e-08
## 3                                  Malaria 05144    31  17 9.038553e-08
## 4                               Amoebiasis 05146    78  29 2.181289e-07
## 5                 ECM-receptor interaction 04512    76  29 1.132017e-07
## 6                    Glutamatergic synapse 04724   103  31 1.473739e-05
## 7                                Apoptosis 04210    75  13 2.322582e-01
## 8      Complement and coagulation cascades 04610    47  16 3.665645e-04
## 9                   Dilated cardiomyopathy 05414    73  25 8.175237e-06
## 10         Staphylococcus aureus infection 05150    32  13 1.742705e-04
##            tA    pPERT           pG        pGFdr       pGFWER    Status
## 1   14.898087 0.108000 3.877086e-11 5.195295e-09 5.195295e-09 Activated
## 2    3.297866 0.028000 2.835358e-08 1.899690e-06 3.799380e-06 Activated
## 3    4.623166 0.116000 2.031241e-07 9.072878e-06 2.721863e-05 Activated
## 4    5.500484 0.398000 1.498387e-06 4.663993e-05 2.007839e-04 Activated
## 5    1.116699 0.899000 1.740296e-06 4.663993e-05 2.331997e-04 Activated
## 6   14.432307 0.058000 1.279797e-05 2.858212e-04 1.714927e-03 Activated
## 7   53.230830 0.000005 1.703147e-05 3.260309e-04 2.282217e-03 Activated
## 8  -67.400635 0.004000 2.116221e-05 3.544669e-04 2.835736e-03 Inhibited
## 9    0.000000 1.000000 1.039432e-04 1.422595e-03 1.392839e-02 Inhibited
## 10 -27.171346 0.048000 1.061638e-04 1.422595e-03 1.422595e-02 Inhibited
##                                                                                                                                                                                                                                                                                                               KEGGLINK
## 1  http://www.genome.jp/dbget-bin/show_pathway?hsa04060+53833+3977+3556+3554+55540+3606+3624+7043+7042+55504+60401+355+10673+8995+970+53832+3815+4254+1435+5618+3575+50615+3601+3566+3563+1438+2056+85480+3600+3597+9180+3590+3570+3592+56477+10344+6354+6347+6370+6376+9547+6387+51330+8718+4982+8743+8797+2919+57007
## 2                                        http://www.genome.jp/dbget-bin/show_pathway?hsa04080+5340+6865+5618+116443+2890+2891+2892+2894+2897+2899+2900+5026+2554+2563+165829+2911+2918+7433+7434+6344+2642+2696+10800+57105+1902+9170+2798+2859+5028+135+5734+2837+64106+5021+6751+4161+2357+1813+1815+147+150+151+152
## 3                                                                                                                                                                             http://www.genome.jp/dbget-bin/show_pathway?hsa05144+6382+2532+2995+948+7097+7099+4615+6347+3592+3606+3689+7057+7058+7060+7042+7043+6403
## 4                                                                                                                 http://www.genome.jp/dbget-bin/show_pathway?hsa05146+4583+22798+284217+3910+3914+2919+3554+7097+7099+3592+929+8503+3689+9630+5578+107+5330+5332+7042+7043+1281+1282+1284+1286+1289+1290+1302+88+5272
## 5                                                                                                             http://www.genome.jp/dbget-bin/show_pathway?hsa04512+22798+284217+3910+3914+1281+1282+1284+1286+1289+1290+1302+7448+7057+7058+7060+3690+3688+948+9899+6382+3693+22801+8515+8516+3675+3672+1101+5649+3673
## 6                                                                                                  http://www.genome.jp/dbget-bin/show_pathway?hsa04724+2918+2897+2899+2900+2890+2891+2892+116443+2911+107+108+111+57030+5330+5332+3709+5578+2784+2786+2791+2792+51764+59345+27165+2744+6512+3760+5321+8399+9229+22941
## 7                                                                                                                                                                                                  http://www.genome.jp/dbget-bin/show_pathway?hsa04210+598+330+79444+8503+3656+4615+3563+3554+3556+8797+355+4803+8743
## 8                                                                                                                                                                                     http://www.genome.jp/dbget-bin/show_pathway?hsa04610+2153+2159+2155+7056+2266+5627+7035+2157+5340+5327+5054+718+725+710+717+5648
## 9                                                                                                                                      http://www.genome.jp/dbget-bin/show_pathway?hsa05414+6444+27092+55799+781+93589+6546+22801+3672+3673+3675+8515+8516+3688+3690+3693+6445+7139+7168+7169+70+107+108+111+7042+7043
## 10                                                                                                                                                                                                 http://www.genome.jp/dbget-bin/show_pathway?hsa05150+718+717+5648+2357+5340+2266+2212+3108+3109+3127+3689+6403+6404

Plot the output of the -log10(pPERT) vs -log10(pNDA). We will first need to remove any rows that contain NAs.

res<-na.omit(res)
plotP(res, threshold=0.05) 

The graph plots each KEGG term by it’s p-value due to the perturbation score and it’s p-value based on overenrichment. The most impactful terms will have the most contribution from both. This is reflected in the table. The red and blue lines designate the thresholds set by Bonferroni multiple comparisons test (red) and false discovery rate (blue).