This module covers pathway analysis and covers the basic differences between pathway analysis and GO. It goes over several of the different pathway databases and resources that are most commonly used. First, we will set the working directory.
knitr::opts_knit$set(root.dir = '/Users/aaron/Dropbox (Personal)/Aaron_denDekker/BFX/Miller_RNAseq/')
Pathway analysis is very similar to GO analysis but differs in that contains topology between genes. Thus, the role of a particular gene product is illustrated in the pathway. For example, MDM2 is an inhibitor of p53. p53 can also activate MDM2. These interactions are depicted in the p53 signaling KEGG pathway:
Thus, knowing what pathways are impacted in your experiment can provide even more information than GO or GSEA analysis.
There are a number of databases devoted to molecular pathways. The ones mentioned here only apply to pathways involving genes and/or gene products. There are databases that incorporate other molecules such drugs that aren’t mentioned. Each of these have their own unique applications. And using them for data analysis is also unique to each. If necessary these may be touched on in a separate module.
There are several databases that can be used to perform pathway analyses. The most common one is the Kyoto Enyclopedia of Genes and Genomes database, commonly referred to as KEGG. This database has been around since 1995 and was developed by Minoru Kanehisa at Kyoto University. It was originally free but now requires a license which is pretty expensive and also requires you to have a pretty good background in computer programming to use directly. This is the most common database used in online “push button” GUI tools you’ll find so it’s accessible even if you don’t have a license for direct access. However, there are several issues with this:
1. Often not current. Even though online tools utilize the KEGG database for performing pathway analysis, the databases are often outdated. The KEGG database is updated regularly and it’s difficult for many free online tools to update their database to reflect updates in the KEGG database. Also, the providers of free tools often do not maintain regular licenses. Thus, the KEGG database that is available may be several years old. For example, the online tool DAVID (Database for Annotation, Visualization and Integrated Discovery) which is maintained by the NIH, did not update their KEGG database at all between 2016 and 2021. Lacking current knowledgebase updates is actually a problem across all freeware tools.
2. KEGG itself isn’t expanding. Although there are are regulat updates to the KEGG database, there aren’t really any new additions to the database. The vast majority of the updates are done to reflect changes in ID status of particular genes and proteins. Although, the base KEGG pathways are informative, there is a lack of novel pathways in KEGG.
Another pathway database that is highly used is the Reactome database. Reactome is an open access, peer-reviewed, and manually curated pathway database that attempts to integrate ALL molecular pathways. The Reactome database is well-maintained. If you are a developer, the database is free to access (at least for academic purposes) via an API. So, you can build your own tools to use it if you wish. However, there are a number of tools that can be used to analyze data using the Reactome database.
WikiPathways is an open access database that is contributed to by the scientific community. Although it isn’t technically peer-reviewed in the classic sense, pathways are reviewed by contributors and users. One of the major benfits of WikiPathways is that as more pathways are elucidated and published, they are being added to the WikiPathways database. This allows for identification of novel mechanisms in your data. Many tools now access and analyze data using the WikiPathways database and there are a number of Cytoscape plugins to visualize and curate the results.
When you are looking at pathways, you want to be able to visualize results using the pathway “map”. There a number of databases in addition to the ones I’ve mentioned above, e.g. Biocarta. However, many of these are not easily accessed. Additionally, these databases aren’t as well maintained or contributed to. In the analyses below, I am more concerned with the output of the databases I’ve mentioned above since these are the most comprehensive and well maintained.
Pathway analysis is typically performed using over-enrichment analysis (ORA) similar to GO analysis. Thus, although there is topology on the pathway showing activity relationships between genes in the pathway, this information is not included when performing the analysis. There is a test that does take these relationships into account called impact analysis that was developed by Adi Tarca at Wayne State University. There is a Bioconductor package that uses this algorithm called SPIA (Signaling Pathway Impact Analysis) that was built and is maintained by Adi Tarca’s group. However, there are a couple of caveats to using SPIA:
1. It only applies to KEGG: SPIA was designed for use with KEGG pathways only. So, it can’t be leveraged against Reactome or WikiPathways (at least not right now).
2. It does not contain an up to date version of the KEGG database: The SPIA package implements an old version of the KEGG pathway database that was available when the software was originally written in 2012. Although the SPIA Bioconductor package is regularly maintained to work with updates in R, the KEGG database is not updated. However, it does allow you to load in your own KEGG database if you have access to it. But, as I mentioned above, to do this you have to have a license which is upwards of $5000 annually.
Despite these limitations, the SPIA package can be still be very useful. We will come back to using SPIA later.
There is a huge amount of tools that can be used to run an enrichment analysis for pathway analysis and fall into 2 main categories, R/Bioconductor packages or online tools. R/Biocnductor packages, such as clusterProfiler, are going to be more up to date since the majority of them allow, or require, that you upload the database you want to analyze your data against. I have already mentioned above the limitations imposed on the different databases and although, you have the ability to upload a up-to-date version of a database, that doesn’t mean you actually have an up-to-date version of a database, e.g. KEGG. Additionally, R/Bioconductor packages are more cumbersome because they require a good working knowledge of the particular package and that involves a learning curve. Thus, I am actully going to focus here on using online tools for pathway analysis since they are easy to use. The tool that we will be using is EnrichR because I think they do a better job at keeping the databases updated.
EnrichR is an online tool maintained by the Ma’ayan lab at the Icahn School of Medicine at Mount Sinai in NYC. To access EnrichR, click here. If you’ve gone through the GO module, I’ve already gone over EnrichR. But here, we’re going to go through the output focusing on pathway databases.
1. Once you open to EnrichR homepage, you will be presented with the ability to input just your set of interesting genes. To do this, simply copy and paste your genes from your DE gene file.
2. As always, we want to make sure that we also input our reference gene set. Above the box to input your genes, there is a paragraph describing the operation. Within that, there is the option to adding your reference set: “Also, you can now try adding a background.” When you click this, EnrichR will populate the backgroun box with its default reference gene list. You do not want to use this. Select all the genes in the background box and delete them. Then, copy and paste the genes from your reference gene list into the box. Now hit Submit. It may take a few moments to run.
3. Once the results are generated, select the Pathways tab at the top. The screen will show thumbnails of each database tested and each will be populated the top 5 hits. There are many, many more databases run by EnrichR than I mentioned above, but many of these I find less useful and less accessible.
4. Click on a thumbnail of a database that you are interested in and it will repopulate with the top hits of that database in a bar chart corresponding to the p-value of the hit (default). If you click on the chart, it will change to a different bar chart sorted by a different metric (combined score ranking or rank based ranking.) If you want to get more information about the results, select Table to generate a table of the hits. These will be sorted by the same metrics, but include the p-value, the adjusted p-value after multiple comparisons, the odds ratio, and the combined score. If you want to know more about how these are calculated, check out this paper or you can check the FAQs page.
If you want to plot the output of the data, you will need to download the table using the Export entries to table button at the bottom of the table view. Unfortunately, EnrichR does not include links to external resources, i.e, you can’t click on a hit to jump to a page for the particular pathway tested. To do this, you will need to go out of EnrichR and navigate to the website for the particular database, e.g. Reactome, and input the pathway name or ID. This will allow you to view the structure of the pathway and the genes in it. Although, it doesn’t overlay the DE genes from your dataset on it. There are modules built for Cytoscape that allow you to do this but these are beyond the scope of this module.
As mentioned above, SPIA is a R/Bioconductor package to perform pathway analysis using the impact analysis algorithm. In essence, impact analysis calculates the total accumulate perturbation in a pathway based on the differential expression of your genes. Specifically, a perturbation factor for each gene is calculated based on its expression and the expression of genes upstream from it that act on it:
Then, a p-value is calculated for each pathway using the perturbation of factors for all of genes in the pathway and also a p-value based on standard over-enrichment.
Finally, a combined p-value is calculated from both the ORA p-value (pORA or pNDE) and the perturbation p-value (pPERT).
If this seems complex, don’t worry about it. It’s basically just a way to include the topology of the genes in pathway when performing the analysis. If you want to know more about you can read the original articles that describe impact analysis here and here.
The steps to run SPIA are as follows.
First, read in the gene matrix we generated. This is the same as what is done for GO analysis.
gene_matrix<-read.table("de_gene_matrix.txt", header=TRUE)
gene_matrix<-na.omit(gene_matrix) ## remove any rows with missing values
head(gene_matrix,10)
## baseMean log2FoldChange lfcSE stat pvalue padj
## 1 4039.8363 0.1777647 0.10638128 1.6710151 9.471870e-02 1.715284e-01
## 2 1484.0676 0.3366905 0.07079863 4.7556085 1.978496e-06 1.428035e-05
## 3 802.4782 -0.2426161 0.09796361 -2.4765942 1.326426e-02 3.373410e-02
## 4 188.7049 0.3108578 0.17280967 1.7988450 7.204320e-02 1.375432e-01
## 5 16056.5082 -0.3389572 0.19785169 -1.7131885 8.667787e-02 1.596503e-01
## 6 12011.2178 0.5571671 0.14169897 3.9320475 8.422542e-05 4.158889e-04
## 7 2538.1323 -0.1405250 0.14519350 -0.9678461 3.331212e-01 4.556764e-01
## 8 1947.5685 -0.5692705 0.10832086 -5.2554100 1.476951e-07 1.338705e-06
## 9 910.3267 -0.1299670 0.07601476 -1.7097603 8.731021e-02 1.606176e-01
## 10 4607.9456 0.4016412 0.10992504 3.6537736 2.584142e-04 1.120626e-03
## ensembl_var ensembl symbol entrez
## 1 ENSG00000000003.15 ENSG00000000003 TSPAN6 7105
## 2 ENSG00000000419.14 ENSG00000000419 DPM1 8813
## 3 ENSG00000000457.14 ENSG00000000457 SCYL3 57147
## 4 ENSG00000000460.17 ENSG00000000460 FIRRM 55732
## 5 ENSG00000000971.17 ENSG00000000971 CFH 3075
## 6 ENSG00000001036.14 ENSG00000001036 FUCA2 2519
## 7 ENSG00000001084.13 ENSG00000001084 GCLC 2729
## 8 ENSG00000001167.15 ENSG00000001167 NFYA 4800
## 9 ENSG00000001460.18 ENSG00000001460 STPG1 90529
## 10 ENSG00000001461.17 ENSG00000001461 NIPAL3 57185
Now, we subset the matrix based on the differential expression thresholds.
de_matrix<-gene_matrix[gene_matrix$padj<=0.05,]
de_matrix<-de_matrix[de_matrix$log2FoldChange >=1 | de_matrix$log2FoldChange <= -1,]
Now, we want just the LFC values and the Entrez IDs. SPIA will only work with the Entrez ID numbers! This is why included this in the matrix when we performed the DE analysis.
de_list<-de_matrix$log2FoldChange
names(de_list)<-as.vector(de_matrix$entrez)
Make the list of background genes.
bkgd_list<-as.character(gene_matrix$entrez)
Find and remove any duplicates from the DE gene list.
dups<-unique(names(de_list[which(duplicated(names(de_list)))]))
de_list<-de_list[!(names(de_list) %in% dups)] # remove duplicates from sigGenes
Now, load and run SPIA. If you haven’t downloaded SPIA already, you will need to use the Bioconductor installer to install it. The code here checks if SPIA is installed and, if not, uses the Bioconductor installer to install it.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
## Bioconductor version '3.18' is out-of-date; the current release version '3.19'
## is available with R version '4.4'; see https://bioconductor.org/install
library(BiocManager)
packages <- c("SPIA")
installed<- packages %in% rownames(installed.packages())
if (any(installed == FALSE)) {
BiocManager::install(!installed)
}
library(SPIA)
## Loading required package: KEGGgraph
##
## Attaching package: 'KEGGgraph'
## The following object is masked from 'package:graphics':
##
## plot
## The following object is masked from 'package:base':
##
## plot
Now we have to set up the spia data object that will be used to run SPIA.
res=spia(de=de_list, all=bkgd_list, organism="hsa", nB=2000, plots=TRUE, beta=NULL, combine="fisher", verbose=FALSE, data.dir = system.file("extdata/", package="SPIA"))
head(res,10)
## Name ID pSize NDE pNDE
## 1 Cytokine-cytokine receptor interaction 04060 131 49 1.267761e-11
## 2 Neuroactive ligand-receptor interaction 04080 134 43 4.722207e-08
## 3 Malaria 05144 31 17 9.038553e-08
## 4 Amoebiasis 05146 78 29 2.181289e-07
## 5 ECM-receptor interaction 04512 76 29 1.132017e-07
## 6 Glutamatergic synapse 04724 103 31 1.473739e-05
## 7 Apoptosis 04210 75 13 2.322582e-01
## 8 Complement and coagulation cascades 04610 47 16 3.665645e-04
## 9 Dilated cardiomyopathy 05414 73 25 8.175237e-06
## 10 Staphylococcus aureus infection 05150 32 13 1.742705e-04
## tA pPERT pG pGFdr pGFWER Status
## 1 14.898087 0.108000 3.877086e-11 5.195295e-09 5.195295e-09 Activated
## 2 3.297866 0.028000 2.835358e-08 1.899690e-06 3.799380e-06 Activated
## 3 4.623166 0.116000 2.031241e-07 9.072878e-06 2.721863e-05 Activated
## 4 5.500484 0.398000 1.498387e-06 4.663993e-05 2.007839e-04 Activated
## 5 1.116699 0.899000 1.740296e-06 4.663993e-05 2.331997e-04 Activated
## 6 14.432307 0.058000 1.279797e-05 2.858212e-04 1.714927e-03 Activated
## 7 53.230830 0.000005 1.703147e-05 3.260309e-04 2.282217e-03 Activated
## 8 -67.400635 0.004000 2.116221e-05 3.544669e-04 2.835736e-03 Inhibited
## 9 0.000000 1.000000 1.039432e-04 1.422595e-03 1.392839e-02 Inhibited
## 10 -27.171346 0.048000 1.061638e-04 1.422595e-03 1.422595e-02 Inhibited
## KEGGLINK
## 1 http://www.genome.jp/dbget-bin/show_pathway?hsa04060+53833+3977+3556+3554+55540+3606+3624+7043+7042+55504+60401+355+10673+8995+970+53832+3815+4254+1435+5618+3575+50615+3601+3566+3563+1438+2056+85480+3600+3597+9180+3590+3570+3592+56477+10344+6354+6347+6370+6376+9547+6387+51330+8718+4982+8743+8797+2919+57007
## 2 http://www.genome.jp/dbget-bin/show_pathway?hsa04080+5340+6865+5618+116443+2890+2891+2892+2894+2897+2899+2900+5026+2554+2563+165829+2911+2918+7433+7434+6344+2642+2696+10800+57105+1902+9170+2798+2859+5028+135+5734+2837+64106+5021+6751+4161+2357+1813+1815+147+150+151+152
## 3 http://www.genome.jp/dbget-bin/show_pathway?hsa05144+6382+2532+2995+948+7097+7099+4615+6347+3592+3606+3689+7057+7058+7060+7042+7043+6403
## 4 http://www.genome.jp/dbget-bin/show_pathway?hsa05146+4583+22798+284217+3910+3914+2919+3554+7097+7099+3592+929+8503+3689+9630+5578+107+5330+5332+7042+7043+1281+1282+1284+1286+1289+1290+1302+88+5272
## 5 http://www.genome.jp/dbget-bin/show_pathway?hsa04512+22798+284217+3910+3914+1281+1282+1284+1286+1289+1290+1302+7448+7057+7058+7060+3690+3688+948+9899+6382+3693+22801+8515+8516+3675+3672+1101+5649+3673
## 6 http://www.genome.jp/dbget-bin/show_pathway?hsa04724+2918+2897+2899+2900+2890+2891+2892+116443+2911+107+108+111+57030+5330+5332+3709+5578+2784+2786+2791+2792+51764+59345+27165+2744+6512+3760+5321+8399+9229+22941
## 7 http://www.genome.jp/dbget-bin/show_pathway?hsa04210+598+330+79444+8503+3656+4615+3563+3554+3556+8797+355+4803+8743
## 8 http://www.genome.jp/dbget-bin/show_pathway?hsa04610+2153+2159+2155+7056+2266+5627+7035+2157+5340+5327+5054+718+725+710+717+5648
## 9 http://www.genome.jp/dbget-bin/show_pathway?hsa05414+6444+27092+55799+781+93589+6546+22801+3672+3673+3675+8515+8516+3688+3690+3693+6445+7139+7168+7169+70+107+108+111+7042+7043
## 10 http://www.genome.jp/dbget-bin/show_pathway?hsa05150+718+717+5648+2357+5340+2266+2212+3108+3109+3127+3689+6403+6404
Plot the output of the -log10(pPERT) vs -log10(pNDA). We will first need to remove any rows that contain NAs.
res<-na.omit(res)
plotP(res, threshold=0.05)
The graph plots each KEGG term by it’s p-value due to the perturbation score and it’s p-value based on overenrichment. The most impactful terms will have the most contribution from both. This is reflected in the table. The red and blue lines designate the thresholds set by Bonferroni multiple comparisons test (red) and false discovery rate (blue).