This module will go over gene ontology (GO) analysis. The first parts go into how GO works and how it is calculated. The latter parts go over using topGO to perform a GO analysis. To start, we need to set our working directory.
One of the first steps in exploring mechanisms in RNA-seq data is to perform a GO analysis. Before doing this, its worth describing the structure of GO terms. These are important for interpreting the output of a GO analysis.
First, GO terms are separated into 3 different groups referred to as namespaces: Biological Process (BP), Molecular Function (MF), and Cellular Component (CC). Terms in the BP and MF groups often appear to be very similar, e.g. the MF term catalytic activity and the BP term postive regulation of catalytic activity while terms in the CC group are focused on the subcellular location that the translated proteins of the genes of interest would be found, e.g., cytoplasm.
Second, the genes involved with GO terms do not include any topology. In other words, we do not know how the gene products interact in the context of the term. For example, the genes JAK2 and TNF are both associated with the BP term positive regulation of protein kinase B signaling. However, the GO term doesn’t contain information about if or how these 2 gene products work on each other in the process. Nor is there information on the type of action they make in the process. They are only known to be associated with it. When interpreting the results of GO analysis, it’s important to take this into account. Just because an upregulated gene or set of genes is associated with a GO term doesn’t necessarilly mean that the GO term is up-regulated. For all you know, every one of the upregulated genes could be a negative regulator of the process in the GO term. (This is specifi ally for MF and BP terms. The CC term is pretty self-explanatory.)
Third, GO terms are built as a heirarchial tree. This is structured as more generic terms that are associated with a large number of genes (parent terms) giving rise to more specific terms that contain a smaller number of genes that are associated with more discrete processes (child terms). Each term is calculated independently, i.e., if the genes found to be significant with a child term are also significant in a parent term, they are both reported. This can be problematic when you have a large number of genes that fall into more generic terms. The more interesting terms will be the more specific ones and those will end up being ranked lower. However, there are methods to filter out more generic terms and zero in on the more interesting terms. These are described below in the analysis.
A GO analysis is performed using an enrichment analysis, and the vast majority of the time this is done using a *Fisher’s exact. This tests whether a group of genes taken from your data set is over-represented, or enriched, in a pre-determined set of genes associated with a given process. The basis of a (Fisher’s test calculates the likelihood that a given process is affected in your dataset based on a) how many of your differentially expressed (DE) genes are also associated with the given process, (b) how many genes in the process are not in your DE gene set, (c) how many genes in the DE list are not associated with the process, and (d) how many genes are neither DE nor associated with the given process:
For b, these genes are associated with the GO term and are also in the reference list. If a gene was not tested, i.e, not expressed in the experiment, they are not considered. For d, these genes are the genes in the reference set that are not DE and also not associated with the given process. Thus, the genes included in the reference set matter. It is common for online tools to allow you to input your genes of interest, i.e., DE genes, but make entering a reference set optional. You should never perform an enrichment analysis without using the reference set from your experiment!. These tools use a generalized gene set as a reference when you don’t input one. This gene set will include all genes in the genome and can include things such as pseudogenes and untranslated genes, i.e., the default reference set will be much larger than the reference set for your experiment. Thus, all of the calculations will be inflated and which will result in false positives.
TopGO is an R/Bioconductor package developed by Adrian Alexa who developed algorithms for filtering out false positive GO terms which which are discussed below. Although topGO isn’t the most user friendly tool, it offers several things that make it superior to most other tools out there (at least the ones I’ve used):
1. Flexibility topGO performs a Fisher’s exact test by default but includes other statistical tests built in and, if a particular statistic isn’t already available, topGO also allows the user to input their own statistic test as a function. 2. Up to date results TopGO requires that you map genes to GO IDs using a database of your choosing. Since the database isn’t built in to the software, you define what mappings get used. Thus, you don’t have to worry about whether the mappings are up to date. This is a huge problem with online “push button” tools that are often not maintained regularly and have extremely out of date databases. 3. Filtering out false postive results The classical method of GO analysis calculates each GO term independently which almost always yields many generic, uninformative terms. TopGO was designed to filter out false positive GO terms and give the most concise results. There are other tools* that include filtering methods but it isn’t always clear what is being done. TopGO allows you to have full control over the analysis parameters.
*If you’re familar with iPathwayGuide from Advaita, it uses topGO under the hood to perform GO analysis.
As mentioned above, one of the biggest issues with GO analysis is the generation of false positive results. However, there have been several algorithms developed to deal with this issue, of which the elim and weight methods developed by Alexa et al are the most common. The elim algorithm begins at the lowest levels of the GO graph at the most specific terms and calculates their enrichment p value. If the term is significant, the term is reported. Then, when the parent term is analyzed, the genes that were associateed with the significant child term are eliminated from the analysis of the parent and only the remaining genes are considered. The parent term will be reported if it can be calculated significant with only these remaining genes. However, if the child term was not calculated as significant, all of the DE genes associated with that term will be counted for its parent as if the two are independent. This continues until a term is found to be significant. This reduces the number of false postives reported.
With the weight method, on the other hand, if several terms that are found to be significant are child terms of a common parent, the parent term is calculated and reported as significant. But parent terms above this are eliminated. Thus, the significant sibling terms attribute more weight to the parent term. The original manuscript showed that the weight algorithm identified less false positives than the classical method and missed few true positives, while the elim algorithm had even less false positives than weight but missed more true positives. To address this, topGO uses a combination of both the elim and weight methods called weight01 by default. Most of the time, the weight01 method is sufficient but I’ve also found that there are times when one of the other methods yields more useful results.
The basic data object required for topGO analysis is a topGOdata object. To build this we will utilize the gene lists we previously generated. First, load the necessary packages. Install with BiocManager::install function if not already installed:
Next, read in the entire DE gene matrix we generated:
de_matrix<-read.table("de_gene_matrix.txt", header=TRUE)
de_matrix<-na.omit(de_matrix) ## remove any rows with missing values
Now we want to calculate a score for each gene to define whether it’s “interesting”. topGO accepts a function that defines what genes in your reference dataset are interesting and uses those for the GO analysis. However, the list is a numeric vector and, thus, can only contain one argument. So, if you use p values for your metric, then you select all genes with a p value below a threshold regardless of the lof FC in expression. Conversely, if you use logFC as the metric, it disregards p value. To bypass this, we calculate a z score for each gene by taking the absolute value of logFC (since directionality doesn’t matter in this case) and multiplying it by the negative log base 10 of the adjusted p value:
de_matrix$score<-ifelse((abs(de_matrix$log2FoldChange))>=1 & de_matrix$padj<0.05,
((abs(de_matrix$log2FoldChange))*(-log10(de_matrix$padj))),
0.00)
Now create a named vector of just the scores with the gene symbols assigned:
gene_list<-de_matrix$score
names(gene_list)<-de_matrix$symbol
head(gene_list,50)
## TSPAN6 DPM1 SCYL3 FIRRM CFH FUCA2 GCLC NFYA
## 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## STPG1 NIPAL3 LAS1L ENPP4 SEMA3F ANKIB1 CYP51A1 KRIT1
## 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## RAD52 MYH16 BAD LAP3 CD99 HS3ST1 MAD1L1 LASP1
## 0.000000 0.000000 0.000000 0.000000 0.000000 2.468507 0.000000 0.000000
## SNX11 TMEM176A M6PR KLHL13 CYP26B1 ICA1 DBNDD1 ALS2
## 0.000000 7.171751 0.000000 29.820731 0.000000 0.000000 0.000000 0.000000
## CASP10 CFLAR TFPI NDUFAF7 RBM5 MTMR7 SLC7A2 ARF5
## 0.000000 0.000000 3.383944 0.000000 0.000000 9.514232 22.227959 0.000000
## SARM1 POLDIP2 PLXND1 AK2 CD38 FKBP4 KDM1A RBM6
## 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## CAMKK1 RECQL
## 0.000000 0.000000
Now, we’ll build a map of GO terms to the associated genes. We do this because GO terms can be mapped to a variety of identifiers such as microarray probes, gene symbols, entrez IDs, etc. In our case, we are using the official gene symbols which we will get from the org.Hs.eg.db. This step must be done separately for each GO category (BP, MF, CC) so we want to name it accordingly.
GO2genes.BP <- annFUN.org(
whichOnto = 'BP',
feasibleGenes = NULL, ## no filtering of genes. all genes from GO are included.
mapping = 'org.Hs.eg.db',
ID = 'symbol')
GO2genes.MF <- annFUN.org(
whichOnto = 'MF',
feasibleGenes = NULL,
mapping = 'org.Hs.eg.db',
ID = 'symbol')
GO2genes.CC <- annFUN.org(
whichOnto = 'CC',
feasibleGenes = NULL,
mapping = 'org.Hs.eg.db',
ID = 'symbol')
So that we can use the z score to select for our genes of interest, we have to make a function declaring the cutoff:
selection<-function(score){
return(score>=1.3) ## this is the value of a |1| * -log10(0.05)
}
No we apply the mapping to build the topGOdata object and input our selection function. This also has to be done separately for each category.
## visualize the output
GOdata.BP
##
## ------------------------- topGOdata object -------------------------
##
## Description:
## -
##
## Ontology:
## - BP
##
## 16955 available genes (all genes from the array):
## - symbol: TSPAN6 DPM1 SCYL3 FIRRM CFH ...
## - score : 0 0 0 0 0 ...
## - 2348 significant genes.
##
## 13449 feasible genes (genes that can be used in the analysis):
## - symbol: TSPAN6 DPM1 SCYL3 FIRRM CFH ...
## - score : 0 0 0 0 0 ...
## - 1817 significant genes.
##
## GO graph (nodes with at least 10 genes):
## - a graph with directed edges
## - number of nodes = 6079
## - number of edges = 13144
##
## ------------------------- topGOdata object -------------------------
We can see the even though we set no filtering of genes from the GO database, there are 3546 genes considered not feasible for analysis and reduces the number of significant genes from 2348 to 1817. This is because these genes don’t have GO assignments. This isn’t uncommon.
Now we’ll run topGO using a Fisher’s exact test with the several of the filtering algorithms to compare them.
Generate a table of the results:
results.table.bp=GenTable(GOdata.BP, fis.w01=go.bp.w01, fis.elim=go.bp.elim, fis.wt=go.bp.wt, fis.classic=go.bp.classic, orderBy="fis.elim", ranksOf="fis.elim", topNodes=50)
results.table.bp
## GO.ID Term Annotated Significant
## 1 GO:0007155 cell adhesion 1170 282
## 2 GO:0007157 heterophilic cell-cell adhesion via plas... 40 20
## 3 GO:0007601 visual perception 150 43
## 4 GO:0007411 axon guidance 194 51
## 5 GO:0007267 cell-cell signaling 1294 302
## 6 GO:0033627 cell adhesion mediated by integrin 74 26
## 7 GO:0006954 inflammatory response 552 125
## 8 GO:0098703 calcium ion import across plasma membran... 35 16
## 9 GO:0045109 intermediate filament organization 35 16
## 10 GO:0007189 adenylate cyclase-activating G protein-c... 95 30
## 11 GO:0072677 eosinophil migration 13 9
## 12 GO:0007193 adenylate cyclase-inhibiting G protein-c... 48 19
## 13 GO:0034765 regulation of monoatomic ion transmembra... 318 93
## 14 GO:0030198 extracellular matrix organization 264 80
## 15 GO:0007156 homophilic cell adhesion via plasma memb... 142 39
## 16 GO:0050804 modulation of chemical synaptic transmis... 393 99
## 17 GO:0007631 feeding behavior 58 21
## 18 GO:0048333 mesodermal cell differentiation 23 12
## 19 GO:0035725 sodium ion transmembrane transport 125 39
## 20 GO:0042572 retinol metabolic process 35 15
## 21 GO:0002690 positive regulation of leukocyte chemota... 65 22
## 22 GO:0030199 collagen fibril organization 61 21
## 23 GO:0007565 female pregnancy 144 38
## 24 GO:0050829 defense response to Gram-negative bacter... 32 14
## 25 GO:0048662 negative regulation of smooth muscle cel... 40 16
## 26 GO:0015874 norepinephrine transport 15 9
## 27 GO:0007186 G protein-coupled receptor signaling pat... 491 135
## 28 GO:0045666 positive regulation of neuron differenti... 68 22
## 29 GO:0001975 response to amphetamine 26 12
## 30 GO:0098915 membrane repolarization during ventricul... 13 8
## 31 GO:0010811 positive regulation of cell-substrate ad... 104 29
## 32 GO:0008217 regulation of blood pressure 130 34
## 33 GO:0051897 positive regulation of phosphatidylinosi... 126 33
## 34 GO:0070374 positive regulation of ERK1 and ERK2 cas... 137 35
## 35 GO:0009593 detection of chemical stimulus 47 19
## 36 GO:0042476 odontogenesis 98 32
## 37 GO:0007160 cell-matrix adhesion 199 46
## 38 GO:0007612 learning 123 32
## 39 GO:0043410 positive regulation of MAPK cascade 340 86
## 40 GO:0010832 negative regulation of myotube different... 11 7
## 41 GO:0051241 negative regulation of multicellular org... 816 174
## 42 GO:0048714 positive regulation of oligodendrocyte d... 21 10
## 43 GO:0045766 positive regulation of angiogenesis 129 33
## 44 GO:0001525 angiogenesis 419 99
## 45 GO:0098810 neurotransmitter reuptake 25 11
## 46 GO:0008584 male gonad development 114 30
## 47 GO:0090090 negative regulation of canonical Wnt sig... 114 30
## 48 GO:0002052 positive regulation of neuroblast prolif... 29 12
## 49 GO:0051966 regulation of synaptic transmission, glu... 60 19
## 50 GO:0019221 cytokine-mediated signaling pathway 330 75
## Expected fis.w01 fis.elim fis.wt fis.classic
## 1 158.07 2.1e-10 8.0e-09 7.7e-18 8.3e-25
## 2 5.40 3.4e-08 3.4e-08 1.00000 3.5e-08
## 3 20.27 2.0e-06 7.7e-07 0.34317 8.0e-07
## 4 26.21 1.1e-06 1.4e-06 1.5e-06 1.5e-06
## 5 174.82 2.0e-10 1.6e-06 1.1e-06 3.5e-24
## 6 10.00 5.8e-05 2.0e-06 1.00000 2.0e-06
## 7 74.58 6.9e-07 3.3e-06 6.8e-06 1.8e-09
## 8 4.73 3.7e-06 3.7e-06 1.00000 3.7e-06
## 9 4.73 3.7e-06 3.7e-06 3.7e-06 3.7e-06
## 10 12.83 0.00010 4.2e-06 1.00000 4.3e-06
## 11 1.76 6.3e-06 6.3e-06 1.00000 6.4e-06
## 12 6.48 6.3e-06 6.3e-06 1.00000 6.5e-06
## 13 42.96 5.4e-06 6.9e-06 1.00000 9.9e-14
## 14 35.67 3.3e-07 7.2e-06 1.00000 7.1e-13
## 15 19.18 7.7e-06 7.7e-06 1.00000 8.0e-06
## 16 53.10 0.00023 8.5e-06 1.00000 2.5e-10
## 17 7.84 1.5e-06 1.1e-05 1.1e-05 1.1e-05
## 18 3.11 7.4e-05 1.1e-05 1.1e-05 1.1e-05
## 19 16.89 3.2e-05 1.7e-05 0.43664 2.4e-07
## 20 4.73 1.9e-05 1.9e-05 1.00000 2.0e-05
## 21 8.78 0.00129 2.3e-05 1.00000 2.4e-05
## 22 8.24 2.7e-05 2.7e-05 1.00000 2.7e-05
## 23 19.45 4.4e-07 2.7e-05 0.39536 2.8e-05
## 24 4.32 2.8e-05 2.8e-05 1.00000 2.8e-05
## 25 5.40 0.00059 2.9e-05 1.00000 3.0e-05
## 26 2.03 3.4e-05 3.4e-05 3.4e-05 3.4e-05
## 27 66.34 2.3e-05 4.0e-05 2.6e-14 6.3e-17
## 28 9.19 5.1e-05 5.1e-05 5.3e-05 5.3e-05
## 29 3.51 5.5e-05 5.5e-05 1.00000 5.5e-05
## 30 1.76 7.4e-05 7.4e-05 1.00000 7.5e-05
## 31 14.05 0.00060 8.1e-05 1.00000 8.3e-05
## 32 17.56 0.00068 8.5e-05 0.63488 8.8e-05
## 33 17.02 0.00010 0.00010 0.00011 0.00011
## 34 18.51 0.00011 0.00011 1.00000 0.00012
## 35 6.35 7.2e-05 0.00012 4.5e-06 4.5e-06
## 36 13.24 0.00016 0.00012 9.1e-07 9.1e-07
## 37 26.89 0.00054 0.00014 0.00725 0.00015
## 38 16.62 0.00039 0.00015 1.00000 0.00016
## 39 45.94 0.00289 0.00015 3.1e-09 3.1e-09
## 40 1.49 0.00016 0.00016 0.00016 0.00016
## 41 110.24 0.01385 0.00016 0.00594 1.7e-10
## 42 2.84 0.00017 0.00017 0.00017 0.00017
## 43 17.43 0.00038 0.00017 1.00000 0.00017
## 44 56.61 0.00195 0.00019 1.00000 9.5e-09
## 45 3.38 0.00019 0.00019 1.00000 0.00019
## 46 15.40 3.3e-05 0.00019 0.00020 0.00020
## 47 15.40 0.00019 0.00019 1.00000 0.00020
## 48 3.92 0.00020 0.00020 1.00000 0.00020
## 49 8.11 0.00052 0.00022 1.00000 0.00023
## 50 44.58 9.4e-06 0.00022 1.00000 2.9e-06
pdf(file="bp.elim.pdf")
par(cex = 0.5)
showSigOfNodes(GOdata.BP, score(go.bp.elim), firstSigNodes = 10, useInfo = 'all')
dev.off()
pdf(file="bp.w01.pdf")
par(cex = 0.5)
showSigOfNodes(GOdata.BP, score(go.bp.w01), firstSigNodes = 10, useInfo = 'all', )
dev.off()
pdf(file="bp.wt.pdf")
showSigOfNodes(GOdata.BP, score(go.bp.wt), firstSigNodes = 10, useInfo = 'all')
dev.off()
showSigOfNodes(GOdata.BP, score(go.bp.classic), firstSigNodes = 5, useInfo = 'all')
You can also run topGP using a Kolmogorov-Smirnov (KS) statistic test.
However, topGO is finicky when running the KS statistic. Using the KS
statistic works fine when using the weight01 or classic algorithms. But,
the elim algorithm induces an error. I’ve reached out to the community
about this and I’ve attempted to debug it based on feedback but I
haven’t been to make it work.Regardless, we’ll run the KS statistic here
with the weight01 algorithm to see how it compares to the Fisher
results.
Generate a table of the results:
results.table.bp.ks=GenTable(GOdata.BP, ks.w01=go.bp.ks.w01, fis.w01=go.bp.w01, fis.elim=go.bp.elim, orderBy="ks.w01", ranksOf="fis.w01", topNodes=50)
results.table.bp.ks
## GO.ID Term Annotated Significant
## 1 GO:0008150 biological_process 13430 1816
## 2 GO:0015031 protein transport 1414 134
## 3 GO:0006281 DNA repair 565 14
## 4 GO:0051301 cell division 582 36
## 5 GO:2000045 regulation of G1/S transition of mitotic... 150 15
## 6 GO:0008380 RNA splicing 423 11
## 7 GO:0006338 chromatin remodeling 544 25
## 8 GO:0006406 mRNA export from nucleus 65 0
## 9 GO:0006886 intracellular protein transport 808 59
## 10 GO:0006511 ubiquitin-dependent protein catabolic pr... 605 23
## 11 GO:1902969 mitotic DNA replication 16 0
## 12 GO:1990830 cellular response to leukemia inhibitory... 92 11
## 13 GO:0030071 regulation of mitotic metaphase/anaphase... 87 0
## 14 GO:0000281 mitotic cytokinesis 86 2
## 15 GO:0043161 proteasome-mediated ubiquitin-dependent ... 437 16
## 16 GO:0009411 response to UV 146 12
## 17 GO:0090307 mitotic spindle assembly 69 3
## 18 GO:2000819 regulation of nucleotide-excision repair 27 0
## 19 GO:0048026 positive regulation of mRNA splicing, vi... 21 0
## 20 GO:0042168 heme metabolic process 38 0
## 21 GO:0006915 apoptotic process 1567 248
## 22 GO:0060261 positive regulation of transcription ini... 58 0
## 23 GO:0045893 positive regulation of DNA-templated tra... 1360 140
## 24 GO:0006417 regulation of translation 383 21
## 25 GO:0032530 regulation of microvillus organization 11 0
## 26 GO:0006750 glutathione biosynthetic process 14 0
## 27 GO:0035720 intraciliary anterograde transport 18 0
## 28 GO:0042147 retrograde transport, endosome to Golgi 94 2
## 29 GO:1903146 regulation of autophagy of mitochondrion 37 3
## 30 GO:0018105 peptidyl-serine phosphorylation 250 23
## 31 GO:0031122 cytoplasmic microtubule organization 56 3
## 32 GO:0019082 viral protein processing 30 0
## 33 GO:0006893 Golgi to plasma membrane transport 57 3
## 34 GO:0060271 cilium assembly 344 28
## 35 GO:0006298 mismatch repair 29 0
## 36 GO:0032968 positive regulation of transcription elo... 48 1
## 37 GO:0000398 mRNA splicing, via spliceosome 283 4
## 38 GO:2000781 positive regulation of double-strand bre... 83 1
## 39 GO:0006446 regulation of translational initiation 75 3
## 40 GO:0006261 DNA-templated DNA replication 156 4
## 41 GO:0051447 negative regulation of meiotic cell cycl... 10 0
## 42 GO:0036010 protein localization to endosome 26 2
## 43 GO:0051123 RNA polymerase II preinitiation complex ... 55 2
## 44 GO:0016078 tRNA catabolic process 15 0
## 45 GO:0016567 protein ubiquitination 662 36
## 46 GO:0000380 alternative mRNA splicing, via spliceoso... 68 2
## 47 GO:1902459 positive regulation of stem cell populat... 45 3
## 48 GO:0030968 endoplasmic reticulum unfolded protein r... 71 2
## 49 GO:0046827 positive regulation of protein export fr... 17 1
## 50 GO:0006357 regulation of transcription by RNA polym... 2044 210
## Expected Rank in fis.w01 ks.w01 fis.w01 fis.elim
## 1 1814.43 551 5.6e-25 0.036 0.28
## 2 191.04 3931 2.7e-08 1.000 1.00
## 3 76.33 3933 8.6e-08 1.000 1.00
## 4 78.63 3932 1.2e-07 1.000 1.00
## 5 20.27 3788 1.6e-07 0.988 0.92
## 6 57.15 3912 3.2e-06 1.000 1.00
## 7 73.50 3924 4.7e-06 1.000 1.00
## 8 8.78 3935 8.2e-06 1.000 1.00
## 9 109.16 3901 1.1e-05 1.000 1.00
## 10 81.74 3896 1.3e-05 1.000 1.00
## 11 2.16 3936 1.7e-05 1.000 1.00
## 12 12.43 2902 2.8e-05 0.713 0.71
## 13 11.75 3937 3.1e-05 1.000 1.00
## 14 11.62 3905 3.8e-05 1.000 1.00
## 15 59.04 3927 3.8e-05 1.000 1.00
## 16 19.73 3846 5.2e-05 0.997 0.98
## 17 9.32 3864 8.7e-05 0.998 1.00
## 18 3.65 3938 0.00013 1.000 1.00
## 19 2.84 3939 0.00013 1.000 1.00
## 20 5.13 3940 0.00016 1.000 1.00
## 21 211.71 2471 0.00017 0.539 0.51
## 22 7.84 3941 0.00018 1.000 1.00
## 23 183.74 3919 0.00036 1.000 1.00
## 24 51.74 3909 0.00041 1.000 1.00
## 25 1.49 3942 0.00042 1.000 1.00
## 26 1.89 3943 0.00044 1.000 1.00
## 27 2.43 3944 0.00044 1.000 1.00
## 28 12.70 3910 0.00048 1.000 1.00
## 29 5.00 3554 0.00055 0.915 0.89
## 30 33.78 3855 0.00057 0.997 0.99
## 31 7.57 3801 0.00068 0.990 0.99
## 32 4.05 3945 0.00069 1.000 1.00
## 33 7.70 3946 0.00071 1.000 0.99
## 34 46.48 3837 0.00071 0.995 1.00
## 35 3.92 3947 0.00077 1.000 1.00
## 36 6.48 3878 0.00081 0.999 1.00
## 37 38.23 3928 0.00082 1.000 1.00
## 38 11.21 3915 0.00087 1.000 1.00
## 39 10.13 3948 0.00094 1.000 1.00
## 40 21.08 3949 0.00100 1.000 1.00
## 41 1.35 3950 0.00103 1.000 1.00
## 42 3.51 3951 0.00116 1.000 0.88
## 43 7.43 3849 0.00123 0.997 1.00
## 44 2.03 3952 0.00125 1.000 1.00
## 45 89.44 3925 0.00127 1.000 1.00
## 46 9.19 3953 0.00128 1.000 1.00
## 47 6.08 3665 0.00134 0.953 0.95
## 48 9.59 3885 0.00157 1.000 1.00
## 49 2.30 3534 0.00164 0.915 0.92
## 50 276.15 3929 0.00169 1.000 1.00
You can see that using the KS statistic here reports much more uninformative terms than the Fisher’s statistic does. For reasons we won’t get into, the KS test has been documented to be more susceptible to outliers compared to the Fisher’s test. Thus, I typically go with the Fisher’s test.
I laid out why I prefer topGO above. There are, of course, a number of other tools that can be used to perform GO analysis. If you would rather use another tool, I recommend using EnrichR. The group who maintains EnrichR attempts to keep the databases up to date and it’s really simple to use.
EnrichR can be reached here: EnrichR
When you use EnrichR, simply input the list of DE genes into the box. Make sure to select the backgroud option in the description above and input your reference gene list. EnrichR will input a default list here, so make sure you select all of the default genes, delete them, and then add your list. Again, do NOT use EnrichR, or any other tool, without including the reference set of genes that is specific to your experiment.
When interpreting your results, keep in mind that the statistics used by EnrichR are slightly different than what is used in topGO. The Fisher’s test is used but there are nuanced differences. You can read more about the specifics in this article .