1. Introduction

This module will go over gene ontology (GO) analysis. The first parts go into how GO works and how it is calculated. The latter parts go over using topGO to perform a GO analysis. To start, we need to set our working directory.

2. Gene Ontology

2.1 Background

One of the first steps in exploring mechanisms in RNA-seq data is to perform a GO analysis. Before doing this, its worth describing the structure of GO terms. These are important for interpreting the output of a GO analysis.

First, GO terms are separated into 3 different groups referred to as namespaces: Biological Process (BP), Molecular Function (MF), and Cellular Component (CC). Terms in the BP and MF groups often appear to be very similar, e.g. the MF term catalytic activity and the BP term postive regulation of catalytic activity while terms in the CC group are focused on the subcellular location that the translated proteins of the genes of interest would be found, e.g., cytoplasm.

Second, the genes involved with GO terms do not include any topology. In other words, we do not know how the gene products interact in the context of the term. For example, the genes JAK2 and TNF are both associated with the BP term positive regulation of protein kinase B signaling. However, the GO term doesn’t contain information about if or how these 2 gene products work on each other in the process. Nor is there information on the type of action they make in the process. They are only known to be associated with it. When interpreting the results of GO analysis, it’s important to take this into account. Just because an upregulated gene or set of genes is associated with a GO term doesn’t necessarilly mean that the GO term is up-regulated. For all you know, every one of the upregulated genes could be a negative regulator of the process in the GO term. (This is specifi ally for MF and BP terms. The CC term is pretty self-explanatory.)

Third, GO terms are built as a heirarchial tree. This is structured as more generic terms that are associated with a large number of genes (parent terms) giving rise to more specific terms that contain a smaller number of genes that are associated with more discrete processes (child terms). Each term is calculated independently, i.e., if the genes found to be significant with a child term are also significant in a parent term, they are both reported. This can be problematic when you have a large number of genes that fall into more generic terms. The more interesting terms will be the more specific ones and those will end up being ranked lower. However, there are methods to filter out more generic terms and zero in on the more interesting terms. These are described below in the analysis.

2.2 Fisher’s exact test

A GO analysis is performed using an enrichment analysis, and the vast majority of the time this is done using a *Fisher’s exact. This tests whether a group of genes taken from your data set is over-represented, or enriched, in a pre-determined set of genes associated with a given process. The basis of a (Fisher’s test calculates the likelihood that a given process is affected in your dataset based on a) how many of your differentially expressed (DE) genes are also associated with the given process, (b) how many genes in the process are not in your DE gene set, (c) how many genes in the DE list are not associated with the process, and (d) how many genes are neither DE nor associated with the given process:

For b, these genes are associated with the GO term and are also in the reference list. If a gene was not tested, i.e, not expressed in the experiment, they are not considered. For d, these genes are the genes in the reference set that are not DE and also not associated with the given process. Thus, the genes included in the reference set matter. It is common for online tools to allow you to input your genes of interest, i.e., DE genes, but make entering a reference set optional. You should never perform an enrichment analysis without using the reference set from your experiment!. These tools use a generalized gene set as a reference when you don’t input one. This gene set will include all genes in the genome and can include things such as pseudogenes and untranslated genes, i.e., the default reference set will be much larger than the reference set for your experiment. Thus, all of the calculations will be inflated and which will result in false positives.

3. topGO

3.1 Background

TopGO is an R/Bioconductor package developed by Adrian Alexa who developed algorithms for filtering out false positive GO terms which which are discussed below. Although topGO isn’t the most user friendly tool, it offers several things that make it superior to most other tools out there (at least the ones I’ve used):

1. Flexibility topGO performs a Fisher’s exact test by default but includes other statistical tests built in and, if a particular statistic isn’t already available, topGO also allows the user to input their own statistic test as a function. 2. Up to date results TopGO requires that you map genes to GO IDs using a database of your choosing. Since the database isn’t built in to the software, you define what mappings get used. Thus, you don’t have to worry about whether the mappings are up to date. This is a huge problem with online “push button” tools that are often not maintained regularly and have extremely out of date databases. 3. Filtering out false postive results The classical method of GO analysis calculates each GO term independently which almost always yields many generic, uninformative terms. TopGO was designed to filter out false positive GO terms and give the most concise results. There are other tools* that include filtering methods but it isn’t always clear what is being done. TopGO allows you to have full control over the analysis parameters.

*If you’re familar with iPathwayGuide from Advaita, it uses topGO under the hood to perform GO analysis.

3.2 GO term filtering algorithms

As mentioned above, one of the biggest issues with GO analysis is the generation of false positive results. However, there have been several algorithms developed to deal with this issue, of which the elim and weight methods developed by Alexa et al are the most common. The elim algorithm begins at the lowest levels of the GO graph at the most specific terms and calculates their enrichment p value. If the term is significant, the term is reported. Then, when the parent term is analyzed, the genes that were associateed with the significant child term are eliminated from the analysis of the parent and only the remaining genes are considered. The parent term will be reported if it can be calculated significant with only these remaining genes. However, if the child term was not calculated as significant, all of the DE genes associated with that term will be counted for its parent as if the two are independent. This continues until a term is found to be significant. This reduces the number of false postives reported.

With the weight method, on the other hand, if several terms that are found to be significant are child terms of a common parent, the parent term is calculated and reported as significant. But parent terms above this are eliminated. Thus, the significant sibling terms attribute more weight to the parent term. The original manuscript showed that the weight algorithm identified less false positives than the classical method and missed few true positives, while the elim algorithm had even less false positives than weight but missed more true positives. To address this, topGO uses a combination of both the elim and weight methods called weight01 by default. Most of the time, the weight01 method is sufficient but I’ve also found that there are times when one of the other methods yields more useful results.

4. GO analysis with topGO

4.1 Load the necessary packages, and import and parse the necessary data

The basic data object required for topGO analysis is a topGOdata object. To build this we will utilize the gene lists we previously generated. First, load the necessary packages. Install with BiocManager::install function if not already installed:

Next, read in the entire DE gene matrix we generated:

de_matrix<-read.table("de_gene_matrix.txt", header=TRUE)
de_matrix<-na.omit(de_matrix) ## remove any rows with missing values

Now we want to calculate a score for each gene to define whether it’s “interesting”. topGO accepts a function that defines what genes in your reference dataset are interesting and uses those for the GO analysis. However, the list is a numeric vector and, thus, can only contain one argument. So, if you use p values for your metric, then you select all genes with a p value below a threshold regardless of the lof FC in expression. Conversely, if you use logFC as the metric, it disregards p value. To bypass this, we calculate a z score for each gene by taking the absolute value of logFC (since directionality doesn’t matter in this case) and multiplying it by the negative log base 10 of the adjusted p value:

de_matrix$score<-ifelse((abs(de_matrix$log2FoldChange))>=1 & de_matrix$padj<0.05,
                        ((abs(de_matrix$log2FoldChange))*(-log10(de_matrix$padj))),
                        0.00)

Now create a named vector of just the scores with the gene symbols assigned:

gene_list<-de_matrix$score
names(gene_list)<-de_matrix$symbol
head(gene_list,50)

##    TSPAN6      DPM1     SCYL3     FIRRM       CFH     FUCA2      GCLC      NFYA 
##  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000 
##     STPG1    NIPAL3     LAS1L     ENPP4    SEMA3F    ANKIB1   CYP51A1     KRIT1 
##  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000 
##     RAD52     MYH16       BAD      LAP3      CD99    HS3ST1    MAD1L1     LASP1 
##  0.000000  0.000000  0.000000  0.000000  0.000000  2.468507  0.000000  0.000000 
##     SNX11  TMEM176A      M6PR    KLHL13   CYP26B1      ICA1    DBNDD1      ALS2 
##  0.000000  7.171751  0.000000 29.820731  0.000000  0.000000  0.000000  0.000000 
##    CASP10     CFLAR      TFPI   NDUFAF7      RBM5     MTMR7    SLC7A2      ARF5 
##  0.000000  0.000000  3.383944  0.000000  0.000000  9.514232 22.227959  0.000000 
##     SARM1   POLDIP2    PLXND1       AK2      CD38     FKBP4     KDM1A      RBM6 
##  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000 
##    CAMKK1     RECQL 
##  0.000000  0.000000

4.2 Build the topGOdata object

Now, we’ll build a map of GO terms to the associated genes. We do this because GO terms can be mapped to a variety of identifiers such as microarray probes, gene symbols, entrez IDs, etc. In our case, we are using the official gene symbols which we will get from the org.Hs.eg.db. This step must be done separately for each GO category (BP, MF, CC) so we want to name it accordingly.

GO2genes.BP <- annFUN.org(
      whichOnto = 'BP',
      feasibleGenes = NULL, ## no filtering of genes. all genes from GO are included.
      mapping = 'org.Hs.eg.db',
      ID = 'symbol')

GO2genes.MF <- annFUN.org(
      whichOnto = 'MF',
      feasibleGenes = NULL,
      mapping = 'org.Hs.eg.db',
      ID = 'symbol')

GO2genes.CC <- annFUN.org(
      whichOnto = 'CC',
      feasibleGenes = NULL,
      mapping = 'org.Hs.eg.db',
      ID = 'symbol')

So that we can use the z score to select for our genes of interest, we have to make a function declaring the cutoff:

selection<-function(score){
  return(score>=1.3) ## this is the value of a |1| * -log10(0.05)
}

No we apply the mapping to build the topGOdata object and input our selection function. This also has to be done separately for each category.

## visualize the output
GOdata.BP

## 
## ------------------------- topGOdata object -------------------------
## 
##  Description:
##    -   
## 
##  Ontology:
##    -  BP 
## 
##  16955 available genes (all genes from the array):
##    - symbol:  TSPAN6 DPM1 SCYL3 FIRRM CFH  ...
##    - score :  0 0 0 0 0  ...
##    - 2348  significant genes. 
## 
##  13449 feasible genes (genes that can be used in the analysis):
##    - symbol:  TSPAN6 DPM1 SCYL3 FIRRM CFH  ...
##    - score :  0 0 0 0 0  ...
##    - 1817  significant genes. 
## 
##  GO graph (nodes with at least  10  genes):
##    - a graph with directed edges
##    - number of nodes = 6079 
##    - number of edges = 13144 
## 
## ------------------------- topGOdata object -------------------------

We can see the even though we set no filtering of genes from the GO database, there are 3546 genes considered not feasible for analysis and reduces the number of significant genes from 2348 to 1817. This is because these genes don’t have GO assignments. This isn’t uncommon.

4.3 Run topGO and visualize results

Now we’ll run topGO using a Fisher’s exact test with the several of the filtering algorithms to compare them.

Generate a table of the results:

results.table.bp=GenTable(GOdata.BP, fis.w01=go.bp.w01, fis.elim=go.bp.elim, fis.wt=go.bp.wt, fis.classic=go.bp.classic, orderBy="fis.elim", ranksOf="fis.elim", topNodes=50)
results.table.bp

##         GO.ID                                        Term Annotated Significant
## 1  GO:0007155                               cell adhesion      1170         282
## 2  GO:0007157 heterophilic cell-cell adhesion via plas...        40          20
## 3  GO:0007601                           visual perception       150          43
## 4  GO:0007411                               axon guidance       194          51
## 5  GO:0007267                         cell-cell signaling      1294         302
## 6  GO:0033627          cell adhesion mediated by integrin        74          26
## 7  GO:0006954                       inflammatory response       552         125
## 8  GO:0098703 calcium ion import across plasma membran...        35          16
## 9  GO:0045109          intermediate filament organization        35          16
## 10 GO:0007189 adenylate cyclase-activating G protein-c...        95          30
## 11 GO:0072677                        eosinophil migration        13           9
## 12 GO:0007193 adenylate cyclase-inhibiting G protein-c...        48          19
## 13 GO:0034765 regulation of monoatomic ion transmembra...       318          93
## 14 GO:0030198           extracellular matrix organization       264          80
## 15 GO:0007156 homophilic cell adhesion via plasma memb...       142          39
## 16 GO:0050804 modulation of chemical synaptic transmis...       393          99
## 17 GO:0007631                            feeding behavior        58          21
## 18 GO:0048333             mesodermal cell differentiation        23          12
## 19 GO:0035725          sodium ion transmembrane transport       125          39
## 20 GO:0042572                   retinol metabolic process        35          15
## 21 GO:0002690 positive regulation of leukocyte chemota...        65          22
## 22 GO:0030199                collagen fibril organization        61          21
## 23 GO:0007565                            female pregnancy       144          38
## 24 GO:0050829 defense response to Gram-negative bacter...        32          14
## 25 GO:0048662 negative regulation of smooth muscle cel...        40          16
## 26 GO:0015874                    norepinephrine transport        15           9
## 27 GO:0007186 G protein-coupled receptor signaling pat...       491         135
## 28 GO:0045666 positive regulation of neuron differenti...        68          22
## 29 GO:0001975                     response to amphetamine        26          12
## 30 GO:0098915 membrane repolarization during ventricul...        13           8
## 31 GO:0010811 positive regulation of cell-substrate ad...       104          29
## 32 GO:0008217                regulation of blood pressure       130          34
## 33 GO:0051897 positive regulation of phosphatidylinosi...       126          33
## 34 GO:0070374 positive regulation of ERK1 and ERK2 cas...       137          35
## 35 GO:0009593              detection of chemical stimulus        47          19
## 36 GO:0042476                               odontogenesis        98          32
## 37 GO:0007160                        cell-matrix adhesion       199          46
## 38 GO:0007612                                    learning       123          32
## 39 GO:0043410         positive regulation of MAPK cascade       340          86
## 40 GO:0010832 negative regulation of myotube different...        11           7
## 41 GO:0051241 negative regulation of multicellular org...       816         174
## 42 GO:0048714 positive regulation of oligodendrocyte d...        21          10
## 43 GO:0045766         positive regulation of angiogenesis       129          33
## 44 GO:0001525                                angiogenesis       419          99
## 45 GO:0098810                   neurotransmitter reuptake        25          11
## 46 GO:0008584                      male gonad development       114          30
## 47 GO:0090090 negative regulation of canonical Wnt sig...       114          30
## 48 GO:0002052 positive regulation of neuroblast prolif...        29          12
## 49 GO:0051966 regulation of synaptic transmission, glu...        60          19
## 50 GO:0019221         cytokine-mediated signaling pathway       330          75
##    Expected fis.w01 fis.elim  fis.wt fis.classic
## 1    158.07 2.1e-10  8.0e-09 7.7e-18     8.3e-25
## 2      5.40 3.4e-08  3.4e-08 1.00000     3.5e-08
## 3     20.27 2.0e-06  7.7e-07 0.34317     8.0e-07
## 4     26.21 1.1e-06  1.4e-06 1.5e-06     1.5e-06
## 5    174.82 2.0e-10  1.6e-06 1.1e-06     3.5e-24
## 6     10.00 5.8e-05  2.0e-06 1.00000     2.0e-06
## 7     74.58 6.9e-07  3.3e-06 6.8e-06     1.8e-09
## 8      4.73 3.7e-06  3.7e-06 1.00000     3.7e-06
## 9      4.73 3.7e-06  3.7e-06 3.7e-06     3.7e-06
## 10    12.83 0.00010  4.2e-06 1.00000     4.3e-06
## 11     1.76 6.3e-06  6.3e-06 1.00000     6.4e-06
## 12     6.48 6.3e-06  6.3e-06 1.00000     6.5e-06
## 13    42.96 5.4e-06  6.9e-06 1.00000     9.9e-14
## 14    35.67 3.3e-07  7.2e-06 1.00000     7.1e-13
## 15    19.18 7.7e-06  7.7e-06 1.00000     8.0e-06
## 16    53.10 0.00023  8.5e-06 1.00000     2.5e-10
## 17     7.84 1.5e-06  1.1e-05 1.1e-05     1.1e-05
## 18     3.11 7.4e-05  1.1e-05 1.1e-05     1.1e-05
## 19    16.89 3.2e-05  1.7e-05 0.43664     2.4e-07
## 20     4.73 1.9e-05  1.9e-05 1.00000     2.0e-05
## 21     8.78 0.00129  2.3e-05 1.00000     2.4e-05
## 22     8.24 2.7e-05  2.7e-05 1.00000     2.7e-05
## 23    19.45 4.4e-07  2.7e-05 0.39536     2.8e-05
## 24     4.32 2.8e-05  2.8e-05 1.00000     2.8e-05
## 25     5.40 0.00059  2.9e-05 1.00000     3.0e-05
## 26     2.03 3.4e-05  3.4e-05 3.4e-05     3.4e-05
## 27    66.34 2.3e-05  4.0e-05 2.6e-14     6.3e-17
## 28     9.19 5.1e-05  5.1e-05 5.3e-05     5.3e-05
## 29     3.51 5.5e-05  5.5e-05 1.00000     5.5e-05
## 30     1.76 7.4e-05  7.4e-05 1.00000     7.5e-05
## 31    14.05 0.00060  8.1e-05 1.00000     8.3e-05
## 32    17.56 0.00068  8.5e-05 0.63488     8.8e-05
## 33    17.02 0.00010  0.00010 0.00011     0.00011
## 34    18.51 0.00011  0.00011 1.00000     0.00012
## 35     6.35 7.2e-05  0.00012 4.5e-06     4.5e-06
## 36    13.24 0.00016  0.00012 9.1e-07     9.1e-07
## 37    26.89 0.00054  0.00014 0.00725     0.00015
## 38    16.62 0.00039  0.00015 1.00000     0.00016
## 39    45.94 0.00289  0.00015 3.1e-09     3.1e-09
## 40     1.49 0.00016  0.00016 0.00016     0.00016
## 41   110.24 0.01385  0.00016 0.00594     1.7e-10
## 42     2.84 0.00017  0.00017 0.00017     0.00017
## 43    17.43 0.00038  0.00017 1.00000     0.00017
## 44    56.61 0.00195  0.00019 1.00000     9.5e-09
## 45     3.38 0.00019  0.00019 1.00000     0.00019
## 46    15.40 3.3e-05  0.00019 0.00020     0.00020
## 47    15.40 0.00019  0.00019 1.00000     0.00020
## 48     3.92 0.00020  0.00020 1.00000     0.00020
## 49     8.11 0.00052  0.00022 1.00000     0.00023
## 50    44.58 9.4e-06  0.00022 1.00000     2.9e-06

pdf(file="bp.elim.pdf")
par(cex = 0.5)
showSigOfNodes(GOdata.BP, score(go.bp.elim), firstSigNodes = 10, useInfo = 'all')
dev.off()

pdf(file="bp.w01.pdf")
par(cex = 0.5)
showSigOfNodes(GOdata.BP, score(go.bp.w01), firstSigNodes = 10, useInfo = 'all', )
dev.off()

pdf(file="bp.wt.pdf")
showSigOfNodes(GOdata.BP, score(go.bp.wt), firstSigNodes = 10, useInfo = 'all')
dev.off()

showSigOfNodes(GOdata.BP, score(go.bp.classic), firstSigNodes = 5, useInfo = 'all')

You can also run topGP using a Kolmogorov-Smirnov (KS) statistic test. However, topGO is finicky when running the KS statistic. Using the KS statistic works fine when using the weight01 or classic algorithms. But, the elim algorithm induces an error. I’ve reached out to the community about this and I’ve attempted to debug it based on feedback but I haven’t been to make it work.Regardless, we’ll run the KS statistic here with the weight01 algorithm to see how it compares to the Fisher results.

Generate a table of the results:

results.table.bp.ks=GenTable(GOdata.BP, ks.w01=go.bp.ks.w01, fis.w01=go.bp.w01, fis.elim=go.bp.elim, orderBy="ks.w01", ranksOf="fis.w01", topNodes=50)
results.table.bp.ks

##         GO.ID                                        Term Annotated Significant
## 1  GO:0008150                          biological_process     13430        1816
## 2  GO:0015031                           protein transport      1414         134
## 3  GO:0006281                                  DNA repair       565          14
## 4  GO:0051301                               cell division       582          36
## 5  GO:2000045 regulation of G1/S transition of mitotic...       150          15
## 6  GO:0008380                                RNA splicing       423          11
## 7  GO:0006338                        chromatin remodeling       544          25
## 8  GO:0006406                    mRNA export from nucleus        65           0
## 9  GO:0006886             intracellular protein transport       808          59
## 10 GO:0006511 ubiquitin-dependent protein catabolic pr...       605          23
## 11 GO:1902969                     mitotic DNA replication        16           0
## 12 GO:1990830 cellular response to leukemia inhibitory...        92          11
## 13 GO:0030071 regulation of mitotic metaphase/anaphase...        87           0
## 14 GO:0000281                         mitotic cytokinesis        86           2
## 15 GO:0043161 proteasome-mediated ubiquitin-dependent ...       437          16
## 16 GO:0009411                              response to UV       146          12
## 17 GO:0090307                    mitotic spindle assembly        69           3
## 18 GO:2000819    regulation of nucleotide-excision repair        27           0
## 19 GO:0048026 positive regulation of mRNA splicing, vi...        21           0
## 20 GO:0042168                      heme metabolic process        38           0
## 21 GO:0006915                           apoptotic process      1567         248
## 22 GO:0060261 positive regulation of transcription ini...        58           0
## 23 GO:0045893 positive regulation of DNA-templated tra...      1360         140
## 24 GO:0006417                   regulation of translation       383          21
## 25 GO:0032530      regulation of microvillus organization        11           0
## 26 GO:0006750            glutathione biosynthetic process        14           0
## 27 GO:0035720          intraciliary anterograde transport        18           0
## 28 GO:0042147     retrograde transport, endosome to Golgi        94           2
## 29 GO:1903146    regulation of autophagy of mitochondrion        37           3
## 30 GO:0018105             peptidyl-serine phosphorylation       250          23
## 31 GO:0031122        cytoplasmic microtubule organization        56           3
## 32 GO:0019082                    viral protein processing        30           0
## 33 GO:0006893          Golgi to plasma membrane transport        57           3
## 34 GO:0060271                             cilium assembly       344          28
## 35 GO:0006298                             mismatch repair        29           0
## 36 GO:0032968 positive regulation of transcription elo...        48           1
## 37 GO:0000398              mRNA splicing, via spliceosome       283           4
## 38 GO:2000781 positive regulation of double-strand bre...        83           1
## 39 GO:0006446      regulation of translational initiation        75           3
## 40 GO:0006261               DNA-templated DNA replication       156           4
## 41 GO:0051447 negative regulation of meiotic cell cycl...        10           0
## 42 GO:0036010            protein localization to endosome        26           2
## 43 GO:0051123 RNA polymerase II preinitiation complex ...        55           2
## 44 GO:0016078                      tRNA catabolic process        15           0
## 45 GO:0016567                      protein ubiquitination       662          36
## 46 GO:0000380 alternative mRNA splicing, via spliceoso...        68           2
## 47 GO:1902459 positive regulation of stem cell populat...        45           3
## 48 GO:0030968 endoplasmic reticulum unfolded protein r...        71           2
## 49 GO:0046827 positive regulation of protein export fr...        17           1
## 50 GO:0006357 regulation of transcription by RNA polym...      2044         210
##    Expected Rank in fis.w01  ks.w01 fis.w01 fis.elim
## 1   1814.43             551 5.6e-25   0.036     0.28
## 2    191.04            3931 2.7e-08   1.000     1.00
## 3     76.33            3933 8.6e-08   1.000     1.00
## 4     78.63            3932 1.2e-07   1.000     1.00
## 5     20.27            3788 1.6e-07   0.988     0.92
## 6     57.15            3912 3.2e-06   1.000     1.00
## 7     73.50            3924 4.7e-06   1.000     1.00
## 8      8.78            3935 8.2e-06   1.000     1.00
## 9    109.16            3901 1.1e-05   1.000     1.00
## 10    81.74            3896 1.3e-05   1.000     1.00
## 11     2.16            3936 1.7e-05   1.000     1.00
## 12    12.43            2902 2.8e-05   0.713     0.71
## 13    11.75            3937 3.1e-05   1.000     1.00
## 14    11.62            3905 3.8e-05   1.000     1.00
## 15    59.04            3927 3.8e-05   1.000     1.00
## 16    19.73            3846 5.2e-05   0.997     0.98
## 17     9.32            3864 8.7e-05   0.998     1.00
## 18     3.65            3938 0.00013   1.000     1.00
## 19     2.84            3939 0.00013   1.000     1.00
## 20     5.13            3940 0.00016   1.000     1.00
## 21   211.71            2471 0.00017   0.539     0.51
## 22     7.84            3941 0.00018   1.000     1.00
## 23   183.74            3919 0.00036   1.000     1.00
## 24    51.74            3909 0.00041   1.000     1.00
## 25     1.49            3942 0.00042   1.000     1.00
## 26     1.89            3943 0.00044   1.000     1.00
## 27     2.43            3944 0.00044   1.000     1.00
## 28    12.70            3910 0.00048   1.000     1.00
## 29     5.00            3554 0.00055   0.915     0.89
## 30    33.78            3855 0.00057   0.997     0.99
## 31     7.57            3801 0.00068   0.990     0.99
## 32     4.05            3945 0.00069   1.000     1.00
## 33     7.70            3946 0.00071   1.000     0.99
## 34    46.48            3837 0.00071   0.995     1.00
## 35     3.92            3947 0.00077   1.000     1.00
## 36     6.48            3878 0.00081   0.999     1.00
## 37    38.23            3928 0.00082   1.000     1.00
## 38    11.21            3915 0.00087   1.000     1.00
## 39    10.13            3948 0.00094   1.000     1.00
## 40    21.08            3949 0.00100   1.000     1.00
## 41     1.35            3950 0.00103   1.000     1.00
## 42     3.51            3951 0.00116   1.000     0.88
## 43     7.43            3849 0.00123   0.997     1.00
## 44     2.03            3952 0.00125   1.000     1.00
## 45    89.44            3925 0.00127   1.000     1.00
## 46     9.19            3953 0.00128   1.000     1.00
## 47     6.08            3665 0.00134   0.953     0.95
## 48     9.59            3885 0.00157   1.000     1.00
## 49     2.30            3534 0.00164   0.915     0.92
## 50   276.15            3929 0.00169   1.000     1.00

You can see that using the KS statistic here reports much more uninformative terms than the Fisher’s statistic does. For reasons we won’t get into, the KS test has been documented to be more susceptible to outliers compared to the Fisher’s test. Thus, I typically go with the Fisher’s test.

5. Using online tools for GO analysis

I laid out why I prefer topGO above. There are, of course, a number of other tools that can be used to perform GO analysis. If you would rather use another tool, I recommend using EnrichR. The group who maintains EnrichR attempts to keep the databases up to date and it’s really simple to use.

EnrichR can be reached here: EnrichR

When you use EnrichR, simply input the list of DE genes into the box. Make sure to select the backgroud option in the description above and input your reference gene list. EnrichR will input a default list here, so make sure you select all of the default genes, delete them, and then add your list. Again, do NOT use EnrichR, or any other tool, without including the reference set of genes that is specific to your experiment.

When interpreting your results, keep in mind that the statistics used by EnrichR are slightly different than what is used in topGO. The Fisher’s test is used but there are nuanced differences. You can read more about the specifics in this article .

Bulk RNA-seq pipeline - GO Analysis

2024-05-09