This module will pick up from the previous module where we performed DE analysis with DESeq2. I’ve tried to set the modules up so that they can be used independently if you want. In this module, we’ll go over making volcano plots and heat maps. First, we need to load the packages we’re going to use. Then, we will need to load up the data tables we exported in the previous module as dataframes to work with here.
Similar to part 1, we first want to load all of the software packages that we’ll use and set the working directory to read from. As in part 1, the code here will check whether a package is installed and, if not, install it. We’ll be using functions within the tidyverse to build our volcano plot and heatmap. Note that these are NOT Bioconductor packages and are, thus, loaded using the standard R installer.
Package installation:
packages<-c("tidyverse", "RColorBrewer", "ggrepel")
installed<- packages %in% rownames(installed.packages())
if (any(installed == FALSE)) {
install.packages(!installed)
}
library(tidyverse) # includes ggplot2, for data visualisation. dplyr, for data manipulation.
library(RColorBrewer) # for a colourful plot
library(ggrepel) # for nice annotations
Set working directory:
knitr::opts_knit$set(root.dir = '/Users/aaron/Dropbox (Personal)/Protocols/Bioinformatics/Bioinformatics_Scripts_From_Aaron/Bioinformatics Resources/')
To make heat maps we will need the DE gene table of all the genes in the experiment. We also want the gene list we made with the cutoffs for LFC of |1| and adjusted p value of 0.05. For heat maps, we will need the table of normalized counts.
de_matrix<-read.table("de_gene_matrix.txt", header=TRUE)
res_total_lfc1_padj005<-read.table("de_genes_lfc1_padj005.txt", header = TRUE)
norm_counts<-read.table("normalized_counts.txt", header=TRUE)
First, we’ll add a column to table designating whether the gene is differentially expressed and, if so, if it is up or down-regulated.
de_matrix$diffexpressed <- "NO"
de_matrix$diffexpressed[de_matrix$log2FoldChange >= 1 & de_matrix$padj <= 0.05] <- "UP"
de_matrix$diffexpressed[de_matrix$log2FoldChange <= -1 & de_matrix$padj <= 0.05] <- "DOWN"
For this volcano plot, we want to label points of particular genes of interest. To do this we will read in our list of genes from a file and make new column in the dataframe with them.
The list is in a text file called “interesting_genes.txt” and we will read it in using read_lines since the genes are in rows. If you have a list of genes where they are in a tab- or comma-separated (.txt, .tsv, or .csv) document with each gene in the column, instead use the read.csv function instead and set the appropriate delimiter.
interesting_genes<-read_lines("/Users/aaron/Dropbox/Protocols/Bioinformatics/Bioinformatics_Scripts_From_Aaron/Bioinformatics Resources/interesting_genes.txt")
interesting_genes ## peek and check that your list is in the appropriate format
## [1] "AMTN" "ODAM" "APOE" "TIMP3" "EFEMP1" "CNQTNF5" "UNC13D"
## [8] "ATP2B2" "SCD1" "SLC13A3" "CCN2" "MXRA5" "TF" "LRP8"
Next, we will make an abbreviated data frame of just these genes.
sig_genes <- de_matrix %>%
filter(symbol %in% interesting_genes) ## make a dataframe of just the interesting genes
sig_genes ## check the output to confirm its in the appropriate format
## baseMean log2FoldChange lfcSE stat pvalue padj
## 1 33956.436 -4.3042451 0.3014400 -14.278943 2.960570e-46 1.838950e-43
## 2 12491.669 -3.1746771 0.2354563 -13.483083 1.967119e-41 8.839061e-39
## 3 570000.091 0.9549403 0.1268847 7.526045 5.230024e-14 1.407043e-12
## 4 19225.903 2.7671743 0.4781842 5.786838 7.172388e-09 8.443348e-08
## 5 5458.339 4.4103046 0.5399598 8.167838 3.139652e-16 1.175644e-14
## 6 180481.648 1.7599103 0.1226100 14.353723 1.009675e-46 6.461613e-44
## 7 28819.406 2.3916526 0.2179730 10.972242 5.196747e-28 6.902522e-26
## 8 13252.322 -1.4290210 0.2572974 -5.553965 2.792610e-08 2.905278e-07
## 9 7230.000 -3.1188224 0.2605686 -11.969294 5.146442e-33 1.055220e-30
## 10 7348.127 -2.1412475 0.1863079 -11.493055 1.429645e-30 2.358803e-28
## 11 3954.095 -1.4064260 0.1391262 -10.108998 5.039691e-24 4.502265e-22
## 12 21913.790 6.0685857 0.7729579 7.851121 4.123360e-15 1.316089e-13
## ensembl_var ensembl symbol entrez diffexpressed
## 1 ENSG00000091513.16 ENSG00000091513 TF 7018 DOWN
## 2 ENSG00000092929.12 ENSG00000092929 UNC13D 201294 DOWN
## 3 ENSG00000100234.12 ENSG00000100234 TIMP3 7078 NO
## 4 ENSG00000101825.8 ENSG00000101825 MXRA5 25878 UP
## 5 ENSG00000109205.17 ENSG00000109205 ODAM 54959 UP
## 6 ENSG00000115380.20 ENSG00000115380 EFEMP1 2202 UP
## 7 ENSG00000118523.6 ENSG00000118523 CCN2 1490 UP
## 8 ENSG00000130203.10 ENSG00000130203 APOE 348 DOWN
## 9 ENSG00000157087.20 ENSG00000157087 ATP2B2 491 DOWN
## 10 ENSG00000157193.18 ENSG00000157193 LRP8 7804 DOWN
## 11 ENSG00000158296.14 ENSG00000158296 SLC13A3 64849 DOWN
## 12 ENSG00000187689.10 ENSG00000187689 AMTN 401138 UP
Now we can construct our volcano plot and use the objects we set up in parts 4.1 and 4.2 to label and annotate it.
First, set up a color scheme. This is optional as ggplot has a default scheme. But, setting up your own allows you to have more control over the appearance.
cols <- c("UP" = "red", "DOWN" = "blue", "NO" = "grey") ## The color scheme is set to label points based on the value in the diffexpressed column which we set up in part 4.1.
Now, generate the plot. Each annotation function is described in the comments to the right:
volcano<-ggplot(data = de_matrix,
aes(x = log2FoldChange,
y = -log10(padj))) +
geom_point(aes(colour = diffexpressed), ## the points will be colored different based on the value in the diffexpressed column. This uses the default color scheme but we will override that with the scale_color_manual function below.
alpha = 0.75,
shape = 16,
size = 1) +
geom_point(data = sig_genes, ## this well draw a circle around points that correspond to our genes of interest.
shape = 21,
size = 2,
colour = "black") +
geom_vline(xintercept = c(-1, 1), ## draws the vertical line showing the thresholds for log2FC
col = "black",
linetype = 'dashed') +
geom_hline(yintercept = -log10(0.05), ## draws the horizontal line showing the threshold for adjusted p value
col = "black",
linetype = 'dashed') +
scale_color_manual(values = cols, ## this overrides the color scheme from above and labels the legend
labels = c("Downregulated", "Not significant", "Upregulated")) +
labs(colour = 'Expression', ## labels the axes
x = expression("log"[2]*"FC"),
y = expression("-log"[10]*"p-value")) +
geom_label_repel(data = sig_genes,
## labels the points of genes of interest. The geom_label_repel function attempts to arrange the labels so that they are easily viewed on the plot.,
size = 2.5,
aes(label = symbol),
box.padding = 1,
force = 2,
nudge_y = 3,
show.legend = FALSE)
volcano
## Warning: Removed 32 rows containing missing values or values outside the scale range
## (`geom_point()`).
The image shown here may vary from the the actual volcano plot that you export. The following code will export as a PDF which can be further manipulated in Illustrator if necessary.
pdf("volcano.pdf", width = 10, height = 13)
plot(volcano)
invisible(dev.off())
Now we’ll make a heat map of all the genes. This is simply just another way to visualize how the genes are expressed between the groups. There are a number of different packages you can use to make heatmaps but R’s heatmap function is simple and really useful for simple visualization.
First, we have to convert the normalized counts object into a numeric matrix. We then use the heatmap function in the base R to make the heatmap.
mat <- data.matrix(norm_counts)
heatmap(mat, scale="row", na.rm=TRUE)
## Here, we've scaled the colorization on the row which is the gene. It's a sort of normalization to account for the distribution of the counts across all of the genes.
pdf("heatmap.pdf", width = 10, height = 13)
heatmap(mat, scale="row", na.rm=TRUE)
invisible(dev.off())
By default, the R heatmap function performs heriarchial clustering (this can be turned off if you want to retain the structure of the matrix in the heatmap.
We can see that the fetal and adult samples cluster together but the clustering for the genes is less obvious. There are definitely areas that are “opposing” but it’s likely that most of the variation falls into a relatively small group of genes.
As mentioned above, there are a number of tools to generate a heatmap. Some of these other tools are more optimal for sequencing data and are really useful when you want to have more control over the parameters. But for simply graphing the entire gene set, the base R heatmap function is fine.
Now that we have some basic renderings of the data, we can start to tease out the interesting parts of the data set. Next, we’ll go over gene ontology and pathway analysis to identify mechanisms involved with our DE genes using both standard enrichment and gene set enrichment analysis (GSEA).