Create a directory named data
in your current working directory.
Loading packages and data.
library(pheatmap)
library(rafalib)
library(DESeq2)
library(pvclust)
library(biomaRt)
library(enrichR)
library(fgsea)
# source download function
source("https://raw.githubusercontent.com/NBISweden/workshop-RNAseq/master/assets/scripts.R")
# read data
download_data("data/gene_counts.csv")
data <- read.csv("data/gene_counts.csv",row.names = 1)
dim(data)
# read metadata
download_data("data/metadata_raw.csv")
metadata <- read.csv("data/metadata_raw.csv",row.names = 1,stringsAsFactors=T)
## [1] 10553 6
Here you would select genes based on p-values and logFC you obtained from differential expression.
res <- read.csv("data/dge_results.csv",row.names=1)
res1 <- na.omit(res)
selected_genes <- rownames(res1)[(res1$pvalue < 0.001) & (abs(res1$log2FoldChange) > 1)]
There are many ways of selecting gene lists. One could for example select only the top 500 variable genes. CPM counts are converted to log scale and row-wise variance is computed.
# Gene selection
vars <- apply(data,1,var)
vars <- sort(vars,decreasing=T)
top_var <- names(vars)[1:500]
Now that you understand the concepts of hierarchical clustering both at the sample and at the gene level, we can use a heatmap function to explore the visual consequences of clustering. Here, we can make use of the pheatmap()
function, which by default will do the clustering of the rows and columns.
cl <- pheatmap( data[rownames(data) %in% selected_genes,],scale="row",color=colorRampPalette(c("navy","white","firebrick"))(90),border_color=NA,cluster_cols=F,cutree_rows=2)
gene_clusters <- cutree(cl$tree_row,k=2)
R-packages | Comments |
---|---|
topGO | GO |
goana | GO |
GOseq | GO |
topKEGG | KEGG |
kegga | KEGG |
enrichR | GO, KEGG, many others |
piano | GO, KEGG, GSEA, many others, enrichment consensus |
ClusterProfiler | GO, KEGG, GSEA, many others, nice plots! |
Pathview | Nice visualization for KEGG pathways |
fgsea | GSEA |
You can list all available databases by using the command listEnrichrDbs()
function.
genes_cluster1 <- names(gene_clusters)[gene_clusters == 1]
genes_cluster2 <- names(gene_clusters)[gene_clusters == 2]
head(data[genes_cluster1, ])
head(data[genes_cluster2, ])
go_cluster <- enrichr(genes = genes_cluster2,databases = "GO_Biological_Process_2018")
go_cluster <- go_cluster$GO_Biological_Process_2018
go_cluster <- go_cluster[order(go_cluster$P.value),]
go_cluster[1:5,]
{
mypar(1,1,mar=c(2,25,2,2))
barplot(-log10(go_cluster$P.value[15:1]),horiz=T,border=F,yaxs="i",
names.arg=go_cluster$Term[15:1],las=1,cex.names=0.8)
abline(v=0,lwd=2)
}
Exploratory Questions
kegg_cluster <- enrichr(genes=genes_cluster2,databases="KEGG_2019_Human")
kegg_cluster <- kegg_cluster$KEGG_2019_Human
kegg_cluster <- kegg_cluster[order(kegg_cluster$P.value),]
kegg_cluster[1:5,]
{
mypar(1,1,mar=c(2,25,2,2))
barplot(-log10(kegg_cluster$P.value[15:1]),horiz=T,border=F,yaxs="i",
names.arg=kegg_cluster$Term[15:1],las=1,cex.names=0.8)
abline(v=0,lwd=2)
}
Exploratory Questions
MSigDB is one of the largest curated databases of signatures and pathways created by the BROAD INSTITUTE. The Molecular Signatures Database (MSigDB) is accessible here.
For this exercise, you can download a couple of data sets:
Database | Link to Download |
---|---|
KEGG | c2.cp.kegg.v6.2.symbols.gmt |
GO_BP | c5.bp.v6.2.symbols.gmt |
HALLMARK | h.all.v6.2.symbols.gmt |
As you could already notice, the differences in gene expression between days 0 to 7 are very clear. Here we will illustrate a gene set enrichment to explore which pathways are UP or DOWN regulated.
Here you would use the logFC values you obtained from differential expression. But for simplicity and reduce dependency on other parts of the course, we are using a simple difference in mean expression.
# if folder and files doesn't exist, download it
if(!dir.exists("data/MSigDB_files/")) {
dir.create("data/MSigDB_files")
download.file(url="https://raw.github.com/NBISweden/workshop-RNAseq/master/data/MSigDB_files/c2.cp.kegg.v6.2.symbols.gmt.txt","./data/MSigDB_files/c2.cp.kegg.v6.2.symbols.gmt.txt")
download.file(url="https://raw.github.com/NBISweden/workshop-RNAseq/master/data/MSigDB_files/c3.tft.v6.2.symbols.gmt.txt","./data/MSigDB_files/c3.tft.v6.2.symbols.gmt.txt")
download.file(url="https://raw.github.com/NBISweden/workshop-RNAseq/master/data/MSigDB_files/h.all.v6.2.symbols.gmt.txt","./data/MSigDB_files/h.all.v6.2.symbols.gmt.txt")
download.file(url="https://raw.github.com/NBISweden/workshop-RNAseq/master/data/MSigDB_files/c2.cp.v6.2.symbols.gmt.txt","./data/MSigDB_files/c2.cp.v6.2.symbols.gmt.txt")
download.file(url="https://raw.github.com/NBISweden/workshop-RNAseq/master/data/MSigDB_files/c5.bp.v6.2.symbols.gmt.txt","./data/MSigDB_files/c5.bp.v6.2.symbols.gmt.txt")
}
# Create a gene rank based on the gene expression
gene_rank <- setNames( res$log2FoldChange,casefold(rownames(res),upper=T) )
# Load hallmark pathways
hallmark_pathways <- gmtPathways("data/MSigDB_files/h.all.v6.2.symbols.gmt.txt")
Once our list of genes are sorted (from highest to lowest expression in day 7), we can proceed with the enrichment itself. Here, we will be using the fgsea()
function. This will result in a table containing information for several pathways. We can then sort and filter those pathways to visualize only the top ones. You can select/filter them by either p-value
or normalized enrichemnet score (NES
).
# Perform enrichemnt analysis
fgseaRes <- fgsea( pathways=hallmark_pathways, stats=gene_rank, minSize=15, maxSize=500, nperm=10000)
# Filter the results table to show only the top 10 UP and DOWN regulated processes (optional)
topPathwaysUp <- fgseaRes[ES > 0][head(order(pval), n=10), pathway]
topPathwaysDown <- fgseaRes[ES < 0][head(order(pval), n=10), pathway]
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))
# Nice summary table (shown as a plot)
plotGseaTable(hallmark_pathways[topPathways], gene_rank, fgseaRes, gseaParam = 0.5)
Exploratory Questions
-res$log2FoldChange
as gene rank ? What changes and why?Checking for individual enrichment:
topPathways
source("https://raw.githubusercontent.com/czarnewski/niceRplots/master/R/helper_functions.R")
mypar(2,3)
# Nice plot to see enrichment of one specific pathway
p <- list()
for (i in c(topPathwaysUp[1:3],topPathwaysDown[3:1])){
plot_enrich(i,hallmark_pathways,gene_rank)
# p[[i]] <- plotEnrichment(hallmark_pathways[[i]], gene_rank) + ggplot2::ggtitle(i)
}
# p
Exploratory Questions
Exploratory Questions (optional)
Below you can find a list with the most commonly used online enrichment tools. Each one differs a bit on what they can do. The packages are sorted in order, we would like you to work on the workshop.
Package | link | Database | Comment |
---|---|---|---|
Enrichr | http://amp.pharm.mssm.edu/Enrichr/ | GO, KEGG, TF, many others | Extensive libraries |
GOrilla | http://cbl-gorilla.cs.technion.ac.il | GO | Support for REVIGO |
REVIGO | http://revigo.irb.hr | GO | Summarises redundancy |
DAVID | https://david.ncifcrf.gov | GO, KEGG, TF, many others | *Not updated |
KEGG | https://www.genome.jp/kegg/ | KEGG | Shows the pathways |
Reactome | https://www.reactome.org | KEGG-like | Shows the pathways/reactions |
Panther | http://www.pantherdb.org/about.jsp | GO | Evolutionary conserved GO annotation |
In case you want to test the online tools above, you can use the code below to copy the gene vector into memory, and then paste it to the webtools. Should work for both Mac and Windows users.
clip <- pipe("pbcopy","w")
write.table(genes_cluster1, file=clip, sep = '\t', row.names = FALSE)
close(clip)
In the course we will not have much time to work on gene expression networks. However, there are some nice databases that allow you to easily visualize protein-protein interactions. Please paste the list of genes using the command above into those sites to have a visual insight on how the Up- or Down- regulated genes interact with each other.
Database | Link |
---|---|
GeneMANIA | https://genemania.org |
STRING-DB | https://string-db.org |
MIST | https://fgrtools.hms.harvard.edu/MIST/help.jsp |