Create a directory named data
in your current working directory for input and output files.
Load R packages.
library(pheatmap) #plot heatmap
library(DESeq2) #differential gene expression for RNA-seq
library(pvclust) #clustering bootstrapping
library(biomaRt) #gene annotation
library(rafalib) #nice plot arrangement
rafalib::mypar(mar=c(6,2.5,2.5,1)) #sets nice arrangement for the whole document
# source download function
source("https://raw.githubusercontent.com/NBISweden/workshop-RNAseq/master/assets/scripts.R")
Load the sequencing data and the metadata data with sample information.
# Load data and metadata
download_data("data/counts_tpm.csv")
data <- read.csv("data/counts_tpm.csv",header=TRUE,stringsAsFactors=FALSE,row.names=1)
# if file doesn't exist, download it
download_data("data/metadata_raw.csv")
metadata <- read.csv("data/metadata_raw.csv",header=TRUE,stringsAsFactors=TRUE,row.names=1)
A useful step is to convert ENSEMBL gene/transcript IDs into meaningful gene names. This can be done in many way using a gtf file or using the biomaRt
function in R.
download_data("data/mouse_genes.txt")
mouse <- biomaRt::useMart(biomart="ensembl", dataset="mmusculus_gene_ensembl")
annot <- biomaRt::getBM(attributes = c("ensembl_gene_id","external_gene_name"),mart = mouse,useCache=FALSE)
gene_names <- as.character ( annot[match(rownames(data),annot[,"ensembl_gene_id"]),"external_gene_name"] )
gene_names[is.na(gene_names) ] <- ""
Remove missing or fuse duplicated annotations and sum transcripts into genes.
data <- rowsum(data, group = as.character(gene_names) ) #assign gene names and sum duplicates
data <- data[ rownames(data) != "" , ] #remove blank annotation
data <- data[ rownames(data) != "NA" , ] #remove blank annotation
data <- data[ order(rownames(data)),] #order genes alphabetically
dim(data)
## [1] 10573 6
We can now save this counts for later use
write.csv(data ,"data/gene_tpm_counts.csv",row.names=T)
cf <- read.csv("data/counts_filtered.csv",header=TRUE,stringsAsFactors=FALSE,row.names=1)
cf <- rowsum(cf, group = as.character(gene_names) ) #assign gene names and sum duplicates
cf <- cf[ rownames(cf) != "" , ] #remove blank annotation
cf <- cf[ rownames(cf) != "NA" , ] #remove blank annotation
cf <- cf[ order(rownames(cf)),] #order genes alphabetically
write.csv(cf ,"data/gene_counts.csv",row.names=T)
Performing PCA has many useful applications and interpretations, which much depends on the data used. In the case of life sciences, we want to segregate samples based on gene expression patterns in the data.
Now that the data is prepared, we now proceed with PCA. Since each gene has a different expression level, it means that genes with higher expression values will naturally have higher variation that will be captured by PCA. This means that we need to somehow give each gene a similar weight when performing PCA (see below). The common practice is to center and scale each gene before performing PCA. This exact scaling is called Z-score normalization it is very useful for PCA, clustering and plotting heatmaps.
Znorm <- t(apply(data,1,function(x) scale(x,center=T,scale=T)))
colnames(Znorm) <- colnames(data)
{
mypar(1,2,mar=c(5,3,2,1))
boxplot(t(data[1:30,]),ylim=c(0,12),las=2,col="grey",main="data raw_data",cex=.2)
boxplot(t(Znorm[1:30,]),ylim=c(-4,4),las=2,col="grey",main="Z-score data",cex=.2)
abline(h=0,col="red",lty=2)
}
As you can appreciate above, Z-score normalization have set each gene mean to 0 and changes the variances to a common scale. However, if we normalize all genes to the same scale, even genes that could be considered stable across samples (i.e. that do not vary across your samples) will be used for separation of samples. This might in turn add ‘noise’ to your data in form of ‘variation across samples that are not meaningful’. Therefore, we can also sort the genes by variance and use the top genes to filter/prioritize the data.
#Compute the mean, variance and cv for each gene and sort them in decreasing order
gene_stats <- data.frame(row.names=rownames(data))
means <- apply(data,1,mean)
vars <- apply(data,1,var)
#Plot the row means versus row means
{
mypar(1,2,mar=c(5,3,2,1))
plot(means,vars,cex=.1)
}
#Sort and select the top 500 highly variable genes from the data
vars <- sort(vars,decreasing=T)
top_var <- names(vars)[1:100]
boxplot(t(data[top_var[1:15],]),ylim=c(0,12),las=2,col="grey", main="data (top var genes)",cex=.2)
Next, we can compute PCA using the prcomp()
function. As additional parameters, we can already center and scale each gene already inside the function. This the same exact Z-score scaling above, but already inside the function.
{
mypar(1,2,mar = c(3,3,2,1))
PC <- prcomp( t( Znorm[ top_var, ]) ) #Method1
PC <- prcomp( t( data[ top_var, ]), center = TRUE, scale. = TRUE) #Method2
mypar(1,2)
plot(PC$x[,1],PC$x[,2],cex=2,col=factor(metadata$Group),xlab="PC1",ylab="PC2",pch=16,main="PCA",las=1)
text(PC$x[,1],PC$x[,2],cex=.7,labels = paste0(metadata$SampleName),pos=3)
plot(PC$x[,3],PC$x[,4],cex=2,col=factor(metadata$Group),xlab="PC3",ylab="PC4",pch=16,main="PCA",las=1)
text(PC$x[,3],PC$x[,4],cex=.7,labels = paste0(metadata$SampleName),pos=3)
}
Exploratory Questions
The usefulness of PCA is that the principal components do have a meaning: They store the amount of variance in decreasing order, so some PCs are more important than others. Inside the PC$sdev
object, we can get the standard deviation stored in each PC.
PC_sd <- setNames(PC$sdev,paste0("PC",1:length(PC$sdev)))
PC_var_expl <- (PC_sd^2)/sum(PC_sd^2)*100
{
mypar(1,1)
barplot(PC_var_expl,las=2,ylab="% variance explained")
abline(h=10,lty=2)
}
Exploratory Questions
Now that you know that each PC stores a particular structure of the data, we could also explore with genes are more responsible for that separation (a.k.a. leading genes). For that, we can take a look inside the PC object.
leading_genes <- PC$rotation
head(leading_genes)
leading_PC1 <- sort(leading_genes[,1],decreasing=T)
leading_PC2 <- sort(leading_genes[,2],decreasing=T)
{
mypar(1,2,mar=c(3,4,2,2))
barplot(leading_PC1[15:1],las=2,horiz=T,cex.names=.8,cex.axis=0.8,yaxs="i")
abline(v=0,lwd=2)
barplot(leading_PC2[15:1],las=2,horiz=T,cex.names=.8,cex.axis=0.8,yaxs="i")
abline(v=0,lwd=2)
}
Exploratory Questions
Hierarchical clustering is group of clustering methods used to group samples based on a hierarchy. The hierarchical clustering is done in two steps:
The base R stats
package already contains a function dist()
that calculates distances between all pairs of samples. Since we want to compute distances between samples, rather than among genes, we need to transpose the data before applying it to the dist()
function. This can be done by simply adding the transpose function t()
to the data. When clustering on genes, it is wise to first define the gene selection to reduce the time to compute distances. A sensible choice is to do it on the differentially expressed genes only (including up to ~3000 genes).
The distance methods available in dist()
are: euclidean, maximum, manhattan, canberra, binary or minkowski.
d <- dist( t(data) , method="euclidean")
d
Exploratory Questions
As you might have realized, correlation is not a method implemented in the dist()
function. However, we can create our own distances and transform them to a distance object.
We can first compute sample correlations using the cor()
function. As you might already know, correlation range from -1 to 1, where 1 indicates that two samples are closest, -1 indicates that two samples are the furthest and 0 is somewhat in between. This, however, creates a problem in defining distances because a distance of 0 indicates that two samples are closest, 1 indicates that two samples are the furthest and distance of -1 is not meaningful. We thus need to transform the correlations to a positive scale (a.k.a. adjacency):
\[adj = -\frac{cor - 1}{2}\]
Once we transformed the correlations to a 0-1 scale, we can simply convert it to a distance object using as.dist()
function. The transformation does not need to have a maximum of 1, but it is more intuitive to have it at 1, rather than at any other number.
#Compute sample correlations
sample_cor <- cor( data )
round(sample_cor,4)
pheatmap(sample_cor)
#Transform the scale from correlations
cor_distance <- -(sample_cor-1)/2
round(cor_distance,4)
pheatmap(cor_distance)
#Convert it to a distance object
d2 <- as.dist(cor_distance)
d2
Exploratory Questions
After having calculated the distances between samples calculated, we can now proceed with the hierarchical clustering per-se. We will use the function hclust()
for this purpose, in which we can simply run it with the distance objects created above.
The methods available are: ward.D, ward.D2, single, complete, average, mcquitty, median or centroid.
#Clustering using euclidean distance
{
mypar(1,2,mar=c(6,4,2,1))
h <- hclust(d,method="complete")
plot( as.dendrogram(h),las=1,main="d=euclidean\nh=complete")
points(1:ncol(data),rep(0,ncol(data)),pch=16,cex=2,col=metadata$Group[h$order])
}
h2 <- hclust(d2,method="complete")
{
plot( as.dendrogram(h2),las=1, main="d=correlation\nh=complete")
points(1:ncol(data),rep(0,ncol(data)),pch=16,cex=2, col=metadata$Group[h2$order])
}
Exploratory Questions
ward.D2
. How does it affect your results ?Once your dendrogram is created, the next step is to define which samples belong to a particular cluster. However, the sample groups are already known in this example, so clustering them does not add much information for us. What we can do instead is subdivide the genes into clusters. As for the PCA (above), the ideal scenario is to use the Z-score normalized gene expression table, because in this way we make sure that we are grouping together expression trends (going up vs. down), rather than expression level (genes with more counts vs less counts). This way, we can simply repeat the steps above using the transpose of the Z-score matrix, compute the correlation distances and cluster using ward.D2 linkage method. Since it is time consuming to cluster on all genes, this step is usually done only on the differentially expressed gene list (say up to ~3000 genes). Here, we can for instance cluster on the genes with highest variance top_var
.
gene_cor <- cor(t(Znorm[top_var, ]))
gene_dist <- as.dist(-(gene_cor-1)/2)
gene_clus <- hclust(gene_dist,method="complete")
After identifying the dendrogram for the genes (below), we can now literally cut the tree at a fixed threshold (with cutree
) at different levels to define the clusters.
HEIGHT <- 0.9
gene_clusters <- cutree(gene_clus,h=HEIGHT)
gene_clusters
{
mypar(1,1,mar=c(6,4,2,1))
plot( as.dendrogram(gene_clus),las=1,main="d=correlation\nh=complete")
rect.hclust(gene_clus,h=HEIGHT)
abline(h=HEIGHT,col="red",lty=2)
points(1:length(gene_clusters),rep(0,length(gene_clusters)),pch=16,cex=2, col=factor(gene_clusters)[gene_clus$order])
legend("topright",levels(factor(gene_clusters)),pch=16,col= factor(levels(factor(gene_clusters))))
}
Exploratory Questions
Now that you understand the concepts of hierarchical clustering both at the sample and at the gene level, we can use a heatmap function to explore the visual consequences of clustering. Here, we can make use of the pheatmap
function, which by default will do the clustering of the rows and columns.
pheatmap( data[top_var,] , scale="row" , color = colorRampPalette(c("navy","white","firebrick"))(90), border_color=NA, cluster_cols=F)
Exploratory Questions
pheatmap
function pressing tab.pheatmap
function pressing tab.Let’s say that you have 12 samples, so that all samples have the exact same expression level (say 12000 genes). And to that you add 1 single gene which is more expressed in samples 6-12, than in 1-6. If you ran a bootstrapping in this mock example, you will see that the samples will seem to cluster very well enven though there is only 1 out 12000 genes that make that separation possible.
One way to measure clustering robustness / accuracy is by selecting part of the data set (say 90% of the genes), performing the clustering and recording which samples fall together. Then, you repeat (iterate) with another selection of 90% of the genes up to 1000 times, recording the clustering results. In the end, you compare the results of all 1000 clusterings and check how often the same samples failed in the same group. This procedure is called bootstrapping, and it is measured as the “percentage of times an event can occur”. For more details on this, please read the webpage for the pvclust()
which is full of examples on this:
pvclust
website: http://stat.sys.i.kyoto-u.ac.jp/prog/pvclust/pvclust
paper: https://academic.oup.com/bioinformatics/article/22/12/1540/207339Note that bootstrapping procedure is very time consuming and computationally intensive. For this purpose, we will run it only on the top 100 most variable genes in the dataset.
# Clustering on our dataset (this step is slow)
pvc <- pvclust( data = data[top_var[1:100],] , method.dist = "correlation", method.hclust = "complete")
{
mypar()
plot(pvc,las=2,hang = -0.5)
pvrect(pvc, alpha = 0.9)
points(1:ncol(data) ,rep(0,ncol(data)), pch= 16, cex=2, col=metadata$Group[pvc$hclust$order])
}
Exploratory Questions