The files needed during this exercise are provided as URLs. The files can be saved as a text file in your working directory under a directory labelled data
to follow the tutorial exactly as written. Alternatively, you can provide the URL directly to the read.table()
function like this read.table('url-to-file.txt',header=TRUE)
.
Data preprocessing is done in R. First, we read in the count table from this URL: https://raw.githubusercontent.com/NBISweden/course_rnaseq_slu/master/data/counts_raw.txt
.
cr <- read.table("./data/counts_raw.txt",header=TRUE)
head(cr)
str(cr)
## Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6
## ENSG00000000003 321 303 204 492 455 359
## ENSG00000000005 0 0 0 0 0 0
## ENSG00000000419 696 660 472 951 963 689
## ENSG00000000457 59 54 44 109 73 66
## ENSG00000000460 399 405 236 445 454 374
## ENSG00000000938 0 0 0 0 0 1
## Sample_7 Sample_8 Sample_9 Sample_10 Sample_11 Sample_12
## ENSG00000000003 376 523 450 950 760 1436
## ENSG00000000005 0 0 0 0 0 0
## ENSG00000000419 706 1041 796 1036 789 1413
## ENSG00000000457 60 125 74 108 115 174
## ENSG00000000460 316 505 398 141 168 259
## ENSG00000000938 0 0 0 1 0 0
## 'data.frame': 59573 obs. of 12 variables:
## $ Sample_1 : int 321 0 696 59 399 0 0 844 535 493 ...
## $ Sample_2 : int 303 0 660 54 405 0 0 808 458 444 ...
## $ Sample_3 : int 204 0 472 44 236 0 1 495 290 308 ...
## $ Sample_4 : int 492 0 951 109 445 0 2 1065 722 511 ...
## $ Sample_5 : int 455 0 963 73 454 0 0 1037 693 523 ...
## $ Sample_6 : int 359 0 689 66 374 1 0 839 574 350 ...
## $ Sample_7 : int 376 0 706 60 316 0 1 723 406 370 ...
## $ Sample_8 : int 523 0 1041 125 505 0 0 1266 623 579 ...
## $ Sample_9 : int 450 0 796 74 398 0 0 984 545 497 ...
## $ Sample_10: int 950 0 1036 108 141 1 0 1044 487 637 ...
## $ Sample_11: int 760 0 789 115 168 0 1 877 446 548 ...
## $ Sample_12: int 1436 0 1413 174 259 0 1 1428 648 878 ...
The count data shows read counts across samples and genes. The columns denote samples and rows denote genes.
Read in the metadata from this URL: https://raw.githubusercontent.com/NBISweden/course_rnaseq_slu/master/data/metadata_raw.csv
. Each row corresponds to a sample. The sample names can be added as row names.
mr <- read.csv2("./data/metadata_raw.csv",header=TRUE,stringsAsFactors=F)
rownames(mr) <- mr$Sample_ID
head(mr)
str(mr)
## Sample_ID Sample_Name Time Replicate Cell
## Sample_1 Sample_1 t0_A t0 A A431
## Sample_2 Sample_2 t0_B t0 B A431
## Sample_3 Sample_3 t0_C t0 C A431
## Sample_4 Sample_4 t2_A t2 A A431
## Sample_5 Sample_5 t2_B t2 B A431
## Sample_6 Sample_6 t2_C t2 C A431
## 'data.frame': 12 obs. of 5 variables:
## $ Sample_ID : chr "Sample_1" "Sample_2" "Sample_3" "Sample_4" ...
## $ Sample_Name: chr "t0_A" "t0_B" "t0_C" "t2_A" ...
## $ Time : chr "t0" "t0" "t0" "t2" ...
## $ Replicate : chr "A" "B" "C" "A" ...
## $ Cell : chr "A431" "A431" "A431" "A431" ...
The metadata columns are sample name and time points. It is a good idea to check that the number of columns of data match the number of rows of metadata. The column names of data must also match the row names of metadata.
all.equal(colnames(cr),rownames(mr))
## [1] TRUE
Let’s create a boxplot to visualise the distribution of counts.
boxplot(log10(as.matrix(cr)+1),ylab=expression('Log'[10]~'Read counts'),las=2,
main="Raw data")
The median values are zero across all samples. This is a sign that the data set would benefit from a low count filtering.
We can check if any samples need to be discarded based on the number of genes detected. We create a barplot of genes detected across samples.
{barplot(colSums(cr>5),ylab="Number of detected genes",las=2)
abline(h=median(colSums(cr>5)))}
None of the samples look bad enough to be removed.
What does cr>5
do? Why did we use 5? Is it better than using cr>0
?
And we can create a similar plot for detection rate across genes.
{barplot(rowSums(cr>5),xlab="Genes",ylab="Number of samples",las=2,names.arg="")
abline(h=median(rowSums(cr>5)))}
Some of the genes are not detected across most samples. These genes can be discarded. Below, rather than using zero are minimum value for detection, we used minimum of 1 count-per-million (CPM). This ignores noisy background counts. And we want to keep genes that have minimum 1 CPM across 2 samples (since each of test groups consist of 3 samples).
# remove genes with low counts
keepgenes <- rowSums(edgeR::cpm(cr)>1) >= 2
cf <- cr[keepgenes,]
How would the results change if we used total number of samples (ie; 12 for this dataset) in the code above?
boxplot(log10(as.matrix(cf)+1),ylab=expression('Log'[10]~'Read counts'),las=2,
main="Filtered data")
The missingness in the data set is reduced. The filtering process has removed 44995 genes with low counts.
Since no samples were discarded, the metadata file will remain the same. And we can check that the labels are the same order in counts and metadata.
all.equal(colnames(cf),rownames(mr))
## [1] TRUE
At this point, we can save the filtered data.
write.table(cf,"./data/counts_filtered.txt",col.names=T,quote=F,sep="\t",dec=".")
The raw count data needs to be corrected for various biases before statistical inference. If the data set is to be used in an R package for differential gene expression such as DESeq2, edgeR or Limma, the raw data must be used directly. This is because, these packages handle the correction and transformation internally.
For analysis other than DGE, the data set must be corrected before use. The most basic correction required is sequencing depth. This is achieved using rescaling the counts to counts per million.
We read in our filtered count data and metadata. You can use the files you have created or read from the provided URL.
Filtered data is here https://raw.githubusercontent.com/NBISweden/course_rnaseq_slu/master/data/counts_filtered.txt
and the metadata is here https://raw.githubusercontent.com/NBISweden/course_rnaseq_slu/master/data/metadata_raw.csv
.
cf <- read.table("./data/counts_filtered.txt",header=TRUE)
mr <- read.csv2("./data/metadata_raw.csv",header=TRUE,stringsAsFactors=F)
rownames(mr) <- mr$Sample_ID
We can use the function cpm()
from R package edgeR to convert to counts-per-million.
cc <- edgeR::cpm(cf)
boxplot(log10(as.matrix(cc)+1),ylab=expression('Log'[10]~'Read counts'),las=2,main="CPM")
But, CPM data has some drawbacks. It is not suitable for within-sample comparisons. The total number of reads per sample varies from sample to sample. This also makes it harder to compare one experiment to another. In addition, gene length is not controlled for in this correction. RPKM/FPKM normalisations correct for gene length, but they are not recommended because they are not comparable between samples.
A better correction method that resolves these issues is TPM (transcripts-per-million). The code for computing TPM is simple.
#' @title Compute TPM from a read count matrix
#' @param counts A numeric data.frame of read counts with samples (columns) and genes (rows).
#' @param len A vector of gene cds length equal to number of rows of dfr.
#'
#' https://support.bioconductor.org/p/91218/
#'
tpm <- function(counts,len) {
x <- counts/len
return(t(t(x)*1e6/colSums(x)))
}
We read in the annotation data from URL: https://raw.githubusercontent.com/NBISweden/course_rnaseq_slu/master/data/human_genes.txt
. Then, we remove duplicated ensembl IDs and compute gene lengths.
g <- read.delim("./data/human_genes.txt",header=T,stringsAsFactors=F)
g <- g[!duplicated(g$ensembl_gene_id),]
g$len <- g$end_position-g$start_position
rownames(g) <- g$ensembl_gene_id
Next, we find shared genes between count data and annotation data and match their order.
igenes <- intersect(rownames(cf),g$ensembl_gene_id)
g1 <- g[igenes,]
cf1 <- cf[igenes,]
all.equal(rownames(cf1),g1$ensembl_gene_id)
## [1] TRUE
ct <- tpm(cf1,g1$len)
boxplot(log10(as.matrix(ct)+1),ylab=expression('Log'[10]~'Read counts'),las=2,main="TPM")
DESeq2 internally corrects counts for sequencing depth and RNA compositional bias using Median of ratios method. The details of this method are described further below under DGE size factors. To run this method, we create a DESeq2 object using the count data and metadata.
library(DESeq2)
mr$Time <- factor(mr$Time)
d <- DESeqDataSetFromMatrix(countData=cf,colData=mr,design=~Time)
d <- DESeq2::estimateSizeFactors(d,type="ratio")
cd <- counts(d,normalized=TRUE)
boxplot(log10(as.matrix(cd)+1),ylab=expression('Log'[10]~'Read counts'),las=2,main="DESeq2")
For the purpose of exploratory analysis such as MDS, PCA, clustering etc, VST (variance-stabilizing-transformation) is recommended. VST is also run using DESeq2. As in the previous step, a DESeq2 object is created.
library(DESeq2)
mr$Time <- factor(mr$Time)
d <- DESeqDataSetFromMatrix(countData=cf,colData=mr,design=~Time)
d <- DESeq2::estimateSizeFactors(d,type="ratio")
d <- DESeq2::estimateDispersions(d)
cv <- as.data.frame(assay(varianceStabilizingTransformation(d,blind=T)),check.names=F)
# write vst counts to working directory
if(!file.exists("./data/counts_vst.txt")) write.table(cv,"./data/counts_vst.txt",sep="\t",dec=".",quote=FALSE)
# plot
boxplot(log10(as.matrix(cv)+1),ylab=expression('Log'[10]~'Read counts'),las=2,main="VST")
The effect of VST transformation can be clearly seen in a mean vs variance plot.
rowVar <- function(x) apply(x,1,var)
par(mfrow=c(2,2))
plot(log10(rowMeans(cf)+1),log10(rowVar(cf)+1),xlab=expression('Log'[10]~'Mean count'),ylab=expression('Log'[10]~'Variance'),main="Filtered")
plot(log10(rowMeans(ct)+1),log10(rowVar(ct)+1),xlab=expression('Log'[10]~'Mean count'),ylab=expression('Log'[10]~'Variance'),main="TPM")
plot(log10(rowMeans(cd)+1),log10(rowVar(cd)+1),xlab=expression('Log'[10]~'Mean count'),ylab=expression('Log'[10]~'Variance'),main="DESeq2")
plot(rowMeans(cv),rowVar(cv),xlab='Mean count',ylab='Variance',main="VST")
par(mfrow=c(1,1))
For RNA-seq data, as the mean count value increases, the variance increases. There is a strong almost linear relationship as seen in the figures. The statistical methods such as PCA expects similar variance across the range of mean values. If not, the higher variance genes will contribute more than the lower variance genes. Such data is said to be heteroscedastic and needs to be corrected. Correction using log transformation (with pseudocount) now gives inflates the contribution of the low variance genes. To obtain similar variance across the whole range of mean values, DESeq2 offers two methods VST (variance stabilising transformation) and RLOG (regularised log transformation).
As the name suggests, VST transformation stabilizes variance across the whole range of count values. VST is recommended for clustering or visualisation. It is not intended for differential gene expression. If the size factors vary dramatically between samples, then RLOG transformation is recommended.
Finally, we can compare all of the various transformations in a single plot.
par(mfrow=c(1,4))
boxplot(log10(as.matrix(cf)+1),ylab=expression('Log'[10]~'Read counts'),las=2,main="Filtered")
boxplot(log10(as.matrix(ct)+1),ylab=expression('Log'[10]~'Read counts'),las=2,main="TPM")
boxplot(log10(as.matrix(cd)+1),ylab=expression('Log'[10]~'Read counts'),las=2,main="DESeq2")
boxplot(as.matrix(cv),ylab='Read counts',las=2,main="VST")
par(mfrow=c(1,1))
Would it be possible to have one perfect normalisation method for all types of analyses? Is there any drawback to using gene length corrected counts in differential gene expression analyses?
In this section, we are strictly not running any quantitative analyses or statistical test. This is a QC of the counts. We are running correlation to check similarity of gene counts between samples. We are running distance based clustering and PCA to estimate similarity between samples. This should give us an idea of the variation within your sample groups and detect possible outliers or mis-labelled samples.
We will use the variance stabilized counts (VST) from the previous step for all exploratory analyses. Use the files you have created or else read from these URLs. For VST counts use https://raw.githubusercontent.com/NBISweden/course_rnaseq_slu/master/data/counts_vst.txt
. For metadata, use https://raw.githubusercontent.com/NBISweden/course_rnaseq_slu/master/data/metadata_raw.csv
. Otherwise, the files below can also be copied from the data location mentioned at the beginning of this document.
cv <- read.table("./data/counts_vst.txt",header=TRUE)
mr <- read.csv2("./data/metadata_raw.csv",header=TRUE,stringsAsFactors=F)
rownames(mr) <- mr$Sample_ID
It is a good idea to start by checking correlation between samples. RNA-Seq samples generally have very high correlation (R^2 > 0.9). R^2 values below 0.8 may be an indication of an outlier sample. Depending on other QC metric, these low correlation samples may be discarded from analyses.
We use the function cor()
for computing sample-to-sample Spearman correlation. Note that the input matrix has genes as rows and samples as columns. This generates a sample-to-sample pairwise correlation matrix. This matrix can be plotted as a heatmap using the pheatmap()
function from the pheatmap R package.
dmat <- as.matrix(cor(cv,method="spearman"))
library(pheatmap)
pheatmap(dmat,border_color=NA,annotation_col=mr[,"Time",drop=F],
annotation_row=mr[,"Time",drop=F],annotation_legend=T)
In the matrix, red colour denotes higher correlation (higher similarity) and blue denotes lower correlation (lower similarity). pheatmap()
also hierarchically clusters rows and columns based on correlation values. The dendrograms show how samples cluster. Annotation colours denote Time groups. Notice that samples group by Time.
To run PCA, we use the R function prcomp()
. It takes in a matrix where samples are rows and variables are columns. Therefore, we transpose our count matrix using the function t()
. If we do not transpose, then PCA is run on the genes rather than the samples. The next line of code plots the variance explained by the top PCs.
pcaobj <- prcomp(x=t(cv))
{barplot(round(pcaobj$sdev^2/sum(pcaobj$sdev^2)*100,2),las=2,
names.arg=colnames(pcaobj$x),ylab="% Variance explained",
xlab="PCA principal components")
abline(h=2, lty=2)}
The first two principal components in total explain 85% (75%+10%) of the variance in the data set. This means a scatterplot of PC1 vs PC2 can help to visualise the most important trend in the data. Then we merge the rotated data with the metadata and plot a scatterplot coloured by our variable of interest (Time).
pcamat1 <- as.data.frame(pcaobj$x)
pcamat2 <- merge(pcamat1,mr,by=0)
ggplot(pcamat2,aes(PC1,PC2,colour=Time))+
geom_point()+
theme_bw()
Samples cluster by the Time variable as expected.
Sometimes the first two PCs may not be the ones that will best separate the sample groups, so it is a good idea to look at more PCs.
p1 <- ggplot(pcamat2,aes(PC1,PC3,colour=Time))+
geom_point()+
theme_bw()
p2 <- ggplot(pcamat2,aes(PC2,PC3,colour=Time))+
geom_point()+
theme_bw()
ggpubr::ggarrange(p1,p2,nrow=1,ncol=2)
An alternative way to create a PCA plot is directly from the DESeq2 object using the DESeq2 function plotPCA()
.
plotPCA(varianceStabilizingTransformation(d),intgroup="Time")
For clustering, we create a sample-to-sample pairwise distance matrix (here we use euclidean distance). The rows and columns of this matrix is then hierarchically clustered. We use the function dist()
to compute the distance. Note that for a sample-to-sample matrix, the rows need to be samples and columns should be genes. Therefore, we use the function t()
to transpose our VST normalised count matrix.
dmat <- as.matrix(dist(t(cv)))
library(pheatmap)
pheatmap(dmat,border_color=NA,annotation_col=mr[,"Time",drop=F],
annotation_row=mr[,"Time",drop=F],annotation_legend=T)
Hierarchically clustered sample-to-sample euclidean distance matrix. Larger distances mean lower similarity and vice-versa. In the matrix, red colour denotes larger distance (lower similarity) and blue denotes small distance (higher similarity). Annotation colours denote Time groups. The dendrogram helps to visualise sample clustering. Notice that samples group by Time.
Try to run the PCA using one of the other normalisation methods, say logCPM and/or DESeq2 normalised counts.
How different are the results? How do samples group? Can these differences be explained?
For differential gene expression, we use the DESeq2 package. We use the raw filtered counts (https://raw.githubusercontent.com/NBISweden/course_rnaseq_slu/master/data/counts_filtered.txt
) and metadata (https://raw.githubusercontent.com/NBISweden/course_rnaseq_slu/master/data/metadata_raw.csv
).
cf <- read.table("./data/counts_filtered.txt",header=TRUE)
mr <- read.csv2("./data/metadata_raw.csv",header=TRUE,stringsAsFactors=F)
rownames(mr) <- mr$Sample_ID
The data is converted to a DESeq2 object. The GLM model we use is simple since we only have one variable of interest ~Time
.
If we had other covariates to control for, we would add them to the model like so ~var+Time
. The variable of interest appears in the end. This model means find differentially expressed genes between groups under ‘time’ variable while controlling for the effect of ‘var’. Similarily, batch effects can be controlled by specifying them in the model ~batch+Time
.
library(DESeq2)
mr$Time <- factor(mr$Time)
d <- DESeqDataSetFromMatrix(countData=cf,colData=mr,design=~Time)
The first step is estimating size factors. The data is normalised for sequencing depth and compositional bias as done for the VST step. DESeq2 uses a method called median-of-ratios for this step.
d <- DESeq2::estimateSizeFactors(d,type="ratio")
Optional
For those interested in the details of the median-of-ratios method, click below.
This is a step-by-step guide to computing normalisation factors (size factors) using the median-of-ratios method.
gm_mean = function(x, na.rm=TRUE){
exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}
gmean <- apply(cf,1,gm_mean)
head(gmean)
## ENSG00000000003 ENSG00000000419 ENSG00000000457 ENSG00000000460
## 479.11181 820.01611 81.68888 319.29571
## ENSG00000001036 ENSG00000001084
## 920.13961 520.55995
ratio <- cf/gmean
head(ratio)[,1:5]
## Sample_1 Sample_2 Sample_3 Sample_4 Sample_5
## ENSG00000000003 0.6699898 0.6324202 0.4257879 1.026900 0.9496739
## ENSG00000000419 0.8487638 0.8048622 0.5755984 1.159733 1.1743672
## ENSG00000000457 0.7222525 0.6610447 0.5386290 1.334331 0.8936344
## ENSG00000000460 1.2496253 1.2684166 0.7391267 1.393692 1.4218794
## ENSG00000001036 0.9172521 0.8781276 0.5379618 1.157433 1.1270029
## ENSG00000001084 1.0277395 0.8798218 0.5570924 1.386968 1.3312588
sf <- apply(ratio,2,median)
sf
## Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7
## 0.8975253 0.8407580 0.5086763 1.1263621 1.0931051 0.8123073 0.7558324
## Sample_8 Sample_9 Sample_10 Sample_11 Sample_12
## 1.1745210 1.0193067 1.3720949 1.2387549 1.8679810
We can verify that these values are correct by comparing with size factors generated by DESeq2.
# deseq2 size factors
sizeFactors(d)
## Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7
## 0.9003753 0.8437393 0.5106445 1.1276451 1.0941383 0.8133849 0.7553903
## Sample_8 Sample_9 Sample_10 Sample_11 Sample_12
## 1.1744008 1.0189325 1.3642797 1.2325485 1.8555904
If we plot the size factors for each sample against the total counts for each sample, we get the plot below. We can see that they correlate very well. Size factors are mostly correcting for total counts, ie; sequencing depth.
plot(sizeFactors(d),colSums(cf),xlab="Size factors",ylab="Total counts")
The raw counts can then be divided by the size factor to yield normalised read counts.
# custom
head(t(t(cf)/sf))[,1:5]
# deseq2
head(counts(d,normalized=TRUE))[,1:5]
## Sample_1 Sample_2 Sample_3 Sample_4 Sample_5
## ENSG00000000003 357.65009 360.38907 401.04088 436.80445 416.24541
## ENSG00000000419 775.46562 785.00590 927.89851 844.31105 880.97655
## ENSG00000000457 65.73631 64.22776 86.49901 96.77172 66.78223
## ENSG00000000460 444.55572 481.70816 463.94925 395.07720 415.33059
## ENSG00000001036 940.36349 961.03752 973.11390 945.52183 948.67361
## ENSG00000001084 596.08349 544.74652 570.10713 641.00166 633.97378
## Sample_1 Sample_2 Sample_3 Sample_4 Sample_5
## ENSG00000000003 356.51801 359.11565 399.49511 436.30747 415.85236
## ENSG00000000419 773.01102 782.23211 924.32203 843.35041 880.14467
## ENSG00000000457 65.52823 64.00081 86.16561 96.66161 66.71917
## ENSG00000000460 443.14856 480.00607 462.16101 394.62769 414.93840
## ENSG00000001036 937.38692 957.64173 969.36314 944.44605 947.77780
## ENSG00000001084 594.19669 542.82168 567.90972 640.27234 633.37514
The function estimateSizeFactors()
has options to set a custom locfunc
other than the median. Why is this useful? What happens if you change it to “shorth”. Check out the help page for estimateSizeFactors()
.
When it comes to comparing values between groups, some measure of variation is needed to estimate the variability in gene counts within groups. Dispersion is a measure of variation in a data set. Variance and standard deviation are not a good measure to estimate variability because it correlates with the mean.
Optional
For some more discussion on dispersion, click below.
We can create a mean counts vs variance plot for all genes in our data set.
dm <- apply(cd,1,mean)
dv <- apply(cd,1,var)
ggplot(data.frame(mean=log10(dm+1),var=log10(dv+1)),
aes(mean,var))+
geom_point(alpha=0.2)+
geom_smooth(method="lm")+
labs(x=expression('Log'[10]~'Mean counts'),y=expression('Log'[10]~'Variance'))+
theme_bw()
We see a mostly linear relationship on the log scale. The blue line denotes a linear fit. Genes that have larger read counts show higher variance. It’s hard to say which genes are more variable based on this alone. Therefore, variance is not a good measure to identify variation in read counts. A measure that controls for this mean-variance relationship is what we need.
One option is the coefficient of variation (CV).
cva <- function(x) sd(x)/mean(x)
dc <- apply(cd,1,cva)
ggplot(data.frame(mean=log10(dm+1),var=dc),
aes(mean,var))+
geom_point(alpha=0.2)+
geom_smooth()+
labs(x=expression('Log'[10]~'Mean counts'),y="Coefficient of variation")+
theme_bw()
Now, we see that genes with lower counts have higher variability and genes with larger counts have lower variability. A measure like CV is taking the ratio of ‘variation’ to mean. cv=sd(x)/mean(x)
.
This becomes even more apparent if we compute the CV and mean over replicates within sample groups (Time).
dx1 <- data.frame(t0=apply(cd[,1:3],1,cva),t2=apply(cd[,4:6],1,cva),
t6=apply(cd[,7:9],1,cva),t24=apply(cd[,10:12],1,cva))
dx1$gene <- rownames(dx1)
dx1 <- tidyr::gather(dx1,key=sample,value=cv,-gene)
rownames(dx1) <- paste0(dx1$gene,"-",dx1$sample)
dx2 <- data.frame(t0=apply(cd[,1:3],1,mean),t2=apply(cd[,4:6],1,mean),
t6=apply(cd[,7:9],1,mean),t24=apply(cd[,10:12],1,mean))
dx2$gene <- rownames(dx2)
dx2 <- tidyr::gather(dx2,key=sample,value=mean,-gene)
rownames(dx2) <- paste0(dx2$gene,"-",dx2$sample)
dx3 <- merge(dx1,dx2,by=0)
ggplot(dx3,aes(x=log10(mean+1),y=cv))+
geom_point(alpha=0.2)+
geom_smooth()+
facet_wrap(~sample.x)+
labs(x=expression('Log'[10]~'Mean counts'),y="Coefficient of variation")+
theme_bw()
We find that CV strongly declines with increasing counts. Genes with low counts show higher dispersion. For the sake of completeness, we can also plot the relationship between CV and variance for the same sample groups.
dx1 <- data.frame(t0=apply(cd[,1:3],1,cva),t2=apply(cd[,4:6],1,cva),
t6=apply(cd[,7:9],1,cva),t24=apply(cd[,10:12],1,cva))
dx1$gene <- rownames(dx1)
dx1 <- tidyr::gather(dx1,key=sample,value=cv,-gene)
rownames(dx1) <- paste0(dx1$gene,"-",dx1$sample)
dx2 <- data.frame(t0=apply(cd[,1:3],1,var),t2=apply(cd[,4:6],1,var),
t6=apply(cd[,7:9],1,var),t24=apply(cd[,10:12],1,var))
dx2$gene <- rownames(dx2)
dx2 <- tidyr::gather(dx2,key=sample,value=var,-gene)
rownames(dx2) <- paste0(dx2$gene,"-",dx2$sample)
dx3 <- merge(dx1,dx2,by=0)
ggplot(dx3,aes(x=log10(var+1),y=cv))+
geom_point(alpha=0.2)+
geom_smooth()+
facet_wrap(~sample.x)+
labs(x=expression('Log'[10]~'Variance'),y="Coefficient of variation")+
theme_bw()
DESeq2 computes it’s own version of dispersion in a more robust manner taking into account low count values. The DESeq2 dispersion estimates are inversely related to the mean and directly related to variance. The dispersion estimate is a good measure of the variation in gene expression for a certain mean value.
Now, the variance or dispersion estimate for genes with low counts is unreliable when there are too few replicates. To overcome this, DESeq2 borrows information from other genes. DESeq2 assumes that genes with similar expression levels have similar dispersion values. Dispersion estimates are computed for each gene separately using maximum likelihood estimate. A curve is fitted to these gene-wise dispersion estimates. The gene-wise estimates are then ‘shrunk’ to the fitted curve.
Gene-wide dispersions, fitted curve and shrinkage can be visualised using the plotDispEsts()
function.
d <- DESeq2::estimateDispersions(d)
plotDispEsts(d)
The black points denote the maximum likelihood dispersion estimate for each gene. The red curve denote the fitted curve. The blue points denote the new gene dispersion estimates after shrunk towards the curve. The circled blue points denote estimates that are not shrunk as they are too far away from the curve. Thus, shrinkage method is important to reduce false positives in DGE analysis involving too few replicates.
It is a good idea to visually check the dispersion shrinkage plot to verify that the modelling works for your data set.
Overdispersion is the reason why RNA-Seq data is better modelled as negative-binomial distribution rather than poisson distribution. Poisson distribution has a mean = variance relationship, while negative-binomial distribution has a variance > mean relationship. The last step in the DESeq2 workflow is fitting the Negative Binomial model for each gene and performing differential expression testing. This is based on the log fold change values computed on the corrected count estimates between groups.
logFC = log2 (corrected counts group A / corrected counts group B)
The most commonly used testing for comparing two groups in DESeq2 is the Walds’s test. The null hypothesis is that the groups are not different and logFC=0
. The list of contrasts can be seen using resultsNames()
. Then we can pick our comparisons of interest.
dg <- nbinomWaldTest(d)
resultsNames(dg)
## [1] "Intercept" "Time_t2_vs_t0" "Time_t24_vs_t0" "Time_t6_vs_t0"
And we can get the result tables for the three different comparisons. The summary of the result object shows the number of genes that are differentially expressed with positive or negative fold-change and outliers.
res1 <- results(dg,name="Time_t2_vs_t0",alpha=0.05)
summary(res1)
res2 <- results(dg,name="Time_t6_vs_t0",alpha=0.05)
summary(res2)
res3 <- results(dg,name="Time_t24_vs_t0",alpha=0.05)
summary(res3)
##
## out of 14578 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 413, 2.8%
## LFC < 0 (down) : 696, 4.8%
## outliers [1] : 0, 0%
## low counts [2] : 2261, 16%
## (mean count < 26)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
##
##
## out of 14578 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 2484, 17%
## LFC < 0 (down) : 2856, 20%
## outliers [1] : 0, 0%
## low counts [2] : 0, 0%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
##
##
## out of 14578 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 5066, 35%
## LFC < 0 (down) : 5093, 35%
## outliers [1] : 0, 0%
## low counts [2] : 0, 0%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
You can also build up the comparison using contrasts. Contrats need the condition, level to compare and the reference level (base level). For example, results(dg,contrast=c("Time","t2","t0"),alpha=0.05)
.
Note that in both cases above, t0
is the reference level and other levels are compared to this. Therefore, a fold-change of 2 would mean that, a gene is 2 fold higher expressed in the test level compared to t0
.
The results()
function has many useful arguments. One can set a threshold on the logFC values using lfcThreshold
. By default, no filtering is performed based on logFC values. Outliers are detected and p-values are set to NA automatically using cooksCutoff
. independentFiltering
remove genes with low counts.
head(res1)
## log2 fold change (MLE): Time t2 vs t0
## Wald test p-value: Time t2 vs t0
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## ENSG00000000003 490.017213130775 0.220619809383656 0.112761083788383
## ENSG00000000419 817.780657499214 0.0592719659601406 0.101481284662652
## ENSG00000000457 82.078767893562 0.207748623172909 0.220404896514001
## ENSG00000000460 356.07160020304 -0.129186381676072 0.115139182335103
## ENSG00000001036 919.606750211436 0.0288826917736355 0.0851500691928604
## ENSG00000001084 529.593965301543 0.211964772147193 0.0929810563117334
## stat pvalue padj
## <numeric> <numeric> <numeric>
## ENSG00000000003 1.95652437854969 0.0504034135093116 0.263505451695327
## ENSG00000000419 0.584067950629267 0.559174596319379 0.830262316260423
## ENSG00000000457 0.942577168015466 0.34589722283592 0.689945926100409
## ENSG00000000460 -1.12200190288033 0.2618616308297 0.612624613846088
## ENSG00000001036 0.33919751384133 0.734460942036292 0.909638554355053
## ENSG00000001084 2.27965545408033 0.0226281312376226 0.159263252830742
The results table contains mean expression value (baseMean), log2 fold change (log2FoldChange), log2 fold change standard error (lfcSE), wald test statistic (stat), wald test p-value (pvalue) and BH adjusted p-value (padj) for each gene.
It is a good idea to look at the distribution on unadjusted p-values.
hist(res1$pvalue[res1$baseMean>1],main="res1 Pval distribution",xlab="P-values")
This is the kind of distribution to be expected when the p-values are “well-behaved”. For more explanation on p-value distributions, see here. If this distribution is very different or strange, then it might indicate an underlying problem.
Note that the results object is a DESeqResults
class object and not a data.frame. It can be converted to a data.frame using as.data.frame()
for exporting to a file.
We can filter the results table as needed.
# convert res to data.frame
df_res1 <- na.omit(as.data.frame(res1))
df_res2 <- na.omit(as.data.frame(res2))
df_res3 <- na.omit(as.data.frame(res3))
# all genes
nrow(df_res1)
# only genes with padj <0.05
nrow(df_res1[df_res1$padj<0.05,])
# only genes with padj <0.05 and an absolute fold change >2
nrow(df_res1[df_res1$padj<0.05 & abs(df_res1$log2FoldChange)>2,])
## [1] 12317
## [1] 1109
## [1] 11
Note that manually filtering by log2FoldChange on the results table is not exactly the same as setting the lfcThreshold
argument in the results()
function.
Finally, we can add additional information to our results to make the interpretation and downstream analyses easier. We read in the gene info downloaded through biomaRt.
# add row names as a new column
df_res1$ensembl_gene_id <- rownames(df_res1)
df_res2$ensembl_gene_id <- rownames(df_res2)
df_res3$ensembl_gene_id <- rownames(df_res3)
# read genes info
hg <- read.delim("./data/human_genes.txt",header=T,sep="\t",stringsAsFactors=F)
hg <- hg[!duplicated(hg$ensembl_gene_id),]
# merge dataframes
df_res1 <- merge(df_res1,hg,by="ensembl_gene_id")
df_res2 <- merge(df_res2,hg,by="ensembl_gene_id")
df_res3 <- merge(df_res3,hg,by="ensembl_gene_id")
head(df_res1)
## ensembl_gene_id baseMean log2FoldChange lfcSE stat
## 1 ENSG00000000003 490.01721 0.22061981 0.11276108 1.9565244
## 2 ENSG00000000419 817.78066 0.05927197 0.10148128 0.5840680
## 3 ENSG00000000457 82.07877 0.20774862 0.22040490 0.9425772
## 4 ENSG00000000460 356.07160 -0.12918638 0.11513918 -1.1220019
## 5 ENSG00000001036 919.60675 0.02888269 0.08515007 0.3391975
## 6 ENSG00000001084 529.59397 0.21196477 0.09298106 2.2796555
## pvalue padj entrezgene external_gene_name chromosome_name
## 1 0.05040341 0.2635055 7105 TSPAN6 X
## 2 0.55917460 0.8302623 8813 DPM1 20
## 3 0.34589722 0.6899459 57147 SCYL3 1
## 4 0.26186163 0.6126246 55732 C1orf112 1
## 5 0.73446094 0.9096386 2519 FUCA2 6
## 6 0.02262813 0.1592633 2729 GCLC 6
## start_position end_position strand gene_biotype
## 1 100627109 100639991 -1 protein_coding
## 2 50934867 50958555 -1 protein_coding
## 3 169849631 169894267 -1 protein_coding
## 4 169662007 169854080 1 protein_coding
## 5 143494811 143511690 -1 protein_coding
## 6 53497341 53616970 -1 protein_coding
## description
## 1 tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]
## 2 dolichyl-phosphate mannosyltransferase subunit 1, catalytic [Source:HGNC Symbol;Acc:HGNC:3005]
## 3 SCY1 like pseudokinase 3 [Source:HGNC Symbol;Acc:HGNC:19285]
## 4 chromosome 1 open reading frame 112 [Source:HGNC Symbol;Acc:HGNC:25565]
## 5 alpha-L-fucosidase 2 [Source:HGNC Symbol;Acc:HGNC:4008]
## 6 glutamate-cysteine ligase catalytic subunit [Source:HGNC Symbol;Acc:HGNC:4311]
Now we have annotated our DEG list with information such as gene names, genomic coordinates, biotype etc.
Save the data to a text file for use in functional anlayses.
write.table(df_res1,"./data/deg-time_t2_vs_t0.txt",sep="\t",col.names=TRUE,row.names=FALSE,quote=F)
write.table(df_res2,"./data/deg-time_t6_vs_t0.txt",sep="\t",col.names=TRUE,row.names=FALSE,quote=F)
write.table(df_res3,"./data/deg-time_t24_vs_t0.txt",sep="\t",col.names=TRUE,row.names=FALSE,quote=F)
How many DE genes do you find for each test? Is there a trend? How much do they change of you set a threshold of 1 on fold-change?
nrow(df_res1[df_res1$padj<0.05,])
nrow(df_res2[df_res2$padj<0.05,])
nrow(df_res3[df_res3$padj<0.05,])
nrow(df_res1[df_res1$padj<0.05 & abs(df_res1$log2FoldChange)>1,])
nrow(df_res2[df_res2$padj<0.05 & abs(df_res2$log2FoldChange)>1,])
nrow(df_res3[df_res3$padj<0.05 & abs(df_res3$log2FoldChange)>1,])
## [1] 1086
## [1] 5222
## [1] 9909
## [1] 104
## [1] 788
## [1] 4082
This is an extra step to generate more accurate log2 fold changes. This step corrects the log2 fold changes for genes with high dispersion.
lres1 <- lfcShrink(dg,coef="Time_t2_vs_t0",res=res1)
summary(lres1)
##
## out of 14578 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 413, 2.8%
## LFC < 0 (down) : 696, 4.8%
## outliers [1] : 0, 0%
## low counts [2] : 2261, 16%
## (mean count < 26)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
head(lres1)
## log2 fold change (MAP): Time t2 vs t0
## Wald test p-value: Time t2 vs t0
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## ENSG00000000003 490.017213130775 0.166374521328742 0.0903392402749302
## ENSG00000000419 817.780657499214 0.0494916503724324 0.0846342204609171
## ENSG00000000457 82.078767893562 0.0925123104527917 0.114348354878349
## ENSG00000000460 356.07160020304 -0.0895216223688513 0.091747791502412
## ENSG00000001036 919.606750211436 0.0257185248646338 0.0746622716252995
## ENSG00000001084 529.593965301543 0.189029922610361 0.0795917180410762
## stat pvalue padj
## <numeric> <numeric> <numeric>
## ENSG00000000003 1.95652437854969 0.0504034135093116 0.263505451695327
## ENSG00000000419 0.584067950629267 0.559174596319379 0.830262316260423
## ENSG00000000457 0.942577168015466 0.34589722283592 0.689945926100409
## ENSG00000000460 -1.12200190288033 0.2618616308297 0.612624613846088
## ENSG00000001036 0.33919751384133 0.734460942036292 0.909638554355053
## ENSG00000001084 2.27965545408033 0.0226281312376226 0.159263252830742
We see that the number of genes up and down has not changed. But, we can see that the logFC distribution has changed in the density plot below.
par(mfrow=c(1,2))
plot(density(na.omit(res1$log2FoldChange)),main="Default log2FoldChange")
plot(density(na.omit(lres1$log2FoldChange)),main="lfcShrink log2FoldChange")
par(mfrow=c(1,1))
This step does not change the total number of DE genes. This may be useful in downstream steps where genes need to be filtered down based on fold change or if fold change values are used in functional analyses such as GSEA.
In this section, we can explore some useful visualisation of the differential gene expression output.
The MA plot shows mean expression vs log fold change for all genes. The plotMA()
function from DESeq2 takes a results object as input. Differentially expressed genes are marked in red.
DESeq2::plotMA(res1)
How does this plot change if you set log fold change filtering to minimum value of 1. How does the plot change when you use lfcShrink()
?
A volcano plot is similar to the MA plot. It plots log fold change vs adjusted p-values.
df_res1_1 <- df_res1[df_res1$padj<0.05,]
ggplot()+
geom_point(data=df_res1,aes(x=log2FoldChange,y=-log10(padj)),col="grey80",alpha=0.5)+
geom_point(data=df_res1_1,aes(x=log2FoldChange,y=-log10(padj)),col="red",alpha=0.7)+
geom_hline(aes(yintercept=-log10(0.05)),alpha=0.5)+
theme_bw()
X axis denotes log fold change and the y axis denotes -log10 adjusted p-value. The adjusted p-value is transformed so that the smallest p-values appears at the top. The horizontal grey line denotes the significance threshold line. All genes above this line (coloured red as well) are considered significant.
Why is the y-axis (p-value) on a -log scale?
It can be a good idea to manually verify some of these genes by plotting out it’s actual read count values. We can use the function plotCounts()
to visualise the data points for a gene of interest. Below, we see the counts before and after normalisation.
plotCounts(d,gene=rownames(res1)[1],intgroup="Time",normalized=F)
plotCounts(d,gene=rownames(res1)[1],intgroup="Time",normalized=T)
By looking at the count plots, do you agree that the top DE genes differ significantly between the groups compared?
A heatmap is a good way to visualise the clustered relative expression levels of all DEGs.
We first select DE genes from the DGE results table using some criteria. Here we select genes with adjusted-p-value < 0.05 and absolute log fold change >1. A log fold change of 1 translates to a fold change of 2 (2^1=2). Then we can use this list of genes to filter the VST table for only these genes.
df_res3 <- na.omit(df_res3)
deg <- df_res3[df_res3$padj<0.05 & abs(df_res3$log2FoldChange)>1,]
cv1 <- cv[deg$ensembl_gene_id,]
We can use a heatmap function to explore the visual consequences of clustering. Here, we can make use of the pheatmap()
function, which by default will do the clustering of the rows and columns.
cl <- pheatmap(cv1, scale="row",color=colorRampPalette(c("navy","white","firebrick"))(90),border_color=NA,cluster_cols=F,cutree_rows=2,show_rownames=FALSE)
We can see that up and down genes cluster distinctly. By setting scale="row"
, the data is are converted to z-scores which means that they are scaled row-wise. Therefore, the colours show fold-change up or down relative to the mean expression value for the gene. The rows are then clustered based on these z-scores.
Why do samples 10, 11 and 12 show the strongest values for up and down?
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats datasets grDevices utils graphics
## [8] methods base
##
## other attached packages:
## [1] pheatmap_1.0.12 edgeR_3.24.3
## [3] limma_3.38.3 DESeq2_1.22.2
## [5] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
## [7] BiocParallel_1.16.6 matrixStats_0.54.0
## [9] Biobase_2.42.0 GenomicRanges_1.34.0
## [11] GenomeInfoDb_1.18.2 IRanges_2.16.0
## [13] S4Vectors_0.20.1 BiocGenerics_0.28.0
## [15] biomaRt_2.39.3 ggplot2_3.1.1
## [17] dplyr_0.8.0.1 leaflet_2.0.2
## [19] captioner_2.2.3 bookdown_0.9
## [21] knitr_1.22
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-137 bitops_1.0-6 bit64_0.9-7
## [4] RColorBrewer_1.1-2 progress_1.2.0 httr_1.4.0
## [7] tools_3.5.2 backports_1.1.4 R6_2.4.0
## [10] rpart_4.1-13 mgcv_1.8-26 Hmisc_4.2-0
## [13] DBI_1.0.0 lazyeval_0.2.2 colorspace_1.4-1
## [16] nnet_7.3-12 withr_2.1.2 tidyselect_0.2.5
## [19] gridExtra_2.3 prettyunits_1.0.2 bit_1.1-14
## [22] compiler_3.5.2 htmlTable_1.13.1 labeling_0.3
## [25] scales_1.0.0 checkmate_1.9.1 genefilter_1.64.0
## [28] stringr_1.4.0 digest_0.6.18 foreign_0.8-71
## [31] rmarkdown_1.12 XVector_0.22.0 base64enc_0.1-3
## [34] pkgconfig_2.0.2 htmltools_0.3.6 htmlwidgets_1.3
## [37] rlang_0.3.4 rstudioapi_0.10 RSQLite_2.1.1
## [40] shiny_1.3.2 jsonlite_1.6 crosstalk_1.0.0
## [43] acepack_1.4.1 RCurl_1.95-4.12 magrittr_1.5
## [46] GenomeInfoDbData_1.2.0 Formula_1.2-3 Matrix_1.2-15
## [49] Rcpp_1.0.1 munsell_0.5.0 stringi_1.4.3
## [52] yaml_2.2.0 zlibbioc_1.28.0 plyr_1.8.4
## [55] grid_3.5.2 blob_1.1.1 promises_1.0.1
## [58] crayon_1.3.4 lattice_0.20-38 cowplot_0.9.4
## [61] splines_3.5.2 annotate_1.60.1 hms_0.4.2
## [64] locfit_1.5-9.1 pillar_1.3.1 ggpubr_0.2
## [67] geneplotter_1.60.0 XML_3.98-1.19 glue_1.3.1
## [70] evaluate_0.13 latticeExtra_0.6-28 data.table_1.12.2
## [73] httpuv_1.5.1 tidyr_0.8.3 gtable_0.3.0
## [76] purrr_0.3.2 assertthat_0.2.1 xfun_0.6
## [79] mime_0.6 xtable_1.8-3 later_0.8.0
## [82] survival_2.43-3 tibble_2.1.1 AnnotationDbi_1.44.0
## [85] memoise_1.1.0 cluster_2.0.7-1
End of document