Author: Åsa Björklund
Clustering of data using Pagoda in the SCDE package, following tutorial at: http://hms-dbmi.github.io/scde/pagoda.html
For this exercise you can either run with your own data or with the example data from Pollen paper that they provide with the package. Below is an example with human innate lympoid cells (ILCs) from Bjorklund et al. 2016 where you can get some hints on how to run with your own data.
With a large dataset, it may be a good idea to run all the cpu/memory-intensive steps in the SCDE package on Uppmax, or other HPC center, submitted as an Sbatch job calling an R-script. OBS! The SCDE package uses a lot of memory, so even if you request 16 cpus (a full node) on Uppmax you will have issues with memory overflow if you run the program on all 16 cores. Suggested setting would be to use n.cores = 10 when running on a normal node or specify nodes with more memory.
But the memory consumption could of course depend on the dataset, so you should monitor your cpu and memory usage using the jobstats
command.
If you want to run this example, all data plus some intermediate files for steps that takes long time, is located in the course uppmax folder with subfolder:
scrnaseq_course/data/ILC/
suppressMessages(library(scde))
# read in meta data table and create pheno data
M <- read.table("data/ILC/Metadata_ILC.csv", sep=",",header=T)
# read count values, SCDE require counts and not normalized rpkms.
C <- read.table("data/ILC/ensembl_countvalues_ILC.csv",sep=",",header=T)
Define color scales based on celltype and donor.
celltype2cols <- c("blue", "cyan", "red", "green")[as.integer(M$Celltype)]
donor2cols <- c("black", "orange", "magenta")[as.integer(M$Donor)]
suppressMessages(library(plotrix))
nD <- colSums(C>2)
nDet2cols <- color.scale(nD,c(0,1,1),
c(1,1,0),0)
In the Tutorial for biomaRt, there is an example on how to fetch go-terms for Ensembl genes. In the file GO_BP_annotations.Rdata
this was done for all genes in the ILC dataset and selecting for Biological Process entries in GO.
In the tutorial there is an example where they fetch first 100 go-terms in org.Hs.egGO2ALLEGS, plus some selected ones related to neurogenesis. In their example they translate gene symbol to NCBI gene id to then fetch go-terms. There are several different ways of getting go-terms for genes, but if you are working with Ensembl IDs the most straight forward is probably to use biomaRt.
More examples on how to get GO-terms for gene symbols can be found at: http://hms-dbmi.github.io/scde/genesets.html
# load go-term annotations for all genes
load("data/ILC/GO_BP_annotations.Rdata",verbose=TRUE)
## Loading objects:
## goBP2gene
The clean.counts function will remove genes and cells based on specified cutoffs:
cd <- clean.counts(C, min.lib.size = 1000, min.reads = 10, min.detected = 5)
genes <- rownames(cd)
# now filter out all genes that are not included in cd from the go-annotations
go.env <- lapply(goBP2gene, function(x) x[x %in% genes])
# remove GOs with too few or too many genes
go.env <- clean.gos(go.env,min.size=5,max.size=2000)
# convert to an environment
go.env <- list2env(go.env)
OBS! it takes a while to run on 1 cpu, around an hour with this dataset.
# since this step takes a while, save data to a file so that it does not have to be rerun if you execute the code again.
savefile <- "data/ILC/pagoda_knn.Rdata"
if (file.exists(savefile)){
load(savefile)
}else {
pdf("data/ILC/pagoda.cell.models.pdf")
knn <- knn.error.models(cd, k = ncol(cd)/4, n.cores = 1, min.count.threshold = 2, min.nonfailed = 5, max.model.plots = 10)
dev.off()
save(knn,file=savefile)
}
In this case, we know that we have a clear batch effect from the 3 donors, so we can also include batch information in the normalization step.
This step also takes a while to run.
In addition, we add in a step where the number of detected genes per cell, which may be a techincal artefact.
savefile <- "data/ILC/pagoda_varnorm.Rdata"
if (file.exists(savefile)){
load(savefile)
}else {
pdf("data/ILC/pagoda.varnorm.pdf")
varinfo <- pagoda.varnorm(knn, counts = cd, trim = 3/ncol(cd), max.adj.var = 5, n.cores = 1, plot = TRUE, batch=M$Donor)
dev.off()
# remove effect of detected genes.
varinfo <- pagoda.subtract.aspect(varinfo, colSums(cd[, rownames(knn)]>0))
save(varinfo,file=savefile)
}
In this example we only remove the aspect of gene detection, but there can be other aspects that you want to remove, an example from there tutorial to remove cell cycle effect.
OBS! This was not run on the ICL dataset
# get cell cycle signature and view the top genes
cc.pattern <- pagoda.show.pathways(c("GO:0000280", "GO:0007067"), varinfo, go.env, show.cell.dendrogram = TRUE, cell.clustering = hc, showRowLabels = TRUE)
# subtract the pattern
varinfo.cc <- pagoda.subtract.aspect(varinfo, cc.pattern)
Instead of using the app, that sometimes is very slow, you can also create each plot with different commands, here are some example plots.
Also a step that takes hours to run.
savefile <- "data/ILC/pagoda_pwpca.Rdata"
if (file.exists(savefile)){
load(savefile)
}else {
pwpca <- pagoda.pathway.wPCA(varinfo, go.env, n.components = 2, n.cores = 1)
save(pwpca,file=savefile)
}
OBS! This takes very long time to run, several hours, so we skip this step in this tutorial and only run the subsequent steps with pwpca.
But it can be run with:
clpca <- pagoda.gene.clusters(varinfo, trim = 7.1/ncol(varinfo$mat), n.clusters = 50, n.cores = 1, plot = TRUE)
df <- pagoda.top.aspects(pwpca, return.table = TRUE, plot = TRUE, z.score = 1.96)
head(df)
## name npc n score
## 1069 GO:0009416;response to light stimulus 1 18 4.241465
## 1797 GO:0032870;cellular response to hormone stimulus 1 24 4.177536
## 2823 GO:0051591;response to cAMP 1 27 3.997107
## 2889 GO:0060037;pharyngeal system development 1 10 3.097088
## 3141 GO:0071850;mitotic cell cycle arrest 1 12 2.978951
## 1989 GO:0035994;response to muscle stretch 1 13 2.886317
## z adj.z sh.z adj.sh.z
## 1069 24.45432 24.16538 NA NA
## 1797 25.72343 25.40572 NA NA
## 2823 25.27727 24.98162 NA NA
## 2889 15.26336 14.84385 NA NA
## 3141 15.18589 14.77459 NA NA
## 1989 14.88660 14.47606 NA NA
# get full info on the top aspects
tam <- pagoda.top.aspects(pwpca, n.cells = NULL, z.score = qnorm(0.01/2, lower.tail = FALSE))
# determine overall cell clustering
hc <- pagoda.cluster.cells(tam, varinfo)
# Next, we will reduce redundant aspects in two steps. First we will combine pathways that are driven by the same sets of genes:
tamr <- pagoda.reduce.loading.redundancy(tam, pwpca)
# In the second step we will combine aspects that show similar patterns (i.e. separate the same sets of cells). Here we will plot the cells using the overall cell clustering determined above:
tamr2 <- pagoda.reduce.redundancy(tamr, distance.threshold = 0.9, plot = TRUE, cell.clustering = hc, labRow = NA, labCol = NA, box = TRUE, margins = c(0.5, 0.5), trim = 0)
# define colors for clusters, here we split by 4 clusters
col.cols <- rbind(groups = cutree(hc, 4))
# plot top aspects, include also donor/celltype color vectors
pagoda.view.aspects(tamr2, cell.clustering = hc, box = TRUE, labCol = NA, margins = c(0.5, 20), col.cols = rbind(col.cols,donor2cols,celltype2cols),top=20)
# here top 20 aspects are plotted,
The results can be browsed interactively with the pagoda app using commands:
app <- make.pagoda.app(tamr2, tam, varinfo, go.env, pwpca, col.cols = rbind(col.cols,donor2cols,celltype2cols), cell.clustering = hc, title = "ILCs")
# show app in the browser (port 1468)
show.app(app, "ILCs", browse = TRUE, port = 1468)
Instead of basing tSNE on regular PCA distances, you can also use the distances from the pagoda pwpca object as an input to Rtsne.
suppressMessages(library(Rtsne))
# recalculate clustering distance, we'll need to specify return.details=T
cell.clustering <- pagoda.cluster.cells(tam,varinfo,include.aspects=TRUE,verbose=TRUE,return.details=T)
## clustering cells based on 1005 genes and 587 aspect patterns
# fix the seed to ensure reproducible results
set.seed(0)
tSNE.pagoda <- Rtsne(cell.clustering$distance,is_distance=T,initial_dims=30,perplexity=30,theta=0.1)
par(mfrow=c(2,2), mar = c(2.5,2.5,2.0,0.5), mgp = c(2,0.65,0), cex = 1.0);
plot(tSNE.pagoda$Y,col=adjustcolor(col.cols,alpha=0.5),cex=1,pch=19,xlab="",ylab="",main="Pagoda clusters")
plot(tSNE.pagoda$Y,col=adjustcolor(celltype2cols,alpha=0.5),cex=1,pch=19,xlab="",ylab="",main="Celltypes")
plot(tSNE.pagoda$Y,col=adjustcolor(donor2cols,alpha=0.5),cex=1,pch=19,xlab="",ylab="",main="Donors")
plot(tSNE.pagoda$Y,col=adjustcolor(nDet2cols,alpha=0.5),cex=1,pch=19,xlab="",ylab="",main="Detected genes")
Run tsne based on same dataset (cd).
set.seed(0)
tSNE.counts <- Rtsne(t(log2(cd+1)),initial_dims=30,perplexity=30,theta=0.1)
par(mfrow=c(2,2), mar = c(2.5,2.5,2.0,0.5), mgp = c(2,0.65,0), cex = 1.0);
plot(tSNE.counts$Y,col=adjustcolor(col.cols,alpha=0.5),cex=1,pch=19,xlab="",ylab="",main="Pagoda clusters")
plot(tSNE.counts$Y,col=adjustcolor(celltype2cols,alpha=0.5),cex=1,pch=19,xlab="",ylab="",main="Celltypes")
plot(tSNE.counts$Y,col=adjustcolor(donor2cols,alpha=0.5),cex=1,pch=19,xlab="",ylab="",main="Donors")
plot(tSNE.counts$Y,col=adjustcolor(nDet2cols,alpha=0.5),cex=1,pch=19,xlab="",ylab="",main="Detected genes")
What do you think about the clustering based on Pagoda?
Do you see any clear issues with the tSNE based on counts?
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Rtsne_0.13 plotrix_3.7 scde_2.6.0 flexmix_2.3-14
## [5] lattice_0.20-35
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.15 nloptr_1.0.4
## [3] compiler_3.4.1 RColorBrewer_1.1-2
## [5] tools_3.4.1 Lmoments_1.2-3
## [7] lme4_1.1-15 digest_0.6.15
## [9] evaluate_0.10.1 nlme_3.1-131
## [11] mgcv_1.8-23 Matrix_1.2-12
## [13] yaml_2.1.16 parallel_3.4.1
## [15] SparseM_1.77 RcppArmadillo_0.8.300.1.0
## [17] stringr_1.2.0 knitr_1.19
## [19] extRemes_2.0-8 MatrixModels_0.4-1
## [21] locfit_1.5-9.1 stats4_3.4.1
## [23] rprojroot_1.3-2 grid_3.4.1
## [25] nnet_7.3-12 Biobase_2.38.0
## [27] distillery_1.0-4 Rook_1.1-1
## [29] BiocParallel_1.12.0 rmarkdown_1.8
## [31] minqa_1.2.4 limma_3.34.8
## [33] car_2.1-6 edgeR_3.20.8
## [35] magrittr_1.5 splines_3.4.1
## [37] backports_1.1.2 pcaMethods_1.70.0
## [39] htmltools_0.3.6 modeltools_0.2-21
## [41] MASS_7.3-48 BiocGenerics_0.24.0
## [43] pbkrtest_0.4-7 RMTstat_0.3
## [45] brew_1.0-6 quantreg_5.34
## [47] stringi_1.1.6 rjson_0.2.15
## [49] Cairo_1.5-9