In this document we will illustrate the use of several automatic cell type annotation packages. We will use an example data set consisting of 2,700 PBMCs, sequenced using 10x Genomics technology.
We first load the required R packages.
suppressPackageStartupMessages({
library(SummarizedExperiment)
library(SingleCellExperiment)
library(TENxPBMCData)
library(scater)
library(scran)
library(BiocSingular)
library(pheatmap)
library(Seurat)
library(igraph)
library(ggplot2)
library(limma)
library(edgeR)
library(cellassign)
library(SingleR)
library(AUCell)
})
Next, we load the data (from the TENxPBMCData package), calculate QC metrics with scater, normalize and run denoising PCA with r Biocpkg("scran")
and apply tSNE using scater, as we have seen previously in the course. We also apply a graph-based community detection algorithm in order to partition the cells into discrete clusters.
## snapshotDate(): 2019-09-25
## see ?TENxPBMCData and browseVignettes('TENxPBMCData') for documentation
## downloading 0 resources
## loading from cache
## Add column names, and use gene symbol as row names wherever possible
colnames(sce) <- paste0("Cell", seq_len(ncol(sce)))
rownames(sce) <- scater::uniquifyFeatureNames(
ID = rowData(sce)$ENSEMBL_ID,
names = rowData(sce)$Symbol_TENx
)
sce$cell <- colnames(sce)
## Calculate QC metrics and remove cells with more than 5% reads from
## mitochondrial genes
MT <- rownames(sce)[grep("^MT-", rownames(sce))]
df <- scater::perCellQCMetrics(
x = sce,
subsets = list(MT = MT)
)
colData(sce) <- cbind(colData(sce), df)
sce <- sce[, sce$subsets_MT_percent < 5]
## Calculate size factors and normalize
## Note that the pre-clustering step has been excluded to save computational
## time
sce <- scran::computeSumFactors(sce, min.mean = 0.1)
sce <- scater::logNormCounts(sce)
logcounts(sce) <- as.matrix(logcounts(sce))
## Fit variance trend and apply denoising PCA
dec <- scran::modelGeneVar(sce)
set.seed(123)
sce <- scran::denoisePCA(sce, technical = dec, BSPARAM = IrlbaParam())
## Apply tSNE
set.seed(123)
sce <- scater::runTSNE(sce, dimred = "PCA", perplexity = 30)
## Cluster
graph_k10 <- scran::buildSNNGraph(sce, k = 10, use.dimred = "PCA", type = "rank")
clust_k10_louvain <- igraph::cluster_louvain(graph_k10)$membership
sce$cluster_louvain_k10 <- factor(clust_k10_louvain)
scater::plotReducedDim(sce, "TSNE", colour_by = "cluster_louvain_k10")
CellAssign (Zhang et al. 2019) takes as input a single-cell data set (as a SingleCellExperiment
object) and a collection of marker genes for cell types of interest. The method builds on the assumption that a cell from a certain cell type should display a high expression of the marker genes for that cell type. For illustration, here we are using a set of marker genes for cell types in a tumour microenvironment, provided with the CellAssign
package.
## Load the marker genes and convert to binary indicator matrix
data(example_TME_markers)
(marker_gene_mat <- marker_list_to_mat(example_TME_markers$symbol))
## B cells T cells Cytotoxic T cells Monocyte/Macrophage
## ACTA2 0 0 0 0
## CD14 0 0 0 1
## CD19 1 0 0 0
## CD2 0 1 1 0
## CD24 1 0 0 0
## CD28 0 1 1 0
## CD33 0 0 0 1
## CD3D 0 1 1 0
## CD3E 0 1 1 0
## CD3G 0 1 1 0
## CD4 0 1 0 1
## CD79A 1 0 0 0
## CD8A 0 0 1 0
## CDH1 0 0 0 0
## CDH5 0 0 0 0
## CLDN3 0 0 0 0
## CLDN4 0 0 0 0
## CLEC14A 0 0 0 0
## COL1A1 0 0 0 0
## COL3A1 0 0 0 0
## EMCN 0 0 0 0
## EPCAM 0 0 0 0
## FCGR3A 0 0 0 1
## GZMA 0 0 1 0
## ITGAM 0 0 0 1
## ITGAX 0 0 0 1
## LYZ 0 0 0 1
## MCAM 0 0 0 0
## MS4A1 1 0 0 0
## MYH11 0 0 0 0
## MYLK 0 0 0 0
## NKG7 0 0 1 0
## PECAM1 0 0 0 0
## PLN 0 0 0 0
## PRF1 0 0 1 0
## PTPRC 1 1 1 1
## SERPINH1 0 0 0 0
## VIM 1 1 1 1
## VWF 0 0 0 0
## Epithelial cells Myofibroblast Vascular smooth muscle cells
## ACTA2 0 1 1
## CD14 0 0 0
## CD19 0 0 0
## CD2 0 0 0
## CD24 0 0 0
## CD28 0 0 0
## CD33 0 0 0
## CD3D 0 0 0
## CD3E 0 0 0
## CD3G 0 0 0
## CD4 0 0 0
## CD79A 0 0 0
## CD8A 0 0 0
## CDH1 1 0 0
## CDH5 0 0 0
## CLDN3 1 0 0
## CLDN4 1 0 0
## CLEC14A 0 0 0
## COL1A1 0 1 1
## COL3A1 0 1 1
## EMCN 0 0 0
## EPCAM 1 0 0
## FCGR3A 0 0 0
## GZMA 0 0 0
## ITGAM 0 0 0
## ITGAX 0 0 0
## LYZ 0 0 0
## MCAM 0 0 1
## MS4A1 0 0 0
## MYH11 0 0 1
## MYLK 0 0 1
## NKG7 0 0 0
## PECAM1 0 0 0
## PLN 0 0 1
## PRF1 0 0 0
## PTPRC 0 0 0
## SERPINH1 0 1 1
## VIM 0 1 1
## VWF 0 0 0
## Endothelial cells other
## ACTA2 0 0
## CD14 0 0
## CD19 0 0
## CD2 0 0
## CD24 0 0
## CD28 0 0
## CD33 0 0
## CD3D 0 0
## CD3E 0 0
## CD3G 0 0
## CD4 0 0
## CD79A 0 0
## CD8A 0 0
## CDH1 0 0
## CDH5 1 0
## CLDN3 0 0
## CLDN4 0 0
## CLEC14A 1 0
## COL1A1 0 0
## COL3A1 0 0
## EMCN 1 0
## EPCAM 0 0
## FCGR3A 0 0
## GZMA 0 0
## ITGAM 0 0
## ITGAX 0 0
## LYZ 0 0
## MCAM 1 0
## MS4A1 0 0
## MYH11 0 0
## MYLK 0 0
## NKG7 0 0
## PECAM1 1 0
## PLN 0 0
## PRF1 0 0
## PTPRC 0 0
## SERPINH1 1 0
## VIM 1 0
## VWF 1 0
## Find markers shared with the data set
shared <- intersect(rownames(marker_gene_mat), rownames(sce))
s <- sizeFactors(sce)
## Run cellassign. The data set should be subset to the marker genes
fit <- cellassign(exprs_obj = sce[shared, ],
marker_gene_info = marker_gene_mat[shared, ],
s = s,
learning_rate = 1e-2,
shrinkage = TRUE,
verbose = FALSE)
## Warning in cellassign(exprs_obj = sce[shared, ], marker_gene_info =
## marker_gene_mat[shared, : Genes with no mapping counts are present. Make
## sure this is expected -- this can be valid input in some cases (e.g. when
## cell types are overspecified).
## Warning in cellassign(exprs_obj = sce[shared, ], marker_gene_info =
## marker_gene_mat[shared, : Cells with no mapping counts are present. You
## might want to filter these out prior to using cellassign.
## [1] "Cytotoxic T cells" "B cells" "Cytotoxic T cells"
## [4] "Monocyte/Macrophage" "Cytotoxic T cells" "Cytotoxic T cells"
## B cells T cells Cytotoxic T cells Monocyte/Macrophage
## [1,] 5.682052e-06 2.459347e-14 9.999916e-01 2.662367e-06
## [2,] 9.999999e-01 9.120164e-21 3.043269e-08 3.642928e-08
## [3,] 3.546612e-11 8.718447e-13 1.000000e+00 1.156666e-11
## [4,] 1.595091e-06 1.574999e-18 1.262782e-05 9.999855e-01
## [5,] 2.060520e-10 4.386193e-23 1.000000e+00 2.301922e-11
## [6,] 8.537414e-04 4.776439e-13 9.979890e-01 1.152145e-03
## Epithelial cells Myofibroblast Vascular smooth muscle cells
## [1,] 7.628969e-12 1.225058e-11 6.193463e-12
## [2,] 6.350016e-15 1.889068e-13 7.942218e-14
## [3,] 5.034158e-17 6.953427e-17 3.054440e-17
## [4,] 1.972433e-11 5.421507e-12 2.548232e-12
## [5,] 6.984863e-16 1.023776e-16 5.573620e-17
## [6,] 3.983346e-10 2.236500e-08 1.125137e-08
## Endothelial cells other
## [1,] 1.373205e-11 9.651955e-08
## [2,] 2.344012e-13 8.973982e-11
## [3,] 8.437968e-17 6.926728e-13
## [4,] 6.342749e-12 2.605581e-07
## [5,] 1.094129e-16 8.465151e-12
## [6,] 2.514499e-08 5.054143e-06
SingleR (Aran et al. 2019) classifies cells to cell types by comparing them to full reference expression profiles. Thus, rather than a set of marker genes it requires complete expression matrices as reference input. Here we use three different reference immune data sets that are provided with the SingleR
package for illustration.
## snapshotDate(): 2019-09-25
## see ?SingleR and browseVignettes('SingleR') for documentation
## downloading 0 resources
## loading from cache
## see ?SingleR and browseVignettes('SingleR') for documentation
## downloading 0 resources
## loading from cache
## snapshotDate(): 2019-09-25
## see ?SingleR and browseVignettes('SingleR') for documentation
## downloading 0 resources
## loading from cache
## see ?SingleR and browseVignettes('SingleR') for documentation
## downloading 0 resources
## loading from cache
## snapshotDate(): 2019-09-25
## see ?SingleR and browseVignettes('SingleR') for documentation
## downloading 0 resources
## loading from cache
## see ?SingleR and browseVignettes('SingleR') for documentation
## downloading 0 resources
## loading from cache
pred.sce <- SingleR::SingleR(test = sce, ref = ref, labels = ref$label.main)
pred.sce2 <- SingleR::SingleR(test = sce, ref = ref2, labels = ref2$label.main)
pred.sce3 <- SingleR::SingleR(test = sce, ref = ref3, labels = ref3$label.main)
table(pred.sce$pruned.labels, pred.sce2$pruned.labels)
##
## B-cells CD4+ T-cells CD8+ T-cells HSC Monocytes NK cells
## B cells 98 0 0 0 0 0
## Monocytes 0 0 0 0 188 0
## NK cells 0 0 38 0 0 56
## T cells, CD4+ 0 133 132 0 0 0
## T cells, CD8+ 0 14 73 0 0 0
##
## B_cell CMP Monocyte NK_cell Platelets Pre-B_cell_CD34-
## B cells 98 0 0 0 0 0
## Monocytes 0 0 182 0 0 6
## NK cells 0 0 0 53 0 1
## T cells, CD4+ 0 0 0 0 0 0
## T cells, CD8+ 0 0 0 0 0 0
##
## T_cells
## B cells 0
## Monocytes 0
## NK cells 39
## T cells, CD4+ 268
## T cells, CD8+ 89
sce$singleR_celltypes_dice <- pred.sce$pruned.labels
sce$singleR_celltypes_blueprint <- pred.sce2$pruned.labels
sce$singleR_celltypes_hpca <- pred.sce3$pruned.labels
SingleR::plotScoreHeatmap(
pred.sce, show.labels = TRUE,
annotation_col = data.frame(cluster = sce$cluster_louvain_k10,
row.names = rownames(pred.sce))
)
SingleR::plotScoreHeatmap(
pred.sce2, show.labels = TRUE,
annotation_col = data.frame(cluster = sce$cluster_louvain_k10,
row.names = rownames(pred.sce2))
)
AUCell (Aibar et al. 2017) calculates “activation scores” for gene sets (which could, for example, consist of marker genes for specific cell types). An important difference to the previous methods is that a cell can be “assigned” to multiple reference gene sets, and not all cells are “assigned” to any reference gene set. To obtain gene sets for assigning cells to cell types, we perform pairwise t-tests on one of the reference data sets from SingleR
, and use scran to extract only marker genes that are upregulated in one cell type compared to all the others.
## snapshotDate(): 2019-09-25
## see ?SingleR and browseVignettes('SingleR') for documentation
## downloading 0 resources
## loading from cache
## see ?SingleR and browseVignettes('SingleR') for documentation
## downloading 0 resources
## loading from cache
markers <- scran::findMarkers(ref, groups = ref$label.main,
test.type = "t", assay.type = "logcounts",
lfc = 1, pval.type = "all", direction = "up")
markerlist <- lapply(markers, function(w) rownames(w)[w$FDR <= 0.01])
sapply(markerlist, length)
## B cells Monocytes NK cells T cells, CD4+ T cells, CD8+
## 1058 984 680 83 101
## Get rankings of genes in each cell
set.seed(123)
cell_rankings <- AUCell::AUCell_buildRankings(logcounts(sce), nCores = 2)
## Quantiles for the number of genes detected by cell:
## (Non-detected genes are shuffled at the end of the ranking. Keep it in mind when choosing the threshold for calculating the AUC).
## min 1% 5% 10% 50% 100%
## 246.00 346.98 461.25 542.00 820.00 2815.00
## Using 2 cores.
## Calculate AUC for each gene set and cell
cell_AUC <- AUCell::AUCell_calcAUC(markerlist, cell_rankings,
aucMaxRank = nrow(cell_rankings) * 0.05)
## Genes in the gene sets NOT available in the dataset:
## B cells: 332 (31% of 1058)
## Monocytes: 97 (10% of 984)
## NK cells: 104 (15% of 680)
## T cells, CD4+: 12 (14% of 83)
## T cells, CD8+: 15 (15% of 101)
## Plots for threshold selection
par(mfrow = c(2, 3))
set.seed(123)
cell_assignment <- AUCell::AUCell_exploreThresholds(
cell_AUC, plotHist = TRUE,
nCores = 1, assignCells = TRUE
)
## Warning in .auc_assignmnetThreshold_v6(aucRow = aucMatrix[gSetName, , drop
## = FALSE], : NK cells: Check the AUC histogram. 'minimumDens' was selected
## as the best threshold, but there might be several distributions in the AUC.
## Get assigned cells
cellsAssigned <- lapply(cell_assignment, function(x) x$assignment)
assignmentTable <- reshape2::melt(cellsAssigned, value.name = "cell")
colnames(assignmentTable)[2] <- "geneSet"
head(assignmentTable)
## cell geneSet
## 1 Cell2 B cells
## 2 Cell11 B cells
## 3 Cell21 B cells
## 4 Cell22 B cells
## 5 Cell23 B cells
## 6 Cell26 B cells
assignmentMat <- table(assignmentTable[, "geneSet"], assignmentTable[, "cell"])
assignmentMat[, 1:2]
##
## Cell109 Cell11
## B cells 1 1
## Monocytes 0 0
## NK cells 0 0
## T cells, CD4+ 0 0
## T cells, CD8+ 0 0
set.seed(123)
miniAssigMat <- assignmentMat[, sample(1:ncol(assignmentMat), min(500, ncol(assignmentMat)))]
pheatmap(miniAssigMat, cluster_rows = FALSE, show_colnames = FALSE)
assignments <- assignmentTable %>% dplyr::group_by(cell) %>%
dplyr::summarize(AUCellType = paste(geneSet, collapse = "/")) %>%
dplyr::group_by(AUCellType) %>%
dplyr::mutate(n = length(unique(cell))) %>%
dplyr::filter(n > 5)
sce$AUCell_celltypes <- assignments$AUCellType[match(sce$cell, assignments$cell)]
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] doRNG_1.7.1 rngtools_1.4
## [3] pkgmaker_0.27 registry_0.5-1
## [5] foreach_1.4.7 AUCell_1.6.1
## [7] SingleR_0.99.13 cellassign_0.99.11
## [9] edgeR_3.27.13 limma_3.41.17
## [11] igraph_1.2.4.1 Seurat_3.1.1
## [13] pheatmap_1.0.12 BiocSingular_1.1.7
## [15] scran_1.13.26 scater_1.13.25
## [17] ggplot2_3.2.1 TENxPBMCData_1.3.0
## [19] HDF5Array_1.13.9 rhdf5_2.29.3
## [21] SingleCellExperiment_1.7.11 SummarizedExperiment_1.15.9
## [23] DelayedArray_0.11.8 BiocParallel_1.19.3
## [25] matrixStats_0.55.0 Biobase_2.45.1
## [27] GenomicRanges_1.37.16 GenomeInfoDb_1.21.2
## [29] IRanges_2.19.16 S4Vectors_0.23.25
## [31] BiocGenerics_0.31.6 BiocStyle_2.13.2
##
## loaded via a namespace (and not attached):
## [1] reticulate_1.13 R.utils_2.9.0
## [3] tidyselect_0.2.5 RSQLite_2.1.2
## [5] AnnotationDbi_1.47.1 htmlwidgets_1.5.1
## [7] grid_3.6.1 Rtsne_0.15
## [9] munsell_0.5.0 codetools_0.2-16
## [11] ica_1.0-2 statmod_1.4.32
## [13] future_1.14.0 withr_2.1.2
## [15] colorspace_1.4-1 knitr_1.25
## [17] ROCR_1.0-7 tensorflow_2.0.0
## [19] gbRd_0.4-11 listenv_0.7.0
## [21] labeling_0.3 Rdpack_0.11-0
## [23] GenomeInfoDbData_1.2.1 bit64_0.9-7
## [25] vctrs_0.2.0 xfun_0.10
## [27] BiocFileCache_1.9.1 doParallel_1.0.15
## [29] R6_2.4.0 ggbeeswarm_0.6.0
## [31] rsvd_1.0.2 locfit_1.5-9.1
## [33] bitops_1.0-6 assertthat_0.2.1
## [35] promises_1.1.0 SDMTools_1.1-221.1
## [37] scales_1.0.0 beeswarm_0.2.3
## [39] gtable_0.3.0 beachmat_2.1.2
## [41] npsurv_0.4-0 globals_0.12.4
## [43] rlang_0.4.0 zeallot_0.1.0
## [45] splines_3.6.1 lazyeval_0.2.2
## [47] BiocManager_1.30.7 yaml_2.2.0
## [49] reshape2_1.4.3 backports_1.1.5
## [51] httpuv_1.5.2 tools_3.6.1
## [53] gplots_3.0.1.1 RColorBrewer_1.1-2
## [55] ggridges_0.5.1 Rcpp_1.0.2
## [57] plyr_1.8.4 base64enc_0.1-3
## [59] zlibbioc_1.31.0 purrr_0.3.2
## [61] RCurl_1.95-4.12 pbapply_1.4-2
## [63] viridis_0.5.1 cowplot_1.0.0
## [65] zoo_1.8-6 ggrepel_0.8.1
## [67] cluster_2.1.0 magrittr_1.5
## [69] data.table_1.12.4 lmtest_0.9-37
## [71] RANN_2.6.1 whisker_0.4
## [73] fitdistrplus_1.0-14 lsei_1.2-0
## [75] mime_0.7 evaluate_0.14
## [77] xtable_1.8-4 XML_3.98-1.20
## [79] gridExtra_2.3 tfruns_1.4
## [81] compiler_3.6.1 tibble_2.1.3
## [83] KernSmooth_2.23-15 crayon_1.3.4
## [85] R.oo_1.22.0 htmltools_0.4.0
## [87] segmented_1.0-0 later_1.0.0
## [89] tidyr_1.0.0 RcppParallel_4.4.4
## [91] DBI_1.0.0 ExperimentHub_1.11.6
## [93] dbplyr_1.4.2 MASS_7.3-51.4
## [95] rappdirs_0.3.1 Matrix_1.2-17
## [97] R.methodsS3_1.7.1 gdata_2.18.0
## [99] metap_1.1 pkgconfig_2.0.3
## [101] plotly_4.9.0 annotate_1.63.0
## [103] vipor_0.4.5 dqrng_0.2.1
## [105] XVector_0.25.0 bibtex_0.4.2
## [107] stringr_1.4.0 digest_0.6.21
## [109] sctransform_0.2.0 RcppAnnoy_0.0.13
## [111] tsne_0.1-3 graph_1.63.0
## [113] rmarkdown_1.16 leiden_0.3.1
## [115] uwot_0.1.4 GSEABase_1.47.0
## [117] DelayedMatrixStats_1.7.2 curl_4.2
## [119] shiny_1.4.0 gtools_3.8.1
## [121] lifecycle_0.1.0 nlme_3.1-141
## [123] jsonlite_1.6 Rhdf5lib_1.7.5
## [125] BiocNeighbors_1.3.5 viridisLite_0.3.0
## [127] pillar_1.4.2 lattice_0.20-38
## [129] fastmap_1.0.1 httr_1.4.1
## [131] survival_2.44-1.1 interactiveDisplayBase_1.23.0
## [133] glue_1.3.1 iterators_1.0.12
## [135] png_0.1-7 bit_1.1-14
## [137] mixtools_1.1.0 stringi_1.4.3
## [139] blob_1.2.0 AnnotationHub_2.17.10
## [141] caTools_1.17.1.2 memoise_1.1.0
## [143] dplyr_0.8.3 irlba_2.3.3
## [145] future.apply_1.3.0 ape_5.3
Aibar, Sara, Carmen Bravo González-Blas, Thomas Moerman, Vân Anh Huynh-Thu, Hana Imrichova, Gert Hulselmans, Florian Rambow, et al. 2017. “SCENIC: Single-Cell Regulatory Network Inference and Clustering.” Nat. Methods 14 (11): 1083–6.
Aran, Dvir, Agnieszka P Looney, Leqian Liu, Esther Wu, Valerie Fong, Austin Hsu, Suzanna Chak, et al. 2019. “Reference-Based Analysis of Lung Single-Cell Sequencing Reveals a Transitional Profibrotic Macrophage.” Nat. Immunol. 20 (2): 163–72.
Zhang, Allen W, Ciara O’Flanagan, Elizabeth A Chavez, Jamie L P Lim, Nicholas Ceglia, Andrew McPherson, Matt Wiens, et al. 2019. “Probabilistic Cell-Type Assignment of Single-Cell RNA-seq for Tumor Microenvironment Profiling.” Nat. Methods 16 (10): 1007–15.