SC3 example with ILC data

Author: Åsa Björklund

Clustering of data using SC3 package, following tutorial at: https://bioconductor.org/packages/release/bioc/vignettes/SC3/inst/doc/SC3.html

Run through steps 1-5 of the manual, step 6 is more detail on the different steps of SC3, go through that as well if you find time.

For this exercise you can either run with your own data or with the example data from Treutlein paper that they provide with the package. Below is an example with human innate lympoid cells (ILCs) from Bjorklund et al. 2016.

If you want to run the package with the ILCs, all data, plus some intermediate files for steps that takes long time, can be found in the course uppmax folder with subfolder:

scrnaseq_course/data/ILC/

Load packages

suppressMessages(library(scater))
suppressMessages(library(SC3))

Read data and create a scater SingleCellExperiment (SCE) object

# read in meta data table and create pheno data
M <- read.table("data/ILC/Metadata_ILC.csv", sep=",",header=T)

# read rpkm values
R <- read.table("data/ILC/ensembl_rpkmvalues_ILC.csv",sep=",",header=T)
C <- read.table("data/ILC/ensembl_countvalues_ILC.csv",sep=",",header=T)

# in this case it may be wise to translate ensembl IDs to gene names to make plots with genes more understandable
TR <- read.table("data/ILC/gene_name_translation_biotype.tab",sep="\t")

# find the correct entries in TR and merge ensembl name and gene id.
m <- match(rownames(R),TR$ensembl_gene_id)
newnames <- apply(cbind(as.vector(TR$external_gene_name)[m],
                        rownames(R)),1,paste,collapse=":")
rownames(R)<-newnames
rownames(C)<-newnames

# create the SingleCellExperiement (SCE) object
sce <- SingleCellExperiment(assays = 
                    list(fpkm = as.matrix(R), 
                         counts = as.matrix(C)), colData = M)

# define feature names in feature_symbol column
rowData(sce)$feature_symbol <- rownames(sce)

# remove features with duplicated names
sce <- sce[!duplicated(rowData(sce)$feature_symbol), ]

# add in a logcounts slot, there needs to be a logcounts slot for SC3 to run on.
logcounts(sce) <- log2(as.matrix(R)+1)

# define spike-ins
isSpike(sce, "ERCC") <- grepl("ERCC_", rowData(sce)$feature_symbol)

In this example we fill all slots, fpkm, counts and logcounts, to show how it can be done. However, for running SC3 it is only necessary to have the logcounts slot, since that is what is used.

QC with scater

Use scater package to calculate qc-metrics and plot a PCA

sce <- calculateQCMetrics(sce, exprs_values ="logcounts")
sce <- runPCA(sce, ntop = 1000, ncompnents = 20, exprs_values = "logcounts")
plotPCA(sce, colour_by = "Celltype")

Run SC3

OBS! it takes a while to run (10-30mins depending on data set size and how many cores you use), define number of clusters to test with ks parameter, testing more different k’s will take longer time. You can get a hint on number of clusters you should set by running the sc3_estimate_k function, but it may not always give the biologically relevant clusters.

# since this step takes a while, save data to a file so that it does not have to be rerun if you execute the code again.
# To be used by SC3, the SCE object must contain "counts", "logcounts", and "feature_symbol"

savefile <- "data/ILC/sc3_cdata_ilc_k3-6.Rdata"
if (file.exists(savefile)){
   load(savefile)
}else {
   sce <- sc3(sce, ks = 3:6, biology = TRUE, n_cores = 1)
   save(sce,file=savefile)
}

Now you can explore the data interactively within a shiny app using command:

sc3_interactive(sce)

Plot results

Instead of using the app, that sometimes is very slow, you can also create each plot with different commands, here are some example plots.

# plot PCA for 4 clusters
plotPCA(
    sce, 
    colour_by = "sc3_4_clusters", 
    size_by = "sc3_4_log2_outlier_score"
)

# plot onto tsne embedding, first need to run tSNE for the SCE object
set.seed(1)
sce <- runTSNE(sce, ntop = 1000, exprs_values = "fpkm", n_dimred = 7)
plotTSNE(
    sce, 
    colour_by = "sc3_4_clusters", 
    size_by = "sc3_4_log2_outlier_score"
)

# same plot but with celltype annotation
plotTSNE(sce, colour_by = "Celltype")

# plot how many high auc value genes there are per cluster
plotRowData(
    sce, 
    x = "sc3_4_markers_clusts", 
    y = "sc3_4_markers_auroc", 
    colour_by = "sc3_4_markers_padj"
)

# plot consensus clusters - 4 clusters
sc3_plot_consensus(
    sce, k = 4, 
    show_pdata = c(
        "Celltype", 
        "log10_total_features",
        "sc3_4_clusters", 
        "sc3_4_log2_outlier_score",
    "Donor" 
    )
)

SC3 clearly groups the 4 main celltypes, but within celltypes there is clear separation of the donors.

# same with 6 clusters 
sc3_plot_consensus(
    sce, k = 6, 
    show_pdata = c(
        "Celltype", 
        "log10_total_features",
        "sc3_6_clusters", 
        "sc3_6_log2_outlier_score",
    "Donor" 
    )
)

The next clustering steps clearly separates the ILC3s by donor,

# plot expression of gene clusters
sc3_plot_expression(sce, k = 4,
    show_pdata = c(
        "Celltype", 
        "log10_total_features",
        "sc3_4_clusters", 
        "sc3_4_log2_outlier_score",
    "Donor" 
    )
)

This plots shows cluster of genes and their expression in the different clusters.

DE genes, these are estimated using the non-parametric Kruskal-Wallis test.

# plot DE genes
sc3_plot_de_genes(sce, k = 4,
    show_pdata = c(
        "Celltype", 
        "log10_total_features",
        "sc3_4_clusters", 
        "sc3_4_log2_outlier_score",
    "Donor" 
    )
)

Marker genes - are estimated from AUC values.

# plot marker genes
sc3_plot_markers(sce, k = 4,
    show_pdata = c(
        "Celltype", 
        "log10_total_features",
        "sc3_4_clusters", 
        "sc3_4_log2_outlier_score",
    "Donor" 
    )
)

In this dataset, it is clear that the main separation of the data is by celltype, however, the donor batch effect becomes clear when splitting into more than 4 clusters.

So it would be sensible to first run batch effect correction on the data, and then run SC3 on batch corrected expression values. These could be provided as logcounts when creating the SingleCellExperiment.

# if you have batch normalized data (norm.data) it is normally done on logged counts or rpkms, so the SCE object should be created as:

sce.norm <- SingleCellExperiment(
  assays = list(
          logcounts = norm.data
        ),
        colData = M
)

# or by just adding it to an existing SCE object
logcounts(sce) <- norm.data

Session info

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS: /sw/apps/R/x86_64/3.5.0/rackham/lib64/R/lib/libRblas.so
## LAPACK: /sw/apps/R/x86_64/3.5.0/rackham/lib64/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] rmarkdown_1.11              SC3_1.10.1                 
##  [3] scater_1.10.1               ggplot2_3.1.0              
##  [5] SingleCellExperiment_1.4.1  SummarizedExperiment_1.12.0
##  [7] DelayedArray_0.8.0          BiocParallel_1.16.5        
##  [9] matrixStats_0.54.0          Biobase_2.42.0             
## [11] GenomicRanges_1.34.0        GenomeInfoDb_1.18.1        
## [13] IRanges_2.16.0              S4Vectors_0.20.1           
## [15] BiocGenerics_0.28.0        
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6             doParallel_1.0.14       
##  [3] RColorBrewer_1.1-2       tools_3.5.0             
##  [5] doRNG_1.7.1              R6_2.3.0                
##  [7] KernSmooth_2.23-15       HDF5Array_1.10.1        
##  [9] vipor_0.4.5              lazyeval_0.2.1          
## [11] colorspace_1.4-0         withr_2.1.2             
## [13] tidyselect_0.2.5         gridExtra_2.3           
## [15] compiler_3.5.0           pkgmaker_0.27           
## [17] labeling_0.3             caTools_1.17.1.1        
## [19] scales_1.0.0             DEoptimR_1.0-8          
## [21] mvtnorm_1.0-8            robustbase_0.93-3       
## [23] stringr_1.3.1            digest_0.6.18           
## [25] XVector_0.22.0           rrcov_1.4-7             
## [27] pkgconfig_2.0.2          htmltools_0.3.6         
## [29] bibtex_0.4.2             WriteXLS_4.0.0          
## [31] rlang_0.3.1              shiny_1.2.0             
## [33] DelayedMatrixStats_1.4.0 bindr_0.1.1             
## [35] gtools_3.8.1             dplyr_0.7.8             
## [37] RCurl_1.95-4.11          magrittr_1.5            
## [39] GenomeInfoDbData_1.2.0   Matrix_1.2-14           
## [41] Rcpp_1.0.0               ggbeeswarm_0.6.0        
## [43] munsell_0.5.0            Rhdf5lib_1.4.2          
## [45] viridis_0.5.1            yaml_2.2.0              
## [47] stringi_1.2.4            zlibbioc_1.28.0         
## [49] Rtsne_0.15               rhdf5_2.26.2            
## [51] gplots_3.0.1.1           plyr_1.8.4              
## [53] grid_3.5.0               promises_1.0.1          
## [55] gdata_2.18.0             crayon_1.3.4            
## [57] lattice_0.20-35          knitr_1.21              
## [59] pillar_1.3.1             rngtools_1.3.1          
## [61] reshape2_1.4.3           codetools_0.2-15        
## [63] glue_1.3.0               evaluate_0.12           
## [65] httpuv_1.4.5.1           foreach_1.4.4           
## [67] gtable_0.2.0             purrr_0.3.0             
## [69] assertthat_0.2.0         xfun_0.4                
## [71] mime_0.6                 xtable_1.8-3            
## [73] e1071_1.7-0.1            later_0.7.5             
## [75] class_7.3-14             pcaPP_1.9-73            
## [77] viridisLite_0.3.0        tibble_2.0.1            
## [79] pheatmap_1.0.12          iterators_1.0.10        
## [81] registry_0.5             beeswarm_0.2.3          
## [83] bindrcpp_0.2.2           cluster_2.0.7-1         
## [85] ROCR_1.0-7