Author: Åsa Björklund
Detailed tutorial of scater package at: https://www.bioconductor.org/packages/release/bioc/vignettes/scater/inst/doc/vignette-qc.html
We recommend that you follow steps 1-3 in the tutorial.
Many other packages builds on the SingleCellExperiment class in scater, so it is important that you learn properly how to create an SCE from your data and understand the basics of the scater package.
For this exercise you can either run with your own data or with the example data that they provide with the package. Below is an example with human innate lympoid cells (ILCs) from Bjorklund et al. 2016.
If you want to run the package with the ILCs, all data is available in the course uppmax folder with subfolder: scrnaseq_course/data/ILC/
OBS! As of July 2017, scater has switched from the SCESet class previously defined within the package to the more widely applicable SingleCellExperiment class. From Bioconductor 3.6 (October 2017), the release version of scater will use SingleCellExperiment.
suppressMessages(library(scater))
# read in meta data table and create pheno data
M <- read.table("data/ILC/Metadata_ILC.csv", sep=",",header=T)
# read rpkm values and counts
R <- read.table("data/ILC/ensembl_rpkmvalues_ILC.csv",sep=",",header=T)
C <- read.table("data/ILC/ensembl_countvalues_ILC.csv",sep=",",header=T)
# create an SCESet
example_sce <- SingleCellExperiment(assays = list(counts = as.matrix(C)), colData = M)
# you can also add in expression values from the rpkm matrix 
# instead of using logged counts.
exprs(example_sce) <- log2(as.matrix(R)+1)
assay(example_sce, "exprs") <- exprs(example_sce)
# you can access the rpkm or count matrix with the commands "counts" and "exprs"
counts(example_sce)[10:13,1:5]
##                 T74_P1_A9_ILC1 T74_P1_B4_NK T74_P1_B7_ILC2 T74_P1_B9_NK
## ENSG00000001167              0            0              0            0
## ENSG00000001460              0            0              0            0
## ENSG00000001461              0         1035              1            1
## ENSG00000001497              0            0              0            0
##                 T74_P1_D10_ILC2
## ENSG00000001167               0
## ENSG00000001460               0
## ENSG00000001461               2
## ENSG00000001497               0
exprs(example_sce)[10:13,1:5]
##                 T74_P1_A9_ILC1 T74_P1_B4_NK T74_P1_B7_ILC2 T74_P1_B9_NK
## ENSG00000001167              0     0.000000      0.0000000    0.0000000
## ENSG00000001460              0     0.000000      0.0000000    0.0000000
## ENSG00000001461              0     6.615791      0.2243554    0.2142426
## ENSG00000001497              0     0.000000      0.0000000    0.0000000
##                 T74_P1_D10_ILC2
## ENSG00000001167       0.0000000
## ENSG00000001460       0.0000000
## ENSG00000001461       0.8229705
## ENSG00000001497       0.0000000
We have accessor functions to access elements of the SingleCellExperiment object.
For convenience (and backwards compatibility with SCESet) getters and setters are provided as follows: exprs, tpm, cpm, fpkm and versions of these with the prefix “norm_”)
The closest to rpkms is in this case fpkms, so we use fpkm.
It also has slots for:
Use scater package to calculate qc-metrics
# first check which genes are spike-ins if you have included those
ercc <- grep("ERCC_",rownames(R))
# specify the ercc as feature control genes and calculate all qc-metrics
example_sce <- calculateQCMetrics(example_sce, 
                                  feature_controls = list(ERCC = ercc))
# check what all entries are - 
colnames(colData(example_sce))
##  [1] "Samples"                                       
##  [2] "Plate"                                         
##  [3] "Donor"                                         
##  [4] "Celltype"                                      
##  [5] "is_cell_control"                               
##  [6] "total_features_by_counts"                      
##  [7] "log10_total_features_by_counts"                
##  [8] "total_counts"                                  
##  [9] "log10_total_counts"                            
## [10] "pct_counts_in_top_50_features"                 
## [11] "pct_counts_in_top_100_features"                
## [12] "pct_counts_in_top_200_features"                
## [13] "pct_counts_in_top_500_features"                
## [14] "total_features_by_counts_endogenous"           
## [15] "log10_total_features_by_counts_endogenous"     
## [16] "total_counts_endogenous"                       
## [17] "log10_total_counts_endogenous"                 
## [18] "pct_counts_endogenous"                         
## [19] "pct_counts_in_top_50_features_endogenous"      
## [20] "pct_counts_in_top_100_features_endogenous"     
## [21] "pct_counts_in_top_200_features_endogenous"     
## [22] "pct_counts_in_top_500_features_endogenous"     
## [23] "total_features_by_counts_feature_control"      
## [24] "log10_total_features_by_counts_feature_control"
## [25] "total_counts_feature_control"                  
## [26] "log10_total_counts_feature_control"            
## [27] "pct_counts_feature_control"                    
## [28] "pct_counts_in_top_50_features_feature_control" 
## [29] "pct_counts_in_top_100_features_feature_control"
## [30] "pct_counts_in_top_200_features_feature_control"
## [31] "pct_counts_in_top_500_features_feature_control"
## [32] "total_features_by_counts_ERCC"                 
## [33] "log10_total_features_by_counts_ERCC"           
## [34] "total_counts_ERCC"                             
## [35] "log10_total_counts_ERCC"                       
## [36] "pct_counts_ERCC"                               
## [37] "pct_counts_in_top_50_features_ERCC"            
## [38] "pct_counts_in_top_100_features_ERCC"           
## [39] "pct_counts_in_top_200_features_ERCC"           
## [40] "pct_counts_in_top_500_features_ERCC"
A more detailed description can be found at the tutorial site, or by running: ?calculateQCMetrics
If you have additional qc-metrics that you want to include, like mapping stats, rseqc data etc, you can include all of that in your phenoData.
You can play around with the data interactively with the shiny app they provide. OBS! It takes a while to load and plot, so be patient.
# you can open the interactive gui with:
scater_gui(example_sce)
Different ways of visualizing gene expression per batch/celltype etc.
# plot detected genes at different library depth for different plates and celltypes
plotScater(example_sce, block1 = "Plate", block2 = "Celltype",
     colour_by = "Celltype", nfeatures = 300, exprs_values = "exprs")

# violin plot for gene expression
plotExpression(example_sce, rownames(example_sce)[6:11],
               x = "Celltype", exprs_values = "exprs", 
               colour = "Donor",log=TRUE)

plotExpression(example_sce, rownames(example_sce)[6:11],
               x = "Celltype", exprs_values = "counts", colour = "Donor",
               show_median = TRUE, show_violin = FALSE,  log = TRUE)

You can play around with all the arguments in plotExpression, for example:
And specify different coloring and and batches to plot by that are defined in the CellMetadata (ex-phenoData in the SCESet class).
There are several ways to plot the QC summaries of the cells in the scater package. A few examples are provided below. In this case, cells have already been filtered to remove low quality samples, so no filtering step is performed.
# first remove all features with no/low expression, here set to expression in more than 5 cells with > 1 count
keep_feature <- rowSums(counts(example_sce) > 1) > 5
example_sce <- example_sce[keep_feature,]
## Plot highest expressed genes.
plotHighestExprs(example_sce, colour_cells_by="Celltype")

Plot frequency of expression (number of cells with detection) vs mean normalised expression.
plotExprsFreqVsMean(example_sce)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Plot log10 total count vs number of cells a gene is detected in.
plotRowData(example_sce, x = "n_cells_by_counts", y = "log10_total_counts")

Plot different qc-metrics per batch.
p1 <- plotColData(example_sce, x = "Donor", y = "total_features_by_counts",
                  colour_by = "log10_total_counts")
p2 <- plotColData(example_sce, x = "Celltype", y = "total_features_by_counts",
                  colour_by = "log10_total_counts")
multiplot(p1, p2, rows = 2)

## [1] 2
Plot the percentage of expression accounted for by feature controls against total_features.
plotColData(example_sce,  x = "total_features_by_counts", y = "pct_counts_feature_control", colour_by = "Donor")

Plot the cells in reduced space and define color/shape/size by different qc-metrics or meta-data entries.
It is adviced to first run functions like runPCA, runTSNE etc before hand so that they are stored in the SCE object, so that there is no need to rerun those functions each time that you are plotting.
The reduced dimensions can either be plotted with functions like plotPCA, plotTSNE etc. Or with the fucntion plotReducedDim and specifying use_dimred = "pca" or similar.
# run PCA with 1000 top variable genes
example_sce <- runPCA(example_sce, ntop = 1000, exprs_values = "exprs", ncomponents = 20)
# PCA - with different coloring, first 4 components
# first by Donor
plotPCA(example_sce,ncomponents=4,colour_by="Celltype",shape_by="Donor")

# then by Celltype
plotPCA(example_sce,ncomponents=4,colour_by="Donor",shape_by="Celltype")

# Diffusion map, OBS! Requires installation of package destiny to run!
set.seed(1)
example_sce <- runDiffusionMap(example_sce, ntop = 1000, ncomponents = 4)
plotDiffusionMap(example_sce, colour_by="Celltype",shape_by="Donor",ncomponents=4)

# tSNE - uses Rtsne function to run tsne
set.seed(1)
example_sce <- runTSNE(example_sce, ntop = 1000, ncomponents = 2, perplexity = 30, n_dimred = 10)
plotTSNE(example_sce, colour_by="Celltype",shape_by="Donor")

# UMAP, OBS! Requires installation of package umap to run!
set.seed(1)
example_sce <- runUMAP(example_sce)
plotUMAP(object = example_sce, colour_by="Celltype",shape_by="Donor")

For all of these dimensionality reduction methods, you can specify return_SCE = TRUE and it will return an SCESet object with the slot reducedDimension filled. This can be usefule if PCA/tSNE takes long time to run and you want to plot several different colors etc.
You can later plot the reduced dimension with plotReducedDim.
PCA based on the phenoData can be used to detect outlier cells with qc-measures that deviates from the rest. But be careful with checking how these cells deviate before taking a decision on why to remove them.
OBS! detection of outlier requires that package mvoutlier is installed.
example_sce <- runPCA(example_sce, use_coldata = TRUE,
    detect_outliers = TRUE)
plotReducedDim(example_sce, use_dimred="PCA_coldata")
## Warning: 'add_ticks' is deprecated.
## Use '+ geom_rug(...)' instead.

# we can use the filter function to remove all outlier cells
filtered_sce <- filter(example_sce, outlier==FALSE)
Median marginal R2 for each variable in pData(example_sceset) when fitting a linear model regressing exprs values against just that variable. Shows how much of the data variation is explained by a single variable.
plotExplanatoryVariables(example_sce)

Identify PCs that correlate strongly to certain QC or Meta-data values
# for total_features
plotExplanatoryPCs(example_sce)

PC1 clearly correlates to total_features, which is a common problem in scRNAseq data. This may be a technical artifact, or a biological features of celltypes with very different sizes.
It is also clear that PC1 separates out the different plates, while PC2 & PC4 separates the celltypes.
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS: /sw/apps/R/x86_64/3.5.0/rackham/lib64/R/lib/libRblas.so
## LAPACK: /sw/apps/R/x86_64/3.5.0/rackham/lib64/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] rmarkdown_1.11              bindrcpp_0.2.2             
##  [3] scater_1.10.1               ggplot2_3.1.0              
##  [5] SingleCellExperiment_1.4.1  SummarizedExperiment_1.12.0
##  [7] DelayedArray_0.8.0          BiocParallel_1.16.5        
##  [9] matrixStats_0.54.0          Biobase_2.42.0             
## [11] GenomicRanges_1.34.0        GenomeInfoDb_1.18.1        
## [13] IRanges_2.16.0              S4Vectors_0.20.1           
## [15] BiocGenerics_0.28.0        
## 
## loaded via a namespace (and not attached):
##   [1] Rtsne_0.15               ggbeeswarm_0.6.0        
##   [3] colorspace_1.4-0         mvoutlier_2.0.9         
##   [5] RcppEigen_0.3.3.5.0      modeltools_0.2-22       
##   [7] class_7.3-14             rio_0.5.16              
##   [9] mclust_5.4.2             XVector_0.22.0          
##  [11] pls_2.7-0                proxy_0.4-22            
##  [13] cvTools_0.3.2            flexmix_2.3-14          
##  [15] RSpectra_0.13-1          mvtnorm_1.0-8           
##  [17] splines_3.5.0            sROC_0.1-2              
##  [19] robustbase_0.93-3        knitr_1.21              
##  [21] jsonlite_1.6             robCompositions_2.0.9   
##  [23] umap_0.2.0.0             kernlab_0.9-27          
##  [25] cluster_2.0.7-1          HDF5Array_1.10.1        
##  [27] BiocManager_1.30.4       rrcov_1.4-7             
##  [29] compiler_3.5.0           assertthat_0.2.0        
##  [31] Matrix_1.2-14            lazyeval_0.2.1          
##  [33] htmltools_0.3.6          tools_3.5.0             
##  [35] igraph_1.2.2             gtable_0.2.0            
##  [37] glue_1.3.0               GenomeInfoDbData_1.2.0  
##  [39] reshape2_1.4.3           dplyr_0.7.8             
##  [41] ggthemes_4.0.1           Rcpp_1.0.0              
##  [43] carData_3.0-2            trimcluster_0.1-2.1     
##  [45] cellranger_1.1.0         zCompositions_1.1.2     
##  [47] sgeostat_1.0-27          fpc_2.1-11.1            
##  [49] DelayedMatrixStats_1.4.0 lmtest_0.9-36           
##  [51] xfun_0.4                 laeken_0.5.0            
##  [53] stringr_1.3.1            openxlsx_4.1.0          
##  [55] DEoptimR_1.0-8           zlibbioc_1.28.0         
##  [57] MASS_7.3-50              zoo_1.8-4               
##  [59] scales_1.0.0             VIM_4.7.0               
##  [61] hms_0.4.2                rhdf5_2.26.2            
##  [63] RColorBrewer_1.1-2       yaml_2.2.0              
##  [65] curl_3.3                 NADA_1.6-1              
##  [67] reticulate_1.10          gridExtra_2.3           
##  [69] reshape_0.8.8            stringi_1.2.4           
##  [71] pcaPP_1.9-73             e1071_1.7-0.1           
##  [73] destiny_2.12.0           TTR_0.23-4              
##  [75] boot_1.3-20              zip_1.0.0               
##  [77] truncnorm_1.0-8          prabclus_2.2-7          
##  [79] rlang_0.3.1              pkgconfig_2.0.2         
##  [81] bitops_1.0-6             evaluate_0.12           
##  [83] lattice_0.20-35          purrr_0.3.0             
##  [85] Rhdf5lib_1.4.2           bindr_0.1.1             
##  [87] labeling_0.3             tidyselect_0.2.5        
##  [89] GGally_1.4.0             plyr_1.8.4              
##  [91] magrittr_1.5             R6_2.3.0                
##  [93] pillar_1.3.1             haven_2.0.0             
##  [95] foreign_0.8-71           withr_2.1.2             
##  [97] xts_0.11-2               survival_2.42-6         
##  [99] scatterplot3d_0.3-41     abind_1.4-5             
## [101] RCurl_1.95-4.11          sp_1.3-1                
## [103] nnet_7.3-12              tibble_2.0.1            
## [105] crayon_1.3.4             car_3.0-2               
## [107] viridis_0.5.1            grid_3.5.0              
## [109] readxl_1.2.0             data.table_1.12.0       
## [111] forcats_0.3.0            diptest_0.75-7          
## [113] vcd_1.4-4                digest_0.6.18           
## [115] munsell_0.5.0            beeswarm_0.2.3          
## [117] viridisLite_0.3.0        smoother_1.1            
## [119] vipor_0.4.5              tcltk_3.5.0