Get data

In this tutorial, we will be using a publicly available dataset from 10X Genomics, available throught the Bioconductor TENxPBMCData package. The package uses AnnotationHub to download the required data and store them on local cache for reuse.

## class: SingleCellExperiment 
## dim: 31232 7040 
## metadata(0):
## assays(1): counts
## rownames(31232): ENSG00000243485 ENSG00000237613 ...
##   ENSG00000198695 ENSG00000198727
## rowData names(3): ENSEMBL_ID Symbol Symbol_TENx
## colnames: NULL
## colData names(11): Sample Barcode ... Individual Date_published
## reducedDimNames(0):
## spikeNames(0):

The data are stored in a SingleCellExperiment object.

Note that the data are internally stored in a HDF5 file (.h5), which means that they are not loaded in memory, until it is necessary to do so. Many Bioconductor packages, including the ones that we will use in this tutorial, use block processing to ensure that we can work even with datasets larger than the available RAM.

## <31232 x 7040> DelayedMatrix object of type "integer":
##                    [,1]    [,2]    [,3]    [,4] ... [,7037] [,7038]
## ENSG00000243485       0       0       0       0   .       0       0
## ENSG00000237613       0       0       0       0   .       0       0
## ENSG00000186092       0       0       0       0   .       0       0
## ENSG00000238009       0       0       0       0   .       0       0
## ENSG00000239945       0       0       0       0   .       0       0
##             ...       .       .       .       .   .       .       .
## ENSG00000212907       0       0       0       1   .       1       3
## ENSG00000198886      10      33       3       3   .       9      28
## ENSG00000198786       1       1       2       2   .       9      10
## ENSG00000198695       0       0       0       0   .       0       1
## ENSG00000198727       4       8       4       2   .       9      20
##                 [,7039] [,7040]
## ENSG00000243485       0       0
## ENSG00000237613       0       0
## ENSG00000186092       0       0
## ENSG00000238009       0       0
## ENSG00000239945       0       0
##             ...       .       .
## ENSG00000212907       0       1
## ENSG00000198886       7       6
## ENSG00000198786       3       9
## ENSG00000198695       0       0
## ENSG00000198727       3       9
## [[1]]
## An object of class "HDF5ArraySeed"
## Slot "filepath":
## [1] "/Users/davide/Library/Caches/ExperimentHub/4bcb2ca4b42e_1605"
## 
## Slot "name":
## [1] "counts"
## 
## Slot "dim":
## [1] 32738  2700
## 
## Slot "first_val":
## [1] 0
## 
## Slot "chunkdim":
## [1] 631  52
## 
## 
## [[2]]
## An object of class "HDF5ArraySeed"
## Slot "filepath":
## [1] "/Users/davide/Library/Caches/ExperimentHub/db1a1842fd94_1611"
## 
## Slot "name":
## [1] "counts"
## 
## Slot "dim":
## [1] 33694  4340
## 
## Slot "first_val":
## [1] 0
## 
## Slot "chunkdim":
## [1] 512  66
## 9941608 bytes

Note that the sce object already includes several metadata, called “column data”, which can be accessed with the colData function.

## DataFrame with 7040 rows and 11 columns
##           Sample            Barcode         Sequence   Library
##      <character>        <character>      <character> <integer>
## 1         pbmc3k   AAACATACAACCAC-1   AAACATACAACCAC         1
## 2         pbmc3k   AAACATTGAGCTAC-1   AAACATTGAGCTAC         1
## 3         pbmc3k   AAACATTGATCAGC-1   AAACATTGATCAGC         1
## 4         pbmc3k   AAACCGTGCTTCCG-1   AAACCGTGCTTCCG         1
## 5         pbmc3k   AAACCGTGTATGCG-1   AAACCGTGTATGCG         1
## ...          ...                ...              ...       ...
## 7036      pbmc4k TTTGGTTTCGCTAGCG-1 TTTGGTTTCGCTAGCG         1
## 7037      pbmc4k TTTGTCACACTTAACG-1 TTTGTCACACTTAACG         1
## 7038      pbmc4k TTTGTCACAGGTCCAC-1 TTTGTCACAGGTCCAC         1
## 7039      pbmc4k TTTGTCAGTTAAGACA-1 TTTGTCAGTTAAGACA         1
## 7040      pbmc4k TTTGTCATCCCAAGAT-1 TTTGTCATCCCAAGAT         1
##      Cell_ranger_version Tissue_status Barcode_type   Chemistry
##              <character>   <character>  <character> <character>
## 1                 v1.1.0            NA      GemCode Chromium_v1
## 2                 v1.1.0            NA      GemCode Chromium_v1
## 3                 v1.1.0            NA      GemCode Chromium_v1
## 4                 v1.1.0            NA      GemCode Chromium_v1
## 5                 v1.1.0            NA      GemCode Chromium_v1
## ...                  ...           ...          ...         ...
## 7036                v2.1            NA     Chromium Chromium_v2
## 7037                v2.1            NA     Chromium Chromium_v2
## 7038                v2.1            NA     Chromium Chromium_v2
## 7039                v2.1            NA     Chromium Chromium_v2
## 7040                v2.1            NA     Chromium Chromium_v2
##      Sequence_platform    Individual Date_published
##            <character>   <character>    <character>
## 1           NextSeq500 HealthyDonor2     2016-05-26
## 2           NextSeq500 HealthyDonor2     2016-05-26
## 3           NextSeq500 HealthyDonor2     2016-05-26
## 4           NextSeq500 HealthyDonor2     2016-05-26
## 5           NextSeq500 HealthyDonor2     2016-05-26
## ...                ...           ...            ...
## 7036         Hiseq4000 HealthyDonor1     2017-11-08
## 7037         Hiseq4000 HealthyDonor1     2017-11-08
## 7038         Hiseq4000 HealthyDonor1     2017-11-08
## 7039         Hiseq4000 HealthyDonor1     2017-11-08
## 7040         Hiseq4000 HealthyDonor1     2017-11-08

Similarly, the object contains information on the genes, called “row data”, which can be accessed with the rowData function.

## DataFrame with 31232 rows and 3 columns
##                      ENSEMBL_ID       Symbol  Symbol_TENx
##                     <character>  <character>  <character>
## ENSG00000243485 ENSG00000243485           NA RP11-34P13.3
## ENSG00000237613 ENSG00000237613      FAM138A      FAM138A
## ENSG00000186092 ENSG00000186092        OR4F5        OR4F5
## ENSG00000238009 ENSG00000238009 LOC100996442 RP11-34P13.7
## ENSG00000239945 ENSG00000239945           NA RP11-34P13.8
## ...                         ...          ...          ...
## ENSG00000212907 ENSG00000212907         ND4L      MT-ND4L
## ENSG00000198886 ENSG00000198886          ND4       MT-ND4
## ENSG00000198786 ENSG00000198786          ND5       MT-ND5
## ENSG00000198695 ENSG00000198695          ND6       MT-ND6
## ENSG00000198727 ENSG00000198727         CYTB       MT-CYB

With data in place, now we can start loading the libraries we will use in this tutorial.

## Registered S3 method overwritten by 'R.oo':
##   method        from       
##   throw.default R.methodsS3
## Loading required package: ggplot2
## 
## Attaching package: 'scater'
## The following object is masked from 'package:S4Vectors':
## 
##     rename
## The following object is masked from 'package:stats':
## 
##     filter

Identifying empty droplets

In droplet data (e.g., 10X Genomics, Dropseq) the libraries are made from the droplets, which are not guaranteed to cvontain a cell. Thus, we need to distinguish between cells and empty droplets based on the observed expression profiles. A complication is that empty droplets may contain ambient (i.e., extracellular) RNA that can be captured and sequenced, resulting in non-zero counts for libraries that do not contain any cell.

This step should be carried on starting from the “unfiltered” matrices from CellRanger (or similar pipeline), because these pipelines usually include a heuristic to filter out empty droplets. Here, we show how to use the functions on the pbmc4k data, for illustration purposes. We will use the default filtering employed by CellRanger for the remainder of the tutorial, hence the following chunks do not need to be run for the rest of the tutorial to work.

Testing for empty droplets

emptyDrops() computes Monte Carlo p-values based on a Dirichlet-multinomial model of sampling molecules into droplets.

emptyDrops() assumes that libraries with total UMI counts below a certain threshold (lower=100 by default) correspond to empty droplets. These are used to estimate the ambient expression profile against which the remaining libraries are tested. Under this definition, these low-count libraries cannot be cell-containing droplets and are excluded from the hypothesis testing.

To retain only the detected cells, we would subset our SingleCellExperiment object.

Calculate QC

Having removed the empty droplets, we can start calculating some quality metrics. We can for example calculate the percentage of mitocondrial and ribossomal genes per cell and add to the metadata. This will be helpfull to visualize them across different metadata parameteres (i.e. datasetID and chemistry version). There are several ways of doing this, and here manually calculate the proportion of mitochondrial reads and add to the metadata table.

Citing from “Simple Single Cell” workflows (Lun, McCarthy & Marioni, 2017): “High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), possibly because of loss of cytoplasmic RNA from perforated cells. The reasoning is that mitochondria are larger than individual transcript molecules and less likely to escape through tears in the cell membrane.”

Analogously, we can calculate the proportion of gene expression coming from ribosomal proteins.

Given these set of genes, the scater package automatically calculates several per-cell QC metrics.

##  [1] 31220 31221 31222 31223 31224 31225 31226 31227 31228 31229

QC-based filtering

A standard approach is to filter cells with low amount of reads as well as genes that are present in at least a certain amount of cells. While simple, using fixed thresholds requires knowledge of the experiment and of the experimental protocol.

An alternative approach is to use adaptive, data-driven thresholds to identify outlying cells, based on the set of QC metrics just calculated.

We identify cells that are outliers for the various QC metrics, based on the median absolute deviation (MAD) from the median value of each metric across all cells. Specifically, a value is considered an outlier if it is more than 3 MADs from the median in the “problematic” direction.

## discard
## FALSE  TRUE 
##  6678   362

Additionaly, Extremely high number of detected genes could indicate doublets. However, depending on the cell type composition in your sample, you may have cells with higher number of genes (and also higher counts) from one cell type. In these datasets, there is also a clear difference between the v1 vs v2 10x chemistry with regards to gene detection, so it may not be fair to apply the same cutoffs to all of them. Hence, we do not discard cells that have a high number of detected genes and use specific tools to detect doublets later on.

A similar approach is implemented in the metric_sample_filter function of the scone Bioconductor package.

Filtering

Cell filtering

Until now we only marked low-quality cells. We could decide to keep all the data in the object, including the low-quality cells, and “keep an eye” on the low-quality cells at the interpretation stage.

For simplicity, it is often better to discard the low-quality cells at the QC stage. To do so it is sufficient to subset the sce object.

## class: SingleCellExperiment 
## dim: 31232 6678 
## metadata(0):
## assays(1): counts
## rownames(31232): ENSG00000243485 ENSG00000237613 ...
##   ENSG00000198695 ENSG00000198727
## rowData names(12): ENSEMBL_ID Symbol ... total_counts
##   log10_total_counts
## colnames: NULL
## colData names(57): Sample Barcode ...
##   pct_counts_in_top_500_features_Ribo discard
## reducedDimNames(0):
## spikeNames(0):

Gene filtering

Naturally, we want to exclude genes that are not expressed in our system, as they do not contribute any information to our experiment.

## 
## FALSE  TRUE 
## 19249 11983

In addition, very lowly expressed genes may only contribute noise. Hence, it is often suggested to remove genes that are not expressed in at least a certain proportion of cells in the dataset. Here, we keep those genes that are expressed in at least 5% of the data.

## class: SingleCellExperiment 
## dim: 5407 6678 
## metadata(0):
## assays(1): counts
## rownames(5407): ENSG00000188976 ENSG00000187608 ...
##   ENSG00000198695 ENSG00000198727
## rowData names(12): ENSEMBL_ID Symbol ... total_counts
##   log10_total_counts
## colnames: NULL
## colData names(57): Sample Barcode ...
##   pct_counts_in_top_500_features_Ribo discard
## reducedDimNames(0):
## spikeNames(0):

Note that if we expect one or more rare cell populations we might need to decrease the percentage of cells.

Normalization

Each cell is sequenced at a different sequencing depth. This difference is a combination of different experimental factors, such as cDNA capture and PCR amplification. Normalization aims to remove these differences to ensure that the when we compare expression profiles between cells we are measuring biology and not technical biases.

Here, we focus on “scaling normalization”, which is the simplest and most common normalization strategy. As the name suggests, it simply involves scaling the number of reads by a cell-specific factor, often known as a size factor.

The easiest option is to divide the counts of each cell by the total number of reads. This can be done in the scater.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1986  0.6247  0.8838  1.0000  1.2056  9.4696

While simple and intuitive, this strategy may be problematic when there is an imbalance in differential expression among cells, e.g., when few highly expressed genes are also those that mark each cell type. This is a well-known problem in bulk RNA-seq and many normalization methods have been shown to work better than library size normalization (e.g., TMM or DESeq normalization).

However, single-cell data can be problematic for these methods, due to the large number of zero and low counts. To overcome this, the scran normalization pools together similar cells to compute and then deconvolve size factors. To avoid pooling together very different cells, a quick clustering is performed prior to the pooling and only cells from the same cluster can be pooled.

The scran package has a quickCluster function, but here we use the faster mini-batch k-means algorithm for this initial step.

## 
##    1    2    3    4    5    6    7    8    9   10 
##  578  972  497   13  741 1316  855  131 1402  173

Once we have the appropriate size factors, we can transform the data using the scater package.

## class: SingleCellExperiment 
## dim: 5407 6678 
## metadata(1): log.exprs.offset
## assays(2): counts logcounts
## rownames(5407): ENSG00000188976 ENSG00000187608 ...
##   ENSG00000198695 ENSG00000198727
## rowData names(12): ENSEMBL_ID Symbol ... total_counts
##   log10_total_counts
## colnames: NULL
## colData names(57): Sample Barcode ...
##   pct_counts_in_top_500_features_Ribo discard
## reducedDimNames(0):
## spikeNames(0):
## <5407 x 6678> DelayedMatrix object of type "double":
##                      [,1]      [,2]      [,3] ...   [,6677]   [,6678]
## ENSG00000188976  0.000000  0.000000  0.000000   .  0.000000  0.000000
## ENSG00000187608  0.000000  0.000000  1.170138   .  1.183913  0.000000
## ENSG00000078808  0.000000  0.000000  1.170138   .  0.000000  0.000000
## ENSG00000160087  0.000000  0.000000  0.000000   .  0.000000  0.000000
## ENSG00000127054  0.000000  0.000000  0.000000   .  0.000000  0.000000
##             ...         .         .         .   .         .         .
## ENSG00000212907 0.0000000 0.0000000 0.0000000   . 0.0000000 0.9988313
## ENSG00000198886 3.9016968 4.6010242 2.2482296   . 3.3079307 2.8053509
## ENSG00000198786 1.2597905 0.7698663 1.8076282   . 2.2677643 3.3198238
## ENSG00000198695 0.0000000 0.0000000 0.0000000   . 0.0000000 0.0000000
## ENSG00000198727 2.7177451 2.7313775 2.5852813   . 2.2677643 3.3198238

This function adds a logcounts slot to the object with the normalized data.

Doublet detection

Doublets are artifactual libraries generated from two cells. This happens if two cells are mistakenly captured in the same droplet. Doublets are particularly problematic for two reasons: (a) having twice as much the RNA of a single cell they appear as an extremely good quality sample; (b) they can be mistaken for intermediate populations or transitory states that do not actually exist.

We can infer doublets with computational approaches. Two approaches are implemented in the scran package. One needs cluster information and one does not. Here, we use the clustering results obtained earlier for simplicity, but a proper cluster analysis (seen in later lectures) would be preferable here.

Using cluster information

The doubletCluster() function identifes clusters with expression profiles lying between two other clusters. Considering every possible triplet of clusters, the method uses the number of DE genes, the median library size, and the proporion of cells in the cluster to mark clusters as possible doublets.

## DataFrame with 10 rows and 9 columns
##        source1     source2         N            best               p.value
##    <character> <character> <integer>     <character>             <numeric>
## 8           10           5         0 ENSG00000198763     0.786693566587063
## 10           4           3         0 ENSG00000203747     0.117093138746492
## 5            7           6         1 ENSG00000251562  1.93519641117096e-96
## 6            5           4        45 ENSG00000150045  2.67663777277273e-11
## 4           10           2        65 ENSG00000158062  1.11741567263486e-09
## 9            6           1        66 ENSG00000213402 1.36914320388869e-187
## 7            5           4        70 ENSG00000177600  4.00497986877925e-06
## 3            4           1       189 ENSG00000163221  5.17291015775857e-43
## 2            9           7       242 ENSG00000142541 5.05535470531125e-108
## 1            4           3       293 ENSG00000128383     2.03735403382e-36
##            lib.size1         lib.size2                prop
##            <numeric>         <numeric>           <numeric>
## 8   1.14728492983527 0.452837095790116  0.0196166516921234
## 10  1.85045734950011 0.469261859178898   0.025905959868224
## 5     1.291565615737 0.809350579358663   0.110961365678347
## 6   1.23555851506576  5.79257532878309    0.19706498951782
## 4  0.540406943326819 0.142171513967123 0.00194669062593591
## 9   1.77616794795979  1.72531046717918    0.20994309673555
## 7  0.774254120592531  3.62987690381807   0.128032345013477
## 3   3.94333635539438 0.661264732547597  0.0744234800838574
## 2  0.683646654538104   1.9377400444714     0.1455525606469
## 1   5.96332476435304  1.51225364181662  0.0865528601377658
##                                   all.pairs
##                             <DataFrameList>
## 8       10:5:0:...,10:7:0:...,5:4:3:...,...
## 10       4:3:0:...,4:1:7:...,8:4:52:...,...
## 5       7:6:1:...,8:7:17:...,7:4:35:...,...
## 6      5:4:45:...,8:5:83:...,7:4:90:...,...
## 4   10:2:65:...,10:7:71:...,10:5:77:...,...
## 9     6:1:66:...,4:2:97:...,5:1:102:...,...
## 7     5:4:70:...,5:2:74:...,8:2:117:...,...
## 3  4:1:189:...,8:4:292:...,10:4:316:...,...
## 2   9:7:242:...,5:4:245:...,7:4:246:...,...
## 1   4:3:293:...,9:4:293:...,4:2:312:...,...
## [1] "8"  "10"

These clusters are marked as possible doublet clusters, so care should be taken when interpreting their biological identity.

Using simulations

The other approach uses simulation of in silico doublets, which are then compared to the original cells. If some cells look similar to the simulated doublets, they are likely to be doublets themselves.

The advantage of this approach is that it produces a “doublet score” for each cell, at a cost of more computational burden.

Finally, let’s save the QC-filtered, normalized data for further analysis. Because our data are in HDF5 format, we use a specialized function.

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] mbkmeans_1.0.0              scran_1.12.1               
##  [3] scater_1.12.2               ggplot2_3.2.1              
##  [5] DropletUtils_1.4.3          TENxPBMCData_1.2.0         
##  [7] HDF5Array_1.12.2            rhdf5_2.28.0               
##  [9] SingleCellExperiment_1.6.0  SummarizedExperiment_1.14.1
## [11] DelayedArray_0.10.0         BiocParallel_1.18.1        
## [13] matrixStats_0.55.0          Biobase_2.44.0             
## [15] GenomicRanges_1.36.1        GenomeInfoDb_1.20.0        
## [17] IRanges_2.18.3              S4Vectors_0.22.1           
## [19] BiocGenerics_0.30.0        
## 
## loaded via a namespace (and not attached):
##   [1] ggbeeswarm_0.6.0              colorspace_1.4-1             
##   [3] dynamicTreeCut_1.63-1         benchmarkme_1.0.2            
##   [5] XVector_0.24.0                BiocNeighbors_1.2.0          
##   [7] bit64_0.9-7                   interactiveDisplayBase_1.22.0
##   [9] AnnotationDbi_1.46.1          ClusterR_1.2.0               
##  [11] codetools_0.2-16              splines_3.6.1                
##  [13] R.methodsS3_1.7.1             doParallel_1.0.15            
##  [15] knitr_1.25                    zeallot_0.1.0                
##  [17] ade4_1.7-13                   cluster_2.1.0                
##  [19] dbplyr_1.4.2                  R.oo_1.22.0                  
##  [21] FD_1.0-12                     shiny_1.4.0                  
##  [23] BiocManager_1.30.7            compiler_3.6.1               
##  [25] httr_1.4.1                    dqrng_0.2.1                  
##  [27] backports_1.1.5               assertthat_0.2.1             
##  [29] Matrix_1.2-17                 fastmap_1.0.1                
##  [31] lazyeval_0.2.2                limma_3.40.6                 
##  [33] later_1.0.0                   BiocSingular_1.0.0           
##  [35] htmltools_0.4.0               tools_3.6.1                  
##  [37] gmp_0.5-13.5                  rsvd_1.0.2                   
##  [39] igraph_1.2.4.1                gtable_0.3.0                 
##  [41] glue_1.3.1                    GenomeInfoDbData_1.2.1       
##  [43] dplyr_0.8.3                   rappdirs_0.3.1               
##  [45] Rcpp_1.0.2                    vctrs_0.2.0                  
##  [47] ape_5.3                       nlme_3.1-140                 
##  [49] ExperimentHub_1.10.0          iterators_1.0.12             
##  [51] DelayedMatrixStats_1.6.1      xfun_0.10                    
##  [53] stringr_1.4.0                 beachmat_2.0.0               
##  [55] mime_0.7                      irlba_2.3.3                  
##  [57] gtools_3.8.1                  statmod_1.4.32               
##  [59] AnnotationHub_2.16.1          edgeR_3.26.8                 
##  [61] zlibbioc_1.30.0               MASS_7.3-51.4                
##  [63] scales_1.0.0                  promises_1.1.0               
##  [65] yaml_2.2.0                    curl_4.2                     
##  [67] memoise_1.1.0                 gridExtra_2.3                
##  [69] stringi_1.4.3                 RSQLite_2.1.2                
##  [71] foreach_1.4.7                 permute_0.9-5                
##  [73] benchmarkmeData_1.0.2         geometry_0.4.4               
##  [75] rlang_0.4.0                   pkgconfig_2.0.3              
##  [77] bitops_1.0-6                  evaluate_0.14                
##  [79] lattice_0.20-38               purrr_0.3.2                  
##  [81] Rhdf5lib_1.6.1                labeling_0.3                 
##  [83] cowplot_1.0.0                 bit_1.1-14                   
##  [85] tidyselect_0.2.5              magrittr_1.5                 
##  [87] R6_2.4.0                      DBI_1.0.0                    
##  [89] pillar_1.4.2                  withr_2.1.2                  
##  [91] mgcv_1.8-28                   abind_1.4-5                  
##  [93] RCurl_1.95-4.12               tibble_2.1.3                 
##  [95] crayon_1.3.4                  BiocFileCache_1.8.0          
##  [97] rmarkdown_1.16                viridis_0.5.1                
##  [99] locfit_1.5-9.1                grid_3.6.1                   
## [101] blob_1.2.0                    vegan_2.5-6                  
## [103] digest_0.6.21                 xtable_1.8-4                 
## [105] httpuv_1.5.2                  R.utils_2.9.0                
## [107] munsell_0.5.0                 beeswarm_0.2.3               
## [109] viridisLite_0.3.0             vipor_0.4.5                  
## [111] magic_1.5-9