In this tutorial we will look at different ways of integrating multiple single cell RNA-seq datasets. We will explore two different methods to correct for batch effects across datasets. We will also look at a quantitative measure to assess the quality of the integrated data. Seurat uses the data integration method presented in Comprehensive Integration of Single Cell Data, while Scran and Scanpy use a mutual Nearest neighbour method (MNN). Below you can find a list of the most recent methods for single data integration:

Markdown Language Library Ref
CCA R Seurat Cell
MNN R/Python Scater/Scanpy Nat. Biotech.
Conos R conos Nat. Methods
Scanorama Python scanorama Nat. Biotech.

Let’s first load necessary libraries and the data saved in the previous lab.


alldata <- readRDS("data/results/covid_qc_dr.rds")
## [1] "pca"               "umap"              "tsne"             
## [4] "UMAP10_on_PCA"     "UMAP_on_ScaleData" "UMAP_on_Graph"

We split the combined object into a list, with each dataset as an element. We perform standard preprocessing (log-normalization), and identify variable features individually for each dataset based on a variance stabilizing transformation (“vst”).

alldata.list <- SplitObject(alldata, = "orig.ident")

for (i in 1:length(alldata.list)) {
    alldata.list[[i]] <- NormalizeData(alldata.list[[i]], verbose = FALSE)
    alldata.list[[i]] <- FindVariableFeatures(alldata.list[[i]], selection.method = "vst", 
        nfeatures = 2000, verbose = FALSE)

hvgs_per_dataset <- lapply(alldata.list, function(x) {
venn::venn(hvgs_per_dataset, opacity = 0.4, zcolor = (scales::hue_pal())(3), cexsn = 1, 
    cexil = 1, lwd = 1, col = "white", frame = F, borders = NA)