Data Integration

Bioconductor Toolkit

Combining and harmonizing samples or datasets from different batches such as experiments or conditions to enable meaningful cross-sample comparisons.
Authors

Åsa Björklund

Paulo Czarnewski

Susanne Reinsbach

Roy Francis

Published

09-Feb-2024

Note

Code chunks run R commands unless otherwise specified.

In this tutorial we will look at different ways of integrating multiple single cell RNA-seq datasets. We will explore a few different methods to correct for batch effects across datasets. Seurat uses the data integration method presented in Comprehensive Integration of Single Cell Data, while Scran and Scanpy use a mutual Nearest neighbour method (MNN). Below you can find a list of some methods for single data integration:

Markdown Language Library Ref
CCA R Seurat Cell
MNN R/Python Scater/Scanpy Nat. Biotech.
Conos R conos Nat. Methods
Scanorama Python scanorama Nat. Biotech.

1 Data preparation

Let’s first load necessary libraries and the data saved in the previous lab.

# Activate scanorama Python venv
reticulate::use_virtualenv("/opt/venv/scanorama")
reticulate::py_discover_config()
python:         /opt/venv/scanorama/bin/python
libpython:      /usr/lib/python3.10/config-3.10-x86_64-linux-gnu/libpython3.10.so
pythonhome:     /opt/venv/scanorama:/opt/venv/scanorama
version:        3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
numpy:          /opt/venv/scanorama/lib/python3.10/site-packages/numpy
numpy_version:  1.26.3

NOTE: Python version was forced by use_python function
suppressPackageStartupMessages({
    library(scater)
    library(scran)
    library(patchwork)
    library(ggplot2)
    library(batchelor)
    library(harmony)
    library(reticulate)
})
# download pre-computed data if missing or long compute
fetch_data <- TRUE

# url for source and intermediate data
path_data <- "https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq"
path_file <- "data/covid/results/bioc_covid_qc_dr.rds"
if (!dir.exists(dirname(path_file))) dir.create(dirname(path_file), recursive = TRUE)
if (fetch_data && !file.exists(path_file)) download.file(url = file.path(path_data, "covid/results/bioc_covid_qc_dr.rds"), destfile = path_file)
sce <- readRDS(path_file)
print(reducedDims(sce))
List of length 8
names(8): PCA UMAP tSNE_on_PCA ... UMAP_on_ScaleData KNN UMAP_on_Graph

We split the combined object into a list, with each dataset as an element. We perform standard preprocessing (log-normalization), and identify variable features individually for each dataset based on a variance stabilizing transformation (vst).

sce.list <- lapply(unique(sce$sample), function(x) {
    x <- sce[, sce$sample == x]
})

hvgs_per_dataset <- lapply(sce.list, function(x) {
    x <- computeSumFactors(x, sizes = c(20, 40, 60, 80))
    x <- logNormCounts(x)
    var.out <- modelGeneVar(x, method = "loess")
    hvg.out <- var.out[which(var.out$FDR <= 0.05 & var.out$bio >= 0.2), ]
    hvg.out <- hvg.out[order(hvg.out$bio, decreasing = TRUE), ]
    return(rownames(hvg.out))
})
names(hvgs_per_dataset) <- unique(sce$sample)

# venn::venn(hvgs_per_dataset,opacity = .4,zcolor = scales::hue_pal()(3),cexsn = 1,cexil = 1,lwd=1,col="white",borders = NA)

temp <- unique(unlist(hvgs_per_dataset))
overlap <- sapply(hvgs_per_dataset, function(x) {
    temp %in% x
})
pheatmap::pheatmap(t(overlap * 1), cluster_rows = F, color = c("grey90", "grey20")) ## MNN

The mutual nearest neighbors (MNN) approach within the scran package utilizes a novel approach to adjust for batch effects. The fastMNN() function returns a representation of the data with reduced dimensionality, which can be used in a similar fashion to other lower-dimensional representations such as PCA. In particular, this representation can be used for downstream methods such as clustering. The BNPARAM can be used to specify the specific nearest neighbors method to use from the BiocNeighbors package. Here we make use of the Annoy library via the BiocNeighbors::AnnoyParam() argument. We save the reduced-dimension MNN representation into the reducedDims slot of our sce object.

mnn_out <- batchelor::fastMNN(sce, subset.row = unique(unlist(hvgs_per_dataset)), batch = factor(sce$sample), k = 20, d = 50)
Caution

fastMNN() does not produce a batch-corrected expression matrix.

mnn_out <- t(reducedDim(mnn_out, "corrected"))
colnames(mnn_out) <- unlist(lapply(sce.list, function(x) {
    colnames(x)
}))
mnn_out <- mnn_out[, colnames(sce)]
rownames(mnn_out) <- paste0("dim", 1:50)
reducedDim(sce, "MNN") <- t(mnn_out)

We can observe that a new assay slot is now created under the name MNN.

reducedDims(sce)
List of length 9
names(9): PCA UMAP tSNE_on_PCA UMAP_on_PCA ... KNN UMAP_on_Graph MNN

Thus, the result from fastMNN() should solely be treated as a reduced dimensionality representation, suitable for direct plotting, TSNE/UMAP, clustering, and trajectory analysis that relies on such results.

set.seed(42)
sce <- runTSNE(sce, dimred = "MNN", n_dimred = 50, perplexity = 30, name = "tSNE_on_MNN")
sce <- runUMAP(sce, dimred = "MNN", n_dimred = 50, ncomponents = 2, name = "UMAP_on_MNN")

We can now plot the unintegrated and the integrated space reduced dimensions.

wrap_plots(
    plotReducedDim(sce, dimred = "PCA", colour_by = "sample", point_size = 0.6) + ggplot2::ggtitle(label = "PCA"),
    plotReducedDim(sce, dimred = "tSNE_on_PCA", colour_by = "sample", point_size = 0.6) + ggplot2::ggtitle(label = "tSNE_on_PCA"),
    plotReducedDim(sce, dimred = "UMAP_on_PCA", colour_by = "sample", point_size = 0.6) + ggplot2::ggtitle(label = "UMAP_on_PCA"),
    plotReducedDim(sce, dimred = "MNN", colour_by = "sample", point_size = 0.6) + ggplot2::ggtitle(label = "MNN"),
    plotReducedDim(sce, dimred = "tSNE_on_MNN", colour_by = "sample", point_size = 0.6) + ggplot2::ggtitle(label = "tSNE_on_MNN"),
    plotReducedDim(sce, dimred = "UMAP_on_MNN", colour_by = "sample", point_size = 0.6) + ggplot2::ggtitle(label = "UMAP_on_MNN"),
    ncol = 3
) + plot_layout(guides = "collect")

Let’s plot some marker genes for different cell types onto the embedding.

Markers Cell Type
CD3E T cells
CD3E CD4 CD4+ T cells
CD3E CD8A CD8+ T cells
GNLY, NKG7 NK cells
MS4A1 B cells
CD14, LYZ, CST3, MS4A7 CD14+ Monocytes
FCGR3A, LYZ, CST3, MS4A7 FCGR3A+ Monocytes
FCER1A, CST3 DCs
plotlist <- list()
for (i in c("CD3E", "CD4", "CD8A", "NKG7", "GNLY", "MS4A1", "CD14", "LYZ", "MS4A7", "FCGR3A", "CST3", "FCER1A")) {
    plotlist[[i]] <- plotReducedDim(sce, dimred = "UMAP_on_MNN", colour_by = i, by_exprs_values = "logcounts", point_size = 0.6) +
        scale_fill_gradientn(colours = colorRampPalette(c("grey90", "orange3", "firebrick", "firebrick", "red", "red"))(10)) +
        ggtitle(label = i) + theme(plot.title = element_text(size = 20))
}
wrap_plots(plotlist = plotlist, ncol = 3)

2 Harmony

An alternative method for integration is Harmony, for more details on the method, please se their paper Nat. Methods. This method runs the integration on a dimensionality reduction, in most applications the PCA. So first, we will rerun scaling and PCA with the same set of genes that were used for the CCA integration.

library(harmony)

reducedDimNames(sce)
 [1] "PCA"               "UMAP"              "tSNE_on_PCA"      
 [4] "UMAP_on_PCA"       "UMAP10_on_PCA"     "UMAP_on_ScaleData"
 [7] "KNN"               "UMAP_on_Graph"     "MNN"              
[10] "tSNE_on_MNN"       "UMAP_on_MNN"      
sce <- RunHarmony(
    sce,
    group.by.vars = "sample",
    reduction.save = "harmony",
    reduction = "PCA",
    dims.use = 1:50
)

# Here we use all PCs computed from Harmony for UMAP calculation
sce <- runUMAP(sce, dimred = "harmony", n_dimred = 50, ncomponents = 2, name = "UMAP_on_Harmony")

3 Scanorama

Important

If you are running locally using Docker and you have a Mac with ARM chip, the Scanorama reticulate module is known to crash. In this case, you might want to skip this section.

Another integration method is Scanorama (see Nat. Biotech.). This method is implemented in python, but we can run it through the Reticulate package.

We will run it with the same set of variable genes, but first we have to create a list of all the objects per sample.

hvgs <- unique(unlist(hvgs_per_dataset))

scelist <- list()
genelist <- list()
for (i in 1:length(sce.list)) {
    scelist[[i]] <- t(as.matrix(logcounts(sce.list[[i]])[hvgs, ]))
    genelist[[i]] <- hvgs
}

lapply(scelist, dim)
[[1]]
[1] 900 694

[[2]]
[1] 599 694

[[3]]
[1] 373 694

[[4]]
[1] 1101  694

[[5]]
[1] 1052  694

[[6]]
[1] 1173  694

[[7]]
[1] 1063  694

[[8]]
[1] 1170  694

Now we can run scanorama:

scanorama <- reticulate::import("scanorama")

integrated.data <- scanorama$integrate(datasets_full = scelist, genes_list = genelist)

intdimred <- do.call(rbind, integrated.data[[1]])
colnames(intdimred) <- paste0("PC_", 1:100)
rownames(intdimred) <- colnames(logcounts(sce))

# Add standard deviations in order to draw Elbow Plots 
stdevs <- apply(intdimred, MARGIN = 2, FUN = sd)
attr(intdimred, "varExplained") <- stdevs

reducedDim(sce, "Scanorama_PCA") <- intdimred

# Here we use all PCs computed from Scanorama for UMAP calculation
sce <- runUMAP(sce, dimred = "Scanorama_PCA", n_dimred = 50, ncomponents = 2, name = "UMAP_on_Scanorama")

plotReducedDim(sce, dimred = "UMAP_on_Scanorama", colour_by = "sample", point_size = 0.6) + ggplot2::ggtitle(label = "UMAP_on_Scanorama")

4 Overview all methods

Now we will plot UMAPS with all three integration methods side by side.

p1 <- plotReducedDim(sce, dimred = "UMAP_on_PCA", colour_by = "sample", point_size = 0.6) + ggplot2::ggtitle(label = "UMAP_on_PCA")
p2 <- plotReducedDim(sce, dimred = "UMAP_on_MNN", colour_by = "sample", point_size = 0.6) + ggplot2::ggtitle(label = "UMAP_on_MNN")
p3 <- plotReducedDim(sce, dimred = "UMAP_on_Harmony", colour_by = "sample", point_size = 0.6) + ggplot2::ggtitle(label = "UMAP_on_Harmony")
p4 <- plotReducedDim(sce, dimred = "UMAP_on_Scanorama", colour_by = "sample", point_size = 0.6) + ggplot2::ggtitle(label = "UMAP_on_Scanorama")

wrap_plots(p1, p2, p3, p4, nrow = 2) +
    plot_layout(guides = "collect")

Let’s save the integrated data for further analysis.

saveRDS(sce, "data/covid/results/bioc_covid_qc_dr_int.rds")

5 Session info

Click here
sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] reticulate_1.30             harmony_1.2.0              
 [3] Rcpp_1.0.10                 batchelor_1.18.1           
 [5] patchwork_1.1.2             scran_1.30.0               
 [7] scater_1.30.1               ggplot2_3.4.2              
 [9] scuttle_1.12.0              SingleCellExperiment_1.24.0
[11] SummarizedExperiment_1.32.0 Biobase_2.62.0             
[13] GenomicRanges_1.54.1        GenomeInfoDb_1.38.5        
[15] IRanges_2.36.0              S4Vectors_0.40.2           
[17] BiocGenerics_0.48.1         MatrixGenerics_1.14.0      
[19] matrixStats_1.0.0          

loaded via a namespace (and not attached):
 [1] bitops_1.0-7              gridExtra_2.3            
 [3] rlang_1.1.1               magrittr_2.0.3           
 [5] RcppAnnoy_0.0.20          compiler_4.3.0           
 [7] DelayedMatrixStats_1.24.0 png_0.1-8                
 [9] vctrs_0.6.2               pkgconfig_2.0.3          
[11] crayon_1.5.2              fastmap_1.1.1            
[13] XVector_0.42.0            labeling_0.4.2           
[15] utf8_1.2.3                rmarkdown_2.22           
[17] ggbeeswarm_0.7.2          xfun_0.39                
[19] bluster_1.12.0            zlibbioc_1.48.0          
[21] beachmat_2.18.0           jsonlite_1.8.5           
[23] DelayedArray_0.28.0       BiocParallel_1.36.0      
[25] irlba_2.3.5.1             parallel_4.3.0           
[27] cluster_2.1.4             R6_2.5.1                 
[29] RColorBrewer_1.1-3        limma_3.58.1             
[31] knitr_1.43                Matrix_1.5-4             
[33] igraph_1.4.3              tidyselect_1.2.0         
[35] rstudioapi_0.14           abind_1.4-5              
[37] yaml_2.3.7                viridis_0.6.3            
[39] codetools_0.2-19          lattice_0.21-8           
[41] tibble_3.2.1              withr_2.5.0              
[43] evaluate_0.21             Rtsne_0.16               
[45] pillar_1.9.0              generics_0.1.3           
[47] rprojroot_2.0.3           RCurl_1.98-1.12          
[49] sparseMatrixStats_1.14.0  munsell_0.5.0            
[51] scales_1.2.1              RhpcBLASctl_0.23-42      
[53] glue_1.6.2                metapod_1.10.1           
[55] pheatmap_1.0.12           tools_4.3.0              
[57] BiocNeighbors_1.20.2      ScaledMatrix_1.10.0      
[59] locfit_1.5-9.8            cowplot_1.1.1            
[61] grid_4.3.0                edgeR_4.0.7              
[63] colorspace_2.1-0          GenomeInfoDbData_1.2.11  
[65] beeswarm_0.4.0            BiocSingular_1.18.0      
[67] vipor_0.4.5               cli_3.6.1                
[69] rsvd_1.0.5                fansi_1.0.4              
[71] S4Arrays_1.2.0            viridisLite_0.4.2        
[73] dplyr_1.1.2               uwot_0.1.14              
[75] ResidualMatrix_1.12.0     gtable_0.3.3             
[77] digest_0.6.31             SparseArray_1.2.3        
[79] ggrepel_0.9.3             dqrng_0.3.0              
[81] farver_2.1.1              htmlwidgets_1.6.2        
[83] htmltools_0.5.5           lifecycle_1.0.3          
[85] here_1.0.1                statmod_1.5.0