Data Integration

Seurat Toolkit

Combining and harmonizing samples or datasets from different batches such as experiments or conditions to enable meaningful cross-sample comparisons.
Authors

Åsa Björklund

Paulo Czarnewski

Susanne Reinsbach

Roy Francis

Published

09-Feb-2024

Note

Code chunks run R commands unless otherwise specified.

In this tutorial we will look at different ways of integrating multiple single cell RNA-seq datasets. We will explore a few different methods to correct for batch effects across datasets. Seurat uses the data integration method presented in Comprehensive Integration of Single Cell Data, while Scran and Scanpy use a mutual Nearest neighbour method (MNN). Below you can find a list of some methods for single data integration:

Markdown Language Library Ref
CCA R Seurat Cell
MNN R/Python Scater/Scanpy Nat. Biotech.
Conos R conos Nat. Methods
Conos R Harmony Nat. Methods
Scanorama Python scanorama Nat. Biotech.

1 Data preparation

Let’s first load necessary libraries and the data saved in the previous lab.

suppressPackageStartupMessages({
    library(Seurat)
    library(ggplot2)
    library(patchwork)
    library(reticulate)
})

# Activate scanorama Python venv
reticulate::use_virtualenv("/opt/venv/scanorama")
reticulate::py_discover_config()
python:         /opt/venv/scanorama/bin/python
libpython:      /usr/lib/python3.10/config-3.10-x86_64-linux-gnu/libpython3.10.so
pythonhome:     /opt/venv/scanorama:/opt/venv/scanorama
version:        3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
numpy:          /opt/venv/scanorama/lib/python3.10/site-packages/numpy
numpy_version:  1.26.3

NOTE: Python version was forced by use_python function
# download pre-computed data if missing or long compute
fetch_data <- TRUE

# url for source and intermediate data
path_data <- "https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq"
path_file <- "data/covid/results/seurat_covid_qc_dr.rds"
if (!dir.exists(dirname(path_file))) dir.create(dirname(path_file), recursive = TRUE)
if (fetch_data && !file.exists(path_file)) download.file(url = file.path(path_data, "covid/results/seurat_covid_qc_dr.rds"), destfile = path_file)
alldata <- readRDS(path_file)
print(names(alldata@reductions))
[1] "pca"               "umap"              "tsne"             
[4] "UMAP10_on_PCA"     "UMAP_on_ScaleData" "UMAP_on_Graph"    

We split the combined object into a list, with each dataset as an element. We perform standard preprocessing (log-normalization), and identify variable features individually for each dataset based on a variance stabilizing transformation (vst).

alldata.list <- SplitObject(alldata, split.by = "orig.ident")

for (i in 1:length(alldata.list)) {
    alldata.list[[i]] <- NormalizeData(alldata.list[[i]], verbose = FALSE)
    alldata.list[[i]] <- FindVariableFeatures(alldata.list[[i]], selection.method = "vst", nfeatures = 2000,verbose = FALSE)
}

# get the variable genes from all the datasets.
hvgs_per_dataset <- lapply(alldata.list, function(x) { x@assays$RNA@var.features })

# also add in the variable genes that was selected on the whole dataset
hvgs_per_dataset$all = VariableFeatures(alldata)

temp <- unique(unlist(hvgs_per_dataset))
overlap <- sapply( hvgs_per_dataset , function(x) { temp %in% x } )
pheatmap::pheatmap(t(overlap*1),cluster_rows = F ,
                   color = c("grey90","grey20"))

As you can see, there are a lot of genes that are variable in just one dataset. There are also some genes in the gene set that was selected using all the data that are not variable in any of the individual datasets. These are most likely genes driven by batch effects.

A better way to select features for integration is to combine the information on variable genes across the dataset. This can be done with the function SelectIntegrationFeatures that combines the ranks of the variable features in the different datasets.

hvgs_all = SelectIntegrationFeatures(alldata.list)
hvgs_per_dataset$all_ranks = hvgs_all

temp <- unique(unlist(hvgs_per_dataset))
overlap <- sapply( hvgs_per_dataset , function(x) { temp %in% x } )
pheatmap::pheatmap(t(overlap*1),cluster_rows = F ,
                   color = c("grey90","grey20"))

For all downstream integration we will use this set of genes.

2 CCA

We identify anchors using the FindIntegrationAnchors() function, which takes a list of Seurat objects as input.

alldata.anchors <- FindIntegrationAnchors(object.list = alldata.list, dims = 1:30,reduction = "cca", anchor.features = hvgs_all)

We then pass these anchors to the IntegrateData() function, which returns a Seurat object.

alldata.int <- IntegrateData(anchorset = alldata.anchors, dims = 1:30, new.assay.name = "CCA")

We can observe that a new assay slot is now created under the name CCA. If you do not specify the assay name, the default will be integrated.

names(alldata.int@assays)
[1] "RNA" "CCA"
# by default, Seurat now sets the integrated assay as the default assay, so any operation you now perform will be on the integrated data.
alldata.int@active.assay
[1] "CCA"

After running IntegrateData(), the Seurat object will contain a new Assay with the integrated (or batch-corrected) expression matrix. Note that the original (uncorrected values) are still stored in the object in the “RNA” assay, so you can switch back and forth. We can then use this new integrated matrix for downstream analysis and visualization. Here we scale the integrated data, run PCA, and visualize the results with UMAP and TSNE. The integrated datasets cluster by cell type, instead of by technology.

As CCA is the active.assay now, the functions will by default run on the data in that assay. But you could also specify in each of the functions to run them in a specific assay with the parameter assay = "CCA".

#Run Dimensionality reduction on integrated space
alldata.int <- ScaleData(alldata.int, verbose = FALSE)
alldata.int <- RunPCA(alldata.int, npcs = 30, verbose = FALSE)
alldata.int <- RunUMAP(alldata.int, dims = 1:30)
alldata.int <- RunTSNE(alldata.int, dims = 1:30)

We can now plot the unintegrated and the integrated space reduced dimensions.

wrap_plots(
  DimPlot(alldata, reduction = "pca", group.by = "orig.ident")+NoAxes()+ggtitle("PCA raw_data"),
  DimPlot(alldata, reduction = "tsne", group.by = "orig.ident")+NoAxes()+ggtitle("tSNE raw_data"),
  DimPlot(alldata, reduction = "umap", group.by = "orig.ident")+NoAxes()+ggtitle("UMAP raw_data"),
  
  DimPlot(alldata.int, reduction = "pca", group.by = "orig.ident")+NoAxes()+ggtitle("PCA integrated"),
  DimPlot(alldata.int, reduction = "tsne", group.by = "orig.ident")+NoAxes()+ggtitle("tSNE integrated"),
  DimPlot(alldata.int, reduction = "umap", group.by = "orig.ident")+NoAxes()+ggtitle("UMAP integrated"),
  ncol = 3
) + plot_layout(guides = "collect")

2.1 Clean memory

Again we have a lot of large objects in the memory. We have the original data alldata but also the integrated data in alldata.int. We also have the split objects in alldata.list and the anchors in alldata.anchors. In principle we only need the integrated object for now, but we will also keep the list for running Scanorama further down in the tutorial.

We also want to keep the original umap for visualization purposes, so we copy it over to the integrated object.

alldata.int@reductions$umap_raw = alldata@reductions$umap

# remove all objects that will not be used.
rm(alldata,  alldata.anchors)
# run garbage collect to free up memory
gc()
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells   3414522  182.4    4989418  266.5   4989418  266.5
Vcells 203287569 1551.0  566105528 4319.1 882313251 6731.6

Let’s plot some marker genes for different cell types onto the embedding.

Markers Cell Type
CD3E T cells
CD3E CD4 CD4+ T cells
CD3E CD8A CD8+ T cells
GNLY, NKG7 NK cells
MS4A1 B cells
CD14, LYZ, CST3, MS4A7 CD14+ Monocytes
FCGR3A, LYZ, CST3, MS4A7 FCGR3A+ Monocytes
FCER1A, CST3 DCs
myfeatures <- c("CD3E", "CD4", "CD8A", "NKG7", "GNLY", "MS4A1", "CD14", "LYZ", "MS4A7", "FCGR3A", "CST3", "FCER1A")
FeaturePlot(alldata.int, reduction = "umap", dims = 1:2, features = myfeatures, ncol = 4, order = T) + NoLegend() + NoAxes() + NoGrid()

3 Harmony

An alternative method for integration is Harmony, for more details on the method, please se their paper Nat. Methods. This method runs the integration on a dimensionality reduction, in most applications the PCA. So first, we will rerun scaling and PCA with the same set of genes that were used for the CCA integration.

OBS! Make sure to revert back to the RNA assay.

alldata.int@active.assay = "RNA"
VariableFeatures(alldata.int) = hvgs_all
alldata.int = ScaleData(alldata.int, vars.to.regress = c("percent_mito", "nFeature_RNA"))
alldata.int = RunPCA(alldata.int, reduction.name = "pca_harmony")

Now we are ready to run Harmony.

library(harmony)

alldata.int <- RunHarmony(
  alldata.int,
  group.by.vars = "orig.ident",
  reduction.use = "pca_harmony",
  dims.use = 1:50,
  assay.use = "RNA")

Harmony will create another reduction slot in your seurat object with the name harmony, so now we can use that reduction instead of PCA to run UMAP.

alldata.int <- RunUMAP(alldata.int, dims = 1:50, reduction = "harmony", reduction.name = "umap_harmony")

DimPlot(alldata.int, reduction = "umap_harmony", group.by = "orig.ident") + NoAxes() + ggtitle("Harmony UMAP")

4 Scanorama

Important

If you are running locally using Docker and you have a Mac with ARM chip, the Scanorama reticulate module is known to crash. In this case, you might want to skip this section.

Another integration method is Scanorama (see Nat. Biotech.). This method is implemented in python, but we can run it through the Reticulate package.

We will run it with the same set of variable genes, but first we have to create a list of all the objects per sample.

assaylist <- list()
genelist <- list()
for(i in 1:length(alldata.list)) {
  assaylist[[i]] <- t(as.matrix(GetAssayData(alldata.list[[i]], "data")[hvgs_all,]))
  genelist[[i]] <- hvgs_all
}

lapply(assaylist,dim)
[[1]]
[1]  873 2000

[[2]]
[1]  557 2000

[[3]]
[1]  357 2000

[[4]]
[1] 1050 2000

[[5]]
[1] 1034 2000

[[6]]
[1] 1125 2000

[[7]]
[1]  999 2000

[[8]]
[1] 1139 2000

Then, we use the scanorama function through reticulate. The integrated data is added back into the Seurat object as a new Reduction.

# Activate scanorama Python venv
scanorama <- reticulate::import("scanorama")

integrated.data <- scanorama$integrate(datasets_full = assaylist,
                                       genes_list = genelist )

# Now we create a new dim reduction object in the format that Seurat uses
intdimred <- do.call(rbind, integrated.data[[1]])
colnames(intdimred) <- paste0("PC_", 1:100)
rownames(intdimred) <- colnames(alldata.int)

# Add standard deviations in order to draw Elbow Plots in Seurat
stdevs <- apply(intdimred, MARGIN = 2, FUN = sd)

# Create a new dim red object.
alldata.int[["scanorama"]] <- CreateDimReducObject(
  embeddings = intdimred,
  stdev      = stdevs,
  key        = "PC_",
  assay      = "RNA")
#Here we use all PCs computed from Scanorama for UMAP calculation
alldata.int <- RunUMAP(alldata.int, dims = 1:100, reduction = "scanorama",reduction.name = "umap_scanorama")

DimPlot(alldata.int, reduction = "umap_scanorama", group.by = "orig.ident") + NoAxes() + ggtitle("Harmony UMAP")

5 Overview all methods

Now we will plot UMAPS with all three integration methods side by side.

p1 <- DimPlot(alldata.int, reduction = "umap_raw", group.by = "orig.ident") + ggtitle("UMAP raw_data")
p2 <- DimPlot(alldata.int, reduction = "umap", group.by = "orig.ident") + ggtitle("UMAP CCA")
p3 <- DimPlot(alldata.int, reduction = "umap_harmony", group.by = "orig.ident") + ggtitle("UMAP Harmony")
p4 <- DimPlot(alldata.int, reduction = "umap_scanorama", group.by = "orig.ident")+ggtitle("UMAP Scanorama")

wrap_plots(p1, p2, p3, p4, nrow = 2) + plot_layout(guides = "collect")

Discuss

Look at the different integration results, which one do you think looks the best? How would you motivate selecting one method over the other? How do you think you could best evaluate if the integration worked well?

Let’s save the integrated data for further analysis.

saveRDS(alldata.int,"data/covid/results/seurat_covid_qc_dr_int.rds")

6 Extra task

You have now done the Seurat integration with CCA which is quite slow. There are other options in the FindIntegrationAnchors() function. Try rerunning the integration with rpca and/or rlsi and create a new UMAP. Compare the results.

7 Session info

Click here
sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] harmony_1.2.0      Rcpp_1.0.10        reticulate_1.30    patchwork_1.1.2   
[5] ggplot2_3.4.2      SeuratObject_4.1.3 Seurat_4.3.0      

loaded via a namespace (and not attached):
  [1] deldir_1.0-9           pbapply_1.7-0          gridExtra_2.3         
  [4] rlang_1.1.1            magrittr_2.0.3         RcppAnnoy_0.0.20      
  [7] spatstat.geom_3.2-1    matrixStats_1.0.0      ggridges_0.5.4        
 [10] compiler_4.3.0         png_0.1-8              vctrs_0.6.2           
 [13] reshape2_1.4.4         stringr_1.5.0          pkgconfig_2.0.3       
 [16] fastmap_1.1.1          ellipsis_0.3.2         labeling_0.4.2        
 [19] utf8_1.2.3             promises_1.2.0.1       rmarkdown_2.22        
 [22] purrr_1.0.1            xfun_0.39              jsonlite_1.8.5        
 [25] goftest_1.2-3          later_1.3.1            spatstat.utils_3.0-3  
 [28] irlba_2.3.5.1          parallel_4.3.0         cluster_2.1.4         
 [31] R6_2.5.1               ica_1.0-3              stringi_1.7.12        
 [34] RColorBrewer_1.1-3     spatstat.data_3.0-1    parallelly_1.36.0     
 [37] lmtest_0.9-40          scattermore_1.2        knitr_1.43            
 [40] tensor_1.5             future.apply_1.11.0    zoo_1.8-12            
 [43] sctransform_0.3.5      httpuv_1.6.11          Matrix_1.5-4          
 [46] splines_4.3.0          igraph_1.4.3           tidyselect_1.2.0      
 [49] abind_1.4-5            rstudioapi_0.14        yaml_2.3.7            
 [52] spatstat.random_3.1-5  codetools_0.2-19       miniUI_0.1.1.1        
 [55] spatstat.explore_3.2-1 listenv_0.9.0          lattice_0.21-8        
 [58] tibble_3.2.1           plyr_1.8.8             withr_2.5.0           
 [61] shiny_1.7.4            ROCR_1.0-11            evaluate_0.21         
 [64] Rtsne_0.16             future_1.32.0          survival_3.5-5        
 [67] polyclip_1.10-4        fitdistrplus_1.1-11    pillar_1.9.0          
 [70] KernSmooth_2.23-20     plotly_4.10.2          generics_0.1.3        
 [73] rprojroot_2.0.3        sp_1.6-1               munsell_0.5.0         
 [76] scales_1.2.1           globals_0.16.2         xtable_1.8-4          
 [79] RhpcBLASctl_0.23-42    glue_1.6.2             pheatmap_1.0.12       
 [82] lazyeval_0.2.2         tools_4.3.0            data.table_1.14.8     
 [85] RANN_2.6.1             leiden_0.4.3           cowplot_1.1.1         
 [88] grid_4.3.0             tidyr_1.3.0            colorspace_2.1-0      
 [91] nlme_3.1-162           cli_3.6.1              spatstat.sparse_3.0-1 
 [94] fansi_1.0.4            viridisLite_0.4.2      dplyr_1.1.2           
 [97] uwot_0.1.14            gtable_0.3.3           digest_0.6.31         
[100] progressr_0.13.0       ggrepel_0.9.3          farver_2.1.1          
[103] htmlwidgets_1.6.2      htmltools_0.5.5        lifecycle_1.0.3       
[106] here_1.0.1             httr_1.4.6             mime_0.12             
[109] MASS_7.3-58.4