OMICS Integration and Systems Biology Course: Single Cell Omics Integration

Seurat Single Cell Omics Integration
Seurat for Integrating scRNAseq and scATACseq from PBMC Cells
WNN Integration of CITEseq: scRNAseq and scProteomics (ADT) on PBMC Cells
Session Info

Seurat Single Cell Omics Integration

Seurat is a tool developed by the lab of Rahul Satija to facilitate analysis of Single Cell Omics (scOmics) data. Started as a pipeline tool, i.e. a collection of all known pre-processing and scOmics analysis steps, Seurat developed a few original methods on normalization and batch-effect correction. With the rise of Human Cell Atlas (HCA) consortium, Seurat contributed to ideas on data harmonization across multiple labs and technologies. The ambition of HCA was to develop an Atlas of all human cells from all human tissues, that could serve as a refernce for the future research, i.e. human cells from a particular experiment could have been quickly assigned to a particular cell type without performing the regular single cell analysis steps.

Seurat was for a very long time de-facto a standard tool for single cell data analysis in 2016-2018, especially in North America, before first two Seurat articles got published in 2018 and 2019. Nowadays, there are alternative single cell analysis workflows, such as SCRAN (lab of John Marioni at EBI) and SCANPY (lab of Fabian Thejs in Munich), that can compete with Seurat. In the first paper of A. Butler et al., Integrating single cell transcriptomic data across different conditions, technilogies and species in Nature Biotechnology, 2018, Seurat suggested an interesting modification of the Canonical Correlation Analysis (CCA), that belongs to the same family of algorithms as PLS, OPLS, JIVE and DISCO. The modification was to include an alignment of canonical correlation vectors (PLS-components) with the Dynamic Time Warping (DTW), which typically serves as a trajectory similarity measure in Time Series data analysis.

The idea of data integration by Seurat is that CCA delivers components representing linear combinations of features across data sets that are maximally correlated (capture correlation structures across data sets), but not necessarily aligned. Next, Dynamic Time Warping (DTW) is used to locally compress or stretch the vectors during alignment to correct for changes in population density. As a result, the data sets are represented in a single, integrated low-dimensional space.

The CCA + DTW strategy was successfully used for data harmonization across multiple conditions, species, labs and technologies. However, those examples were a sort of single cell oriented batch-effect correction that has been known as a big problem in data analysis for years. In other words, despite CCA + DTW offer an impressive integrative framework, possibly adjusted for single cell data analysis, this methodology is for integration across samples and not straghtforward to extend to the integration across Omics, which is the main challenge and focus of multi-Omics data integration.

Later, Seurat extended the CCA approach for integrating across Omics in the work Stuart et al. 2019, where they used an interesting "anchor" idea borrowed from the Mutual Nearest Neighbors (MNN) algorithm of Haghverdi et al. 2018. The idea is based on identifying cells across two Omics that are most similar to each other (anchors), i.e. most likely belong to the same cell population, and align all the rest cells accordingly. This was probably the first true integration across Omics for single cell area.

Below, we will demonstrate how to use the "anchors" approach for integrating scRNAseq (9432 cells) and scATACseq (8728 cells) data from PBMC cells. Please note that the two modalities were not measures on the physically same cells but on cells from the same tissue. This was because there was no high-throughput technology available at that time that could produce multiple modalities from the same biological cells. Nowadays, with the advent of 10X Multiome technology, the "anchor" method was replaced by the Weighted Nearest Neighbors (WNN) method.

Seurat for Integrating scRNAseq and scATACseq from PBMC Cells

First, we load in the provided matrix of ATACseq peaks and collapse the peak matrix to a "gene activity matrix". Here, we make the simplifying assumption that a gene's activity can be quantified by simply summing all counts within the gene body + 2kb upstream. Next we build the Seurat object and store the original peaks as "ATAC" assay. As a QC step, we also filter out all cells here with fewer than 10K total counts in the scATAC-seq data.

library("Seurat")

## Attaching SeuratObject

library("ggplot2")

## Want to understand how all the pieces fit together? Read R for Data
## Science: https://r4ds.had.co.nz/

library("Signac")


#### Data download #################################################################################
# You need to download the following file and include it in the same directory as this notebook
#
# https://drive.google.com/uc?export=download&id=1hBeh2L5PC-T87YObCmJv4Qcm59IqkkOf
#
####################################################################################################

#After downloading the file above, download the following:
#download.file('https://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_10k/atac_v1_pbmc_10k_singlecell.csv', method = 'wget', destfile = 'atac_v1_pbmc_10k_singlecell.csv')
#download.file('https://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_10k/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5', method = 'wget', destfile = 'atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5')
#download.file('https://www.dropbox.com/s/3f3p5nxrn5b3y4y/pbmc_10k_v3.rds?dl=1', method = 'wget', destfile = 'pbmc_10k_v3.rds')

#Gene activity quantification (ATACseq peaks)
peaks <- Read10X_h5("atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5")
dim(peaks)

## [1] 89796  8728

peaks[1:5,1:5]

## 5 x 5 sparse Matrix of class "dgCMatrix"
##                    AAACGAAAGAGCGAAA-1 AAACGAAAGAGTTTGA-1 AAACGAAAGCGAGCTA-1
## chr1:565107-565550                  .                  .                  .
## chr1:569174-569639                  .                  .                  .
## chr1:713460-714823                  .                  .                  2
## chr1:752422-753038                  .                  .                  .
## chr1:762106-763359                  .                  .                  4
##                    AAACGAAAGGCTTCGC-1 AAACGAAAGTGCTGAG-1
## chr1:565107-565550                  .                  .
## chr1:569174-569639                  .                  .
## chr1:713460-714823                  8                  .
## chr1:752422-753038                  .                  .
## chr1:762106-763359                  2                  .

#Here we summed up all peaks overlapping a gene and computed of ATACseq peaks per gene (here we just load an already pre-computed matrix for the sake of time)
activity.matrix <- read.delim("scatacseq_activity_matrix.txt", header = TRUE, row.names = 1, sep="\t")
colnames(activity.matrix)<-gsub("\\.","-",colnames(activity.matrix))
dim(activity.matrix)

## [1] 18969  8728

activity.matrix[1:5,1:5]

##                AAACGAAAGAGCGAAA-1 AAACGAAAGAGTTTGA-1 AAACGAAAGCGAGCTA-1
## RP5-857K21.4                    0                  0                  0
## RP11-206L10.2                   0                  0                  2
## RP11-206L10.10                  0                  0                  0
## LINC01128                       0                  0                  4
## FAM41C                          0                  0                  0
##                AAACGAAAGGCTTCGC-1 AAACGAAAGTGCTGAG-1
## RP5-857K21.4                    0                  0
## RP11-206L10.2                   8                  0
## RP11-206L10.10                  0                  0
## LINC01128                       3                  0
## FAM41C                          4                  0

#Seurat object setup
pbmc.atac <- CreateSeuratObject(counts = peaks, assay = "ATAC", project = "10x_ATAC")
pbmc.atac@assays$ATAC@counts[1:5,1:5]

## 5 x 5 sparse Matrix of class "dgCMatrix"
##                    AAACGAAAGAGCGAAA-1 AAACGAAAGAGTTTGA-1 AAACGAAAGCGAGCTA-1
## chr1:565107-565550                  .                  .                  .
## chr1:569174-569639                  .                  .                  .
## chr1:713460-714823                  .                  .                  2
## chr1:752422-753038                  .                  .                  .
## chr1:762106-763359                  .                  .                  4
##                    AAACGAAAGGCTTCGC-1 AAACGAAAGTGCTGAG-1
## chr1:565107-565550                  .                  .
## chr1:569174-569639                  .                  .
## chr1:713460-714823                  8                  .
## chr1:752422-753038                  .                  .
## chr1:762106-763359                  2                  .

pbmc.atac[["ACTIVITY"]] <- CreateAssayObject(counts = activity.matrix)
pbmc.atac

## An object of class Seurat 
## 108765 features across 8728 samples within 2 assays 
## Active assay: ATAC (89796 features, 0 variable features)
##  1 other assay present: ACTIVITY

meta<-read.table("atac_v1_pbmc_10k_singlecell.csv",sep=",",header=TRUE,row.names=1,stringsAsFactors=FALSE)
head(meta)

##                      total duplicate chimeric unmapped lowmapq mitochondrial
## NO_BARCODE         8335327   3922818    71559   746476  634014         22972
## AAACGAAAGAAACGCC-1       3         1        0        1       0             0
## AAACGAAAGAAAGCAG-1      14         1        0        4       2             0
## AAACGAAAGAAAGGGT-1       7         1        0        1       0             0
## AAACGAAAGAAATACC-1    9880      5380       79      106    1120             6
## AAACGAAAGAAATCTG-1       2         0        0        0       0             0
##                    passed_filters cell_id is__cell_barcode TSS_fragments
## NO_BARCODE                2937488    None                0             0
## AAACGAAAGAAACGCC-1              1    None                0             1
## AAACGAAAGAAAGCAG-1              7    None                0             0
## AAACGAAAGAAAGGGT-1              5    None                0             2
## AAACGAAAGAAATACC-1           3189    None                0           386
## AAACGAAAGAAATCTG-1              2    None                0             1
##                    DNase_sensitive_region_fragments enhancer_region_fragments
## NO_BARCODE                                        0                         0
## AAACGAAAGAAACGCC-1                                1                         1
## AAACGAAAGAAAGCAG-1                                2                         0
## AAACGAAAGAAAGGGT-1                                2                         0
## AAACGAAAGAAATACC-1                             1623                       297
## AAACGAAAGAAATCTG-1                                2                         1
##                    promoter_region_fragments on_target_fragments
## NO_BARCODE                                 0                   0
## AAACGAAAGAAACGCC-1                         0                   1
## AAACGAAAGAAAGCAG-1                         0                   2
## AAACGAAAGAAAGGGT-1                         1                   2
## AAACGAAAGAAATACC-1                       107                1758
## AAACGAAAGAAATCTG-1                         1                   2
##                    blacklist_region_fragments peak_region_fragments
## NO_BARCODE                                  0                     0
## AAACGAAAGAAACGCC-1                          0                     1
## AAACGAAAGAAAGCAG-1                          0                     0
## AAACGAAAGAAAGGGT-1                          0                     1
## AAACGAAAGAAATACC-1                         16                   262
## AAACGAAAGAAATCTG-1                          0                     2

meta <- meta[colnames(pbmc.atac), ]
pbmc.atac <- AddMetaData(pbmc.atac, metadata = meta)
pbmc.atac <- subset(pbmc.atac, subset = nCount_ATAC > 10000)
pbmc.atac$tech <- "atac"

Next we filter the individual data sets and visualize the individual scRNAseq and scATACseq data sets. Here we perform Latent Semantic Indexing (LSI)) to reduce the dimensionality of the scATAC-seq data down to 30 dimensions. This procedure learns an 'internal' structure for the scRNA-seq data, and is important when determining the appropriate weights for the anchors when transferring information. We utilize Latent Semantic Indexing (LSI) to learn the structure of ATAC-seq data, as proposed in Cusanovich et al, Science 2015. LSI is implemented here by performing computing the term frequency-inverse document frequency (TF-IDF) followed by SVD. We use all peaks that have at least 1000 reads across all cells. We also include a pre-processed scRNAseq PBMC cells data set that was used in many other Seurat vignettes as a benchmark data set. We exclude the first dimension as this is typically correlated with sequencing depth.

#Preprocessing scATACseq per gene (activity martix), we will use it later in order to find anchors between cells in the scATAC-seq dataset and the scRNA-seq dataset.
DefaultAssay(pbmc.atac) <- "ACTIVITY"
pbmc.atac <- FindVariableFeatures(pbmc.atac)
pbmc.atac <- NormalizeData(pbmc.atac)
pbmc.atac <- ScaleData(pbmc.atac)

## Centering and scaling data matrix

#Preprocessing raw peaks
DefaultAssay(pbmc.atac) <- "ATAC"
VariableFeatures(pbmc.atac) <- names(which(Matrix::rowSums(pbmc.atac) > 1000))
pbmc.atac <- RunTFIDF(pbmc.atac)

## Performing TF-IDF normalization

pbmc.atac <- FindTopFeatures(pbmc.atac, min.cutoff = "q0")
pbmc.atac <- RunSVD(pbmc.atac)

## Running SVD

## Scaling cell embeddings

pbmc.atac <- RunTSNE(pbmc.atac, reduction = "lsi", dims = 2:30)

#Reading scRNAseq data set
pbmc.rna <- readRDS("pbmc_10k_v3.rds")
dim(pbmc.rna@assays$RNA@counts)

## [1] 19089  9432

pbmc.rna@assays$RNA@counts[1:5,1:5]

## 5 x 5 sparse Matrix of class "dgCMatrix"
##            rna_AAACCCAAGCGCCCAT-1 rna_AAACCCACAGAGTTGG-1 rna_AAACCCACAGGTATGG-1
## AL627309.1                      .                      .                      .
## AL669831.5                      .                      .                      .
## FAM87B                          .                      .                      .
## LINC00115                       .                      .                      .
## FAM41C                          .                      .                      .
##            rna_AAACCCACATAGTCAC-1 rna_AAACCCACATCCAATG-1
## AL627309.1                      .                      .
## AL669831.5                      .                      .
## FAM87B                          .                      .
## LINC00115                       .                      .
## FAM41C                          .                      .

pbmc.rna$tech <- "rna"

#Plotting scATACseq and scRNAseq next to each other
p1 <- DimPlot(pbmc.atac, reduction = "tsne") + NoLegend() + ggtitle("scATAC-seq") #here we do not have cell annotation, we will predict it using scRNAseq
p2 <- DimPlot(pbmc.rna, reduction = "tsne", group.by = "celltype", label = TRUE, repel = TRUE) + NoLegend() + ggtitle("scRNA-seq")
p1 + p2

Now, we can identify anchors between the scATAC-seq dataset and the scRNA-seq dataset and use these anchors to transfer the celltype labels we learned from the 10K scRNA-seq data to the scATAC-seq cells, i.e. we will apply the transferring anchors Seurat algorithm and visualize the individual scRNAseq and scATACseq data sets after they have been harmonized.

#Transfer anchors
gc()

##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells   6285641  335.7   11366262  607.1   9045316  483.1
## Vcells 517957716 3951.8  894531446 6824.8 888189756 6776.4

transfer.anchors <- FindTransferAnchors(reference = pbmc.rna, query = pbmc.atac, features = VariableFeatures(object = pbmc.rna), 
                                        reference.assay = "RNA", query.assay = "ACTIVITY", reduction = "cca")

## Running CCA

## Merging objects

## Finding neighborhoods

## Finding anchors

##  Found 16387 anchors

## Filtering anchors

##  Retained 3939 anchors

celltype.predictions <- TransferData(anchorset = transfer.anchors, refdata = pbmc.rna$celltype, weight.reduction = pbmc.atac[["lsi"]], dims = 2:30)

## Finding integration vectors

## Finding integration vector weights

## Predicting cell labels

pbmc.atac <- AddMetaData(pbmc.atac, metadata = celltype.predictions)

#We can then examine the distribution of prediction scores 
#and optionally filter out those cells with low scores. 
#Here, we find that over 95% of the cells receive a score of 0.5 or greater.
hist(pbmc.atac$prediction.score.max)
abline(v = 0.5, col = "red")

table(pbmc.atac$prediction.score.max > 0.5)

## 
## FALSE  TRUE 
##   359  5521

#Visualizing individual scRNAseq and scATACseq after their alignment
pbmc.atac.filtered <- subset(pbmc.atac, subset = prediction.score.max > 0.5)
pbmc.atac.filtered$predicted.id <- factor(pbmc.atac.filtered$predicted.id, levels = levels(pbmc.rna))  # to make the colors match (although it does not really work, seems to be broken in Seurat package)
p1 <- DimPlot(pbmc.atac.filtered, reduction = "tsne", group.by = "predicted.id", label = TRUE, repel = TRUE) + ggtitle("scATAC-seq cells") + NoLegend() + scale_colour_hue(drop = FALSE)
p2 <- DimPlot(pbmc.rna, reduction = "tsne", group.by = "celltype", label = TRUE, repel = TRUE) + ggtitle("scRNA-seq cells") + NoLegend()
p1 + p2

Finally we will perform co-embedding and tSNE visualization of the scRNAseq and scATACseq Omics in their common space after the integration has been done. They demonstrate very encouraging overlapping. Here, we use the same anchors used earlier to transfer cell type labels to impute RNA-seq values for the scATAC-seq cells. We then merge the measured and imputed scRNA-seq data and run a standard tSNE analysis to visualize all the cells together. In order to perform co-embedding, we first ‘impute’ RNA expression into the scATAC-seq cells based on the previously computed anchors, and then merge the datasets.

#Co-embedding
# note that we restrict the imputation to variable genes from scRNA-seq, but could impute the full transcriptome if we wanted to
gc()

##             used   (Mb) gc trigger   (Mb)   max used   (Mb)
## Ncells   6379942  340.8   11366262  607.1   11366262  607.1
## Vcells 748358495 5709.6 1073517735 8190.3 1072504236 8182.6

genes.use <- VariableFeatures(pbmc.rna)
refdata <- GetAssayData(pbmc.rna, assay = "RNA", slot = "data")[genes.use, ]

# refdata (input) contains a scRNA-seq expression matrix for the scRNA-seq cells.  imputation (output) will contain an imputed scRNA-seq matrix for each of the ATAC cells
imputation <- TransferData(anchorset = transfer.anchors, refdata = refdata, weight.reduction = pbmc.atac[["lsi"]], dims = 2:30)

## Finding integration vectors

## Finding integration vector weights

## Transfering 3000 features onto reference data

# this line adds the imputed data matrix to the pbmc.atac object
pbmc.atac[["RNA"]] <- imputation
coembed <- merge(x = pbmc.rna, y = pbmc.atac)

# Finally, we run PCA and UMAP on this combined object, to visualize the co-embedding of both datasets
coembed <- ScaleData(coembed, features = genes.use, do.scale = FALSE)

## Centering data matrix

gc()

##              used   (Mb) gc trigger    (Mb)   max used    (Mb)
## Ncells    6380958  340.8   11366262   607.1   11366262   607.1
## Vcells 1119206220 8538.9 1855329845 14155.1 1514408860 11554.1

coembed <- RunPCA(coembed, features = genes.use, verbose = FALSE)
gc()

##              used   (Mb) gc trigger    (Mb)   max used    (Mb)
## Ncells    6382390  340.9   11366262   607.1   11366262   607.1
## Vcells 1120143650 8546.1 1855329845 14155.1 1514408860 11554.1

coembed <- RunTSNE(coembed, dims = 1:30)
coembed$celltype <- ifelse(!is.na(coembed$celltype), coembed$celltype, coembed$predicted.id)

p1 <- DimPlot(coembed, group.by = "tech", reduction = "tsne")  + ggtitle("scRNAseq + scATACseq")
p2 <- DimPlot(coembed, group.by="celltype", label=TRUE, repel=TRUE, reduction="tsne") + NoLegend() + ggtitle("Annotated consensus")
p1 + p2

We conclude that scATACseq and scRNAseq demonstrate a very nice overlapping implying that the function of the cell clusters can be explain from both transcriptomic and epigenetic point of view, i.e. those two layers of information are complementary and one can be used instead of the other one to enhance the signal in case of experiemntal technical failure for any group of cells.

WNN Integration of CITEseq: scRNAseq and scProteomics (ADT) on PBMC Cells

Weighted Nearest Neighbor (WNN) approach for single cell data integration across multiple modalities represents a recent development on the Seurat workflow published in Hao et al., Cell in 2021. The WNN method is based on constructing KNN graphs in individual modalities and intersecting the graphs to get a consensus across modalities graph and a UMAP low dimensional representation of the graph.

Here we are going to utilize WNN for the CITEseq technology on PBMC cells that we have previously used to demonstrate principles of unsupervised OMICs integration through UMAP and Autoencoder. We use the CITE-seq dataset from (Stuart and Butler et al., Cell 2019), which consists of 30672 scRNAseq profiles measured alongside a panel of 25 antibodies (we refer to it as ADT or scProteomics) from bone marrow. The workflow consists of three steps:

Independent preprocessing and dimensional reduction of each modality individually
Learning cell-specific modality ‘weights’, and constructing a WNN graph that integrates the modalities
Downstream analysis (i.e. visualization, clustering, etc.) of the WNN graph

Here we will start with loading the CITEseq data into memory:

library("Seurat")
library("SeuratData")

## Registered S3 method overwritten by 'cli':
##   method     from         
##   print.boxx spatstat.geom

## ── Installed datasets ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── SeuratData v0.2.1 ──

## ✔ bmcite 0.3.0

## ──────────────────────────────────────────────────────────────────────────────────────────────────── Key ───────────────────────────────────────────────────────────────────────────────────────────────────

## ✔ Dataset loaded successfully
## ❯ Dataset built with a newer version of Seurat than installed
## ❓ Unknown version of Seurat installed

library("cowplot")
library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#Load the CITEseq data data
gc()

##              used   (Mb) gc trigger    (Mb)   max used    (Mb)
## Ncells    6420253  342.9   11366262   607.1   11366262   607.1
## Vcells 1120357930 8547.7 1855329845 14155.1 1514408860 11554.1

InstallData("bmcite")
bm <- LoadData(ds = "bmcite")

The gene expression and protein abundance matrices can be accesed as follows:

bm@assays$RNA[1:5,1:5]

## 5 x 5 sparse Matrix of class "dgCMatrix"
##               a_AAACCTGAGCTTATCG-1 a_AAACCTGAGGTGGGTT-1 a_AAACCTGAGTACATGA-1
## FO538757.2                       .                    .                    .
## AP006222.2                       .                    .                    .
## RP4-669L17.10                    .                    .                    .
## RP11-206L10.9                    .                    .                    .
## LINC00115                        .                    .                    .
##               a_AAACCTGCAAACCTAC-1 a_AAACCTGCAAGGTGTG-1
## FO538757.2                       .                    .
## AP006222.2                       .                    .
## RP4-669L17.10                    .                    .
## RP11-206L10.9                    .                    .
## LINC00115                        .                    .

dim(bm@assays$RNA)

## [1] 17009 30672

bm@assays$ADT[1:5,1:5]

##             a_AAACCTGAGCTTATCG-1 a_AAACCTGAGGTGGGTT-1 a_AAACCTGAGTACATGA-1
## CD11a                  1.4383519            2.9486691           2.14050204
## CD11c                  0.9122864            0.3372199           0.48569784
## CD123                  0.3428960            0.1545246           0.06065369
## CD127-IL7Ra            0.4601536            1.9809534           1.30530876
## CD14                   0.3634188            0.2102157           0.22324237
##             a_AAACCTGCAAACCTAC-1 a_AAACCTGCAAGGTGTG-1
## CD11a                  2.8203458            3.8188377
## CD11c                  0.7069059            2.8303333
## CD123                  0.1154213            0.5005284
## CD127-IL7Ra            1.3386684            0.2812587
## CD14                   0.2188883            2.1864307

dim(bm@assays$ADT)

## [1]    25 30672

Next we will perform data pre-processing, normalization and dimension reduction on each modality independently. For scRNAseq we will use library size normalization followed by log-transform, while total sum scaling (TSS) normalization across cells followed by the CLR normalization is recommended by Seurat for ADT data (as they can be viewed as compositional data after TSS procedure):

gc()

##              used   (Mb) gc trigger    (Mb)   max used    (Mb)
## Ncells    6457088  344.9   11366262   607.1   11366262   607.1
## Vcells 1215917836 9276.8 1855329845 14155.1 1514408860 11554.1

DefaultAssay(bm) <- 'RNA'
bm <- NormalizeData(bm) %>% FindVariableFeatures() %>% ScaleData() %>% RunPCA()

## Centering and scaling data matrix

## PC_ 1 
## Positive:  TRBC1, LAT, CD8B, CCL5, KLRB1, IGKC, S100A12, GZMA, S100A8, S100A9 
##     MS4A1, S100B, GNLY, CST7, TYROBP, KLRD1, RP11-291B21.2, NKG7, VCAN, CD14 
##     IGLC2, CCL4, AC092580.4, FCN1, IGLC3, PRF1, RBP7, SERPINA1, DUSP2, JUN 
## Negative:  KIAA0101, TYMS, KLF1, KCNH2, FAM178B, APOC1, CNRIP1, CENPU, GATA1, BIRC5 
##     CENPF, EPCAM, CKS2, RP11-620J15.3, TUBA1B, TFR2, CA1, HMGA1, STMN1, HIST1H4C 
##     CDT1, AHSP, TOP2A, TK1, GFI1B, TUBB, MKI67, NME4, SMIM1, TMEM56 
## PC_ 2 
## Positive:  RPL3, RPS3, RPS18, RPS5, RPS4X, RPSA, RPS12, RPS23, RPS2, EEF1B2 
##     RPL4, LDHB, NPM1, RPS17, RPLP0, TRBC1, LAT, RPL7A, GYPC, HSPA8 
##     CD8B, KLRB1, CCL5, HNRNPA1, PEBP1, RPL37A, MYC, NUCB2, SOD1, CD79A 
## Negative:  LYZ, FCN1, CST3, TYROBP, S100A9, LST1, S100A8, CSTA, MNDA, VCAN 
##     LGALS1, AIF1, S100A12, CFD, SERPINA1, FCER1G, MS4A6A, FOS, S100A6, CD14 
##     LGALS2, FTH1, GAPDH, ANXA2, CD36, CPVL, RBP7, HLA-DRA, LINC01272, H3F3A 
## PC_ 3 
## Positive:  CD74, HLA-DRA, HLA-DPB1, HLA-DPA1, IGLL1, ITM2C, SMIM24, RPS23, FABP5, HLA-DRB5 
##     TCF4, RPS18, PLD4, HLA-DQA1, RPS4X, SOX4, SPINK2, RPLP1, KIAA0125, HLA-DQB1 
##     HLA-DRB1, RPS2, PRSS57, IRF8, CDCA7, C1QTNF4, STMN1, BCL11A, NREP, RPS24 
## Negative:  GYPA, ALAS2, SLC4A1, AHSP, HBM, GYPB, CA2, HBD, RHAG, CA1 
##     HBA1, TMEM56, SELENBP1, HBB, HBA2, MYL4, HEMGN, HMBS, RHCE, HBQ1 
##     DMTN, RFESD, SPTA1, FECH, KLF1, ANK1, CTSE, SMIM1, TSPO2, SLC25A37 
## PC_ 4 
## Positive:  CD79A, MS4A1, CD79B, CD74, HLA-DRA, HLA-DPB1, HLA-DPA1, IGHD, HLA-DQA1, HLA-DQB1 
##     BANK1, VPREB3, SPIB, TCL1A, HLA-DRB5, FAM129C, LINC00926, IGHM, IGKC, HLA-DRB1 
##     TNFRSF13C, JCHAIN, TSPAN13, IRF8, FCER2, CD24, BLK, CD22, GNG7, FCRLA 
## Negative:  NKG7, CST7, PRSS57, GNLY, GZMA, KLRB1, CCL5, LDHB, KLRD1, CMC1 
##     CYTL1, PRF1, EGFL7, LYAR, TRBC1, KLRF1, GAPDH, S100A6, NGFRAP1, LAT 
##     RPS3, S100B, FGFBP2, GZMH, GATA2, RPS24, NPW, CCL4, SMIM24, CDK6 
## PC_ 5 
## Positive:  RPS2, RPS18, RPS12, RPS23, RPL37A, RPLP1, RPL4, RPS5, RPS4X, CNRIP1 
##     RPS17, APOC1, NPM1, MYC, EEF1B2, FAM178B, EPCAM, KCNH2, KLF1, RPLP0 
##     RPL7A, CYTL1, MS4A1, LDHB, SMIM10, GATA1, APOE, RPS3, TFR2, RPL3 
## Negative:  NKG7, GZMB, CST7, GNLY, GZMA, KLRD1, CMC1, KLRF1, PRF1, CCL4 
##     CCL5, GZMH, CLIC3, FGFBP2, SPON2, C12orf75, TRDC, KLRB1, HOPX, XCL2 
##     CD160, IL2RB, PLAC8, FCGR3A, TRGC2, KLRG1, DUSP2, RHOC, PLEK, LAIR2

gc()

##              used   (Mb) gc trigger    (Mb)   max used    (Mb)
## Ncells    6457835  344.9   11366262   607.1   11366262   607.1
## Vcells 1278963131 9757.8 1855329845 14155.1 1855304065 14154.9

DefaultAssay(bm) <- 'ADT'
# we will use all ADT features for dimensional reduction, we set a dimensional reduction name to avoid overwriting
VariableFeatures(bm) <- rownames(bm[["ADT"]])
bm <- NormalizeData(bm, normalization.method = 'CLR', margin = 2) %>% ScaleData() %>% RunPCA(reduction.name = 'apca')

## Normalizing across cells

## Centering and scaling data matrix

## PC_ 1 
## Positive:  CD3, CD28, CD27, CD127-IL7Ra, CD278-ICOS, CD4, CD8a, CD161, CD25, CD45RO 
##     CD45RA, CD197-CCR7 
## Negative:  HLA.DR, CD11c, CD14, CD38, CD69, CD123, CD11a, CD34, CD19, CD79b 
##     CD16, CD56 
## PC_ 2 
## Positive:  CD45RO, CD11a, CD14, CD4, CD11c, CD28, CD3, CD69, CD278-ICOS, CD127-IL7Ra 
##     CD38, CD161 
## Negative:  CD45RA, CD19, CD79b, CD197-CCR7, CD8a, CD57, CD56, CD16, CD27, CD25 
##     HLA.DR, CD34 
## PC_ 3 
## Positive:  CD16, CD56, CD8a, CD11a, CD57, CD161, CD45RA, CD38, CD3, CD127-IL7Ra 
##     CD27, CD11c 
## Negative:  CD19, CD79b, CD4, CD25, HLA.DR, CD197-CCR7, CD69, CD34, CD28, CD45RO 
##     CD14, CD278-ICOS 
## PC_ 4 
## Positive:  CD161, CD25, CD45RO, CD56, CD16, CD4, CD79b, CD19, CD28, CD57 
##     CD127-IL7Ra, HLA.DR 
## Negative:  CD8a, CD27, CD14, CD45RA, CD3, CD11c, CD69, CD38, CD197-CCR7, CD11a 
##     CD34, CD123 
## PC_ 5 
## Positive:  CD4, CD38, CD123, CD16, CD34, CD45RA, CD56, CD278-ICOS, CD28, CD197-CCR7 
##     CD27, CD3 
## Negative:  CD45RO, CD8a, CD69, CD161, CD79b, CD19, CD25, CD57, CD14, CD11a 
##     CD127-IL7Ra, HLA.DR

Now we are going to apply the Mutual Nearest Neighbour (MNN) algorithm for intersecting the two constructed graphs for the individual Omics. For each cell, we calculate its closest neighbors in the dataset based on a weighted combination of RNA and protein similarities. We specify the dimensionality of each modality, which is similar to specifying the number of PCs to include in scRNA-seq clustering.

# Identify multimodal neighbors. These will be stored in the neighbors slot, and can be accessed using bm[['weighted.nn']]
# The WNN graph can be accessed at bm[["wknn"]], and the SNN graph used for clustering at bm[["wsnn"]]
# Cell-specific modality weights can be accessed at bm$RNA.weight

#bm.weight <- FindModalityWeights(object = bm, reduction.list = list("pca", "apca"),dims.list = list(1:30, 1:18))
#bm.weight@first.modality.weight

gc()

##              used   (Mb) gc trigger    (Mb)   max used    (Mb)
## Ncells    6461046  345.1   11366262   607.1   11366262   607.1
## Vcells 1280536018 9769.8 1855329845 14155.1 1855304065 14154.9

bm <- FindMultiModalNeighbors(bm, reduction.list = list("pca", "apca"), dims.list = list(1:30, 1:18), modality.weight.name = "RNA.weight")

## Calculating cell-specific modality weights

## Finding 20 nearest neighbors for each modality.

## Calculating kernel bandwidths

## Finding multimodal neighbors

## Constructing multimodal KNN graph

## Constructing multimodal SNN graph

Now we will run UMAP on the two independent modalities as well as on the WNN graph constructed after overlapping both Omics, i.e. on the data based on a weighted combination of RNA and protein data. We can also perform graph-based clustering and visualize these results on the UMAP, alongside a set of cell annotations.

gc()

##              used   (Mb) gc trigger    (Mb)   max used    (Mb)
## Ncells    6470891  345.6   11366262   607.1   11366262   607.1
## Vcells 1280557144 9769.9 2226475814 16986.7 1855304065 14154.9

#Individual Omics
bm <- RunUMAP(bm, reduction = 'pca', dims = 1:30, assay = 'RNA', reduction.name = 'rna.umap', reduction.key = 'rnaUMAP_')

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

## 14:07:21 UMAP embedding parameters a = 0.9922 b = 1.112

## 14:07:21 Read 30672 rows and found 30 numeric columns

## 14:07:21 Using Annoy for neighbor search, n_neighbors = 30

## 14:07:21 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 14:07:25 Writing NN index file to temp file /tmp/Rtmpp13JXB/filea177491292d
## 14:07:25 Searching Annoy index using 1 thread, search_k = 3000
## 14:07:37 Annoy recall = 100%
## 14:07:37 Commencing smooth kNN distance calibration using 1 thread
## 14:07:39 Initializing from normalized Laplacian + noise
## 14:07:42 Commencing optimization for 200 epochs, with 1427826 positive edges
## 14:07:58 Optimization finished

bm <- RunUMAP(bm, reduction = 'apca', dims = 1:18, assay = 'ADT', reduction.name = 'adt.umap', reduction.key = 'adtUMAP_')

## 14:07:58 UMAP embedding parameters a = 0.9922 b = 1.112
## 14:07:58 Read 30672 rows and found 18 numeric columns
## 14:07:58 Using Annoy for neighbor search, n_neighbors = 30
## 14:07:58 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 14:08:01 Writing NN index file to temp file /tmp/Rtmpp13JXB/filea176c5522da
## 14:08:01 Searching Annoy index using 1 thread, search_k = 3000
## 14:08:13 Annoy recall = 100%
## 14:08:13 Commencing smooth kNN distance calibration using 1 thread
## 14:08:15 Initializing from normalized Laplacian + noise
## 14:08:18 Commencing optimization for 200 epochs, with 1341598 positive edges
## 14:08:32 Optimization finished

#Consensus graph
bm <- RunUMAP(bm, nn.name = "weighted.nn", reduction.name = "wnn.umap", reduction.key = "wnnUMAP_")

## 14:08:32 UMAP embedding parameters a = 0.9922 b = 1.112
## 14:08:32 Commencing smooth kNN distance calibration using 1 thread
## 14:08:34 Initializing from normalized Laplacian + noise
## 14:08:36 Commencing optimization for 200 epochs, with 983534 positive edges
## 14:08:51 Optimization finished

bm <- FindClusters(bm, graph.name = "wsnn", algorithm = 3, resolution = 2, verbose = FALSE)

Finally, we will visualize both Omics and the consunsus UMAP plot:

p1 <- DimPlot(bm, reduction='rna.umap', group.by='celltype.l2', label=TRUE, repel=TRUE, label.size=2.5) + NoLegend()  + ggtitle("scRNAseq")
p2 <- DimPlot(bm, reduction='adt.umap', group.by='celltype.l2', label=TRUE, repel=TRUE, label.size=2.5) + NoLegend()  + ggtitle("scProteomics")
p1 + p2

p3 <- DimPlot(bm, reduction="wnn.umap", group.by="celltype.l2", label=TRUE, label.size=2.5, repel=TRUE) + NoLegend() + ggtitle("WNN")
p3

We can see that the CD8 Naive T-cells that were overlapping with the CD4 naive T-cells using scRNAseq data alone, can now be clearly distinguishable due to the overlapping with the ADT scProteomics data.

Session Info

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=sv_SE.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=sv_SE.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=sv_SE.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=sv_SE.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] dplyr_1.0.7             cowplot_1.1.1           bmcite.SeuratData_0.3.0
##  [4] SeuratData_0.2.1        Signac_1.3.0            ggplot2_3.3.5          
##  [7] SeuratObject_4.0.2      Seurat_4.0.4            knitr_1.33             
## [10] rmarkdown_2.10         
## 
## loaded via a namespace (and not attached):
##   [1] fastmatch_1.1-3        plyr_1.8.6             igraph_1.2.6          
##   [4] lazyeval_0.2.2         splines_4.1.0          BiocParallel_1.26.2   
##   [7] listenv_0.8.0          scattermore_0.7        SnowballC_0.7.0       
##  [10] GenomeInfoDb_1.28.4    digest_0.6.27          htmltools_0.5.2       
##  [13] fansi_0.5.0            magrittr_2.0.1         tensor_1.5            
##  [16] cluster_2.1.2          ROCR_1.0-11            globals_0.14.0        
##  [19] Biostrings_2.60.2      matrixStats_0.60.1     docopt_0.7.1          
##  [22] spatstat.sparse_2.0-0  colorspace_2.0-2       rappdirs_0.3.3        
##  [25] ggrepel_0.9.1          xfun_0.25              sparsesvd_0.2         
##  [28] crayon_1.4.1           RCurl_1.98-1.4         jsonlite_1.7.2        
##  [31] spatstat.data_2.1-0    survival_3.2-11        zoo_1.8-9             
##  [34] glue_1.4.2             polyclip_1.10-0        gtable_0.3.0          
##  [37] zlibbioc_1.38.0        XVector_0.32.0         leiden_0.3.9          
##  [40] future.apply_1.8.1     BiocGenerics_0.38.0    abind_1.4-5           
##  [43] scales_1.1.1           DBI_1.1.1              miniUI_0.1.1.1        
##  [46] Rcpp_1.0.7             viridisLite_0.4.0      xtable_1.8-4          
##  [49] reticulate_1.20        spatstat.core_2.3-0    bit_4.0.4             
##  [52] stats4_4.1.0           htmlwidgets_1.5.3      httr_1.4.2            
##  [55] RColorBrewer_1.1-2     ellipsis_0.3.2         ica_1.0-2             
##  [58] pkgconfig_2.0.3        farver_2.1.0           sass_0.4.0            
##  [61] ggseqlogo_0.1          uwot_0.1.10            deldir_0.2-10         
##  [64] utf8_1.2.2             labeling_0.4.2         tidyselect_1.1.1      
##  [67] rlang_0.4.11           reshape2_1.4.4         later_1.3.0           
##  [70] munsell_0.5.0          tools_4.1.0            cli_3.0.1             
##  [73] generics_0.1.0         ggridges_0.5.3         evaluate_0.14         
##  [76] stringr_1.4.0          fastmap_1.1.0          yaml_2.2.1            
##  [79] goftest_1.2-2          bit64_4.0.5            fitdistrplus_1.1-5    
##  [82] purrr_0.3.4            RANN_2.6.1             pbapply_1.4-3         
##  [85] future_1.22.1          nlme_3.1-152           mime_0.11             
##  [88] slam_0.1-48            RcppRoll_0.3.0         hdf5r_1.3.4           
##  [91] compiler_4.1.0         plotly_4.9.4.1         png_0.1-7             
##  [94] spatstat.utils_2.2-0   tibble_3.1.4           tweenr_1.0.2          
##  [97] bslib_0.3.0            stringi_1.7.4          highr_0.9             
## [100] RSpectra_0.16-0        lattice_0.20-44        Matrix_1.3-4          
## [103] vctrs_0.3.8            pillar_1.6.2           lifecycle_1.0.0       
## [106] spatstat.geom_2.2-2    lmtest_0.9-38          jquerylib_0.1.4       
## [109] RcppAnnoy_0.0.19       data.table_1.14.0      bitops_1.0-7          
## [112] irlba_2.3.3            httpuv_1.6.2           patchwork_1.1.1       
## [115] GenomicRanges_1.44.0   R6_2.5.1               promises_1.2.0.1      
## [118] KernSmooth_2.23-20     gridExtra_2.3          lsa_0.73.2            
## [121] IRanges_2.26.0         parallelly_1.27.0      codetools_0.2-18      
## [124] MASS_7.3-54            assertthat_0.2.1       withr_2.4.2           
## [127] qlcMatrix_0.9.7        sctransform_0.3.2      Rsamtools_2.8.0       
## [130] S4Vectors_0.30.0       GenomeInfoDbData_1.2.6 mgcv_1.8-36           
## [133] parallel_4.1.0         grid_4.1.0             rpart_4.1-15          
## [136] tidyr_1.1.3            Rtsne_0.15             ggforce_0.3.3         
## [139] shiny_1.6.0

OMICS Integration and Systems Biology Course:
Single Cell Omics Integration

Lund, Sweden

Nikolay Oskolkov, SciLifeLab, NBIS Long Term Support, nikolay.oskolkov@scilifelab.se

October 5, 2020

Seurat Single Cell Omics Integration

Seurat for Integrating scRNAseq and scATACseq from PBMC Cells

WNN Integration of CITEseq: scRNAseq and scProteomics (ADT) on PBMC Cells

Session Info