import numpy as np
import pandas as pd
import scanpy as sc
import matplotlib.pyplot as plt
import warnings
import os
import urllib.request
='ignore', category=Warning)
warnings.simplefilter(action
# verbosity: errors (0), warnings (1), info (2), hints (3)
= 3
sc.settings.verbosity
=80)
sc.settings.set_figure_params(dpi%matplotlib inline
Code chunks run Python commands unless it starts with %%bash
, in which case, those chunks run shell commands.
In this tutorial we will look at different ways of integrating multiple single cell RNA-seq datasets. We will explore a few different methods to correct for batch effects across datasets. Seurat uses the data integration method presented in Comprehensive Integration of Single Cell Data, while Scran and Scanpy use a mutual Nearest neighbour method (MNN). Below you can find a list of some methods for single data integration:
Markdown | Language | Library | Ref |
---|---|---|---|
CCA | R | Seurat | Cell |
MNN | R/Python | Scater/Scanpy | Nat. Biotech. |
Conos | R | conos | Nat. Methods |
Scanorama | Python | scanorama | Nat. Biotech. |
1 Data preparation
Let’s first load necessary libraries and the data saved in the previous lab.
Create individual adata objects per batch.
# download pre-computed data if missing or long compute
= True
fetch_data
# url for source and intermediate data
= "https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq"
path_data
= "data/covid/results"
path_results if not os.path.exists(path_results):
=True)
os.makedirs(path_results, exist_ok
= "data/covid/results/scanpy_covid_qc_dr.h5ad"
path_file if fetch_data and not os.path.exists(path_file):
urllib.request.urlretrieve(os.path.join('covid/results/scanpy_covid_qc_dr.h5ad'), path_file)
path_data,
= sc.read_h5ad(path_file)
adata adata
AnnData object with n_obs × n_vars = 7222 × 2626
obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info'
var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'phase_colors', 'sample_colors', 'tsne', 'umap'
obsm: 'X_pca', 'X_tsne', 'X_umap'
varm: 'PCs'
obsp: 'connectivities', 'distances'
print(adata.X.shape)
(7222, 2626)
As the stored AnnData object contains scaled data based on variable genes, we need to make a new object with the logtransformed normalized counts. The new variable gene selection should not be performed on the scaled data matrix.
= adata.raw.to_adata()
adata2
# in some versions of Anndata there is an issue with information on the logtransformation in the slot log1p.base so we set it to None to not get errors.
'log1p']['base']=None
adata2.uns[
# check that the matrix looks like normalized counts
print(adata2.X[1:10,1:10])
<Compressed Sparse Row sparse matrix of dtype 'float64'
with 2 stored elements and shape (9, 9)>
Coords Values
(0, 3) 0.7825693876867097
(7, 6) 1.1311041336746985
2 Detect variable genes
Variable genes can be detected across the full dataset, but then we run the risk of getting many batch-specific genes that will drive a lot of the variation. Or we can select variable genes from each batch separately to get only celltype variation. In the dimensionality reduction exercise, we already selected variable genes, so they are already stored in adata.var.highly_variable
.
= adata.var.highly_variable
var_genes_all
print("Highly variable genes: %d"%sum(var_genes_all))
Highly variable genes: 2626
Detect variable genes in each dataset separately using the batch_key
parameter.
=0.0125, max_mean=3, min_disp=0.5, batch_key = 'sample')
sc.pp.highly_variable_genes(adata2, min_mean
print("Highly variable genes intersection: %d"%sum(adata2.var.highly_variable_intersection))
print("Number of batches where gene is variable:")
print(adata2.var.highly_variable_nbatches.value_counts())
= adata2.var.highly_variable_nbatches > 0 var_genes_batch
extracting highly variable genes
finished (0:00:02)
--> added
'highly_variable', boolean vector (adata.var)
'means', float vector (adata.var)
'dispersions', float vector (adata.var)
'dispersions_norm', float vector (adata.var)
Highly variable genes intersection: 122
Number of batches where gene is variable:
0 7876
1 4163
2 3161
3 2025
4 1115
5 559
6 277
7 170
8 122
Name: highly_variable_nbatches, dtype: int64
Compare overlap of variable genes with batches or with all data.
print("Any batch var genes: %d"%sum(var_genes_batch))
print("All data var genes: %d"%sum(var_genes_all))
print("Overlap: %d"%sum(var_genes_batch & var_genes_all))
print("Variable genes in all batches: %d"%sum(adata2.var.highly_variable_nbatches == 6))
print("Overlap batch instersection and all: %d"%sum(var_genes_all & adata2.var.highly_variable_intersection))
Any batch var genes: 11592
All data var genes: 2626
Overlap: 2625
Variable genes in all batches: 277
Overlap batch instersection and all: 122
Select all genes that are variable in at least 2 datasets and use for remaining analysis.
= adata2.var.highly_variable_nbatches > 2
var_select = var_select.index[var_select]
var_genes len(var_genes)
4268
3 BBKNN
First, we will run BBKNN that is implemented in scanpy.
import bbknn
='sample')
bbknn.bbknn(adata2,batch_key
# Before calculating a new umap and tsne, we want to store the old one.
'X_umap_uncorr'] = adata2.obsm['X_umap']
adata2.obsm['X_tsne_uncorr'] = adata2.obsm['X_tsne']
adata2.obsm[
# then run umap on the integrated space
sc.tl.umap(adata2) sc.tl.tsne(adata2)
computing batch balanced neighbors
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:02)
computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm)
'umap', UMAP parameters (adata.uns) (0:00:12)
computing tSNE
using 'X_pca' with n_pcs = 50
using sklearn.manifold.TSNE
finished: added
'X_tsne', tSNE coordinates (adata.obsm)
'tsne', tSNE parameters (adata.uns) (0:00:14)
We can now plot the unintegrated and the integrated space reduced dimensions.
= plt.subplots(2, 2, figsize=(10,8),constrained_layout=True)
fig, axs ="sample", title="BBKNN Corrected tsne", ax=axs[0,0], show=False)
sc.pl.tsne(adata2, color="sample", title="Uncorrected tsne", ax=axs[0,1], show=False)
sc.pl.tsne(adata, color="sample", title="BBKNN Corrected umap", ax=axs[1,0], show=False)
sc.pl.umap(adata2, color="sample", title="Uncorrected umap", ax=axs[1,1], show=False) sc.pl.umap(adata, color
Let’s save the integrated data for further analysis.
# Before calculating a new umap and tsne, we want to store the old one.
'X_umap_bbknn'] = adata2.obsm['X_umap']
adata2.obsm['X_tsne_bbknn'] = adata2.obsm['X_tsne']
adata2.obsm[
= './data/covid/results/scanpy_covid_qc_dr_bbknn.h5ad'
save_file adata2.write_h5ad(save_file)
4 Harmony
An alternative method for integration is Harmony, for more details on the method, please se their paper Nat. Methods. This method runs the integration on a dimensionality reduction, in most applications the PCA.
import scanpy.external as sce
import harmonypy as hm
'sample')
sce.pp.harmony_integrate(adata2,
# Then we calculate a new umap and tsne.
=10, n_pcs=30, use_rep='X_pca_harmony')
sc.pp.neighbors(adata2, n_neighbors
sc.tl.umap(adata2)='X_pca_harmony')
sc.tl.tsne(adata2, use_rep=0.5) sc.tl.leiden(adata2, resolution
computing neighbors
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm)
'umap', UMAP parameters (adata.uns) (0:00:09)
computing tSNE
using sklearn.manifold.TSNE
finished: added
'X_tsne', tSNE coordinates (adata.obsm)
'tsne', tSNE parameters (adata.uns) (0:00:15)
running Leiden clustering
finished: found 10 clusters and added
'leiden', the cluster labels (adata.obs, categorical) (0:00:00)
= plt.subplots(2, 2, figsize=(10,8),constrained_layout=True)
fig, axs 'X_tsne_bbknn', color="sample", title="BBKNN tsne", ax=axs[0,0], show=False)
sc.pl.embedding(adata2, ="sample", title="Harmony tsne", ax=axs[0,1], show=False)
sc.pl.tsne(adata2, color'X_umap_bbknn', color="sample", title="BBKNN umap", ax=axs[1,0], show=False)
sc.pl.embedding(adata2, ="sample", title="Harmony umap", ax=axs[1,1], show=False) sc.pl.umap(adata2, color
Let’s save the integrated data for further analysis.
# Store this umap and tsne with a new name.
'X_umap_harmony'] = adata2.obsm['X_umap']
adata2.obsm['X_tsne_harmony'] = adata2.obsm['X_tsne']
adata2.obsm[
#save to file
= './data/covid/results/scanpy_covid_qc_dr_harmony.h5ad'
save_file adata2.write_h5ad(save_file)
5 Combat
Batch correction can also be performed with combat. Note that ComBat batch correction requires a dense matrix format as input (which is already the case in this example).
# create a new object with lognormalized counts
= sc.AnnData(X=adata.raw.X, var=adata.raw.var, obs = adata.obs)
adata_combat
# first store the raw data
= adata_combat
adata_combat.raw
# run combat
='sample') sc.pp.combat(adata_combat, key
Standardizing Data across genes.
Found 8 batches
Found 0 numerical variables:
Found 39 genes with zero variance.
Fitting L/S model and finding priors
Finding parametric adjustments
Adjusting data
Then we run the regular steps of dimensionality reduction on the combat corrected data. Variable gene selection, pca and umap with combat data.
sc.pp.highly_variable_genes(adata_combat)print("Highly variable genes: %d"%sum(adata_combat.var.highly_variable))
sc.pl.highly_variable_genes(adata_combat)
=30, use_highly_variable=True, svd_solver='arpack')
sc.pp.pca(adata_combat, n_comps
sc.pp.neighbors(adata_combat)
sc.tl.umap(adata_combat) sc.tl.tsne(adata_combat)
extracting highly variable genes
finished (0:00:01)
--> added
'highly_variable', boolean vector (adata.var)
'means', float vector (adata.var)
'dispersions', float vector (adata.var)
'dispersions_norm', float vector (adata.var)
Highly variable genes: 3923
computing PCA
with n_comps=30
finished (0:00:01)
computing neighbors
using 'X_pca' with n_pcs = 30
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm)
'umap', UMAP parameters (adata.uns) (0:00:10)
computing tSNE
using 'X_pca' with n_pcs = 30
using sklearn.manifold.TSNE
finished: added
'X_tsne', tSNE coordinates (adata.obsm)
'tsne', tSNE parameters (adata.uns) (0:00:14)
# compare var_genes
= adata_combat.var.highly_variable
var_genes_combat print("With all data %d"%sum(var_genes_all))
print("With combat %d"%sum(var_genes_combat))
print("Overlap %d"%sum(var_genes_all & var_genes_combat))
print("With 2 batches %d"%sum(var_select))
print("Overlap %d"%sum(var_genes_combat & var_select))
With all data 2626
With combat 3923
Overlap 1984
With 2 batches 4268
Overlap 2729
We can now plot the unintegrated and the integrated space reduced dimensions.
= plt.subplots(2, 2, figsize=(10,8),constrained_layout=True)
fig, axs ="sample", title="Harmony tsne", ax=axs[0,0], show=False)
sc.pl.tsne(adata2, color="sample", title="Combat tsne", ax=axs[0,1], show=False)
sc.pl.tsne(adata_combat, color="sample", title="Harmony umap", ax=axs[1,0], show=False)
sc.pl.umap(adata2, color="sample", title="Combat umap", ax=axs[1,1], show=False) sc.pl.umap(adata_combat, color
Let’s save the integrated data for further analysis.
#save to file
= './data/covid/results/scanpy_covid_qc_dr_combat.h5ad'
save_file adata_combat.write_h5ad(save_file)
6 Scanorama
Try out Scanorama for data integration as well. First we need to create individual AnnData objects from each of the datasets.
# split per batch into new objects.
= adata2.obs['sample'].cat.categories.tolist()
batches = {}
alldata for batch in batches:
= adata2[adata2.obs['sample'] == batch,]
alldata[batch]
alldata
{'covid_1': View of AnnData object with n_obs × n_vars = 876 × 19468
obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info', 'leiden'
var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'phase_colors', 'sample_colors', 'tsne', 'umap', 'leiden'
obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_umap_uncorr', 'X_tsne_uncorr', 'X_umap_bbknn', 'X_tsne_bbknn', 'X_pca_harmony', 'X_umap_harmony', 'X_tsne_harmony'
obsp: 'connectivities', 'distances',
'covid_15': View of AnnData object with n_obs × n_vars = 591 × 19468
obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info', 'leiden'
var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'phase_colors', 'sample_colors', 'tsne', 'umap', 'leiden'
obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_umap_uncorr', 'X_tsne_uncorr', 'X_umap_bbknn', 'X_tsne_bbknn', 'X_pca_harmony', 'X_umap_harmony', 'X_tsne_harmony'
obsp: 'connectivities', 'distances',
'covid_16': View of AnnData object with n_obs × n_vars = 370 × 19468
obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info', 'leiden'
var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'phase_colors', 'sample_colors', 'tsne', 'umap', 'leiden'
obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_umap_uncorr', 'X_tsne_uncorr', 'X_umap_bbknn', 'X_tsne_bbknn', 'X_pca_harmony', 'X_umap_harmony', 'X_tsne_harmony'
obsp: 'connectivities', 'distances',
'covid_17': View of AnnData object with n_obs × n_vars = 1084 × 19468
obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info', 'leiden'
var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'phase_colors', 'sample_colors', 'tsne', 'umap', 'leiden'
obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_umap_uncorr', 'X_tsne_uncorr', 'X_umap_bbknn', 'X_tsne_bbknn', 'X_pca_harmony', 'X_umap_harmony', 'X_tsne_harmony'
obsp: 'connectivities', 'distances',
'ctrl_5': View of AnnData object with n_obs × n_vars = 1017 × 19468
obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info', 'leiden'
var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'phase_colors', 'sample_colors', 'tsne', 'umap', 'leiden'
obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_umap_uncorr', 'X_tsne_uncorr', 'X_umap_bbknn', 'X_tsne_bbknn', 'X_pca_harmony', 'X_umap_harmony', 'X_tsne_harmony'
obsp: 'connectivities', 'distances',
'ctrl_13': View of AnnData object with n_obs × n_vars = 1121 × 19468
obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info', 'leiden'
var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'phase_colors', 'sample_colors', 'tsne', 'umap', 'leiden'
obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_umap_uncorr', 'X_tsne_uncorr', 'X_umap_bbknn', 'X_tsne_bbknn', 'X_pca_harmony', 'X_umap_harmony', 'X_tsne_harmony'
obsp: 'connectivities', 'distances',
'ctrl_14': View of AnnData object with n_obs × n_vars = 1030 × 19468
obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info', 'leiden'
var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'phase_colors', 'sample_colors', 'tsne', 'umap', 'leiden'
obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_umap_uncorr', 'X_tsne_uncorr', 'X_umap_bbknn', 'X_tsne_bbknn', 'X_pca_harmony', 'X_umap_harmony', 'X_tsne_harmony'
obsp: 'connectivities', 'distances',
'ctrl_19': View of AnnData object with n_obs × n_vars = 1133 × 19468
obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info', 'leiden'
var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'phase_colors', 'sample_colors', 'tsne', 'umap', 'leiden'
obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_umap_uncorr', 'X_tsne_uncorr', 'X_umap_bbknn', 'X_tsne_bbknn', 'X_pca_harmony', 'X_umap_harmony', 'X_tsne_harmony'
obsp: 'connectivities', 'distances'}
import scanorama
#subset the individual dataset to the variable genes we defined at the beginning
= dict()
alldata2 for ds in alldata.keys():
print(ds)
= alldata[ds][:,var_genes]
alldata2[ds]
#convert to list of AnnData objects
= list(alldata2.values())
adatas
# run scanorama.integrate
= 50) scanorama.integrate_scanpy(adatas, dimred
covid_1
covid_15
covid_16
covid_17
ctrl_5
ctrl_13
ctrl_14
ctrl_19
Found 4268 genes among all datasets
[[0. 0.50761421 0.52972973 0.26845018 0.59488692 0.48401826
0.36757991 0.09973522]
[0. 0. 0.81891892 0.33840948 0.43362832 0.23181049
0.29949239 0.17597293]
[0. 0. 0. 0.22702703 0.49459459 0.52972973
0.42702703 0.3 ]
[0. 0. 0. 0. 0.27138643 0.09132841
0.1300738 0.17387467]
[0. 0. 0. 0. 0. 0.8446411
0.73647984 0.25419241]
[0. 0. 0. 0. 0. 0.
0.82815534 0.44836717]
[0. 0. 0. 0. 0. 0.
0. 0.78022948]
[0. 0. 0. 0. 0. 0.
0. 0. ]]
Processing datasets (4, 5)
Processing datasets (5, 6)
Processing datasets (1, 2)
Processing datasets (6, 7)
Processing datasets (4, 6)
Processing datasets (0, 4)
Processing datasets (2, 5)
Processing datasets (0, 2)
Processing datasets (0, 1)
Processing datasets (2, 4)
Processing datasets (0, 5)
Processing datasets (5, 7)
Processing datasets (1, 4)
Processing datasets (2, 6)
Processing datasets (0, 6)
Processing datasets (1, 3)
Processing datasets (2, 7)
Processing datasets (1, 6)
Processing datasets (3, 4)
Processing datasets (0, 3)
Processing datasets (4, 7)
Processing datasets (1, 5)
Processing datasets (2, 3)
Processing datasets (1, 7)
Processing datasets (3, 7)
Processing datasets (3, 6)
#scanorama adds the corrected matrix to adata.obsm in each of the datasets in adatas.
0].obsm['X_scanorama'].shape adatas[
(876, 50)
# Get all the integrated matrices.
= [ad.obsm['X_scanorama'] for ad in adatas]
scanorama_int
# make into one matrix.
= np.concatenate(scanorama_int)
all_s print(all_s.shape)
# add to the AnnData object, create a new object first
"Scanorama"] = all_s adata2.obsm[
(7222, 50)
# tsne and umap
=30, use_rep = "Scanorama")
sc.pp.neighbors(adata2, n_pcs
sc.tl.umap(adata2)= 30, use_rep = "Scanorama") sc.tl.tsne(adata2, n_pcs
computing neighbors
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm)
'umap', UMAP parameters (adata.uns) (0:00:10)
computing tSNE
using sklearn.manifold.TSNE
finished: added
'X_tsne', tSNE coordinates (adata.obsm)
'tsne', tSNE parameters (adata.uns) (0:00:14)
We can now plot the unintegrated and the integrated space reduced dimensions.
= plt.subplots(2, 2, figsize=(10,8),constrained_layout=True)
fig, axs 'X_tsne_harmony', color="sample", title="Harmony tsne", ax=axs[0,0], show=False)
sc.pl.embedding(adata2, ="sample", title="Scanorama tsne", ax=axs[0,1], show=False)
sc.pl.tsne(adata2, color'X_umap_harmony', color="sample", title="Harmony umap", ax=axs[1,0], show=False)
sc.pl.embedding(adata2, ="sample", title="Scanorama umap", ax=axs[1,1], show=False) sc.pl.umap(adata2, color
Let’s save the integrated data for further analysis.
# Store this umap and tsne with a new name.
'X_umap_scanorama'] = adata2.obsm['X_umap']
adata2.obsm['X_tsne_scanorama'] = adata2.obsm['X_tsne']
adata2.obsm[
#save to file, now contains all integrations except the combat one.
= './data/covid/results/scanpy_covid_qc_dr_int.h5ad'
save_file adata2.write_h5ad(save_file)
7 Overview all methods
Now we will plot UMAPS with all three integration methods side by side.
= plt.subplots(2, 3, figsize=(10,8),constrained_layout=True)
fig, axs ="sample", title="Uncorrected", ax=axs[0,0], show=False)
sc.pl.umap(adata, color'X_umap_bbknn', color="sample", title="BBKNN", ax=axs[0,1], show=False)
sc.pl.embedding(adata2, ="sample", title="Combat", ax=axs[0,2], show=False)
sc.pl.umap(adata_combat, color'X_umap_harmony', color="sample", title="Harmony", ax=axs[1,0], show=False)
sc.pl.embedding(adata2, ="sample", title="Scanorama", ax=axs[1,1], show=False) sc.pl.umap(adata2, color
Look at the different integration results, which one do you think looks the best? How would you motivate selecting one method over the other? How do you think you could best evaluate if the integration worked well?
8 Extra task
Have a look at the documentation for BBKNN
Try changing some of the parameteres in BBKNN, such as distance metric, number of PCs and number of neighbors. How does the results change with different parameters? Can you explain why?
9 Session info
Click here
sc.logging.print_versions()
-----
anndata 0.10.8
scanpy 1.10.3
-----
CoreFoundation NA
Foundation NA
PIL 10.4.0
PyObjCTools NA
annoy NA
anyio NA
appnope 0.1.4
arrow 1.3.0
asciitree NA
asttokens NA
attr 24.2.0
attrs 24.2.0
babel 2.14.0
bbknn 1.6.0
brotli 1.1.0
certifi 2024.08.30
cffi 1.17.1
charset_normalizer 3.3.2
cloudpickle 3.0.0
colorama 0.4.6
comm 0.2.2
cycler 0.12.1
cython_runtime NA
cytoolz 0.12.3
dask 2024.9.0
dateutil 2.9.0
debugpy 1.8.5
decorator 5.1.1
defusedxml 0.7.1
exceptiongroup 1.2.2
executing 2.1.0
fastjsonschema NA
fbpca NA
fqdn NA
google NA
h5py 3.11.0
harmonypy 0.0.10
idna 3.10
igraph 0.11.6
intervaltree NA
ipykernel 6.29.5
isoduration NA
jedi 0.19.1
jinja2 3.1.4
joblib 1.4.2
json5 0.9.25
jsonpointer 3.0.0
jsonschema 4.23.0
jsonschema_specifications NA
jupyter_events 0.10.0
jupyter_server 2.14.2
jupyterlab_server 2.27.3
kiwisolver 1.4.7
legacy_api_wrap NA
leidenalg 0.10.2
llvmlite 0.43.0
markupsafe 2.1.5
matplotlib 3.9.2
matplotlib_inline 0.1.7
mpl_toolkits NA
msgpack 1.1.0
natsort 8.4.0
nbformat 5.10.4
numba 0.60.0
numcodecs 0.13.0
numpy 1.26.4
objc 10.3.1
overrides NA
packaging 24.1
pandas 1.5.3
parso 0.8.4
patsy 0.5.6
pickleshare 0.7.5
platformdirs 4.3.6
prometheus_client NA
prompt_toolkit 3.0.47
psutil 6.0.0
pure_eval 0.2.3
pycparser 2.22
pydev_ipython NA
pydevconsole NA
pydevd 2.9.5
pydevd_file_utils NA
pydevd_plugins NA
pydevd_tracing NA
pygments 2.18.0
pynndescent 0.5.13
pyparsing 3.1.4
pythonjsonlogger NA
pytz 2024.2
referencing NA
requests 2.32.3
rfc3339_validator 0.1.4
rfc3986_validator 0.1.1
rpds NA
scanorama 1.7.4
scipy 1.14.1
send2trash NA
session_info 1.0.0
six 1.16.0
sklearn 1.5.2
sniffio 1.3.1
socks 1.7.1
sortedcontainers 2.4.0
sparse 0.15.4
sphinxcontrib NA
stack_data 0.6.2
texttable 1.7.0
threadpoolctl 3.5.0
tlz 0.12.3
toolz 0.12.1
torch 2.4.0.post101
torchgen NA
tornado 6.4.1
tqdm 4.66.5
traitlets 5.14.3
typing_extensions NA
umap 0.5.6
uri_template NA
urllib3 2.2.3
wcwidth 0.2.13
webcolors 24.8.0
websocket 1.8.0
yaml 6.0.2
zarr 2.18.3
zipp NA
zmq 26.2.0
zoneinfo NA
zstandard 0.23.0
-----
IPython 8.27.0
jupyter_client 8.6.3
jupyter_core 5.7.2
jupyterlab 4.2.5
notebook 7.2.2
-----
Python 3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:17:34) [Clang 14.0.6 ]
macOS-14.6.1-x86_64-i386-64bit
-----
Session information updated at 2024-10-07 08:45