Spatial transcriptomics


Adapted from tutorials by Giovanni Palla (https://scanpy-tutorials.readthedocs.io/en/latest/spatial/integration-scanorama.html) and Carlos Talavera-López (https://docs.scvi-tools.org/en/latest/tutorials/notebooks/stereoscope_heart_LV_tutorial.html)

OBS! For this tutorial you must create and use a new conda environment python_spatial. The recipe can be found at: https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/master/labs/environment_python_spatial.yml

Spatial transcriptomic data with the Visium platform is in many ways similar to scRNAseq data. It contains UMI counts for 5-20 cells instead of single cells, but is still quite sparse in the same way as scRNAseq data is, but with the additional information about spatial location in the tissue.

Here we will first run quality control in a similar manner to scRNAseq data, then QC filtering, dimensionality reduction, integration and clustering. Then we will use scRNAseq data from mouse cortex to run LabelTransfer to predict celltypes in the Visium spots.

We will use two Visium spatial transcriptomics dataset of the mouse brain (Sagittal), which are publicly available from the 10x genomics website. Note, that these dataset have already been filtered for spots that does not overlap with the tissue.

Load packages

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import scanpy as sc
import anndata as an
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import scanorama
In [2]:
#sc.logging.print_versions() # gives errror!!
sc.set_figure_params(facecolor="white", figsize=(8, 8))
sc.settings.verbosity = 3

Load ST data

The function datasets.visium_sge() downloads the dataset from 10x genomics and returns an AnnData object that contains counts, images and spatial coordinates. We will calculate standards QC metrics with pp.calculate_qc_metrics and visualize them.

When using your own Visium data, use Scanpy's read_visium() function to import it.

In [3]:
adata_anterior = sc.datasets.visium_sge(
    sample_id="V1_Mouse_Brain_Sagittal_Anterior"
)
adata_posterior = sc.datasets.visium_sge(
    sample_id="V1_Mouse_Brain_Sagittal_Posterior"
)
reading /Users/asabjor/courses/course_git/workshop-scRNAseq/labs/compiled/scanpy/data/V1_Mouse_Brain_Sagittal_Anterior/filtered_feature_bc_matrix.h5
 (0:00:00)
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/anndata/_core/anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
reading /Users/asabjor/courses/course_git/workshop-scRNAseq/labs/compiled/scanpy/data/V1_Mouse_Brain_Sagittal_Posterior/filtered_feature_bc_matrix.h5
 (0:00:00)
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/anndata/_core/anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
In [4]:
adata_anterior.var_names_make_unique()
adata_posterior.var_names_make_unique()

To make sure that both images are included in the merged object, use uns_merge="unique".

In [5]:
# merge into one dataset
library_names = ["V1_Mouse_Brain_Sagittal_Anterior", "V1_Mouse_Brain_Sagittal_Posterior"]

adata = adata_anterior.concatenate(
    adata_posterior,
    batch_key="library_id",
    uns_merge="unique",
    batch_categories=library_names
)
In [6]:
adata
Out[6]:
AnnData object with n_obs × n_vars = 6049 × 31053
    obs: 'in_tissue', 'array_row', 'array_col', 'library_id'
    var: 'gene_ids', 'feature_types', 'genome'
    uns: 'spatial'
    obsm: 'spatial'

As you can see, we now have the slot spatial in obsm, which contains the spatial information from the Visium platform.

Quality control


Similar to scRNAseq we use statistics on number of counts, number of features and percent mitochondria for quality control.

In [7]:
# add info on mitochondrial and hemoglobin genes to the objects.
adata.var['mt'] = adata.var_names.str.startswith('mt-') 
adata.var['hb'] = adata.var_names.str.contains(("^Hb.*-"))

sc.pp.calculate_qc_metrics(adata, qc_vars=['mt','hb'], percent_top=None, log1p=False, inplace=True)
In [8]:
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt', 'pct_counts_hb'],
             jitter=0.4, groupby = 'library_id', rotation= 45)

We can also plot the same data onto the tissue section.

In scanpy, this is a bit tricky when you have multiple sections, as you would have to subset and plot them separately.

In [9]:
# need to plot the two sections separately and specify the library_id

for library in library_names:
    sc.pl.spatial(adata[adata.obs.library_id == library,:], library_id=library, color = ["total_counts", "n_genes_by_counts",'pct_counts_mt', 'pct_counts_hb'])

As you can see, the spots with low number of counts/features and high mitochondrial content is mainly towards the edges of the tissue. It is quite likely that these regions are damaged tissue. You may also see regions within a tissue with low quality if you have tears or folds in your section.

But remember, for some tissue types, the amount of genes expressed and proportion mitochondria may also be a biological features, so bear in mind what tissue you are working on and what these features mean.

Filter

Select all spots with less than 25% mitocondrial reads, less than 20% hb-reads and 1000 detected genes. You must judge for yourself based on your knowledge of the tissue what are appropriate filtering criteria for your dataset.

In [10]:
keep = (adata.obs['pct_counts_hb'] < 20) & (adata.obs['pct_counts_mt'] < 25) & (adata.obs['n_genes_by_counts'] > 1000)
print(sum(keep))

adata = adata[keep,:]
5725

And replot onto tissue sections.

In [11]:
for library in library_names:
    sc.pl.spatial(adata[adata.obs.library_id == library,:], library_id=library, color = ["total_counts", "n_genes_by_counts",'pct_counts_mt', 'pct_counts_hb'])

Top expressed genes

In [12]:
sc.pl.highest_expr_genes(adata, n_top=20)
normalizing counts per cell
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/preprocessing/_normalization.py:170: UserWarning: Received a view of an AnnData. Making a copy.
  view_to_actual(adata)
    finished (0:00:00)

As you can see, the mitochondrial genes are among the top expressed. Also the lncRNA gene Bc1 (brain cytoplasmic RNA 1). Also one hemoglobin gene.

Filter genes

We will remove the Bc1 gene, hemoglobin genes (blood contamination) and the mitochondrial genes.

In [13]:
mito_genes = adata.var_names.str.startswith('mt-')
hb_genes = adata.var_names.str.contains('^Hb.*-')

remove = np.add(mito_genes, hb_genes)
remove[adata.var_names == "Bc1"] = True
keep = np.invert(remove)
print(sum(remove))

adata = adata[:,keep]

print(adata.n_obs, adata.n_vars)
22
5725 31031

Analysis


As we have two sections, we will select variable genes with batch_key="library_id" and then take the union of variable genes for further analysis. The idea is to avoid including batch specific genes in the analysis.

In [14]:
# save the counts to a separate object for later, we need the normalized counts in raw for DEG dete
counts_adata = adata.copy()

sc.pp.normalize_total(adata, inplace=True)
sc.pp.log1p(adata)
# take 1500 variable genes per batch and then use the union of them.
sc.pp.highly_variable_genes(adata, flavor="seurat", n_top_genes=1500, inplace=True, batch_key="library_id")

# subset for variable genes
adata.raw = adata
adata = adata[:,adata.var.highly_variable_nbatches > 0]

# scale data
sc.pp.scale(adata)
normalizing counts per cell
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/preprocessing/_normalization.py:170: UserWarning: Received a view of an AnnData. Making a copy.
  view_to_actual(adata)
    finished (0:00:00)
If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
    finished (0:00:01)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
... as `zero_center=True`, sparse input is densified and may lead to large memory consumption
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/preprocessing/_simple.py:843: UserWarning: Received a view of an AnnData. Making a copy.
  view_to_actual(adata)

Now we can plot gene expression of individual genes, the gene Hpca is a strong hippocampal marker and Ttr is a marker of the choroid plexus.

In [15]:
for library in library_names:
    sc.pl.spatial(adata[adata.obs.library_id == library,:], library_id=library, color = ["Ttr", "Hpca"])

Dimensionality reduction and clustering

We can then now run dimensionality reduction and clustering using the same workflow as we use for scRNA-seq analysis.

In [16]:
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.tl.leiden(adata, key_added="clusters")
computing neighbors
WARNING: You’re trying to run this on 2404 dimensions of `.X`, if you really want this, set `use_rep='X'`.
         Falling back to preprocessing with `sc.pp.pca` and default params.
computing PCA
    with n_comps=50
    finished (0:00:02)
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:06)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:09)
running Leiden clustering
    finished: found 22 clusters and added
    'clusters', the cluster labels (adata.obs, categorical) (0:00:00)

Dimensionality reduction and clustering

We can then now run dimensionality reduction and clustering using the same workflow as we use for scRNA-seq analysis.

In [17]:
sc.pl.umap(
    adata, color=["clusters", "library_id"], palette=sc.pl.palettes.default_20
)
WARNING: Length of palette colors is smaller than the number of categories (palette length: 20, categories length: 22. Some categories will have the same color.
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

As we are plotting the two sections separately, we need to make sure that they get the same colors by fetching cluster colors from a dict.

In [18]:
clusters_colors = dict(
    zip([str(i) for i in range(len(adata.obs.clusters.cat.categories))], adata.uns["clusters_colors"])
)

fig, axs = plt.subplots(1, 2, figsize=(15, 10))

for i, library in enumerate(
    ["V1_Mouse_Brain_Sagittal_Anterior", "V1_Mouse_Brain_Sagittal_Posterior"]
):
    ad = adata[adata.obs.library_id == library, :].copy()
    sc.pl.spatial(
        ad,
        img_key="hires",
        library_id=library,
        color="clusters",
        size=1.5,
        palette=[
            v
            for k, v in clusters_colors.items()
            if k in ad.obs.clusters.unique().tolist()
        ],
        legend_loc=None,
        show=False,
        ax=axs[i],
    )

plt.tight_layout()

Integration

Quite often there are strong batch effects between different ST sections, so it may be a good idea to integrate the data across sections.

We will do a similar integration as in the Data Integration lab, here we will use Scanorama for integration.

In [19]:
adatas = {}
for batch in library_names:
    adatas[batch] = adata[adata.obs['library_id'] == batch,]

adatas 
Out[19]:
{'V1_Mouse_Brain_Sagittal_Anterior': View of AnnData object with n_obs × n_vars = 2590 × 2404
     obs: 'in_tissue', 'array_row', 'array_col', 'library_id', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_hb', 'pct_counts_hb', 'clusters'
     var: 'gene_ids', 'feature_types', 'genome', 'mt', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection', 'mean', 'std'
     uns: 'spatial', 'library_id_colors', 'log1p', 'hvg', 'neighbors', 'umap', 'leiden', 'clusters_colors'
     obsm: 'spatial', 'X_pca', 'X_umap'
     obsp: 'distances', 'connectivities',
 'V1_Mouse_Brain_Sagittal_Posterior': View of AnnData object with n_obs × n_vars = 3135 × 2404
     obs: 'in_tissue', 'array_row', 'array_col', 'library_id', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_hb', 'pct_counts_hb', 'clusters'
     var: 'gene_ids', 'feature_types', 'genome', 'mt', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection', 'mean', 'std'
     uns: 'spatial', 'library_id_colors', 'log1p', 'hvg', 'neighbors', 'umap', 'leiden', 'clusters_colors'
     obsm: 'spatial', 'X_pca', 'X_umap'
     obsp: 'distances', 'connectivities'}
In [20]:
import scanorama

#convert to list of AnnData objects
adatas = list(adatas.values())

# run scanorama.integrate
scanorama.integrate_scanpy(adatas, dimred = 50)


# Get all the integrated matrices.
scanorama_int = [ad.obsm['X_scanorama'] for ad in adatas]

# make into one matrix.
all_s = np.concatenate(scanorama_int)
print(all_s.shape)

# add to the AnnData object
adata.obsm["Scanorama"] = all_s
Found 2404 genes among all datasets
[[0.         0.46409266]
 [0.         0.        ]]
Processing datasets (0, 1)
(5725, 50)
In [21]:
adata
Out[21]:
AnnData object with n_obs × n_vars = 5725 × 2404
    obs: 'in_tissue', 'array_row', 'array_col', 'library_id', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_hb', 'pct_counts_hb', 'clusters'
    var: 'gene_ids', 'feature_types', 'genome', 'mt', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection', 'mean', 'std'
    uns: 'spatial', 'library_id_colors', 'log1p', 'hvg', 'neighbors', 'umap', 'leiden', 'clusters_colors'
    obsm: 'spatial', 'X_pca', 'X_umap', 'Scanorama'
    obsp: 'distances', 'connectivities'

Then we run dimensionality reduction and clustering as before.

In [22]:
sc.pp.neighbors(adata, use_rep="Scanorama")
sc.tl.umap(adata)
sc.tl.leiden(adata, key_added="clusters")
computing neighbors
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:09)
running Leiden clustering
    finished: found 20 clusters and added
    'clusters', the cluster labels (adata.obs, categorical) (0:00:00)
In [23]:
sc.pl.umap(
    adata, color=["clusters", "library_id"], palette=sc.pl.palettes.default_20
)
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

As we have new clusters, we again need to make a new dict for cluster colors

In [24]:
clusters_colors = dict(
    zip([str(i) for i in range(len(adata.obs.clusters.cat.categories))], adata.uns["clusters_colors"])
)


fig, axs = plt.subplots(1, 2, figsize=(15, 10))

for i, library in enumerate(
    ["V1_Mouse_Brain_Sagittal_Anterior", "V1_Mouse_Brain_Sagittal_Posterior"]
):
    ad = adata[adata.obs.library_id == library, :].copy()
    sc.pl.spatial(
        ad,
        img_key="hires",
        library_id=library,
        color="clusters",
        size=1.5,
        palette=[
            v
            for k, v in clusters_colors.items()
            if k in ad.obs.clusters.unique().tolist()
        ],
        legend_loc=None,
        show=False,
        ax=axs[i],
    )

plt.tight_layout()

Do you see any differences between the integrated and non-integrated clusering? Judge for yourself, which of the clusterings do you think looks best? As a reference, you can compare to brain regions in the Allen brain atlas.

Identification of Spatially Variable Features

There are two main workflows to identify molecular features that correlate with spatial location within a tissue. The first is to perform differential expression based on spatially distinct clusters, the other is to find features that are have spatial patterning without taking clusters or spatial annotation into account.

First, we will do differential expression between clusters just as we did for the scRNAseq data before.

In [25]:
# run t-test 
sc.tl.rank_genes_groups(adata, "clusters", method="wilcoxon")
# plot as heatmap for cluster5 genes
sc.pl.rank_genes_groups_heatmap(adata, groups="5", n_genes=10, groupby="clusters")
ranking genes
    finished: added to `.uns['rank_genes_groups']`
    'names', sorted np.recarray to be indexed by group ids
    'scores', sorted np.recarray to be indexed by group ids
    'logfoldchanges', sorted np.recarray to be indexed by group ids
    'pvals', sorted np.recarray to be indexed by group ids
    'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:10)
WARNING: dendrogram data not found (using key=dendrogram_clusters). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
    using 'X_pca' with n_pcs = 50
Storing dendrogram info using `.uns['dendrogram_clusters']`
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0, 1, 2, etc.
var_group_labels: 5
In [26]:
# plot onto spatial location
top_genes = sc.get.rank_genes_groups_df(adata, group='5',log2fc_min=0)['names'][:3]


for library in ["V1_Mouse_Brain_Sagittal_Anterior", "V1_Mouse_Brain_Sagittal_Posterior"]:
    sc.pl.spatial(adata[adata.obs.library_id == library,:], library_id=library, color = top_genes)

We use SpatialDE Svensson et al., a Gaussian process-based statistical framework that aims to identify spatially variable genes.

OBS! Takes a long time to run, so skip this step for now!

In [27]:
# import SpatialDE

# #First, we convert normalized counts and coordinates to pandas dataframe, needed for inputs to spatialDE.
# counts = pd.DataFrame(adata.X.todense(), columns=adata.var_names, index=adata.obs_names)
# coord = pd.DataFrame(adata.obsm['spatial'], columns=['x_coord', 'y_coord'], index=adata.obs_names)
# results = SpatialDE.run(coord, counts)

# #We concatenate the results with the DataFrame of annotations of variables: `adata.var`.
# results.index = results["g"]
# adata.var = pd.concat([adata.var, results.loc[adata.var.index.values, :]], axis=1)

# #Then we can inspect significant genes that varies in space and visualize them with `sc.pl.spatial` function.
# results.sort_values("qval").head(10)

Single cell data

We can use a scRNA-seq dataset as a referenced to predict the proportion of different celltypes in the Visium spots.

Keep in mind that it is important to have a reference that contains all the celltypes you expect to find in your spots. Ideally it should be a scRNAseq reference from the exact same tissue.

We will use a reference scRNA-seq dataset of ~14,000 adult mouse cortical cell taxonomy from the Allen Institute, generated with the SMART-Seq2 protocol.

Conveniently, you can also download the pre-processed dataset in h5ad format from here. Here with bash code:

In [28]:
%%bash

FILE="./data/spatial/adata_processed_sc.h5ad"   

if [ -e $FILE ]
then
   echo "File $FILE is downloaded."
else
   echo "Downloading $FILE"
   mkdir -p data/spatial
   wget -O data/spatial/adata_processed_sc.h5ad https://hmgubox.helmholtz-muenchen.de/f/4ef254675e2a41f89835/?dl=1
fi
File ./data/spatial/adata_processed_sc.h5ad is downloaded.
Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process.
In [29]:
adata_cortex=sc.read_h5ad("data/spatial/adata_processed_sc.h5ad")
In [30]:
sc.pl.umap(adata_cortex, color="cell_subclass", legend_loc = 'on data')
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
In [31]:
adata_cortex.obs.cell_subclass.value_counts()
Out[31]:
L5 IT         2946
Vip           2636
Sst           2524
L6 IT         2148
Pvalb         2035
Lamp5         1826
L4            1336
L6 CT         1220
L2/3 IT       1165
L5 PT          886
NP             729
Astro          542
No Class       489
L6b            448
Sncg           243
nan            232
Oligo          183
Endo           145
VLMC           130
Macrophage     122
SMC            106
Serpinf1        84
Meis2           54
Peri            26
CR              17
Name: cell_subclass, dtype: int64

For speed, and for a more fair comparison of the celltypes, we will subsample all celltypes to a maximum of 200 cells per class (subclass).

In [32]:
target_cells = 200

adatas2 = [adata_cortex[adata_cortex.obs.cell_subclass == clust] for clust in adata_cortex.obs.cell_subclass.cat.categories]

for dat in adatas2:
    if dat.n_obs > target_cells:
         sc.pp.subsample(dat, n_obs=target_cells)

adata_cortex = adatas2[0].concatenate(*adatas2[1:])

adata_cortex.obs.cell_subclass.value_counts()
Out[32]:
Astro         200
L6b           200
Vip           200
Sst           200
Sncg          200
Pvalb         200
No Class      200
NP            200
Lamp5         200
nan           200
L6 IT         200
L6 CT         200
L5 PT         200
L5 IT         200
L4            200
L2/3 IT       200
Oligo         183
Endo          145
VLMC          130
Macrophage    122
SMC           106
Serpinf1       84
Meis2          54
Peri           26
CR             17
Name: cell_subclass, dtype: int64
In [33]:
sc.pl.umap(
    adata_cortex, color=["cell_class", "cell_subclass","donor_genotype","dissected_region"], palette=sc.pl.palettes.default_20
)
WARNING: Length of palette colors is smaller than the number of categories (palette length: 20, categories length: 25. Some categories will have the same color.
WARNING: Length of palette colors is smaller than the number of categories (palette length: 20, categories length: 61. Some categories will have the same color.
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
In [34]:
sc.pl.umap(adata_cortex, color="cell_subclass", legend_loc = 'on data')
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

Subset ST for cortex

Since the scRNAseq dataset was generated from the mouse cortex, we will subset the visium dataset in order to select mainly the spots part of the cortex. Note that the integration can also be performed on the whole brain slice, but it would give rise to false positive cell type assignments and and therefore it should be interpreted with more care.

OBS! For deconvolution we will need the counts data, so we will subset from the counts_adata object that we created earlier.

In [35]:
lib_a = "V1_Mouse_Brain_Sagittal_Anterior"

counts_adata.obs['clusters'] = adata.obs.clusters

adata_anterior_subset = counts_adata[
    (counts_adata.obs.library_id == lib_a) 
    & (counts_adata.obsm["spatial"][:, 1] < 6000), :
].copy()

# select also the cortex clusters
adata_anterior_subset = adata_anterior_subset[adata_anterior_subset.obs.clusters.isin(['2','3','5','6']),:]
In [36]:
# plot to check that we have the correct regions

sc.pl.spatial(
    adata_anterior_subset,
    img_key="hires",
    library_id = lib_a,
    color=['clusters'],
    size=1.5
)
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/anndata/compat/_overloaded_dict.py:106: ImplicitModificationWarning: Trying to modify attribute `._uns` of view, initializing view as actual.
  self.data[key] = value

Deconvolution

Deconvolution is a method to estimate the abundance (or proportion) of different celltypes in a bulkRNAseq dataset using a single cell reference. As the Visium data can be seen as a small bulk, we can both use methods for traditional bulkRNAseq as well as methods especially developed for Visium data. Some methods for deconvolution are DWLS, cell2location, Tangram, Stereoscope, RCTD, SCDC and many more.

Here, we will use deconvolution with Stereoscope implemented in the SCVI-tools package. To read more about Stereoscope please check out this github page (https://github.com/almaan/stereoscope)

Select genes for deconvolution

Most deconvolution methods does a prior gene selection and there are different options that are used:

  • Use variable genes in the SC data.
  • Use variable genes in both SC and ST data
  • DE genes between clusters in the SC data.

In this case we will use top DEG genes per cluster, so first we have to run DEG detection on the scRNAseq data.

In [37]:
sc.tl.rank_genes_groups(adata_cortex, 'cell_subclass', method = "t-test", n_genes=100)
sc.pl.rank_genes_groups_dotplot(adata_cortex, n_genes=3)
ranking genes
    finished: added to `.uns['rank_genes_groups']`
    'names', sorted np.recarray to be indexed by group ids
    'scores', sorted np.recarray to be indexed by group ids
    'logfoldchanges', sorted np.recarray to be indexed by group ids
    'pvals', sorted np.recarray to be indexed by group ids
    'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:14)
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'names'] = self.var_names[global_indices]
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'scores'] = scores[global_indices]
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals'] = pvals[global_indices]
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'logfoldchanges'] = np.log2(
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'names'] = self.var_names[global_indices]
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'scores'] = scores[global_indices]
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals'] = pvals[global_indices]
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'logfoldchanges'] = np.log2(
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'names'] = self.var_names[global_indices]
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'scores'] = scores[global_indices]
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals'] = pvals[global_indices]
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'logfoldchanges'] = np.log2(
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'names'] = self.var_names[global_indices]
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'scores'] = scores[global_indices]
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals'] = pvals[global_indices]
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'logfoldchanges'] = np.log2(
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'names'] = self.var_names[global_indices]
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'scores'] = scores[global_indices]
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals'] = pvals[global_indices]
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'logfoldchanges'] = np.log2(
WARNING: dendrogram data not found (using key=dendrogram_cell_subclass). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
    using 'X_pca' with n_pcs = 30
Storing dendrogram info using `.uns['dendrogram_cell_subclass']`
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_dotplot.py:749: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap', 'norm' will be ignored
  dot_ax.scatter(x, y, **kwds)
In [ ]:
 
In [38]:
sc.tl.filter_rank_genes_groups(adata_cortex, min_fold_change=1)

genes = sc.get.rank_genes_groups_df(adata_cortex, group = None)
genes
Filtering genes using: min_in_group_fraction: 0.25 min_fold_change: 1, max_out_group_fraction: 0.5
Out[38]:
group names scores logfoldchanges pvals pvals_adj
0 Astro Slc1a3 187.573410 12.699084 0.000000e+00 0.000000e+00
1 Astro Cldn10 185.580582 13.047205 5.563159e-318 3.719994e-316
2 Astro Ntsr2 183.421204 12.866629 0.000000e+00 0.000000e+00
3 Astro Gpr37l1 174.906357 13.525583 9.881313e-324 6.719293e-322
4 Astro Gjb6 165.159912 12.888643 0.000000e+00 0.000000e+00
... ... ... ... ... ... ...
2495 nan Syngr1 27.384808 1.977148 1.081371e-116 3.840127e-114
2496 nan Atcay 27.351534 2.616293 2.625003e-96 4.949213e-94
2497 nan Necap1 27.262054 2.301202 1.816355e-116 6.327315e-114
2498 nan App 27.235752 1.733814 3.233941e-126 1.442535e-123
2499 nan Chp1 27.221622 2.231402 2.501038e-107 6.878229e-105

2500 rows × 6 columns

In [39]:
deg = genes.names.unique().tolist()
print(len(deg))
# check that the genes are also present in the ST data


deg = np.intersect1d(deg,adata_anterior_subset.var.index).tolist()
print(len(deg))
1427
1343

Train the model

First, train the model using scRNAseq data.

Stereoscope requires the data to be in counts, earlier in this tutorial we saved the spatial counts in a separate object counts_adata.

However, the single cell dataset that we dowloaded only has the lognormalized data in the adata.X slot, hence we will have to recalculate the count matrix.

In [40]:
# first do exponent and subtract pseudocount
E = np.exp(adata_cortex.X)-1
n = np.sum(E,1)
print(np.min(n), np.max(n))
# all sums to 1.7M
factor = np.mean(n)
nC = np.array(adata_cortex.obs.total_counts) # true number of counts
scaleF = nC/factor
C = E * scaleF[:,None]
C = C.astype("int")
1728362.2 1728362.8
In [41]:
sc_adata = adata_cortex.copy()
sc_adata.X = C

Setup the anndata, the implementation requires the counts matrix to be in the "counts" layer as a copy.

In [42]:
import scvi
#from scvi.data import register_tensor_from_anndata
from scvi.external import RNAStereoscope, SpatialStereoscope

# add counts layer
sc_adata.layers["counts"] = sc_adata.X.copy()

# subset for the selected genes
sc_adata = sc_adata[:, deg].copy()

# create stereoscope object
RNAStereoscope.setup_anndata(sc_adata, layer = "counts", labels_key = "cell_subclass")
Global seed set to 0
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/pytorch_lightning/utilities/warnings.py:53: LightningDeprecationWarning: pytorch_lightning.utilities.warnings.rank_zero_deprecation has been deprecated in v1.6 and will be removed in v1.8. Use the equivalent function from the pytorch_lightning.utilities.rank_zero module instead.
  new_rank_zero_deprecation(
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/pytorch_lightning/utilities/warnings.py:58: LightningDeprecationWarning: The `pytorch_lightning.loggers.base.rank_zero_experiment` is deprecated in v1.7 and will be removed in v1.9. Please use `pytorch_lightning.loggers.logger.rank_zero_experiment` instead.
  return new_rank_zero_deprecation(*args, **kwargs)
In [43]:
# the model is saved to a file, so if is slow to run, you can simply reload it from disk by setting train = False

train = True
if train:
    sc_model = RNAStereoscope(sc_adata)
    sc_model.train(max_epochs = 300)
    sc_model.history["elbo_train"][10:].plot()
    sc_model.save("./data/spatial/scmodel", overwrite=True)
else:
    sc_model = RNAStereoscope.load("./data/spatial/scmodel", sc_adata)
    print("Loaded RNA model from file!")
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Epoch 300/300: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [10:56<00:00,  2.72s/it, loss=1.39e+07, v_num=1]
`Trainer.fit` stopped: `max_epochs=300` reached.
Epoch 300/300: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [10:56<00:00,  2.19s/it, loss=1.39e+07, v_num=1]

Predict propritions on the spatial data

First create a new st object with the correct genes and counts as a layer.

In [44]:
st_adata = adata_anterior_subset.copy()

st_adata.layers["counts"] = st_adata.X.copy()
st_adata = st_adata[:, deg].copy()

SpatialStereoscope.setup_anndata(st_adata, layer="counts")
In [45]:
train=True
if train:
    spatial_model = SpatialStereoscope.from_rna_model(st_adata, sc_model)
    spatial_model.train(max_epochs = 3000)
    spatial_model.history["elbo_train"][10:].plot()
    spatial_model.save("./data/spatial/stmodel", overwrite = True)
else:
    spatial_model = SpatialStereoscope.load("./data/spatial/stmodel", st_adata)
    print("Loaded Spatial model from file!")
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1892: PossibleUserWarning: The number of training batches (6) is smaller than the logging interval Trainer(log_every_n_steps=10). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(
Epoch 3000/3000: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 3000/3000 [16:18<00:00,  7.61it/s, loss=1.49e+06, v_num=1]
`Trainer.fit` stopped: `max_epochs=3000` reached.
Epoch 3000/3000: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 3000/3000 [16:18<00:00,  3.07it/s, loss=1.49e+06, v_num=1]

Get the results from the model, also put them in the .obs slot.

In [46]:
st_adata.obsm["deconvolution"] = spatial_model.get_proportions()

# also copy to the obsm data frame
for ct in st_adata.obsm["deconvolution"].columns:
    st_adata.obs[ct] = st_adata.obsm["deconvolution"][ct]

We are then able to explore how cell types in the scRNA-seq dataset are predicted onto the visium dataset. Let's first visualize the neurons cortical layers.

In [47]:
sc.pl.spatial(
    st_adata,
    img_key="hires",
    color=["L2/3 IT", "L4", "L5 PT", "L6 CT"],
    library_id=lib_a,
    size=1.5,
    ncols=2
)

We can go ahead an visualize astrocytes and oligodendrocytes as well.

In [48]:
sc.pl.spatial(
    st_adata, img_key="hires", color=["Oligo", "Astro"], size=1.5, library_id = lib_a
)

We can also visualize the scores per cluster in ST data

In [49]:
sc.pl.violin(st_adata, ["L2/3 IT", "L6 CT","Oligo","Astro"],
            jitter=0.4, groupby = 'clusters', rotation= 45)

Keep in mind that the deconvolution results are just predictions, depending on how well your scRNAseq data covers the celltypes that are present in the ST data and on how parameters, gene selection etc. are tuned you may get different results.

**Your turn:** Subset for another region that does not contain cortex cells and check what you get from the label transfer. Suggested region is the right end of the posterial section that you can select like this:
In [50]:
lib_p = "V1_Mouse_Brain_Sagittal_Posterior"

adata_subregion = adata[
    (adata.obs.library_id == lib_p) 
    & (adata.obsm["spatial"][:, 0] > 6500),
    :,
].copy()
In [51]:
sc.pl.spatial(
    adata_subregion,
    img_key="hires",
    library_id = lib_p,
    color=['n_genes_by_counts'],
    size=1.5
)