Celltype prediction can either be performed on indiviudal cells where each cell gets a predicted celltype label, or on the level of clusters. All methods are based on similarity to other datasets, single cell or sorted bulk RNAseq, or uses know marker genes for each celltype.
We will select one sample from the Covid data, ctrl_13
and predict celltype by cell on that sample.
Some methods will predict a celltype to each cell based on what it is most similar to even if the celltype of that cell is not included in the reference. Other methods include an uncertainty so that cells with low similarity scores will be unclassified. There are multiple different methods to predict celltypes, here we will just cover a few of those.
Here we will use a reference PBMC dataset that we get from scanpy datasets and classify celltypes based on two methods:
import numpy as np
import pandas as pd
import scanpy as sc
import matplotlib.pyplot as plt
sc.settings.verbosity = 2 # verbosity: errors (0), warnings (1), info (2), hints (3)
#sc.logging.print_versions()
sc.settings.set_figure_params(dpi=80)
Read in the clustered data object.
adata = sc.read_h5ad('./data/results/scanpy_clustered_covid.h5ad')
adata.uns['log1p']['base']=None
print(adata.shape)
print(adata.raw.shape)
Subset for ctrl_13
sample.
adata = adata[adata.obs["sample"] == "ctrl_13",:]
print(adata.shape)
sc.pl.umap(
adata, color=["louvain_0.6"], palette=sc.pl.palettes.default_20
)
Load the reference data from scanpy.datasets
. It is the annotated and processed pbmc3k dataset from 10x.
adata_ref = sc.datasets.pbmc3k_processed()
adata_ref.obs['sample']='pbmc3k'
print(adata_ref.shape)
adata_ref.obs
sc.pl.umap(adata_ref, color='louvain')
Make sure we have the same genes in both datset by taking the intersection
print(adata_ref.shape[1])
print(adata.shape[1])
var_names = adata_ref.var_names.intersection(adata.var_names)
print(len(var_names))
adata_ref = adata_ref[:, var_names]
adata = adata[:, var_names]
First we need to rerun pca and umap with the same gene set for both datasets.
sc.pp.pca(adata_ref)
sc.pp.neighbors(adata_ref)
sc.tl.umap(adata_ref)
sc.pl.umap(adata_ref, color='louvain')
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color='louvain_0.6')
import scanorama
#subset the individual dataset to the same variable genes as in MNN-correct.
alldata = dict()
alldata['ctrl']=adata
alldata['ref']=adata_ref
#convert to list of AnnData objects
adatas = list(alldata.values())
# run scanorama.integrate
scanorama.integrate_scanpy(adatas, dimred = 50)
# add in sample info
adata_ref.obs['sample']='pbmc3k'
# create a merged scanpy object and add in the scanorama
adata_merged = alldata['ctrl'].concatenate(alldata['ref'], batch_key='sample', batch_categories=['ctrl','pbmc3k'])
embedding = np.concatenate([ad.obsm['X_scanorama'] for ad in adatas], axis=0)
adata_merged.obsm['Scanorama'] = embedding
#run umap.
sc.pp.neighbors(adata_merged, n_pcs =50, use_rep = "Scanorama")
sc.tl.umap(adata_merged)
sc.pl.umap(adata_merged, color=["sample","louvain"])
from sklearn.metrics.pairwise import cosine_distances
distances = 1 - cosine_distances(
adata_merged[adata_merged.obs['sample'] == "pbmc3k"].obsm["Scanorama"],
adata_merged[adata_merged.obs['sample'] == "ctrl"].obsm["Scanorama"],
)
def label_transfer(dist, labels, index):
lab = pd.get_dummies(labels)
class_prob = lab.to_numpy().T @ dist
norm = np.linalg.norm(class_prob, 2, axis=0)
class_prob = class_prob / norm
class_prob = (class_prob.T - class_prob.min(1)) / class_prob.ptp(1)
# convert to df
cp_df = pd.DataFrame(
class_prob, columns=lab.columns
)
cp_df.index = index
# classify as max score
m = cp_df.idxmax(axis=1)
return m
class_def = label_transfer(distances, adata_ref.obs.louvain, adata.obs.index)
# add to obs section of the original object
adata.obs['predicted'] = class_def
sc.pl.umap(adata, color="predicted")
# add to merged object.
adata_merged.obs['predicted'] = class_def.append(adata_ref.obs['louvain']).tolist()
sc.pl.umap(adata_merged, color=["sample","louvain",'predicted'])
#plot only ctrl cells.
sc.pl.umap(adata_merged[adata_merged.obs['sample']=='ctrl'], color='predicted')
Another method for celltype prediction is Ingest, for more information, please look at https://scanpy-tutorials.readthedocs.io/en/latest/integrating-data-using-ingest.html
sc.tl.ingest(adata, adata_ref, obs='louvain')
sc.pl.umap(adata, color=['louvain','louvain_0.6'], wspace=0.5)
The predictions from ingest is stored in the column 'louvain' while we named the label transfer with scanorama as 'predicted'
sc.pl.umap(adata, color=['louvain','predicted'], wspace=0.5)
As you can see, the main celltypes are the same, but dendritic cells are mainly predicted to cluster 8 by ingest and the proportions of the different celltypes are different.
The only way to make sure which method you trust is to look at what genes the different celltypes express and use your biological knowledge to make decisions.
Another way of predicting celltypes is to use the differentially expressed genes per cluster and compare to lists of known cell marker genes. This requires a list of genes that you trust and that is relevant for the tissue you are working on.
You can either run it with a marker list from the ontology or a list of your choice as in the example below.
df = pd.read_table('./data/CellMarker_list/Human_cell_markers.txt')
df
# Filter for number of genes per celltype
print(df.shape)
df['nG'] = df.geneSymbol.str.split(",").str.len()
df = df[df['nG'] > 5]
df = df[df['nG'] < 100]
d = df[df['cancerType'] == "Normal"]
print(df.shape)
df.index = df.cellName
gene_dict = df.geneSymbol.str.split(",").to_dict()
# run differential expression per cluster
sc.tl.rank_genes_groups(adata, 'louvain_0.6', method='wilcoxon', key_added = "wilcoxon")
# do gene set overlap to the groups in the gene list and top 300 DEGs.
import gseapy
gsea_res = dict()
pred = dict()
for cl in adata.obs['louvain_0.6'].cat.categories.tolist():
print(cl)
glist = sc.get.rank_genes_groups_df(adata, group=cl, key='wilcoxon')['names'].squeeze().str.strip().tolist()
enr_res = gseapy.enrichr(gene_list=glist[:300],
organism='Human',
gene_sets=gene_dict,
background = adata.raw.shape[1],
cutoff = 1)
if enr_res.results.shape[0] == 0:
pred[cl] = "Unass"
else:
enr_res.results.sort_values(by="P-value",axis=0, ascending=True, inplace=True)
print(enr_res.results.head(2))
gsea_res[cl]=enr_res
pred[cl] = enr_res.results["Term"][0]
# prediction per cluster:
pred
prediction = [pred[x] for x in adata.obs['louvain_0.6']]
adata.obs["GS_overlap_pred"] = prediction
sc.pl.umap(adata, color='GS_overlap_pred')
As you can see, it agrees to some extent with the predictions from label transfer and ingest, but there are clear differences, which do you think looks better?