Celltype prediction


Celltype prediction can either be performed on indiviudal cells where each cell gets a predicted celltype label, or on the level of clusters. All methods are based on similarity to other datasets, single cell or sorted bulk RNAseq, or uses know marker genes for each celltype.

We will select one sample from the Covid data, ctrl_13 and predict celltype by cell on that sample.

Some methods will predict a celltype to each cell based on what it is most similar to even if the celltype of that cell is not included in the reference. Other methods include an uncertainty so that cells with low similarity scores will be unclassified. There are multiple different methods to predict celltypes, here we will just cover a few of those.

Here we will use a reference PBMC dataset that we get from scanpy datasets and classify celltypes based on two methods:

  • Using scanorama for integration just as in the integration lab, and then do label transfer based on closest neighbors.
  • Using ingest to project the data onto the reference data and transfer labels.
In [1]:
import numpy as np
import pandas as pd
import scanpy as sc
import matplotlib.pyplot as plt

sc.settings.verbosity = 2             # verbosity: errors (0), warnings (1), info (2), hints (3)
#sc.logging.print_versions()
In [2]:
sc.settings.set_figure_params(dpi=80)

Read in the clustered data object.

In [3]:
adata = sc.read_h5ad('./data/results/scanpy_clustered_covid.h5ad')
adata.uns['log1p']['base']=None
In [4]:
print(adata.shape)
print(adata.raw.shape)
(5646, 3090)
(5646, 18752)

Subset for ctrl_13 sample.

In [5]:
adata = adata[adata.obs["sample"] == "ctrl_13",:]
print(adata.shape)
(1132, 3090)
In [6]:
sc.pl.umap(
    adata, color=["louvain_0.6"], palette=sc.pl.palettes.default_20
)
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/anndata/compat/_overloaded_dict.py:106: ImplicitModificationWarning: Trying to modify attribute `._uns` of view, initializing view as actual.
  self.data[key] = value
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

Reference data

Load the reference data from scanpy.datasets. It is the annotated and processed pbmc3k dataset from 10x.

In [7]:
adata_ref = sc.datasets.pbmc3k_processed() 
In [8]:
adata_ref.obs['sample']='pbmc3k'

print(adata_ref.shape)
adata_ref.obs
(2638, 1838)
Out[8]:
n_genes percent_mito n_counts louvain sample
index
AAACATACAACCAC-1 781 0.030178 2419.0 CD4 T cells pbmc3k
AAACATTGAGCTAC-1 1352 0.037936 4903.0 B cells pbmc3k
AAACATTGATCAGC-1 1131 0.008897 3147.0 CD4 T cells pbmc3k
AAACCGTGCTTCCG-1 960 0.017431 2639.0 CD14+ Monocytes pbmc3k
AAACCGTGTATGCG-1 522 0.012245 980.0 NK cells pbmc3k
... ... ... ... ... ...
TTTCGAACTCTCAT-1 1155 0.021104 3459.0 CD14+ Monocytes pbmc3k
TTTCTACTGAGGCA-1 1227 0.009294 3443.0 B cells pbmc3k
TTTCTACTTCCTCG-1 622 0.021971 1684.0 B cells pbmc3k
TTTGCATGAGAGGC-1 454 0.020548 1022.0 B cells pbmc3k
TTTGCATGCCTCAC-1 724 0.008065 1984.0 CD4 T cells pbmc3k

2638 rows × 5 columns

In [9]:
sc.pl.umap(adata_ref, color='louvain')
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

Make sure we have the same genes in both datset by taking the intersection

In [10]:
print(adata_ref.shape[1])
print(adata.shape[1])
var_names = adata_ref.var_names.intersection(adata.var_names)
print(len(var_names))
1838
3090
476
In [11]:
adata_ref = adata_ref[:, var_names]
adata = adata[:, var_names]

First we need to rerun pca and umap with the same gene set for both datasets.

In [12]:
sc.pp.pca(adata_ref)
sc.pp.neighbors(adata_ref)
sc.tl.umap(adata_ref)
sc.pl.umap(adata_ref, color='louvain')
computing PCA
    with n_comps=50
    finished (0:00:00)
computing neighbors
    using 'X_pca' with n_pcs = 50
    finished (0:00:03)
computing UMAP
    finished (0:00:04)
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
In [13]:
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color='louvain_0.6')
computing PCA
    on highly variable genes
    with n_comps=50
    finished (0:00:00)
computing neighbors
    using 'X_pca' with n_pcs = 50
    finished (0:00:00)
computing UMAP
    finished (0:00:01)
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

Integrate with scanorama

In [14]:
import scanorama


#subset the individual dataset to the same variable genes as in MNN-correct.
alldata = dict()
alldata['ctrl']=adata
alldata['ref']=adata_ref

#convert to list of AnnData objects
adatas = list(alldata.values())

# run scanorama.integrate
scanorama.integrate_scanpy(adatas, dimred = 50)
Found 476 genes among all datasets
[[0.         0.96378092]
 [0.         0.        ]]
Processing datasets (0, 1)
In [15]:
# add in sample info
adata_ref.obs['sample']='pbmc3k'
In [16]:
# create a merged scanpy object and add in the scanorama 
adata_merged = alldata['ctrl'].concatenate(alldata['ref'], batch_key='sample', batch_categories=['ctrl','pbmc3k'])

embedding = np.concatenate([ad.obsm['X_scanorama'] for ad in adatas], axis=0)

adata_merged.obsm['Scanorama'] = embedding
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
  [AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
  [AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
In [17]:
#run  umap.
sc.pp.neighbors(adata_merged, n_pcs =50, use_rep = "Scanorama")
sc.tl.umap(adata_merged)
computing neighbors
    finished (0:00:00)
computing UMAP
    finished (0:00:06)
In [18]:
sc.pl.umap(adata_merged, color=["sample","louvain"])
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

Label transfer

Using the function in the Spatial tutorial at the scanpy website we will calculate normalized cosine distances between the two datasets and tranfer labels to the celltype with the highest scores.

In [19]:
from sklearn.metrics.pairwise import cosine_distances

distances = 1 - cosine_distances(
    adata_merged[adata_merged.obs['sample'] == "pbmc3k"].obsm["Scanorama"],
    adata_merged[adata_merged.obs['sample'] == "ctrl"].obsm["Scanorama"],
)
In [20]:
def label_transfer(dist, labels, index):
    lab = pd.get_dummies(labels)
    class_prob = lab.to_numpy().T @ dist
    norm = np.linalg.norm(class_prob, 2, axis=0)
    class_prob = class_prob / norm
    class_prob = (class_prob.T - class_prob.min(1)) / class_prob.ptp(1)
    # convert to df
    cp_df = pd.DataFrame(
        class_prob, columns=lab.columns
    )
    cp_df.index = index
    # classify as max score
    m = cp_df.idxmax(axis=1)
    
    return m
In [21]:
class_def = label_transfer(distances, adata_ref.obs.louvain, adata.obs.index)
In [22]:
# add to obs section of the original object
adata.obs['predicted'] = class_def
In [23]:
sc.pl.umap(adata, color="predicted")
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
In [24]:
# add to merged object.
adata_merged.obs['predicted'] = class_def.append(adata_ref.obs['louvain']).tolist()

sc.pl.umap(adata_merged, color=["sample","louvain",'predicted'])
#plot only ctrl cells.
sc.pl.umap(adata_merged[adata_merged.obs['sample']=='ctrl'], color='predicted')
/var/folders/f_/vj_w4xx933z1rr95yf4rhphr0000gp/T/ipykernel_84734/3427748211.py:2: FutureWarning: The series.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  adata_merged.obs['predicted'] = class_def.append(adata_ref.obs['louvain']).tolist()
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

Ingest

Another method for celltype prediction is Ingest, for more information, please look at https://scanpy-tutorials.readthedocs.io/en/latest/integrating-data-using-ingest.html

In [25]:
sc.tl.ingest(adata, adata_ref, obs='louvain')
running ingest
    finished (0:00:17)
In [26]:
sc.pl.umap(adata, color=['louvain','louvain_0.6'], wspace=0.5)
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

Compare results

The predictions from ingest is stored in the column 'louvain' while we named the label transfer with scanorama as 'predicted'

In [27]:
sc.pl.umap(adata, color=['louvain','predicted'], wspace=0.5)
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

As you can see, the main celltypes are the same, but dendritic cells are mainly predicted to cluster 8 by ingest and the proportions of the different celltypes are different.

The only way to make sure which method you trust is to look at what genes the different celltypes express and use your biological knowledge to make decisions.

Gene set analysis

Another way of predicting celltypes is to use the differentially expressed genes per cluster and compare to lists of known cell marker genes. This requires a list of genes that you trust and that is relevant for the tissue you are working on.

You can either run it with a marker list from the ontology or a list of your choice as in the example below.

In [28]:
df = pd.read_table('./data/CellMarker_list/Human_cell_markers.txt')
df
Out[28]:
speciesType tissueType UberonOntologyID cancerType cellType cellName CellOntologyID cellMarker geneSymbol geneID proteinName proteinID markerResource PMID Company
0 Human Kidney UBERON_0002113 Normal Normal cell Proximal tubular cell NaN Intestinal Alkaline Phosphatase ALPI 248 PPBI P09923 Experiment 9263997 NaN
1 Human Liver UBERON_0002107 Normal Normal cell Ito cell (hepatic stellate cell) CL_0000632 Synaptophysin SYP 6855 SYPH P08247 Experiment 10595912 NaN
2 Human Endometrium UBERON_0001295 Normal Normal cell Trophoblast cell CL_0000351 CEACAM1 CEACAM1 634 CEAM1 P13688 Experiment 10751340 NaN
3 Human Germ UBERON_0000923 Normal Normal cell Primordial germ cell CL_0000670 VASA DDX4 54514 DDX4 Q9NQI0 Experiment 10920202 NaN
4 Human Corneal epithelium UBERON_0001772 Normal Normal cell Epithelial cell CL_0000066 KLF6 KLF6 1316 KLF6 Q99612 Experiment 12407152 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2863 Human Embryo UBERON_0000922 Normal Normal cell 1-cell stage cell (Blastomere) CL_0000353 ACCSL, ACVR1B, ARHGEF16, ASF1B, BCL2L10, BLCAP... ACCSL, ACVR1B, ARHGEF16, ASF1B, BCL2L10, BLCAP... 390110, 91, 27237, 55723, 10017, 10904, 662, 7... 1A1L2, ACV1B, ARHGG, ASF1B, B2L10, BLCAP, SEC2... Q4AC99, P36896, Q5VV41, Q9NVP2, Q9HD36, P62952... Single-cell sequencing 23892778 NaN
2864 Human Embryo UBERON_0000922 Normal Normal cell 4-cell stage cell (Blastomere) CL_0000353 ADPGK, AIM1, AIMP2, ARG2, ARHGAP17, ARIH1, CDC... ADPGK, CRYBG1, AIMP2, ARG2, ARHGAP17, ARIH1, C... 83440, 202, 7965, 384, 55114, 25820, 55536, 24... ADPGK, CRBG1, AIMP2, ARGI2, RHG17, ARI1, CDA7L... Q9BRR6, Q9Y4K1, Q13155, P78540, Q68EM7, Q9Y4X5... Single-cell sequencing 23892778 NaN
2865 Human Embryo UBERON_0000922 Normal Normal cell 8-cell stage cell (Blastomere) CL_0000353 C11orf48, C19orf53, DHX9, DIABLO, EIF1AD, EIF4... LBHD1, C19orf53, DHX9, DIABLO, EIF1AD, EIF4G1,... 79081, 28974, 1660, 56616, 84285, 1981, 26017,... LBHD1, L10K, DHX9, DBLOH, EIF1A, IF4G1, FA32A,... Q9BQE6, Q9UNZ5, Q08211, Q9NR28, Q8N9N8, Q04637... Single-cell sequencing 23892778 NaN
2866 Human Embryo UBERON_0000922 Normal Normal cell Morula cell (Blastomere) CL_0000360 ADCK1, AGL, AIMP1, AKAP12, ARPC3, ATP1B3, ATP5... ADCK1, AGL, AIMP1, AKAP12, ARPC3, ATP1B3, NA, ... 57143, 178, 9255, 9590, 10094, 483, NA, 586, 9... ADCK1, GDE, AIMP1, AKA12, ARPC3, AT1B3, AT5F1,... Q86TW2, P35573, Q12904, Q02952, O15145, P54709... Single-cell sequencing 23892778 NaN
2867 Human Brain UBERON_0000955 oligodendroglioma Cancer cell Cancer stem cell NaN ASCL1, BOC, CCND2, CD24, CHD7, EGFR, NFIB, SOX... ASCL1, BOC, CCND2, CD24, CHD7, EGFR, NFIB, SOX... 429, 91653, 894, 100133941, 55636, 1956, 4781,... ASCL1, BOC, CCND2, CD24, CHD7, EGFR, NFIB, SOX... P50553, Q9BWV1, P30279, P25063, Q9P2D1, P00533... Single-cell sequencing 27806376 NaN

2868 rows × 15 columns

In [29]:
# Filter for number of genes per celltype
print(df.shape)

df['nG'] = df.geneSymbol.str.split(",").str.len()

df = df[df['nG'] > 5]
df = df[df['nG'] < 100]
d = df[df['cancerType'] == "Normal"]
print(df.shape)
(2868, 15)
(445, 16)
In [30]:
df.index = df.cellName
gene_dict = df.geneSymbol.str.split(",").to_dict()
In [31]:
# run differential expression per cluster
sc.tl.rank_genes_groups(adata, 'louvain_0.6', method='wilcoxon', key_added = "wilcoxon")
ranking genes
    finished (0:00:00)
In [32]:
# do gene set overlap to the groups in the gene list and top 300 DEGs.

import gseapy

gsea_res = dict()
pred = dict()

for cl in adata.obs['louvain_0.6'].cat.categories.tolist():
    print(cl)
    glist = sc.get.rank_genes_groups_df(adata, group=cl, key='wilcoxon')['names'].squeeze().str.strip().tolist()    
    enr_res = gseapy.enrichr(gene_list=glist[:300],
                     organism='Human',
                     gene_sets=gene_dict,
                     background = adata.raw.shape[1],    
                     cutoff = 1)
    if enr_res.results.shape[0] == 0:
        pred[cl] = "Unass"
    else:
        enr_res.results.sort_values(by="P-value",axis=0, ascending=True, inplace=True)
        print(enr_res.results.head(2))
        gsea_res[cl]=enr_res
        pred[cl] = enr_res.results["Term"][0]
    
0
   Gene_set                   Term Overlap   P-value  Adjusted P-value  \
0  gs_ind_0  Cancer stem-like cell     1/6  0.092243          0.234895   
6  gs_ind_0             Macrophage     1/6  0.092243          0.234895   

   Odds Ratio  Genes  
0   14.444459  ANPEP  
6   14.444459   AIF1  
1
   Gene_set                      Term Overlap   P-value  Adjusted P-value  \
5  gs_ind_0  Parietal progenitor cell     1/7  0.106771          0.264036   
0  gs_ind_0     CD8+ cytotoxic T cell    1/11  0.162599          0.264036   

   Odds Ratio  Genes  
5   12.517863  ANXA1  
0    8.162082  LAMP1  
2
   Gene_set                    Term Overlap   P-value  Adjusted P-value  \
2  gs_ind_0  Effector memory T cell     1/7  0.106771          0.213541   
5  gs_ind_0                Monocyte     1/7  0.106771          0.213541   

   Odds Ratio Genes  
2   12.517863  IL7R  
5   12.517863  CD52  
3
   Gene_set                    Term Overlap   P-value  Adjusted P-value  \
6  gs_ind_0  Effector memory T cell     1/7  0.106771          0.234106   
8  gs_ind_0            Naive T cell     1/7  0.106771          0.234106   

   Odds Ratio Genes  
6   12.517863  IL7R  
8   12.517863  CCR7  
4
   Gene_set              Term Overlap   P-value  Adjusted P-value  Odds Ratio  \
0  gs_ind_0  CD4-CD28+ T cell     1/8  0.121066          0.121066   11.044584   
1  gs_ind_0  CD4-CD28- T cell     1/8  0.121066          0.121066   11.044584   

  Genes  
0  BCL2  
1  BCL2  
5
   Gene_set                      Term Overlap   P-value  Adjusted P-value  \
2  gs_ind_0                Macrophage     1/6  0.092243          0.266926   
4  gs_ind_0  PROM1Low progenitor cell     1/7  0.106771          0.266926   

   Odds Ratio  Genes  
2   14.444459   AIF1  
4   12.517863  ALCAM  
6
   Gene_set              Term Overlap   P-value  Adjusted P-value  Odds Ratio  \
0  gs_ind_0            B cell     1/6  0.092243          0.151332   14.444459   
1  gs_ind_0  CD4-CD28+ T cell     1/8  0.121066          0.151332   11.044584   

  Genes  
0  CD19  
1  BCL2  
7
   Gene_set                             Term Overlap   P-value  \
2  gs_ind_0                       Macrophage     1/6  0.092243   
3  gs_ind_0  Monocyte derived dendritic cell     1/8  0.121066   

   Adjusted P-value  Odds Ratio  Genes  
2          0.242132   14.444459   AIF1  
3          0.242132   11.044584  ITGAX  
In [33]:
# prediction per cluster:

pred
Out[33]:
{'0': 'Cancer stem-like cell',
 '1': 'CD8+ cytotoxic T cell',
 '2': 'CD4+ T cell',
 '3': 'Activated T cell',
 '4': 'CD4-CD28+ T cell',
 '5': 'CD1C+_B dendritic cell',
 '6': 'B cell',
 '7': 'Circulating fetal cell'}
In [34]:
prediction = [pred[x] for x in adata.obs['louvain_0.6']]
adata.obs["GS_overlap_pred"] = prediction
In [35]:
sc.pl.umap(adata, color='GS_overlap_pred')
/Users/asabjor/miniconda3/envs/scRNAseq2023_python/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

As you can see, it agrees to some extent with the predictions from label transfer and ingest, but there are clear differences, which do you think looks better?