Celltype prediction

Celltype prediction can either be performed on indiviudal cells where each cell gets a predicted celltype label, or on the level of clusters. All methods are based on similarity to other datasets, single cell or sorted bulk RNAseq, or uses know marker genes for each celltype.

We will select one sample from the Covid data, ctrl_13 and predict celltype by cell on that sample.

Some methods will predict a celltype to each cell based on what it is most similar to even if the celltype of that cell is not included in the reference. Other methods include an uncertainty so that cells with low similarity scores will be unclassified. There are multiple different methods to predict celltypes, here we will just cover a few of those.

Here we will use a reference PBMC dataset from the scPred package which is provided as a Seurat object with counts. And we will test classification based on the scPred and scMap methods. Finally we will use gene set enrichment predict celltype based on the DEGs of each cluster.

Load and process data

Covid-19 data

First, lets load required libraries and the saved object from the clustering step. Subset for one patient.

# load the data and select 'ctrl_13` sample
alldata <- readRDS("data/results/covid_qc_dr_int_cl.rds")
ctrl.sce <- alldata[, alldata@colData$sample == "ctrl.13"]

# remove all old dimensionality reductions as they will mess up the analysis
# further down
reducedDims(ctrl.sce) <- NULL

Reference data

Then, load the reference dataset with annotated labels. Also, run all steps of the normal analysis pipeline with normalizaiton, variable gene selection, scaling and dimensionality reduction.

reference <- scPred::pbmc_1

## An object of class Seurat 
## 32838 features across 3500 samples within 1 assay 
## Active assay: RNA (32838 features, 0 variable features)

Convert to a SCE object.

ref.sce = Seurat::as.SingleCellExperiment(reference)

Rerun analysis pipeline

Run normalization, feature selection and dimensionality reduction

# Normalize
ref.sce <- computeSumFactors(ref.sce)
ref.sce <- logNormCounts(ref.sce)

# Variable genes
var.out <- modelGeneVar(ref.sce, method = "loess")
hvg.ref <- getTopHVGs(var.out, n = 1000)

# Dim reduction
ref.sce <- runPCA(ref.sce, exprs_values = "logcounts", scale = T, ncomponents = 30, 
    subset_row = hvg.ref)
ref.sce <- runUMAP(ref.sce, dimred = "PCA")
plotReducedDim(ref.sce, dimred = "UMAP", colour_by = "cell_type")