suppressPackageStartupMessages({
library(Seurat)
library(dplyr)
library(patchwork)
library(ggplot2)
library(pheatmap)
# remotes::install_github("powellgenomicslab/scPred")
library(scPred)
})
Code chunks run R commands unless otherwise specified.
Celltype prediction can either be performed on indiviudal cells where each cell gets a predicted celltype label, or on the level of clusters. All methods are based on similarity to other datasets, single cell or sorted bulk RNAseq, or uses known marker genes for each cell type.
We will select one sample from the Covid data, ctrl_13
and predict celltype by cell on that sample.
Some methods will predict a celltype to each cell based on what it is most similar to, even if that celltype is not included in the reference. Other methods include an uncertainty so that cells with low similarity scores will be unclassified.
There are multiple different methods to predict celltypes, here we will just cover a few of those.
We will use a reference PBMC dataset from the scPred
package which is provided as a Seurat object with counts. And we will test classification based on the scPred
and scMap
methods. Finally we will use gene set enrichment predict celltype based on the DEGs of each cluster.
1 Read data
First, lets load required libraries
Let’s read in the saved Covid-19 data object from the clustering step.
# download pre-computed data if missing or long compute
<- TRUE
fetch_data
# url for source and intermediate data
<- "https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq"
path_data <- "data/covid/results/seurat_covid_qc_dr_int_cl.rds"
path_file if (!dir.exists(dirname(path_file))) dir.create(dirname(path_file), recursive = TRUE)
if (fetch_data && !file.exists(path_file)) download.file(url = file.path(path_data, "covid/results/seurat_covid_qc_dr_int_cl.rds"), destfile = path_file)
<- readRDS(path_file) alldata
Subset one patient.
<- alldata[, alldata$orig.ident == "ctrl_13"]
ctrl
# set active assay to RNA and remove the CCA assay
@active.assay <- "RNA"
ctrl"CCA"]] <- NULL
ctrl[[ ctrl
An object of class Seurat
18863 features across 1125 samples within 1 assay
Active assay: RNA (18863 features, 2000 variable features)
6 dimensional reductions calculated: umap, tsne, umap_raw, pca_harmony, harmony, umap_harmony
2 Reference data
Load the reference dataset with annotated labels.
<- scPred::pbmc_1
reference reference
An object of class Seurat
32838 features across 3500 samples within 1 assay
Active assay: RNA (32838 features, 0 variable features)
Rerun analysis pipeline. Run normalization, feature selection and dimensionality reduction
Here, we will run all the steps that we did in previous labs in one go using the magittr
package with the pipe-operator %>%
.
<- reference %>%
reference NormalizeData() %>%
FindVariableFeatures() %>%
ScaleData() %>%
RunPCA(verbose = F) %>%
RunUMAP(dims = 1:30)
DimPlot(reference, group.by = "cell_type", label = TRUE, repel = TRUE) + NoAxes()
Run all steps of the analysis for the ctrl sample as well. Use the clustering from the integration lab with resolution 0.5.
# Set the identity as louvain with resolution 0.3
<- SetIdent(ctrl, value = "CCA_snn_res.0.5")
ctrl
<- ctrl %>%
ctrl NormalizeData() %>%
FindVariableFeatures() %>%
ScaleData() %>%
RunPCA(verbose = F) %>%
RunUMAP(dims = 1:30)
DimPlot(ctrl, label = TRUE, repel = TRUE) + NoAxes()
3 Label transfer
First we will run label transfer using a similar method as in the integration exercise. But, instead of CCA, which is the default for the FindTransferAnchors()
function, we will use pcaproject
, ie; the query dataset is projected onto the PCA of the reference dataset. Then, the labels of the reference data are predicted.
<- FindTransferAnchors(
transfer.anchors reference = reference, query = ctrl,
dims = 1:30
)<- TransferData(
predictions anchorset = transfer.anchors, refdata = reference$cell_type,
dims = 1:30
)<- AddMetaData(object = ctrl, metadata = predictions) ctrl
DimPlot(ctrl, group.by = "predicted.id", label = T, repel = T) + NoAxes()
Now plot how many cells of each celltypes can be found in each cluster.
ggplot(ctrl@meta.data, aes(x = CCA_snn_res.0.5, fill = predicted.id)) +
geom_bar() +
theme_classic()
4 scPred
scPred will train a classifier based on all principal components. First, getFeatureSpace()
will create a scPred object stored in the @misc
slot where it extracts the PCs that best separates the different celltypes. Then trainModel()
will do the actual training for each celltype.
<- getFeatureSpace(reference, "cell_type") reference
● Extracting feature space for each cell type...
DONE!
<- trainModel(reference) reference
● Training models for each cell type...
maximum number of iterations reached 0.0001152056 -0.0001143117DONE!
We can then print how well the training worked for the different celltypes by printing the number of PCs used, the ROC value and Sensitivity/Specificity. Which celltypes do you think are harder to classify based on this dataset?
get_scpred(reference)
'scPred' object
✔ Prediction variable = cell_type
✔ Discriminant features per cell type
✔ Training model(s)
Summary
|Cell type | n| Features|Method | ROC| Sens| Spec|
|:-----------|----:|--------:|:---------|-----:|-----:|-----:|
|B cell | 280| 50|svmRadial | 1.000| 0.964| 1.000|
|CD4 T cell | 1620| 50|svmRadial | 0.997| 0.972| 0.975|
|CD8 T cell | 945| 50|svmRadial | 0.985| 0.899| 0.978|
|cDC | 26| 50|svmRadial | 0.995| 0.547| 1.000|
|cMono | 212| 50|svmRadial | 0.994| 0.958| 0.970|
|ncMono | 79| 50|svmRadial | 0.998| 0.570| 1.000|
|NK cell | 312| 50|svmRadial | 0.999| 0.933| 0.996|
|pDC | 20| 50|svmRadial | 1.000| 0.700| 1.000|
|Plasma cell | 6| 50|svmRadial | 1.000| 0.800| 1.000|
You can optimize parameters for each dataset by chaining parameters and testing different types of models, see more at: https://powellgenomicslab.github.io/scPred/articles/introduction.html. But for now, we will continue with this model. Now, let’s predict celltypes on our data, where scPred will align the two datasets with Harmony and then perform classification.
<- scPredict(ctrl, reference) ctrl
● Matching reference with new dataset...
─ 2000 features present in reference loadings
─ 1786 features shared between reference and new dataset
─ 89.3% of features in the reference are present in new dataset
● Aligning new data to reference...
● Classifying cells...
DONE!
DimPlot(ctrl, group.by = "scpred_prediction", label = T, repel = T) + NoAxes()
Now plot how many cells of each celltypes can be found in each cluster.
ggplot(ctrl@meta.data, aes(x = CCA_snn_res.0.5, fill = scpred_prediction)) +
geom_bar() +
theme_classic()
5 Azimuth
There are multiple online resources with large curated datasets with methods to integrate and do label transfer. One such resource is Azimuth another one is Disco.
Azimuth is also possible to install and run locally, but in this case we have used the online app and ran the predictions for you. So you just have to download the prediction file.
This is how to save the count matrix:
# No need to run now.
= ctrl@assays$RNA@counts
C saveRDS(C, file = "data/covid/results/ctrl13_count_matrix.rds")
Instead load the results and visualize:
<- "https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq"
path_data <- "data/covid/results/azimuth_pred.tsv"
path_file if (!dir.exists(dirname(path_file))) dir.create(dirname(path_file), recursive = TRUE)
if (fetch_data && !file.exists(path_file)) download.file(url = file.path(path_data, "covid/results/azimuth_pred.tsv"), destfile = path_file)
<- read.table(path_file, sep = "\t", header = T)
azimuth_pred
# add predictions to the seurat object
$azimuth = azimuth_pred$predicted.celltype.l2[match(colnames(ctrl), azimuth_pred$cell)]
ctrlDimPlot(ctrl, group.by = "azimuth", label = T, repel = T) + NoAxes()
6 Compare results
Now we will compare the output of the two methods using the convenient function in scPred crossTab()
that prints the overlap between two metadata slots.
crossTab(ctrl, "predicted.id", "scpred_prediction")
We can also plot all the different predictions side by side
wrap_plots(
DimPlot(ctrl, label = T, group.by = "predicted.id") + NoAxes(),
DimPlot(ctrl, label = T, group.by = "scpred_prediction") + NoAxes(),
DimPlot(ctrl, label = T, group.by = "azimuth") + NoAxes(),
ncol = 3
)
7 GSEA with celltype markers
Another option, where celltype can be classified on cluster level is to use gene set enrichment among the DEGs with known markers for different celltypes. Similar to how we did functional enrichment for the DEGs in the differential expression exercise. There are some resources for celltype gene sets that can be used. Such as CellMarker, PanglaoDB or celltype gene sets at MSigDB. We can also look at overlap between DEGs in a reference dataset and the dataset you are analyzing.
7.1 DEG overlap
First, lets extract top DEGs for our Covid-19 dataset and the reference dataset. When we run differential expression for our dataset, we want to report as many genes as possible, hence we set the cutoffs quite lenient.
# run differential expression in our dataset, using clustering at resolution 0.5
<- SetIdent(alldata, value = "CCA_snn_res.0.5")
alldata <- FindAllMarkers(
DGE_table
alldata,logfc.threshold = 0,
test.use = "wilcox",
min.pct = 0.1,
min.diff.pct = 0,
only.pos = TRUE,
max.cells.per.ident = 20,
return.thresh = 1,
assay = "RNA"
)
# split into a list
<- split(DGE_table, DGE_table$cluster)
DGE_list
unlist(lapply(DGE_list, nrow))
0 1 2 3 4 5 6 7 8 9
3335 4183 3290 2491 2055 2481 2078 3494 2332 3551
# Compute differential gene expression in reference dataset (that has cell annotation)
<- SetIdent(reference, value = "cell_type")
reference <- FindAllMarkers(
reference_markers
reference,min.pct = .1,
min.diff.pct = .2,
only.pos = T,
max.cells.per.ident = 20,
return.thresh = 1
)
# Identify the top cell marker genes in reference dataset
# select top 50 with hihgest foldchange among top 100 signifcant genes.
<- reference_markers[order(reference_markers$avg_log2FC, decreasing = T), ]
reference_markers %>%
reference_markers group_by(cluster) %>%
top_n(-100, p_val) %>%
top_n(50, avg_log2FC) -> top50_cell_selection
# Transform the markers into a list
<- split(top50_cell_selection$gene, top50_cell_selection$cluster)
ref_list
unlist(lapply(ref_list, length))
CD8 T cell CD4 T cell cMono B cell NK cell pDC
30 15 50 50 50 50
ncMono cDC Plasma cell
50 50 50
Now we can run GSEA for the DEGs from our dataset and check for enrichment of top DEGs in the reference dataset.
suppressPackageStartupMessages(library(fgsea))
# run fgsea for each of the clusters in the list
<- lapply(DGE_list, function(x) {
res <- setNames(x$avg_log2FC, x$gene)
gene_rank <- fgsea(pathways = ref_list, stats = gene_rank, nperm = 10000)
fgseaRes return(fgseaRes)
})names(res) <- names(DGE_list)
# You can filter and resort the table based on ES, NES or pvalue
<- lapply(res, function(x) {
res $pval < 0.1, ]
x[x
})<- lapply(res, function(x) {
res $size > 2, ]
x[x
})<- lapply(res, function(x) {
res order(x$NES, decreasing = T), ]
x[
}) res
$`0`
pathway pval padj ES NES nMoreExtreme size
1: cMono 0.00009999 0.000299970 0.9597129 2.070031 0 48
2: cDC 0.00009999 0.000299970 0.8432665 1.804334 0 41
3: ncMono 0.00009999 0.000299970 0.8381623 1.796599 0 43
4: pDC 0.00180433 0.004059743 0.7448120 1.529058 17 21
5: NK cell 0.03075964 0.055367349 0.7514873 1.434362 295 10
6: B cell 0.07030144 0.105452155 0.6639697 1.324723 694 15
leadingEdge
1: S100A8,S100A9,LYZ,S100A12,VCAN,FCN1,...
2: LYZ,GRN,TYMP,CST3,AIF1,SPI1,...
3: CTSS,TYMP,CST3,S100A11,AIF1,SERPINA1,...
4: GRN,MS4A6A,CST3,MPEG1,CTSB,TGFBI,...
5: TYROBP,FCER1G,SRGN,CCL3,MYO1F,ITGB2,...
6: NCF1,LY86,MARCH1,POU2F2,HLA-DMB,HLA-DRB5,...
$`1`
pathway pval padj ES NES nMoreExtreme size
1: NK cell 0.0001000100 0.0004014049 0.9456406 2.359393 0 46
2: CD8 T cell 0.0001003512 0.0004014049 0.9275244 2.204282 0 25
3: ncMono 0.0006864203 0.0018304542 0.9013803 1.743174 5 6
4: pDC 0.0057173107 0.0091476972 0.7814822 1.655478 53 10
5: Plasma cell 0.0044074927 0.0088149855 0.6417020 1.547448 43 29
leadingEdge
1: GNLY,GZMB,FGFBP2,PRF1,NKG7,SPON2,...
2: GNLY,GZMB,FGFBP2,PRF1,NKG7,CTSW,...
3: FCGR3A,IFITM2,RHOC
4: GZMB,C12orf75,HSP90B1,ALOX5AP,RRBP1,PLAC8,...
5: FKBP11,PRDM1,CD38,SDF2L1,PPIB,SLAMF7,...
$`2`
pathway pval padj ES NES nMoreExtreme size
1: CD8 T cell 0.0001000801 0.0003502802 0.9337235 2.143348 0 29
2: NK cell 0.0001000400 0.0003502802 0.8231342 1.901350 0 32
3: CD4 T cell 0.0012130569 0.0028304661 0.8732245 1.681665 10 7
leadingEdge
1: CD3D,CD8A,CD3G,CD8B,CCL5,GZMH,...
2: CCL5,GZMA,CCL4,GZMM,NKG7,CST7,...
3: CD3D,CD3G,CD3E,IL7R,PIK3IP1
$`3`
pathway pval padj ES NES nMoreExtreme size
1: B cell 0.0000999900 0.0004057618 0.9067163 1.985511 0 46
2: cDC 0.0001014405 0.0004057618 0.8904040 1.795956 0 14
3: pDC 0.0005024621 0.0013398988 0.8011896 1.654422 4 18
4: Plasma cell 0.0746062154 0.1193699447 0.7256265 1.364306 700 8
leadingEdge
1: CD79A,LINC00926,TCL1A,MS4A1,CD79B,TNFRSF13C,...
2: CD74,HLA-DQB1,HLA-DRA,HLA-DPB1,HLA-DRB1,HLA-DQA1,...
3: CD74,BCL11A,TCF4,IRF8,HERPUD1,TSPAN13,...
4: PLPP5,ISG20,HERPUD1,MZB1,ITM2C,DERL3,...
$`4`
pathway pval padj ES NES nMoreExtreme size
1: CD4 T cell 0.0001015950 0.0006095703 0.9122872 1.784567 0 14
2: CD8 T cell 0.0007486631 0.0022459893 0.8884597 1.623367 6 8
leadingEdge
1: IL7R,LTB,LDHB,MAL,RCAN3,CD3D,...
2: CD3D,IL32,CD3G,CD2,CD3E,CD8B
$`5`
pathway pval padj ES NES nMoreExtreme size
1: B cell 0.0000999900 0.0004077888 0.8990189 1.850227 0 44
2: cDC 0.0001019472 0.0004077888 0.9170109 1.746327 0 13
3: pDC 0.0004031852 0.0010751604 0.8299493 1.613559 3 17
4: Plasma cell 0.0177556818 0.0236742424 0.7551289 1.447306 174 14
leadingEdge
1: CD79A,MS4A1,BANK1,HLA-DQA1,CD74,TNFRSF13C,...
2: HLA-DQA1,CD74,HLA-DRA,HLA-DPB1,HLA-DQB1,HLA-DPA1,...
3: CD74,JCHAIN,SPIB,TCF4,HERPUD1,CCDC50,...
4: JCHAIN,HERPUD1,ISG20,PEBP1,MZB1,ITM2C
$`6`
Empty data.table (0 rows and 8 cols): pathway,pval,padj,ES,NES,nMoreExtreme...
$`7`
pathway pval padj ES NES nMoreExtreme size
1: ncMono 0.000099990 0.00026664 0.9650745 2.032425 0 49
2: cMono 0.000099990 0.00026664 0.8833483 1.832508 0 36
3: cDC 0.000099990 0.00026664 0.8289634 1.723590 0 38
4: NK cell 0.005980134 0.01196027 0.7695612 1.495116 58 14
5: pDC 0.018700091 0.02992015 0.7279419 1.423815 184 15
6: B cell 0.070912013 0.09454935 0.6627464 1.309038 705 17
leadingEdge
1: CDKN1C,LST1,FCGR3A,MS4A7,AIF1,COTL1,...
2: LST1,AIF1,COTL1,SERPINA1,FCER1G,CST3,...
3: LST1,AIF1,COTL1,FCER1G,CST3,SPI1,...
4: FCGR3A,FCER1G,RHOC,TYROBP,IFITM2,CCL3,...
5: CST3,NPC2,CTSB,PLD4,MPEG1,TGFBI,...
6: HLA-DPA1,POU2F2,HLA-DRB5,HLA-DRB1,HLA-DRA,HLA-DPB1,...
$`8`
pathway pval padj ES NES nMoreExtreme size
1: CD4 T cell 0.000101657 0.0006099421 0.9401157 2.010890 0 14
2: CD8 T cell 0.018327340 0.0549820206 0.8328932 1.499054 157 5
leadingEdge
1: IL7R,TCF7,PIK3IP1,LTB,TSHZ2,LEF1,...
2: CD3G,CD3D,CD3E,CD2
$`9`
pathway pval padj ES NES nMoreExtreme size
1: pDC 0.0002021019 0.0009094584 0.8218327 1.802959 1 16
2: Plasma cell 0.0001000100 0.0009000900 0.7382108 1.757711 0 43
3: B cell 0.0279883382 0.0736404834 0.8116259 1.476425 239 5
4: CD8 T cell 0.0327291037 0.0736404834 0.6302848 1.399877 324 18
leadingEdge
1: JCHAIN,MZB1,HSP90B1,TXN,ITM2C,C12orf75,...
2: JCHAIN,MZB1,HSP90B1,SUB1,MANF,SEC11C,...
3: JCHAIN,PDLIM1
4: GZMA,C12orf75,MT2A,GZMB,IL32,GNLY,...
Selecting top significant overlap per cluster, we can now rename the clusters according to the predicted labels. OBS! Be aware that if you have some clusters that have non-significant p-values for all the gene sets, the cluster label will not be very reliable. Also, the gene sets you are using may not cover all the celltypes you have in your dataset and hence predictions may just be the most similar celltype. Also, some of the clusters have very similar p-values to multiple celltypes, for instance the ncMono and cMono celltypes are equally good for some clusters.
<- unlist(lapply(res, function(x) {
new.cluster.ids as.data.frame(x)[1, 1]
}))
$ref_gsea <- new.cluster.ids[as.character(alldata@active.ident)]
alldata
wrap_plots(
DimPlot(alldata, label = T, group.by = "CCA_snn_res.0.5") + NoAxes(),
DimPlot(alldata, label = T, group.by = "ref_gsea") + NoAxes(),
ncol = 2
)
Compare the results with the other celltype prediction methods in the ctrl_13 sample.
$ref_gsea <- alldata$ref_gsea[alldata$orig.ident == "ctrl_13"]
ctrl
wrap_plots(
DimPlot(ctrl, label = T, group.by = "ref_gsea") + NoAxes() + ggtitle("GSEA"),
DimPlot(ctrl, label = T, group.by = "predicted.id") + NoAxes() + ggtitle("LabelTransfer"),
DimPlot(ctrl, label = T, group.by = "scpred_prediction") + NoAxes() + ggtitle("scPred"),
ncol = 3
)
7.2 With annotated gene sets
We have downloaded the celltype gene lists from http://bio-bigdata.hrbmu.edu.cn/CellMarker/CellMarker_download.html and converted the excel file to a csv for you. Read in the gene lists and do some filtering.
<- file.path("data/cell_marker_human.csv")
path_file if (!file.exists(path_file)) download.file(file.path(path_data, "cell_marker_human.csv"), destfile = path_file)
# Load the human marker table
<- read.delim("data/cell_marker_human.csv", sep = ";")
markers <- markers[markers$species == "Human", ]
markers <- markers[markers$cancer_type == "Normal", ]
markers
# Filter by tissue (to reduce computational time and have tissue-specific classification)
sort(unique(markers$tissue_type))
[1] "Abdomen" "Abdominal adipose tissue"
[3] "Abdominal fat pad" "Acinus"
[5] "Adipose tissue" "Adrenal gland"
[7] "Adventitia" "Airway"
[9] "Airway epithelium" "Allocortex"
[11] "Alveolus" "Amniotic fluid"
[13] "Amniotic membrane" "Ampullary"
[15] "Anogenital tract" "Antecubital vein"
[17] "Anterior cruciate ligament" "Anterior presomitic mesoderm"
[19] "Aorta" "Aortic valve"
[21] "Artery" "Arthrosis"
[23] "Articular Cartilage" "Ascites"
[25] "Atrium" "Auditory cortex"
[27] "Basilar membrane" "Beige Fat"
[29] "Bile duct" "Biliary tract"
[31] "Bladder" "Blood"
[33] "Blood vessel" "Bone"
[35] "Bone marrow" "Brain"
[37] "Breast" "Bronchial vessel"
[39] "Bronchiole" "Bronchoalveolar lavage"
[41] "Bronchoalveolar system" "Bronchus"
[43] "Brown adipose tissue" "Calvaria"
[45] "Capillary" "Cardiac atrium"
[47] "Cardiovascular system" "Carotid artery"
[49] "Carotid plaque" "Cartilage"
[51] "Caudal cortex" "Caudal forebrain"
[53] "Caudal ganglionic eminence" "Cavernosum"
[55] "Central amygdala" "Central nervous system"
[57] "Central Nervous System" "Cerebellum"
[59] "Cerebral organoid" "Cerebrospinal fluid"
[61] "Choriocapillaris" "Chorionic villi"
[63] "Chorionic villus" "Choroid"
[65] "Choroid plexus" "Colon"
[67] "Colon epithelium" "Colorectum"
[69] "Cornea" "Corneal endothelium"
[71] "Corneal epithelium" "Coronary artery"
[73] "Corpus callosum" "Corpus luteum"
[75] "Cortex" "Cortical layer"
[77] "Cortical thymus" "Decidua"
[79] "Deciduous tooth" "Dental pulp"
[81] "Dermis" "Diencephalon"
[83] "Distal airway" "Dorsal forebrain"
[85] "Dorsal root ganglion" "Dorsolateral prefrontal cortex"
[87] "Ductal tissue" "Duodenum"
[89] "Ectocervix" "Ectoderm"
[91] "Embryo" "Embryoid body"
[93] "Embryonic brain" "Embryonic heart"
[95] "Embryonic Kidney" "Embryonic prefrontal cortex"
[97] "Embryonic stem cell" "Endocardium"
[99] "Endocrine" "Endoderm"
[101] "Endometrium" "Endometrium stroma"
[103] "Entorhinal cortex" "Epidermis"
[105] "Epithelium" "Esophagus"
[107] "Eye" "Fat pad"
[109] "Fetal brain" "Fetal gonad"
[111] "Fetal heart" "Fetal ileums"
[113] "Fetal kidney" "Fetal Leydig"
[115] "Fetal liver" "Fetal lung"
[117] "Fetal pancreas" "Fetal thymus"
[119] "Fetal umbilical cord" "Fetus"
[121] "Foreskin" "Frontal cortex"
[123] "Fundic gland" "Gall bladder"
[125] "Gastric corpus" "Gastric epithelium"
[127] "Gastric gland" "Gastrointestinal tract"
[129] "Germ" "Gingiva"
[131] "Gonad" "Gut"
[133] "Hair follicle" "Heart"
[135] "Heart muscle" "Hippocampus"
[137] "Ileum" "Inferior colliculus"
[139] "Interfollicular epidermis" "Intervertebral disc"
[141] "Intestinal crypt" "Intestine"
[143] "Intrahepatic cholangio" "Jejunum"
[145] "Kidney" "Lacrimal gland"
[147] "Large intestine" "Laryngeal squamous epithelium"
[149] "Lateral ganglionic eminence" "Ligament"
[151] "Limb bud" "Limbal epithelium"
[153] "Liver" "Lumbar vertebra"
[155] "Lung" "Lymph"
[157] "Lymph node" "Lymphatic vessel"
[159] "Lymphoid tissue" "Malignant pleural effusion"
[161] "Mammary epithelium" "Mammary gland"
[163] "Medial ganglionic eminence" "Medullary thymus"
[165] "Meniscus" "Mesoblast"
[167] "Mesoderm" "Microvascular endothelium"
[169] "Microvessel" "Midbrain"
[171] "Middle temporal gyrus" "Milk"
[173] "Molar" "Muscle"
[175] "Myenteric plexus" "Myocardium"
[177] "Myometrium" "Nasal concha"
[179] "Nasal epithelium" "Nasal mucosa"
[181] "Nasal polyp" "Neocortex"
[183] "Nerve" "Nose"
[185] "Nucleus pulposus" "Olfactory neuroepithelium"
[187] "Optic nerve" "Oral cavity"
[189] "Oral mucosa" "Osteoarthritic cartilage"
[191] "Ovarian cortex" "Ovarian follicle"
[193] "Ovary" "Oviduct"
[195] "Pancreas" "Pancreatic acinar tissue"
[197] "Pancreatic duct" "Pancreatic islet"
[199] "Periodontal ligament" "Periodontium"
[201] "Periosteum" "Peripheral blood"
[203] "Peritoneal fluid" "Peritoneum"
[205] "Pituitary" "Placenta"
[207] "Plasma" "Pluripotent stem cell"
[209] "Polyp" "Posterior presomitic mesoderm"
[211] "Prefrontal cortex" "Premolar"
[213] "Presomitic mesoderm" "Primitive streak"
[215] "Prostate" "Pulmonary arteriy"
[217] "Pyloric gland" "Rectum"
[219] "Renal glomerulus" "Respiratory tract"
[221] "Retina" "Retinal organoid"
[223] "Retinal pigment epithelium" "Right ventricle"
[225] "Saliva" "Salivary gland"
[227] "Scalp" "Sclerocorneal tissue"
[229] "Seminal plasma" "Septum transversum"
[231] "Serum" "Sinonasal mucosa"
[233] "Sinus tissue" "Skeletal muscle"
[235] "Skin" "Small intestinal crypt"
[237] "Small intestine" "Soft tissue"
[239] "Sperm" "Spinal cord"
[241] "Spleen" "Splenic red pulp"
[243] "Sputum" "Stomach"
[245] "Subcutaneous adipose tissue" "Submandibular gland"
[247] "Subpallium" "Subplate"
[249] "Subventricular zone" "Superior frontal gyrus"
[251] "Sympathetic ganglion" "Synovial fluid"
[253] "Synovium" "Taste bud"
[255] "Tendon" "Testis"
[257] "Thalamus" "Thymus"
[259] "Thyroid" "Tonsil"
[261] "Tooth" "Trachea"
[263] "Tracheal airway epithelium" "Transformed artery"
[265] "Trophoblast" "Umbilical cord"
[267] "Umbilical cord blood" "Umbilical vein"
[269] "Undefined" "Urine"
[271] "Urothelium" "Uterine cervix"
[273] "Uterus" "Vagina"
[275] "Vein" "Venous blood"
[277] "Ventral thalamus" "Ventricle"
[279] "Ventricular and atrial" "Ventricular zone"
[281] "Visceral adipose tissue" "Vocal fold"
[283] "Whartons jelly" "White adipose tissue"
[285] "White matter" "Yolk sac"
grep("blood", unique(markers$tissue_type), value = T)
[1] "Peripheral blood" "Umbilical cord blood" "Venous blood"
<- markers[markers$tissue_type %in% c(
markers "Blood", "Venous blood",
"Serum", "Plasma",
"Spleen", "Bone marrow", "Lymph node"
), ]
# remove strange characters etc.
<- lapply(unique(markers$cell_name), function(x) {
celltype_list <- paste(markers$Symbol[markers$cell_name == x], sep = ",")
x <- gsub("[[]|[]]| |-", ",", x)
x <- unlist(strsplit(x, split = ","))
x <- unique(x[!x %in% c("", "NA", "family")])
x <- casefold(x, upper = T)
x
})names(celltype_list) <- unique(markers$cell_name)
<- celltype_list[unlist(lapply(celltype_list, length)) < 100]
celltype_list <- celltype_list[unlist(lapply(celltype_list, length)) > 5] celltype_list
# run fgsea for each of the clusters in the list
<- lapply(DGE_list, function(x) {
res <- setNames(x$avg_log2FC, x$gene)
gene_rank <- fgsea(pathways = celltype_list, stats = gene_rank, nperm = 10000, scoreType = "pos")
fgseaRes return(fgseaRes)
})names(res) <- names(DGE_list)
# You can filter and resort the table based on ES, NES or pvalue
<- lapply(res, function(x) {
res $pval < 0.01, ]
x[x
})<- lapply(res, function(x) {
res $size > 5, ]
x[x
})<- lapply(res, function(x) {
res order(x$NES, decreasing = T), ]
x[
})
# show top 3 for each cluster.
lapply(res, head, 3)
$`0`
pathway pval padj ES NES nMoreExtreme size
1: Neutrophil 0.00009999 0.003033030 0.8599992 1.770018 0 22
2: Monocyte 0.00009999 0.003033030 0.8184005 1.744466 0 40
3: Eosinophil 0.00029997 0.005459454 0.8674470 1.713232 2 13
leadingEdge
1: S100A8,S100A9,CD14,CSF3R,S100A6,PLAUR,...
2: S100A8,S100A9,LYZ,S100A12,VCAN,FCN1,...
3: S100A8,S100A9,LYZ,RETN,CSF3R,ICAM1,...
$`1`
pathway pval
1: Natural killer cell 9.999e-05
2: Finally highly effector (TEMRA) memory T cell 9.999e-05
3: CD4+ recently activated effector memory or effector T cell (CTL) 9.999e-05
padj ES NES nMoreExtreme size
1: 0.00076356 0.9016016 2.221791 0 37
2: 0.00076356 0.9911398 2.097822 0 7
3: 0.00076356 0.9561467 2.095298 0 10
leadingEdge
1: GNLY,GZMB,FGFBP2,PRF1,NKG7,SPON2,...
2: GNLY,NKG7,CST7,GZMA,GZMH,CCL5,...
3: GNLY,PRF1,NKG7,CTSW,GZMH,S1PR5,...
$`2`
pathway pval padj ES NES nMoreExtreme size
1: CD8+ T cell 9.999e-05 0.000869913 0.9490543 2.036329 0 12
2: CD4+ T cell 9.999e-05 0.000869913 0.9463807 2.015727 0 11
3: T cell 9.999e-05 0.000869913 0.8646339 2.011671 0 33
leadingEdge
1: CD3D,CD8A,CD8B,TRGC2,TRAC,CD3E,...
2: CD3D,CD8A,TRAC,CD3E,IL32,IL7R,...
3: GZMK,CD3D,CD8A,CD3G,CD8B,GZMH,...
$`3`
pathway pval padj ES NES nMoreExtreme size
1: Naive B cell 9.999e-05 0.00149985 0.9498666 1.923660 0 13
2: B cell 9.999e-05 0.00149985 0.8995931 1.919820 0 29
3: Follicular B cell 9.999e-05 0.00149985 0.9532447 1.889467 0 10
leadingEdge
1: IGHM,IGHD,TCL1A,MS4A1,FCER2,YBX3,...
2: IGHM,IGHD,CD79A,IGKC,TCL1A,MS4A1,...
3: IGHD,CD79A,TCL1A,MS4A1,CD79B,CD74,...
$`4`
pathway pval padj ES NES nMoreExtreme size
1: Naive T(Th0) cell 0.00009999 0.0039996 0.9026044 1.740965 0 10
2: CD8+ T cell 0.00019998 0.0039996 0.8889920 1.724931 1 11
3: B cell 0.00019998 0.0039996 0.8762422 1.711372 1 12
leadingEdge
1: IL7R,CD3D,IL32,TCF7,CD3E,NPM1,...
2: IL7R,TRAC,CD3D,IL32,CD28,CD3E,...
3: IL7R,CD5,CD28,JUNB,CD27,BCL2,...
$`5`
pathway pval padj ES NES nMoreExtreme
1: B cell 9.999e-05 0.001519848 0.8786266 1.801902 0
2: Plasma cell 9.999e-05 0.001519848 0.9288892 1.796345 0
3: Marginal zone B cell 9.999e-05 0.001519848 0.9507071 1.763716 0
size leadingEdge
1: 37 IGKC,CD79A,MS4A1,BANK1,IGHM,CD74,...
2: 13 IGKC,CD79A,IGLC2,IGLC3,IGHM,IGHG1,...
3: 6 CD79A,MS4A1,CD79B,TNFRSF13B,CD19,CD27
$`6`
pathway pval padj ES NES
1: Megakaryocyte 0.00009999 0.00669933 0.9804198 1.828722
2: Red blood cell (erythrocyte) 0.00289971 0.06699330 0.9284733 1.682025
3: Platelet 0.00299970 0.06699330 0.7982966 1.520026
nMoreExtreme size leadingEdge
1: 0 12 PPBP,PF4,NRGN,MYL9,GNG11,GP9,...
2: 28 6 HBB,HBA2,HBA1,SPI1
3: 29 18 PPBP,PF4,GP9,ITGA2B,CD151,ICAM2,...
$`7`
pathway pval padj ES NES nMoreExtreme size
1: CD16+ monocyte 0.00029997 0.00899910 0.9439605 1.749760 2 6
2: Monocyte 0.00009999 0.00449955 0.8096557 1.649293 0 26
3: Macrophage 0.00009999 0.00449955 0.8139406 1.645671 0 23
leadingEdge
1: FCGR3A,TCF7L2,HES4,LYN,MTSS1
2: LST1,FCGR3A,MS4A7,CST3,PECAM1,CD68,...
3: FCGR3A,MS4A7,FCER1G,CD68,FTL,C1QA,...
$`8`
pathway pval padj ES NES
1: Central memory CD8+ T cell 0.00009999 0.001299870 0.9148557 1.950450
2: Naive CD8+ T cell 0.00009999 0.001299870 0.8708956 1.904995
3: Naive CD8 T cell 0.00019998 0.001949805 0.9037607 1.893987
nMoreExtreme size leadingEdge
1: 0 12 CCR7,IL7R,TCF7,TSHZ2,LEF1,RCAN3,...
2: 0 16 CCR7,TCF7,TSHZ2,LEF1,RCAN3,MAL,...
3: 1 10 CCR7,TCF7,TSHZ2,LEF1,RCAN3,MAL,...
$`9`
pathway pval padj ES NES nMoreExtreme size
1: Plasmablast 9.999e-05 0.00269973 0.9255634 2.028077 0 15
2: Plasma cell 9.999e-05 0.00269973 0.9217604 1.992427 0 13
3: B cell 9.999e-05 0.00269973 0.8739756 1.956983 0 19
leadingEdge
1: IGHA1,IGLC2,IGKC,IGHG1,IGLC3,IGHA2,...
2: IGHA1,IGLC2,IGKC,JCHAIN,IGHG1,IGLC3,...
3: IGHA1,IGKC,JCHAIN,IGHG1,IGHA2,IGHG2,...
Let’s plot the results.
<- unlist(lapply(res, function(x) {
new.cluster.ids as.data.frame(x)[1, 1]
}))$cellmarker_gsea <- new.cluster.ids[as.character(alldata@active.ident)]
alldata
wrap_plots(
DimPlot(alldata, label = T, group.by = "ref_gsea") + NoAxes(),
DimPlot(alldata, label = T, group.by = "cellmarker_gsea") + NoAxes(),
ncol = 2
)
Do you think that the methods overlap well? Where do you see the most inconsistencies?
In this case we do not have any ground truth, and we cannot say which method performs best. You should keep in mind, that any celltype classification method is just a prediction, and you still need to use your common sense and knowledge of the biological system to judge if the results make sense.
Finally, lets save the data with predictions.
saveRDS(ctrl, "data/covid/results/seurat_covid_qc_dr_int_cl_ct-ctrl13.rds")
8 Session info
Click here
sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] fgsea_1.28.0 caret_6.0-94 lattice_0.21-8 scPred_1.9.2
[5] pheatmap_1.0.12 ggplot2_3.4.2 patchwork_1.1.2 dplyr_1.1.2
[9] SeuratObject_4.1.3 Seurat_4.3.0
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-3 rstudioapi_0.14 jsonlite_1.8.5
[4] magrittr_2.0.3 ggbeeswarm_0.7.2 spatstat.utils_3.0-3
[7] farver_2.1.1 rmarkdown_2.22 vctrs_0.6.2
[10] ROCR_1.0-11 spatstat.explore_3.2-1 htmltools_0.5.5
[13] pROC_1.18.2 sctransform_0.3.5 parallelly_1.36.0
[16] KernSmooth_2.23-20 htmlwidgets_1.6.2 ica_1.0-3
[19] plyr_1.8.8 lubridate_1.9.2 plotly_4.10.2
[22] zoo_1.8-12 igraph_1.4.3 mime_0.12
[25] lifecycle_1.0.3 iterators_1.0.14 pkgconfig_2.0.3
[28] Matrix_1.5-4 R6_2.5.1 fastmap_1.1.1
[31] fitdistrplus_1.1-11 future_1.32.0 shiny_1.7.4
[34] digest_0.6.31 colorspace_2.1-0 tensor_1.5
[37] irlba_2.3.5.1 labeling_0.4.2 progressr_0.13.0
[40] timechange_0.2.0 fansi_1.0.4 spatstat.sparse_3.0-1
[43] httr_1.4.6 polyclip_1.10-4 abind_1.4-5
[46] compiler_4.3.0 withr_2.5.0 BiocParallel_1.36.0
[49] lava_1.7.2.1 MASS_7.3-58.4 ModelMetrics_1.2.2.2
[52] tools_4.3.0 vipor_0.4.5 lmtest_0.9-40
[55] beeswarm_0.4.0 httpuv_1.6.11 future.apply_1.11.0
[58] nnet_7.3-18 goftest_1.2-3 glue_1.6.2
[61] nlme_3.1-162 promises_1.2.0.1 grid_4.3.0
[64] Rtsne_0.16 cluster_2.1.4 reshape2_1.4.4
[67] generics_0.1.3 recipes_1.0.6 gtable_0.3.3
[70] spatstat.data_3.0-1 class_7.3-21 tidyr_1.3.0
[73] data.table_1.14.8 sp_1.6-1 utf8_1.2.3
[76] spatstat.geom_3.2-1 RcppAnnoy_0.0.20 ggrepel_0.9.3
[79] RANN_2.6.1 foreach_1.5.2 pillar_1.9.0
[82] stringr_1.5.0 limma_3.58.1 later_1.3.1
[85] splines_4.3.0 survival_3.5-5 deldir_1.0-9
[88] tidyselect_1.2.0 miniUI_0.1.1.1 pbapply_1.7-0
[91] knitr_1.43 gridExtra_2.3 scattermore_1.2
[94] RhpcBLASctl_0.23-42 stats4_4.3.0 xfun_0.39
[97] statmod_1.5.0 hardhat_1.3.0 timeDate_4022.108
[100] matrixStats_1.0.0 stringi_1.7.12 lazyeval_0.2.2
[103] yaml_2.3.7 evaluate_0.21 codetools_0.2-19
[106] kernlab_0.9-32 tibble_3.2.1 cli_3.6.1
[109] uwot_0.1.14 rpart_4.1.19 xtable_1.8-4
[112] reticulate_1.30 munsell_0.5.0 harmony_1.2.0
[115] Rcpp_1.0.10 globals_0.16.2 spatstat.random_3.1-5
[118] png_0.1-8 parallel_4.3.0 ellipsis_0.3.2
[121] gower_1.0.1 listenv_0.9.0 viridisLite_0.4.2
[124] ipred_0.9-14 prodlim_2023.03.31 scales_1.2.1
[127] ggridges_0.5.4 crayon_1.5.2 leiden_0.4.3
[130] purrr_1.0.1 rlang_1.1.1 fastmatch_1.1-3
[133] cowplot_1.1.1