Single Cell Data Analysis 2023
Quality control
removal problematic cells
doublet classification
…
Normalization
Dimensionality reduction
Data Integration
What is differential expression (DE) analysis?
Common methods and models
How to evaluate model performance?
Things to keep in mind
Comparison of cell types (often within a single sample)
Comparison of gene expression within a cell type between samples/conditions (multiple samples)
Gene | logFC | p-value |
---|---|---|
CD79A | 2.82 | 4.73*10^-20 |
CD79B | 2.23 | 4.07*10^-19 |
MS4A1 | -2.44 | 4.67*10^-19 |
… | … | … |
CD74 | 1.53 | 5.04*10^-17 |
⬆️
⬇️
❓
Data characteristics are different; scRNA-seq data is much more sparse, with high variability
“… most computational methods will stick with the old mentality of viewing differential expression as simple”up or down” phenomenon. We advocate that we should fully embrace the features of single cell data, which allows us to observe binary (from Off to On) as well as continuous (the amount of expression) regulations.”
Wu et al. (Bioinformatics, 2017)
Seurat
scran
scanpy
Data is collected from a sample from a much larger population
Summary statistics lets us make inferences (conclusions) about the population from which the sample was derived
As well as predict the outcome given a model fitted to the data
Is there a difference in height between students taking the scRNA-seq course in 2022 and 2023?
H0: null hypothesis: there is no difference in height
H1: alternative hypothesis: difference of the means does not equal 0
The data from both groups can be modeled with a normal distribution, so we have an idea of the average and variation in both groups. The difference between the means of both groups forms the observed test statistic. This test statistic enables us to carry out a hypothesis test in this case a t-tests.
The better the model fits the data, the better and more accurate the test statistic! This explains the many test options. Not every test will fit each dataset or comparison.
High noise levels (technical and biological factors)
low library sizes
low amount of available mRNAs; results in amplification bias and dropout events
partial coverage, uneven depth
stochastic nature of transcription
multimodality in gene expression; presence of multiple possible cell states within a cell population
Generic, non-parametric methods
Wilcoxon rank-sum test
Kruskal-Wallis
Kolmogorov-Smirnov test
These usually convert observed expression values into ranks
Than, test whether the distribution of the ranks for one group are significantly different from the distribution of the ranks in the other group
Can fail in presence of large number of tied values (zeros in scRNA-seq)
If the assumptions for a parametric test hold, these are generally more powerful than a non-parametric test
𝝻 = mean expression
𝞭^2 = dispersion, which is inversely related to the variance
NeBi fits bulk RNA-seq data very well and is used for most statistical methods designed for such data. In addition, it has been show to fit the distribution of molecule counts obtained from data tagged by unique molecule identifiers (UMI) quite well.
d = dropout rate
The dropout rate of a gene is strongly correlated with the men expression of the gene. different zero-inflated negative binomial models use different relationships between mean expression and dropout rate
Implemented in MAST, SCDE
Discrete probability distribution that expresses the probability of a given number of events occuring in a fixed interval of time or space - if these events occur with a known constant mean rate, lambda, and independently of the time since the last event
a: the rate of activation of transcription
b: the rate of inhibition of transcription
g: the rate of transcript production while transcription is active at the locus
Implemented in BPSC.
Bi(n, 𝜃)
Discrete probability distribution of the number of success in a sequence of n independent experiments; 𝜃 is probability of success
Used to compare proportion of zeros
Gene-wise null hypothesis:
Probability of being expressed is the same in group 1 and group 2
Available in scran
Uses generalized linear hurdle model
Designed to account for stochastic dropouts and bimodal expression in which expression is either strongly non-zero or non-detectable
The rate of expression Z, and the level of expression Y, are modeled for each gene g, indicating whether gene g is expressed in cell i
A logistic regression model for the discrete variable Z and a Gaussian linear model for the continuous variable Y
Model parameters fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Models can get quite complicated but in Seurat/Scran/Scanpy the default methods are set to t-test and Wilcoxon rank-sum test
Importance of understanding what you are trying to compare, e.g. mean expressions, or probability of being expressed
Understand your data
Assess and validate results!
No ground truth data available!
Known data
Compare
Investigate results
36 statistical approaches for DE analysis to compare the expression levels in two groups of cells
Based on 9 datasets and different preprocessing in some cases
Extensive evaluation of metrics (number of genes found, characteristics of the false positives, robustness and similarities between methods
Some highlights:
t-test and Wilcoxon work well, given at least a few dozen cells to compare
bulk RNA-seq analysis methods do not generally perform worse than those specifically developed for scRNA-seq (edgeR, DESeq2)
Filtering out lowly expressed genes is quite important for good performance of bulk method
Pairwise cluster comparisons
For a given cluster, find marker genes
DE compared to all cells outside of the cluster
DE compared to at least one other cluster
DE compared to each of the other clusters
DE compared to “most” of the other clusters
DE and up-regulated (upregulated markers are usually easier to interpret)
Cell-type comparisons
Results will depend on what we compare to!
if the data set consists of only T-cells, no generic T-cell markers will (or should) show up as differentially expressed between clusters
This is important to keep in mind when comparing marker genes found in different studies, with potentially different composition
Population specific changes in expression between conditions
Should be replicated
Structure of data should be taken into account!
mixed effect models, pseudo-bulk samples, …
Example workflow here: link
Balanced cluster sizes
Highly similar clusters
Balanced cluster sizes
Cluster 3 will dominate all 1-vs-rest comparisons
Probably good idea to subsample
Be aware the subsampling strategies; in Seurat only does it per test
Highly similar clusters
always go back to RNA assay (or similar) for doing differential expression analysis
Depending on the method you choose use: counts, normalized counts or lognormalized counts
Normalization strategy can have a big influence on the results in differential expression (e.g. when comparing celltypes with few expressed genes vs a celltype with many genes)
Batch effects
Always check if the DEGs you found are just differentially expressed in one of the batches
There are tests that can control for batch effects (Seurat parameter “latent.vars” - only for LR, negbinom, poisson or MAST)
How many cells do you need to get reliable detection of differential expression
Highly expressed genes: probably 10-20 cells is enough
Lowly expressed cells: probably > 20, preferably > 50
Depends on the sensitivity of the library preparation method and how distinct the cell types you are comparing are (e.g. macrophage vs T-cell or CD8 T-cell vs CD4 T-cell)
Many results will be unwanted noise! Sanity test: take two random sets of cells and run DE analysis. You will probably have a few significant genes with most of the commonly used tests.
Single Cell Data Analysis 2023