Differential Expression Analysis

Single Cell Data Analysis 2023

Context of differential expression analysis?

Quality control
- removal problematic cells
- doublet classification
- …

Normalization
Dimensionality reduction
Data Integration

Clustering

Differential Expression

Overview

What is differential expression (DE) analysis?
Common methods and models
How to evaluate model performance?
Things to keep in mind

What is Differential Expression Analysis?

Comparison of cell types (often within a single sample)

What is Differential Expression Analysis?

Comparison of cell types (often within a single sample)

What is Differential Expression Analysis?

Comparison of cell types (often within a single sample)
Comparison of gene expression within a cell type between samples/conditions (multiple samples)

Gene	logFC	p-value
CD79A	2.82	4.73*10^-20
CD79B	2.23	4.07*10^-19
MS4A1	-2.44	4.67*10^-19
…	…	…
CD74	1.53	5.04*10^-17

Aim of Differential Expression Analysis

⬆️

⬇️

❓

Aim of Differential Expression Analysis

Starting from read count data, perform statistical analysis to discover quantitative changes in gene expression levels between groups

Decide whether, for a given gene, an observed difference in read counts is significant; i.e if the difference is greater than would be expected just due to natural random variation

Can we use bulk methods?

Data characteristics are different; scRNA-seq data is much more sparse, with high variability

Dropouts add complexity to DE analysis

“… most computational methods will stick with the old mentality of viewing differential expression as simple”up or down” phenomenon. We advocate that we should fully embrace the features of single cell data, which allows us to observe binary (from Off to On) as well as continuous (the amount of expression) regulations.”

Wu et al. (Bioinformatics, 2017)

Which methods to use for DE analysis?

Seurat
- FindMarkers(), FindAllMarkers()
scran
- findMarkers()
scanpy
- scanpy.tl.rank_genes_groups()

Seurat Example

Which test to use?

Statistical Inference

Aim of DE: Is, for a given gene, the observed difference in read counts between groups or clusters significantly greater than would be expected just due to random natural variation

Data is collected from a sample from a much larger population
Summary statistics lets us make inferences (conclusions) about the population from which the sample was derived
As well as predict the outcome given a model fitted to the data

Example

Is there a difference in height between students taking the scRNA-seq course in 2022 and 2023?

H0: null hypothesis: there is no difference in height
H1: alternative hypothesis: difference of the means does not equal 0

The data from both groups can be modeled with a normal distribution, so we have an idea of the average and variation in both groups. The difference between the means of both groups forms the observed test statistic. This test statistic enables us to carry out a hypothesis test in this case a t-tests.

The better the model fits the data, the better and more accurate the test statistic! This explains the many test options. Not every test will fit each dataset or comparison.

Why special distributions?

High noise levels (technical and biological factors)
- low library sizes
- low amount of available mRNAs; results in amplification bias and dropout events
- partial coverage, uneven depth
- stochastic nature of transcription
- multimodality in gene expression; presence of multiple possible cell states within a cell population

What if no model fits?

Generic, non-parametric methods
- Wilcoxon rank-sum test
- Kruskal-Wallis
- Kolmogorov-Smirnov test
These usually convert observed expression values into ranks
Than, test whether the distribution of the ranks for one group are significantly different from the distribution of the ranks in the other group
Can fail in presence of large number of tied values (zeros in scRNA-seq)
If the assumptions for a parametric test hold, these are generally more powerful than a non-parametric test

Which test to use?

Negative Binomial Distribution

𝝻 = mean expression

𝞭^2 = dispersion, which is inversely related to the variance

NeBi fits bulk RNA-seq data very well and is used for most statistical methods designed for such data. In addition, it has been show to fit the distribution of molecule counts obtained from data tagged by unique molecule identifiers (UMI) quite well.

Zero-inflated Negative Binomial Distribution

d = dropout rate

The dropout rate of a gene is strongly correlated with the men expression of the gene. different zero-inflated negative binomial models use different relationships between mean expression and dropout rate

Implemented in MAST, SCDE

Poisson Distribution

Discrete probability distribution that expresses the probability of a given number of events occuring in a fixed interval of time or space - if these events occur with a known constant mean rate, lambda, and independently of the time since the last event

Poisson-Beta Distribution

a: the rate of activation of transcription

b: the rate of inhibition of transcription

g: the rate of transcript production while transcription is active at the locus

Implemented in BPSC.

Binomial Distribution

Bi(n, 𝜃)

Discrete probability distribution of the number of success in a sequence of n independent experiments; 𝜃 is probability of success

Used to compare proportion of zeros

Gene-wise null hypothesis:

Probability of being expressed is the same in group 1 and group 2

Available in scran

Which test to use?

MAST model

Uses generalized linear hurdle model
Designed to account for stochastic dropouts and bimodal expression in which expression is either strongly non-zero or non-detectable
The rate of expression Z, and the level of expression Y, are modeled for each gene g, indicating whether gene g is expressed in cell i
A logistic regression model for the discrete variable Z and a Gaussian linear model for the continuous variable Y
Model parameters fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test

So what is really important?

Models can get quite complicated but in Seurat/Scran/Scanpy the default methods are set to t-test and Wilcoxon rank-sum test
Importance of understanding what you are trying to compare, e.g. mean expressions, or probability of being expressed
Understand your data
Assess and validate results!

Evaluate Results

Why can it be difficult to evaluate performance?

No ground truth data available!

Known data
- Using data we know something about to get “positive controls”
Compare
- compare between numbers and ranks of DEGs
Investigate results
- How does the expression and distributions of detected DEs look like?

Method Evaluation

36 statistical approaches for DE analysis to compare the expression levels in two groups of cells
Based on 9 datasets and different preprocessing in some cases
Extensive evaluation of metrics (number of genes found, characteristics of the false positives, robustness and similarities between methods

Method Evaluation

Some highlights:

t-test and Wilcoxon work well, given at least a few dozen cells to compare
bulk RNA-seq analysis methods do not generally perform worse than those specifically developed for scRNA-seq (edgeR, DESeq2)
Filtering out lowly expressed genes is quite important for good performance of bulk method

Similarity between methods

Differential expression is always comparative

Which comparisons to make?

Pairwise cluster comparisons
For a given cluster, find marker genes
- DE compared to all cells outside of the cluster
- DE compared to at least one other cluster
- DE compared to each of the other clusters
- DE compared to “most” of the other clusters
- DE and up-regulated (upregulated markers are usually easier to interpret)
Cell-type comparisons

Which comparisons to make?

Results will depend on what we compare to!
- if the data set consists of only T-cells, no generic T-cell markers will (or should) show up as differentially expressed between clusters
- This is important to keep in mind when comparing marker genes found in different studies, with potentially different composition

Within celltype comparisons

Population specific changes in expression between conditions
- e.g. Cell type specific changes after drug treatment
Should be replicated
Structure of data should be taken into account!
mixed effect models, pseudo-bulk samples, …
Example workflow here: link

Things to keep in mind!

Balanced cluster sizes
Highly similar clusters

Things to keep in mind!

Balanced cluster sizes
- Cluster 3 will dominate all 1-vs-rest comparisons
- Probably good idea to subsample
- Be aware the subsampling strategies; in Seurat only does it per test

Things to keep in mind!

Highly similar clusters
- Will have most of their DEGs overlapping

Things to keep in mind!

always go back to RNA assay (or similar) for doing differential expression analysis
Depending on the method you choose use: counts, normalized counts or lognormalized counts
Normalization strategy can have a big influence on the results in differential expression (e.g. when comparing celltypes with few expressed genes vs a celltype with many genes)
Batch effects
- Always check if the DEGs you found are just differentially expressed in one of the batches
- There are tests that can control for batch effects (Seurat parameter “latent.vars” - only for LR, negbinom, poisson or MAST)

Things to keep in mind!

How many cells do you need to get reliable detection of differential expression
- Highly expressed genes: probably 10-20 cells is enough
- Lowly expressed cells: probably > 20, preferably > 50
- Depends on the sensitivity of the library preparation method and how distinct the cell types you are comparing are (e.g. macrophage vs T-cell or CD8 T-cell vs CD4 T-cell)
Many results will be unwanted noise! Sanity test: take two random sets of cells and run DE analysis. You will probably have a few significant genes with most of the commonly used tests.