Differential Gene Expression

Single Cell RNA-Seq Analysis

Jennifer Fransson

02-Apr-2025

Overview

What is differential gene expression?
How is the analysis performed?
- Choosing your groups of interest
- Functions
- Simple design vs complex design
- 1-vs-1 and 1-vs-all
Other considerations

What is differential gene expression?

Count data -> statistical analysis -> Are differences significant (greater than expected randomly)

Statistical tests

Null hypothesis: Mean/median/distribution is equal between group A and group B
\(p\) < 0.05 : if null hypothesis is true, we can expect the measured result in < 5% of cases where group A and group B have been sampled with sample size \(n\)

Statistical significance: The result is likely to be from a true difference rather than random chance

What is differential gene expression?

	avg_log2FC	p_val_adj
CD7	5.535220	0.0000001
LCK	3.605886	0.0000046
HLA-DPB1	-5.291575	0.0000051
HLA-DRA	-4.128576	0.0000126
HLA-DRB1	-5.027130	0.0000172
GNLY	8.198735	0.0000191
GZMM	3.120563	0.0000767
CD3D	2.255304	0.0000805
GZMA	3.078594	0.0001174
HLA-DPA1	-3.661491	0.0002595

What is differential gene expression?

	avg_log2FC	p_val_adj	pct.1	pct.2
CD7	5.535220	0.0000001	0.714	0.058
LCK	3.605886	0.0000046	0.679	0.077
HLA-DPB1	-5.291575	0.0000051	0.107	0.769
HLA-DRA	-4.128576	0.0000126	0.357	0.865
HLA-DRB1	-5.027130	0.0000172	0.107	0.731
GNLY	8.198735	0.0000191	0.571	0.058
GZMM	3.120563	0.0000767	0.571	0.058
CD3D	2.255304	0.0000805	0.643	0.096
GZMA	3.078594	0.0001174	0.571	0.058
HLA-DPA1	-3.661491	0.0002595	0.179	0.712

What is differential gene expression?

(Modified from Tiberi et al., 2023)

Most methods focus on difference in mean
Many different distributions will show a difference in means
- But not all!

How is the analysis performed?

Defining groups of interest

A priori defined groups: Compare cells from different samples, e.g.:

Experimental groups (treatment, time points, clinical information etc)
Sorted cells

Data-driven definition of groups: Compare cells depending on analysis output, e.g.:

RNA-based clustering/identity
Identity based on other data from multi-omics

Warning: Performing DE on clusters defined with the same data (“double-dipping”) will inflate DE analysis. Be mindful of this when you interpret the results.

Functions

Toolkit	Function
Seurat	`FindMarkers()`, `FindAllMarkers()`
Scran	`findMarkers()`
Scanpy	`scanpy.tl.rank_genes_groups()`

FindAllMarkers

FindAllMarkers(
  object,
  assay = NULL,
  features = NULL,
  logfc.threshold = 0.1,
  test.use = "wilcox",
  slot = "data",
  min.pct = 0.01,
  min.diff.pct = -Inf,
  node = NULL,
  verbose = TRUE,
  only.pos = FALSE,
  max.cells.per.ident = Inf,
  random.seed = 1,
  latent.vars = NULL,
  min.cells.feature = 3,
  min.cells.group = 3,
  mean.fxn = NULL,
  fc.name = NULL,
  base = 2,
  return.thresh = 0.01,
  densify = FALSE,
  ...
)

Seurat 5.1.0

FindAllMarkers

FindAllMarkers(
  object,
  assay = NULL,
  features = NULL,
  logfc.threshold = 0.1,
  test.use = "wilcox",
  slot = "data",
  min.pct = 0.01,
  min.diff.pct = -Inf,
  node = NULL,
  verbose = TRUE,
  only.pos = FALSE,
  max.cells.per.ident = Inf,
  random.seed = 1,
  latent.vars = NULL,
  min.cells.feature = 3,
  min.cells.group = 3,
  mean.fxn = NULL,
  fc.name = NULL,
  base = 2,
  return.thresh = 0.01,
  densify = FALSE,
  ...
)

Seurat 5.1.0

FindAllMarkers

FindAllMarkers(
  object,
  assay = NULL,
  features = NULL,
  logfc.threshold = 0.1,
  test.use = "wilcox",
  slot = "data",
  min.pct = 0.01,
  min.diff.pct = -Inf,
  node = NULL,
  verbose = TRUE,
  only.pos = FALSE,
  max.cells.per.ident = Inf,
  random.seed = 1,
  latent.vars = NULL,
  min.cells.feature = 3,
  min.cells.group = 3,
  mean.fxn = NULL,
  fc.name = NULL,
  base = 2,
  return.thresh = 0.01,
  densify = FALSE,
  ...
)

Seurat 5.1.0

FindAllMarkers

FindAllMarkers(
  object,
  assay = NULL,
  features = NULL,
  logfc.threshold = 0.1,
  test.use = "wilcox",
  slot = "data",
  min.pct = 0.01,
  min.diff.pct = -Inf,
  node = NULL,
  verbose = TRUE,
  only.pos = FALSE,
  max.cells.per.ident = Inf,
  random.seed = 1,
  latent.vars = NULL,
  min.cells.feature = 3,
  min.cells.group = 3,
  mean.fxn = NULL,
  fc.name = NULL,
  base = 2,
  return.thresh = 0.01,
  densify = FALSE,
  ...
)

Seurat 5.1.0

Statistical tests

"wilcox" : Identifies differentially expressed genes between two groups of cells using a Wilcoxon Rank Sum test (default)

"bimod" : Likelihood-ratio test for single cell gene expression, (McDavid et al., Bioinformatics, 2013)

"roc" : Identifies 'markers' of gene expression using ROC analysis. For each gene, evaluates (using AUC) a classifier built on that gene alone, to classify between two groups of cells. An AUC value of 1 means that expression values for this gene alone can perfectly classify the two groupings (i.e. Each of the cells in cells.1 exhibit a higher level than each of the cells in cells.2). An AUC value of 0 also means there is perfect classification, but in the other direction. A value of 0.5 implies that the gene has no predictive power to classify the two groups. Returns a 'predictive power' (abs(AUC-0.5) * 2) ranked matrix of putative differentially expressed genes.

"t" : Identify differentially expressed genes between two groups of cells using the Student's t-test.

"negbinom" : Identifies differentially expressed genes between two groups of cells using a negative binomial generalized linear model. Use only for UMI-based datasets

"poisson" : Identifies differentially expressed genes between two groups of cells using a poisson generalized linear model. Use only for UMI-based datasets

"LR" : Uses a logistic regression framework to determine differentially expressed genes. Constructs a logistic regression model predicting group membership based on each feature individually and compares this to a null model with a likelihood ratio test.

"MAST" : Identifies differentially expressed genes between two groups of cells using a hurdle model tailored to scRNA-seq data. Utilizes the MAST package to run the DE testing.

"DESeq2" : Identifies differentially expressed genes between two groups of cells based on a model using DESeq2 which uses a negative binomial distribution (Love et al, Genome Biology, 2014).This test does not support pre-filtering of genes based on average difference (or percent detection rate) between cell groups. However, genes may be pre-filtered based on their minimum detection rate (min.pct) across both cell groups. To use this method, please install DESeq2, using the instructions at https://bioconductor.org/packages/release/bioc/html/DESeq2.html

Statistical tests

"wilcox" : Identifies differentially expressed genes between two groups of cells using a Wilcoxon Rank Sum test (default)

"bimod" : Likelihood-ratio test for single cell gene expression, (McDavid et al., Bioinformatics, 2013)

"roc" : Identifies 'markers' of gene expression using ROC analysis. For each gene, evaluates (using AUC) a classifier built on that gene alone, to classify between two groups of cells. An AUC value of 1 means that expression values for this gene alone can perfectly classify the two groupings (i.e. Each of the cells in cells.1 exhibit a higher level than each of the cells in cells.2). An AUC value of 0 also means there is perfect classification, but in the other direction. A value of 0.5 implies that the gene has no predictive power to classify the two groups. Returns a 'predictive power' (abs(AUC-0.5) * 2) ranked matrix of putative differentially expressed genes.

"t" : Identify differentially expressed genes between two groups of cells using the Student's t-test.

"negbinom" : Identifies differentially expressed genes between two groups of cells using a negative binomial generalized linear model. Use only for UMI-based datasets

"poisson" : Identifies differentially expressed genes between two groups of cells using a poisson generalized linear model. Use only for UMI-based datasets

"LR" : Uses a logistic regression framework to determine differentially expressed genes. Constructs a logistic regression model predicting group membership based on each feature individually and compares this to a null model with a likelihood ratio test.

"MAST" : Identifies differentially expressed genes between two groups of cells using a hurdle model tailored to scRNA-seq data. Utilizes the MAST package to run the DE testing.

"DESeq2" : Identifies differentially expressed genes between two groups of cells based on a model using DESeq2 which uses a negative binomial distribution (Love et al, Genome Biology, 2014).This test does not support pre-filtering of genes based on average difference (or percent detection rate) between cell groups. However, genes may be pre-filtered based on their minimum detection rate (min.pct) across both cell groups. To use this method, please install DESeq2, using the instructions at https://bioconductor.org/packages/release/bioc/html/DESeq2.html

Distributions

High noise (technical + biology)
Low library sizes
Low mRNA quantity
Amplification bias, drop-outs
3’ bias, partial coverage
Bursting
Mixed cell types

Statistical tests

"wilcox" : Identifies differentially expressed genes between two groups of cells using a Wilcoxon Rank Sum test (default)

"bimod" : Likelihood-ratio test for single cell gene expression, (McDavid et al., Bioinformatics, 2013)

"roc" : Identifies 'markers' of gene expression using ROC analysis. For each gene, evaluates (using AUC) a classifier built on that gene alone, to classify between two groups of cells. An AUC value of 1 means that expression values for this gene alone can perfectly classify the two groupings (i.e. Each of the cells in cells.1 exhibit a higher level than each of the cells in cells.2). An AUC value of 0 also means there is perfect classification, but in the other direction. A value of 0.5 implies that the gene has no predictive power to classify the two groups. Returns a 'predictive power' (abs(AUC-0.5) * 2) ranked matrix of putative differentially expressed genes.

"t" : Identify differentially expressed genes between two groups of cells using the Student's t-test.

"negbinom" : Identifies differentially expressed genes between two groups of cells using a negative binomial generalized linear model. Use only for UMI-based datasets

"poisson" : Identifies differentially expressed genes between two groups of cells using a poisson generalized linear model. Use only for UMI-based datasets

"LR" : Uses a logistic regression framework to determine differentially expressed genes. Constructs a logistic regression model predicting group membership based on each feature individually and compares this to a null model with a likelihood ratio test.

"MAST" : Identifies differentially expressed genes between two groups of cells using a hurdle model tailored to scRNA-seq data. Utilizes the MAST package to run the DE testing.

"DESeq2" : Identifies differentially expressed genes between two groups of cells based on a model using DESeq2 which uses a negative binomial distribution (Love et al, Genome Biology, 2014).This test does not support pre-filtering of genes based on average difference (or percent detection rate) between cell groups. However, genes may be pre-filtered based on their minimum detection rate (min.pct) across both cell groups. To use this method, please install DESeq2, using the instructions at https://bioconductor.org/packages/release/bioc/html/DESeq2.html

Hurdle models

…most computational methods still stick with the old mentality of viewing differential expression as a simple ‘up or down’ phenomenon. We advocate that we should fully embrace the features of single cell data, which allows us to observe binary (from Off to On) as well as continuous (the amount of expression) regulations. Wu et al. (2018)

MAST

Two part GLM (Hurdle model)
Models the continuous nature of gene expression and the discrete binary nature of gene detection
Detection hurdle
- Expression detected or not?
- Logistic regression
- If gene is not detected, stop, else move to next hurdle
Expression hurdle
- Genes with positive expression levels modelled using GLM
Hurdle model is able to handle drop-outs
Support complex modelling

Finak et al. (2015)

Complex designs

Comparing groups of samples (e.g. patients vs controls)
Including batch effects
Correcting for covariates (e.g. age)

Complex designs: Groups of samples

Example: 3 patients vs 3 controls

\(n\): Number of cells (1000s) or number of individuals (3)?

Many tests assume independence!

Complex designs: Covariates

Complex designs: Approaches

Approach	Speed	Can include covariates	Can account for multilevel design	Sensitivity	Specificity
Naive model	Fast	✖	✖	High	Low
GLM	Slow	✓	Not recommended	High	Low
Mixed models	Slow	✓	✓	Medium	Medium
Pseudobulk	Fast	✓	✓	Low	High

NB: This table broadly summarizes each approach, but each approach includes many methods with their own advantages and disadvantages.

Further reading: Soneson & Robinson (2018) Zimmerman et al. (2021) Juntilla et al. (2022) Das et al. (2022)

1-vs-1 and 1-vs-all

1-vs-1: C1 vs C2
1-vs-all: C1 vs C0 + C2 + C3

1-vs-all analysis

Larger clusters will be over-represented unless subsampled
Highly similar clusters
- Will have most of their DEGs overlapping
- Pairwise comparisons might help rather than 1 vs rest

Other considerations

Considerations - p-values

\(p\) depends on \(n\), variance and intergroup difference
- As \(n\) increases, variance can increase and difference can decrease without losing power

Are all statistically significant differences of interest for your research question?

Considerations: Composition vs expression

	avg_log2FC	p_val_adj	pct.1	pct.2
S100A8	5.144138	0.0507286	0.510	0.129
SAT1	1.210656	0.0858772	0.857	0.484
TALDO1	1.868835	0.1229653	0.673	0.226
GZMM	-2.129484	0.1467967	0.102	0.452
SPON2	-2.597993	0.2331345	0.082	0.387

Assessing results

Methods are hard to evaluate - we don’t know the ground truth
- Using known data (positive controls)
- Simulated data by modelling
Intersect of multiple methods
Visual inspection

Assessing results

Violin plots are good to visualize distribution

Dot plots give a quick overview of both expression and % of cells expressing a gene

Things to think about

How many cells/samples do I need for reliable DGE?
- How different do I expect my cells/samples to be?
- How high is the expression and how deep am I sequencing?
- Will also depend on library quality
Which test should I use?
- Which populations am I comparing?
- Are cells independent within my groups of interest?
- Do I need to correct for any batch effects?
Which data should I use? Raw? Normalized? Log Normalized?
- Depends on test/method
DE results are always relative to other cells
Don’t just rely on p-values
Always assess your results!
- Visualize the full distributions
- Check for potential confounders
  - Batch effects can be corrected using latent variables (assuming good experimental design)
  - Removing ambient RNA can also help

Conclusions

Single cell data is more complex than differences in mean expression
Different tests rely on different assumptions
Always consider what you are trying to compare
Important to assess and validate the results

References

Das, S., Rai, A., & Rai, S. N. (2022). Differential expression analysis of single-cell RNA-seq data: Current statistical approaches and outstanding challenges. Entropy. https://doi.org/10.3390/e24070995

Finak, G., McDavid, A., Yajima, M., Deng, J., Gersuk, V., Shalek, A. K., Slichter, C. K., Miller, H. W., McElrath, M. J., Prlic, M., et al. (2015). MAST: A flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biology, 16(1), 1–13. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0844-5

Juntilla, S., Smolander, J., & Elo, L. L. (2022). Benchmarking methods for detecting differential states between conditions from multi-subject single-cell RNA-seq data. Briefings in Bioinformatics. https://doi.org/10.1093/bib/bbac286

Soneson, C., & Robinson, M. D. (2018). Bias, robustness and scalability in single-cell differential expression analysis. Nature Methods, 15(4), 255–261. https://www.nature.com/articles/nmeth.4612

Tiberi, S., Crowell, H. L., Samartsidis, P., Weber, L. M., & Robinson, M. (2023). Distinct: A novel approach to differential distribution analyses. The Annals of Applied Statistics, 17(2), 1681–1700. https://doi.org/10.1214/22-AOAS1689

Wu, Z., Zhang, Y., Stitzel, M. L., & Wu, H. (2018). Two-phase differential expression analysis for single cell RNA-seq. Bioinformatics, 34(19), 3340–3348. https://academic.oup.com/bioinformatics/article/34/19/3340/4984507

Zimmerman, K. D., Espeland, M. A., & Langefeld, C. D. (2021). A practical solution to pseudoreplication bias in single-cell studies. Nature Communications. https://www.nature.com/articles/s41467-021-21038-1

Acknowledgements

Slides adapted from previous presentations by Olga Dethlefson, Åsa Björklund, Vincent van Hoef and Roy Francis.