Differential Gene Expression

Single Cell RNA-Seq Analysis

Roy Francis, Åsa Björklund, Olga Dethlefson

09-Feb-2024

Workflow

  • Quality control
  • Cell cycle phase classification
  • Normalization
  • Select highly variable genes
  • Data integration
  • Clustering
  • Cell typing
  • Differential gene expression
  • GSA/GSEA

What is differential gene expression?

avg_log2FC p_val_adj
CD7 5.094294 0.0000001
LCK 3.436469 0.0000046
HLA-DPB1 -5.142617 0.0000051
HLA-DRA -4.092407 0.0000126
HLA-DRB1 -4.868445 0.0000172
GNLY 7.405238 0.0000191
GZMM 2.979152 0.0000767
CD3D 2.229468 0.0000805
GZMA 3.018052 0.0001174
HLA-DPA1 -3.594347 0.0002595

Count data -> statistical analysis -> Are differences significant (greater than expected randomly)

What is differential gene expression?

What is differential gene expression?

Most methods do not take zeros into account.

…most computational methods still stick with the old mentality of viewing differential expression as a simple ‘up or down’ phenomenon. We advocate that we should fully embrace the features of single cell data, which allows us to observe binary (from Off to On) as well as continuous (the amount of expression) regulations. Wu et al. (2018)

Comparisons

  • Pairwise cluster comparisons
    • C1 vs C2, C2 vs C3 …
  • Marker genes
    • C1 vs all other cells …
    • Only positive markers
  • Conditions… Groups…

Functions

Toolkit Function
Seurat FindMarkers(), FindAllMarkers()
Scran findMarkers()
Scanpy scanpy.tl.rank_genes_groups()

FindAllMarkers

FindAllMarkers(
  object,
  assay = NULL,
  features = NULL,
  logfc.threshold = 0.25,
  test.use = "wilcox",
  slot = "data",
  min.pct = 0.1,
  min.diff.pct = -Inf,
  node = NULL,
  verbose = TRUE,
  only.pos = FALSE,
  max.cells.per.ident = Inf,
  random.seed = 1,
  latent.vars = NULL,
  min.cells.feature = 3,
  min.cells.group = 3,
  mean.fxn = NULL,
  fc.name = NULL,
  base = 2,
  return.thresh = 0.01,
  densify = FALSE,
  ...
)

Seurat 4.3.0

Tests

"wilcox" : Identifies differentially expressed genes between two groups of cells using a Wilcoxon Rank Sum test (default)

"bimod" : Likelihood-ratio test for single cell gene expression, (McDavid et al., Bioinformatics, 2013)

"roc" : Identifies 'markers' of gene expression using ROC analysis. For each gene, evaluates (using AUC) a classifier built on that gene alone, to classify between two groups of cells. An AUC value of 1 means that expression values for this gene alone can perfectly classify the two groupings (i.e. Each of the cells in cells.1 exhibit a higher level than each of the cells in cells.2). An AUC value of 0 also means there is perfect classification, but in the other direction. A value of 0.5 implies that the gene has no predictive power to classify the two groups. Returns a 'predictive power' (abs(AUC-0.5) * 2) ranked matrix of putative differentially expressed genes.

"t" : Identify differentially expressed genes between two groups of cells using the Student's t-test.

"negbinom" : Identifies differentially expressed genes between two groups of cells using a negative binomial generalized linear model. Use only for UMI-based datasets

"poisson" : Identifies differentially expressed genes between two groups of cells using a poisson generalized linear model. Use only for UMI-based datasets

"LR" : Uses a logistic regression framework to determine differentially expressed genes. Constructs a logistic regression model predicting group membership based on each feature individually and compares this to a null model with a likelihood ratio test.

"MAST" : Identifies differentially expressed genes between two groups of cells using a hurdle model tailored to scRNA-seq data. Utilizes the MAST package to run the DE testing.

"DESeq2" : Identifies differentially expressed genes between two groups of cells based on a model using DESeq2 which uses a negative binomial distribution (Love et al, Genome Biology, 2014).This test does not support pre-filtering of genes based on average difference (or percent detection rate) between cell groups. However, genes may be pre-filtered based on their minimum detection rate (min.pct) across both cell groups. To use this method, please install DESeq2, using the instructions at https://bioconductor.org/packages/release/bioc/html/DESeq2.html

Tests

"wilcox" : Identifies differentially expressed genes between two groups of cells using a Wilcoxon Rank Sum test (default)

"bimod" : Likelihood-ratio test for single cell gene expression, (McDavid et al., Bioinformatics, 2013)

"roc" : Identifies 'markers' of gene expression using ROC analysis. For each gene, evaluates (using AUC) a classifier built on that gene alone, to classify between two groups of cells. An AUC value of 1 means that expression values for this gene alone can perfectly classify the two groupings (i.e. Each of the cells in cells.1 exhibit a higher level than each of the cells in cells.2). An AUC value of 0 also means there is perfect classification, but in the other direction. A value of 0.5 implies that the gene has no predictive power to classify the two groups. Returns a 'predictive power' (abs(AUC-0.5) * 2) ranked matrix of putative differentially expressed genes.

"t" : Identify differentially expressed genes between two groups of cells using the Student's t-test.

"negbinom" : Identifies differentially expressed genes between two groups of cells using a negative binomial generalized linear model. Use only for UMI-based datasets

"poisson" : Identifies differentially expressed genes between two groups of cells using a poisson generalized linear model. Use only for UMI-based datasets

"LR" : Uses a logistic regression framework to determine differentially expressed genes. Constructs a logistic regression model predicting group membership based on each feature individually and compares this to a null model with a likelihood ratio test.

"MAST" : Identifies differentially expressed genes between two groups of cells using a hurdle model tailored to scRNA-seq data. Utilizes the MAST package to run the DE testing.

"DESeq2" : Identifies differentially expressed genes between two groups of cells based on a model using DESeq2 which uses a negative binomial distribution (Love et al, Genome Biology, 2014).This test does not support pre-filtering of genes based on average difference (or percent detection rate) between cell groups. However, genes may be pre-filtered based on their minimum detection rate (min.pct) across both cell groups. To use this method, please install DESeq2, using the instructions at https://bioconductor.org/packages/release/bioc/html/DESeq2.html

Distributions

  • High noise (technical + biology)
  • Low library sizes
  • Low mRNA quantity
  • Amplification bias, drop-outs
  • 3’ bias, partial coverage
  • Bursting
  • Mixed cell types

Tests

"wilcox" : Identifies differentially expressed genes between two groups of cells using a Wilcoxon Rank Sum test (default)

"bimod" : Likelihood-ratio test for single cell gene expression, (McDavid et al., Bioinformatics, 2013)

"roc" : Identifies 'markers' of gene expression using ROC analysis. For each gene, evaluates (using AUC) a classifier built on that gene alone, to classify between two groups of cells. An AUC value of 1 means that expression values for this gene alone can perfectly classify the two groupings (i.e. Each of the cells in cells.1 exhibit a higher level than each of the cells in cells.2). An AUC value of 0 also means there is perfect classification, but in the other direction. A value of 0.5 implies that the gene has no predictive power to classify the two groups. Returns a 'predictive power' (abs(AUC-0.5) * 2) ranked matrix of putative differentially expressed genes.

"t" : Identify differentially expressed genes between two groups of cells using the Student's t-test.

"negbinom" : Identifies differentially expressed genes between two groups of cells using a negative binomial generalized linear model. Use only for UMI-based datasets

"poisson" : Identifies differentially expressed genes between two groups of cells using a poisson generalized linear model. Use only for UMI-based datasets

"LR" : Uses a logistic regression framework to determine differentially expressed genes. Constructs a logistic regression model predicting group membership based on each feature individually and compares this to a null model with a likelihood ratio test.

"MAST" : Identifies differentially expressed genes between two groups of cells using a hurdle model tailored to scRNA-seq data. Utilizes the MAST package to run the DE testing.

"DESeq2" : Identifies differentially expressed genes between two groups of cells based on a model using DESeq2 which uses a negative binomial distribution (Love et al, Genome Biology, 2014).This test does not support pre-filtering of genes based on average difference (or percent detection rate) between cell groups. However, genes may be pre-filtered based on their minimum detection rate (min.pct) across both cell groups. To use this method, please install DESeq2, using the instructions at https://bioconductor.org/packages/release/bioc/html/DESeq2.html

MAST

  • Two part GLM (Hurdle model)
  • Models the continuous nature of gene expression and the discrete binary nature of gene detection
  • Detection hurdle
    • Expression detected or not?
    • Logistic regression
    • If gene is not detected, stop, else move to next hurdle
  • Expression hurdle
    • Genes with positive expression levels modelled using GLM
  • Hurdle model is able to handle drop-outs
  • Support complex modelling

Finak et al. (2015)

Comparing Methods

  • T-test and Wilcoxon work well enough given sufficient number of samples
  • Bulk methods are not worse than single-cell specific methods
  • Pre-filtering lowly expressed genes is important for bulk methods

Soneson & Robinson (2018)

Comparing Methods

Similarity between methods based on genes identified

Soneson & Robinson (2018)

Assessing results

  • We don’t know the ground truth
  • Using known data (positive controls)
  • Simulated data by modelling
  • Intersect of multiple methods
  • Visual inspection

Assessing results

Violin plots are good to visualize distribution

Cluster balance

  • Highly similar clusters
  • Will have most of their DEGs overlapping
  • Pairwise comparisons might help rather than 1 vs rest

Cluster balance

  • Balance cluster sizes
  • C1 will dominate all 1-vs-rest comparisons
  • Probably good idea to subsample
  • Be aware the subsampling strategies in Seurat only does it per test

Things to think about

  • Which data should I use? Raw? Normalized? Log Normalized?
    • Depends on test/method
  • Check that DEGs are not just a result of some batch effect
  • Batch effects can be corrected using covariates
  • How many cells do I need for reliable DGE?
    • Highly expressed genes: 10-20 cells?
    • Lowly expressed genes: 20-50 cells?
    • Also depends on quality of library prep
  • Distinctness of cell types
  • Differentiate between noise and signal
    • Any comparison will produce some DEGs

Conclusion

  • Important to understand what you are trying to compare: mean expression vs probability of being expressed
  • Important to understand the data
  • Take into account single-cell specific nuances
  • Models can get complicated
  • Important to assess and validate the results
  • Most tests give similar results for top genes but p values might differ.
  • Fold changes can’t be interpreted the same way as bulk rnaseq
  • Too many cells can give extremely low p values

References

Finak, G., McDavid, A., Yajima, M., Deng, J., Gersuk, V., Shalek, A. K., Slichter, C. K., Miller, H. W., McElrath, M. J., Prlic, M., et al. (2015). MAST: A flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biology, 16(1), 1–13. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0844-5
Soneson, C., & Robinson, M. D. (2018). Bias, robustness and scalability in single-cell differential expression analysis. Nature Methods, 15(4), 255–261. https://www.nature.com/articles/nmeth.4612
Wu, Z., Zhang, Y., Stitzel, M. L., & Wu, H. (2018). Two-phase differential expression analysis for single cell RNA-seq. Bioinformatics, 34(19), 3340–3348. https://academic.oup.com/bioinformatics/article/34/19/3340/4984507

Acknowledgements

Slides adapted from previous presentations by Olga Dethlefson and Åsa Björklund.