In this tutorial we will cover about Differetial gene expression, which comprises an extensive range of topics and methods. In single cell, differential expresison can have multiple functionalities such as of identifying marker genes for cell populations, as well as differentially regulated genes across conditions (healthy vs control). We will also exercise on how to account the batch information in your test.
We can first load the data from the clustering session. Moreover, we
can already decide which clustering resolution to use. First let’s
define using the louvain
clustering to identifying
differentially expressed genes.
suppressPackageStartupMessages({
library(scater)
library(scran)
# library(venn)
library(cowplot)
library(ggplot2)
# library(rafalib)
library(pheatmap)
library(igraph)
library(dplyr)
})
<- readRDS("data/results/covid_qc_dr_int_cl.rds") sce
Let us first compute a ranking for the highly differential genes in each cluster. There are many different tests and parameters to be chosen that can be used to refine your results. When looking for marker genes, we want genes that are positivelly expressed in a cell type and possibly not expressed in the others.
# Compute differentiall expression
<- scran::findMarkers(x = sce, groups = as.character(sce$louvain_SNNk15),
markers_genes lfc = 0.5, pval.type = "all", direction = "up")
# List of dataFrames with the results for each cluster
markers_genes
## List of length 8
## names(8): 1 2 3 4 5 6 7 8
# Visualizing the expression of one
"1"]] markers_genes[[
## DataFrame with 18121 rows and 10 columns
## p.value FDR summary.logFC logFC.2 logFC.3
## <numeric> <numeric> <numeric> <numeric> <numeric>
## CD79A 3.12219e-307 5.65771e-303 3.33918 3.50769 3.49569
## MS4A1 2.70271e-242 2.44879e-238 2.34911 2.39361 2.37237
## IGHM 1.34591e-215 8.12973e-212 3.66201 3.82441 3.84624
## TNFRSF13C 6.93770e-201 3.14295e-197 2.05748 2.10440 2.03702
## LINC00926 3.06863e-161 1.11213e-157 1.87617 1.91202 1.91947
## ... ... ... ... ... ...
## AC011043.1 1 1 0.00431686 0.00431686 0.00431686
## AC007325.4 1 1 -0.00900931 -0.00900931 -0.00362024
## AL354822.1 1 1 0.01468456 0.01468456 0.01091020
## AC233755.1 1 1 0.00407473 0.00407473 0.00407473
## AC240274.1 1 1 0.00771093 0.01016628 0.00771093
## logFC.4 logFC.5 logFC.6 logFC.7 logFC.8
## <numeric> <numeric> <numeric> <numeric> <numeric>
## CD79A 3.33918 3.42574 3.48286 3.45316 3.48489
## MS4A1 2.37006 2.38049 2.34911 2.36439 2.38240
## IGHM 3.66201 3.81737 3.84129 3.81403 3.87463
## TNFRSF13C 2.05748 2.08148 2.06205 2.08998 2.08595
## LINC00926 1.87617 1.91044 1.89659 1.89392 1.91356
## ... ... ... ... ... ...
## AC011043.1 0.002419979 0.00431686 0.00197413 0.00320700 0.00431686
## AC007325.4 0.000964593 -0.01778832 -0.00330748 -0.00525116 -0.00195822
## AL354822.1 0.015844207 0.01290013 0.01070242 0.01376169 0.00073809
## AC233755.1 0.004074732 0.00407473 0.00407473 0.00407473 0.00407473
## AC240274.1 -0.001286558 0.00684239 0.00875890 0.00663751 0.01270911
We can now select the top 25 up regulated genes for plotting.
# Colect the top 25 genes for each cluster and put the into a single table
<- lapply(names(markers_genes), function(x) {
top25 <- markers_genes[[x]][1:25, 1:2]
temp $gene <- rownames(markers_genes[[x]])[1:25]
temp$cluster <- x
tempreturn(temp)
})<- as_tibble(do.call(rbind, top25))
top25 $p.value[top25$p.value == 0] <- 1e-300
top25 top25
We can now select the top 25 up regulated genes for plotting.
par(mfrow = c(1, 5), mar = c(4, 6, 3, 1))
for (i in unique(top25$cluster)) {
barplot(sort(setNames(-log10(top25$p.value), top25$gene)[top25$cluster == i],
horiz = T, las = 1, main = paste0(i, " vs. rest"), border = "white",
F), yaxs = "i", xlab = "-log10FC")
abline(v = c(0, -log10(0.05)), lty = c(1, 2))
}
We can visualize them as a heatmap. Here we are selecting the top 5.
as_tibble(top25) %>%
group_by(cluster) %>%
top_n(-5, p.value) -> top5
::plotHeatmap(sce[, order(sce$louvain_SNNk15)], features = unique(top5$gene),
scatercenter = T, zlim = c(-3, 3), colour_columns_by = "louvain_SNNk15", show_colnames = F,
cluster_cols = F, fontsize_row = 6, color = colorRampPalette(c("purple", "black",
"yellow"))(90))
We can also plot a violin plot for each gene.
::plotExpression(sce, features = unique(top5$gene), x = "louvain_SNNk15", ncol = 5,
scatercolour_by = "louvain_SNNk15", scales = "free")
The second way of computing differential expression is to answer which genes are differentially expressed within a cluster. For example, in our case we have libraries comming from patients and controls and we would like to know which genes are influenced the most in a particular cell type.
For this end, we will first subset our data for the desired cell cluster, then change the cell identities to the variable of comparison (which now in our case is the “type”, e.g. Covid/Ctrl).
# Filter cells from that cluster
<- sce[, sce$louvain_SNNk15 == 8]
cell_selection
# Compute differentiall expression
<- findMarkers(x = cell_selection, groups = cell_selection@colData$type,
DGE_cell_selection lfc = 0.25, pval.type = "all", direction = "any")
<- lapply(names(DGE_cell_selection), function(x) {
top5_cell_selection <- DGE_cell_selection[[x]][1:5, 1:2]
temp $gene <- rownames(DGE_cell_selection[[x]])[1:5]
temp$cluster <- x
tempreturn(temp)
})<- as_tibble(do.call(rbind, top5_cell_selection))
top5_cell_selection top5_cell_selection
We can now plot the expression across the “type”.
::plotExpression(cell_selection, features = unique(top5_cell_selection$gene),
scaterx = "type", ncol = 5, colour_by = "type")
#DGE_ALL6.2:
<- list()
plotlist for (i in unique(top5_cell_selection$gene)) {
<- plotReducedDim(sce, dimred = "UMAP_on_MNN", colour_by = i, by_exprs_values = "logcounts") +
plotlist[[i]] ggtitle(label = i) + theme(plot.title = element_text(size = 20))
}plot_grid(ncol = 3, plotlist = plotlist)
Hypergeometric enrichment test
Having a defined list of differentially expressed genes, you can now look for their combined function using hypergeometric test:
# Load additional packages
library(enrichR)
# Check available databases to perform enrichment (then choose one)
::listEnrichrDbs() enrichR
# Perform enrichment
<- DGE_cell_selection$Covid[(DGE_cell_selection$Covid$p.value < 0.01) & (abs(DGE_cell_selection$Covid[,
top_DGE grep("logFC.C", colnames(DGE_cell_selection$Covid))]) > 0.25), ]
<- enrichr(genes = rownames(top_DGE), databases = "GO_Biological_Process_2017b")[[1]] enrich_results
## Uploading data to Enrichr... Done.
## Querying GO_Biological_Process_2017b... Done.
## Parsing results... Done.
Some databases of interest:
GO_Biological_Process_2017b
KEGG_2019_Human
KEGG_2019_Mouse
WikiPathways_2019_Human
WikiPathways_2019_Mouse
You visualize your results using a simple barplot, for example:
par(mfrow = c(1, 1), mar = c(3, 25, 2, 1))
barplot(height = -log10(enrich_results$P.value)[10:1], names.arg = enrich_results$Term[10:1],
horiz = TRUE, las = 1, border = FALSE, cex.names = 0.6)
abline(v = c(-log10(0.05)), lty = 2)
abline(v = 0, lty = 1)
Besides the enrichment using hypergeometric test, we can also perform gene set enrichment analysis (GSEA), which scores ranked genes list (usually based on fold changes) and computes permutation test to check if a particular gene set is more present in the Up-regulated genes, amongthe DOWN_regulated genes or not differentially regulated.
# Create a gene rank based on the gene expression fold change
<- setNames(DGE_cell_selection$Covid[, grep("logFC.C", colnames(DGE_cell_selection$Covid))],
gene_rank casefold(rownames(DGE_cell_selection$Covid), upper = T))
Once our list of genes are sorted, we can proceed with the enrichment itself. We can use the package to get gene set from the Molecular Signature Database (MSigDB) and select KEGG pathways as an example.
library(msigdbr)
# Download gene sets
<- msigdbr::msigdbr("Homo sapiens")
msigdbgmt <- as.data.frame(msigdbgmt)
msigdbgmt
# List available gene sets
unique(msigdbgmt$gs_subcat)
## [1] "MIR:MIR_Legacy" "TFT:TFT_Legacy" "CGP" "TFT:GTRD"
## [5] "" "VAX" "CP:BIOCARTA" "CGN"
## [9] "GO:BP" "GO:CC" "IMMUNESIGDB" "GO:MF"
## [13] "HPO" "CP:KEGG" "MIR:MIRDB" "CM"
## [17] "CP" "CP:PID" "CP:REACTOME" "CP:WIKIPATHWAYS"
# Subset which gene set you want to use.
<- msigdbgmt[msigdbgmt$gs_subcat == "CP:WIKIPATHWAYS", ]
msigdbgmt_subset <- lapply(unique(msigdbgmt_subset$gs_name), function(x) {
gmt $gs_name == x, "gene_symbol"]
msigdbgmt_subset[msigdbgmt_subset
})names(gmt) <- unique(paste0(msigdbgmt_subset$gs_name, "_", msigdbgmt_subset$gs_exact_source))
Next, we will be using the GSEA. This will result in a table
containing information for several pathways. We can then sort and filter
those pathways to visualize only the top ones. You can select/filter
them by either p-value
or normalized enrichemnet score
(NES
).
library(fgsea)
# Perform enrichemnt analysis
<- fgsea(pathways = gmt, stats = gene_rank, minSize = 15, maxSize = 500,
fgseaRes nperm = 10000)
<- fgseaRes[order(fgseaRes$NES, decreasing = T), ]
fgseaRes
# Filter the results table to show only the top 10 UP or DOWN regulated
# processes (optional)
<- fgseaRes$pathway[1:10]
top10_UP
# Nice summary table (shown as a plot)
dev.off()
plotGseaTable(gmt[top10_UP], gene_rank, fgseaRes, gseaParam = 0.5)
Your turn
Which KEGG pathways are upregulated in this cluster? Which KEGG pathways are dowregulated in this cluster? Change the pathway source to another gene set (e.g. “CP:WIKIPATHWAYS” or “CP:REACTOME” or “CP:BIOCARTA” or “GO:BP”) and check the if you get simmilar results?
Finally, lets save the integrated data for further analysis.
saveRDS(sce, "data/results/covid_qc_dr_int_cl_dge.rds")
sessionInfo()
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
##
## Matrix products: default
## BLAS/LAPACK: /Users/asabjor/miniconda3/envs/scRNAseq2023/lib/libopenblasp-r0.3.21.dylib
##
## locale:
## [1] C/UTF-8/C/C/C/C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] fgsea_1.20.0 msigdbr_7.5.1
## [3] enrichR_3.1 dplyr_1.0.10
## [5] igraph_1.3.5 pheatmap_1.0.12
## [7] cowplot_1.1.1 scran_1.22.1
## [9] scater_1.22.0 ggplot2_3.4.0
## [11] scuttle_1.4.0 SingleCellExperiment_1.16.0
## [13] SummarizedExperiment_1.24.0 Biobase_2.54.0
## [15] GenomicRanges_1.46.1 GenomeInfoDb_1.30.1
## [17] IRanges_2.28.0 S4Vectors_0.32.4
## [19] BiocGenerics_0.40.0 MatrixGenerics_1.6.0
## [21] matrixStats_0.63.0 RJSONIO_1.3-1.7
## [23] optparse_1.7.3
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 httr_1.4.4
## [3] RColorBrewer_1.1-3 tools_4.1.3
## [5] bslib_0.4.2 utf8_1.2.2
## [7] R6_2.5.1 irlba_2.3.5.1
## [9] vipor_0.4.5 DBI_1.1.3
## [11] colorspace_2.1-0 withr_2.5.0
## [13] tidyselect_1.2.0 gridExtra_2.3
## [15] curl_4.3.3 compiler_4.1.3
## [17] cli_3.6.0 BiocNeighbors_1.12.0
## [19] formatR_1.14 DelayedArray_0.20.0
## [21] labeling_0.4.2 sass_0.4.5
## [23] scales_1.2.1 stringr_1.5.0
## [25] digest_0.6.31 rmarkdown_2.20
## [27] XVector_0.34.0 pkgconfig_2.0.3
## [29] htmltools_0.5.4 sparseMatrixStats_1.6.0
## [31] highr_0.10 fastmap_1.1.0
## [33] limma_3.50.3 rlang_1.0.6
## [35] DelayedMatrixStats_1.16.0 farver_2.1.1
## [37] jquerylib_0.1.4 generics_0.1.3
## [39] jsonlite_1.8.4 BiocParallel_1.28.3
## [41] RCurl_1.98-1.9 magrittr_2.0.3
## [43] BiocSingular_1.10.0 GenomeInfoDbData_1.2.7
## [45] Matrix_1.5-3 Rcpp_1.0.10
## [47] ggbeeswarm_0.7.1 munsell_0.5.0
## [49] fansi_1.0.4 babelgene_22.9
## [51] viridis_0.6.2 lifecycle_1.0.3
## [53] stringi_1.7.12 yaml_2.3.7
## [55] edgeR_3.36.0 zlibbioc_1.40.0
## [57] grid_4.1.3 parallel_4.1.3
## [59] ggrepel_0.9.2 dqrng_0.3.0
## [61] lattice_0.20-45 beachmat_2.10.0
## [63] locfit_1.5-9.7 metapod_1.2.0
## [65] knitr_1.41 pillar_1.8.1
## [67] rjson_0.2.21 ScaledMatrix_1.2.0
## [69] fastmatch_1.1-3 glue_1.6.2
## [71] evaluate_0.20 data.table_1.14.6
## [73] vctrs_0.5.2 gtable_0.3.1
## [75] getopt_1.20.3 assertthat_0.2.1
## [77] cachem_1.0.6 xfun_0.36
## [79] rsvd_1.0.5 viridisLite_0.4.1
## [81] tibble_3.1.8 beeswarm_0.4.0
## [83] cluster_2.1.4 bluster_1.4.0
## [85] statmod_1.5.0