suppressPackageStartupMessages({
library(scater)
library(scran)
# library(venn)
library(patchwork)
library(ggplot2)
library(pheatmap)
library(igraph)
library(dplyr)
})
Code chunks run R commands unless otherwise specified.
In this tutorial we will cover differential gene expression, which comprises an extensive range of topics and methods. In single cell, differential expresison can have multiple functionalities such as identifying marker genes for cell populations, as well as identifying differentially regulated genes across conditions (healthy vs control). We will also cover controlling batch effect in your test.
We can first load the data from the clustering session. Moreover, we can already decide which clustering resolution to use. First let’s define using the louvain
clustering to identifying differentially expressed genes.
# download pre-computed data if missing or long compute
<- TRUE
fetch_data
# url for source and intermediate data
<- "https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq"
path_data <- "data/covid/results/bioc_covid_qc_dr_int_cl.rds"
path_file if (!dir.exists(dirname(path_file))) dir.create(dirname(path_file), recursive = TRUE)
if (fetch_data && !file.exists(path_file)) download.file(url = file.path(path_data, "covid/results/bioc_covid_qc_dr_int_cl.rds"), destfile = path_file)
<- readRDS(path_file)
sce print(reducedDims(sce))
List of length 17
names(17): PCA UMAP tSNE_on_PCA UMAP_on_PCA ... UMAP_on_Scanorama SNN harmony2
1 Number of cells per cluster
When doing differential expression of one cluster vs all it is a good idea to have equal aomunt of cells from each cluster. Otherwise, the largest clusters will have the highest effect on the vs all comparison.
However, findMarker
in Scran is implemented so that the tests are run in a pariwise manner, e.g. each cluster is tested agains all the others individually. Then a combined p-value is calculated across all the tests using combineMarkers
. So for this method, one large cluster will not influence the results in the same way as FindMarkers
in Seurat or rank_genes_groups
in Scanpy.
2 Cell marker genes
Let us first compute a ranking for the highly differential genes in each cluster. There are many different tests and parameters to be chosen that can be used to refine your results. When looking for marker genes, we want genes that are positively expressed in a cell type and possibly not expressed in others.
In the scran function findMarkers
t-test, wilcoxon test and binomial test implemented.
# Compute differentiall expression
<- scran::findMarkers(
markers_genes x = sce,
groups = as.character(sce$leiden_k20),
test.type = "wilcox",
lfc = .5,
pval.type = "all",
direction = "up"
)
# List of dataFrames with the results for each cluster
markers_genes
List of length 13
names(13): 1 10 11 12 13 2 3 4 5 6 7 8 9
# Visualizing the expression of one
head(markers_genes[["1"]])
DataFrame with 6 rows and 15 columns
p.value FDR summary.AUC AUC.10 AUC.11 AUC.12 AUC.13
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
S100A8 1 1 0.282842 0.959148 0.942246 0.977685 0.980941
IFI6 1 1 0.262378 0.499180 0.364114 0.443247 0.467511
RETN 1 1 0.304136 0.411009 0.383941 0.415450 0.410955
ALOX5AP 1 1 0.303393 0.404740 0.453944 0.287193 0.335212
S100A12 1 1 0.245282 0.870458 0.865589 0.876760 0.871185
ISG15 1 1 0.289798 0.433757 0.299824 0.380235 0.353287
AUC.2 AUC.3 AUC.4 AUC.5 AUC.6 AUC.7 AUC.8
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
S100A8 0.982921 0.282842 0.985213 0.983036 0.985045 0.854209 0.984766
IFI6 0.507736 0.262378 0.504875 0.457423 0.462969 0.549265 0.403180
RETN 0.416034 0.262345 0.415249 0.415516 0.416258 0.400304 0.416099
ALOX5AP 0.387884 0.259607 0.362574 0.336117 0.303393 0.441705 0.250192
S100A12 0.875237 0.245282 0.878158 0.875534 0.877325 0.839671 0.874472
ISG15 0.398547 0.273513 0.429643 0.389656 0.370755 0.480924 0.339772
AUC.9
<numeric>
S100A8 0.414321
IFI6 0.316388
RETN 0.304136
ALOX5AP 0.312316
S100A12 0.381497
ISG15 0.289798
We can now select the top 25 overexpressed genes for plotting.
# Colect the top 25 genes for each cluster and put the into a single table
<- lapply(names(markers_genes), function(x) {
top25 <- markers_genes[[x]][1:25, 1:2]
temp $gene <- rownames(markers_genes[[x]])[1:25]
temp$cluster <- x
tempreturn(temp)
})<- as_tibble(do.call(rbind, top25))
top25 $p.value[top25$p.value == 0] <- 1e-300
top25 top25
We can plot them as barplots per cluster.
par(mfrow = c(1, 5), mar = c(4, 6, 3, 1))
for (i in unique(top25$cluster)) {
barplot(sort(setNames(-log10(top25$p.value), top25$gene)[top25$cluster == i], F),
horiz = T, las = 1, main = paste0(i, " vs. rest"), border = "white", yaxs = "i", xlab = "-log10FC"
)abline(v = c(0, -log10(0.05)), lty = c(1, 2))
}
We can visualize them as a heatmap. Here we are selecting the top 5.
%>%
top25 group_by(cluster) %>%
slice_min(p.value, n = 5, with_ties = FALSE) -> top5
::plotHeatmap(sce[, order(sce$leiden_k20)],
scaterfeatures = unique(top5$gene),
center = T, zlim = c(-3, 3),
colour_columns_by = "leiden_k20",
show_colnames = F, cluster_cols = F,
fontsize_row = 6,
color = colorRampPalette(c("purple", "black", "yellow"))(90)
)
We can also plot a violin plot for each gene.
::plotExpression(sce, features = unique(top5$gene), x = "leiden_k20", ncol = 5, colour_by = "leiden_k20", scales = "free") scater
Another way is by representing the overall group expression and detection rates in a dot-plot.
plotDots(sce, features = unique(top5$gene), group="leiden_k20")
3 DGE across conditions
The second way of computing differential expression is to answer which genes are differentially expressed within a cluster. For example, in our case we have libraries comming from patients and controls and we would like to know which genes are influenced the most in a particular cell type. For this end, we will first subset our data for the desired cell cluster, then change the cell identities to the variable of comparison (which now in our case is the type, e.g. Covid/Ctrl).
# Filter cells from that cluster
<- sce[, sce$leiden_k20 == 2]
cell_selection
# Compute differentiall expression
<- findMarkers(
DGE_cell_selection x = cell_selection,
groups = cell_selection@colData$type,
lfc = .25,
pval.type = "all",
direction = "any"
)<- lapply(names(DGE_cell_selection), function(x) {
top5_cell_selection <- DGE_cell_selection[[x]][1:5, 1:2]
temp $gene <- rownames(DGE_cell_selection[[x]])[1:5]
temp$cluster <- x
tempreturn(temp)
})<- as_tibble(do.call(rbind, top5_cell_selection))
top5_cell_selection top5_cell_selection
We can now plot the expression across the type.
::plotExpression(cell_selection, features = unique(top5_cell_selection$gene), x = "type", ncol = 5, colour_by = "type") scater
Or we can plot them as dotplots to see the expression in each individual sample.
plotDots(cell_selection, unique(top5_cell_selection$gene), group="sample")
Clearly many of the top Covid genes are only high in the covid_17 sample, and not a general feature of covid patients. In this case using a method that can control for pseudreplication or pseudobulk methods would be appropriate. We have a more thorough discussion on this issue with sample batch effects in differential expression across conditions in the Seurat tutorial.
4 Gene Set Analysis (GSA)
4.1 Hypergeometric enrichment test
Having a defined list of differentially expressed genes, you can now look for their combined function using hypergeometric test.
# Load additional packages
library(enrichR)
# Check available databases to perform enrichment (then choose one)
::listEnrichrDbs() enrichR
# Perform enrichment
<- DGE_cell_selection$Covid[(DGE_cell_selection$Covid$p.value < 0.01) & (abs(DGE_cell_selection$Covid[, grep("logFC.C", colnames(DGE_cell_selection$Covid))]) > 0.25), ]
top_DGE
<- enrichr(
enrich_results genes = rownames(top_DGE),
databases = "GO_Biological_Process_2017b"
1]] )[[
Uploading data to Enrichr... Done.
Querying GO_Biological_Process_2017b... Done.
Parsing results... Done.
Some databases of interest:
GO_Biological_Process_2017b
KEGG_2019_Human
KEGG_2019_Mouse
WikiPathways_2019_Human
WikiPathways_2019_Mouse
You visualize your results using a simple barplot, for example:
{par(mfrow = c(1, 1), mar = c(3, 25, 2, 1))
barplot(
height = -log10(enrich_results$P.value)[10:1],
names.arg = enrich_results$Term[10:1],
horiz = TRUE,
las = 1,
border = FALSE,
cex.names = .6
)abline(v = c(-log10(0.05)), lty = 2)
abline(v = 0, lty = 1)
}
5 Gene Set Enrichment Analysis (GSEA)
Besides the enrichment using hypergeometric test, we can also perform gene set enrichment analysis (GSEA), which scores ranked genes list (usually based on fold changes) and computes permutation test to check if a particular gene set is more present in the Up-regulated genes, among the DOWN_regulated genes or not differentially regulated.
# Create a gene rank based on the gene expression fold change
<- setNames(DGE_cell_selection$Covid[, grep("logFC.C", colnames(DGE_cell_selection$Covid))], casefold(rownames(DGE_cell_selection$Covid), upper = T)) gene_rank
Once our list of genes are sorted, we can proceed with the enrichment itself. We can use the package to get gene set from the Molecular Signature Database (MSigDB) and select KEGG pathways as an example.
library(msigdbr)
# Download gene sets
<- msigdbr::msigdbr("Homo sapiens")
msigdbgmt <- as.data.frame(msigdbgmt)
msigdbgmt
# List available gene sets
unique(msigdbgmt$gs_subcat)
[1] "MIR:MIR_Legacy" "TFT:TFT_Legacy" "CGP" "TFT:GTRD"
[5] "" "VAX" "CP:BIOCARTA" "CGN"
[9] "GO:BP" "GO:CC" "IMMUNESIGDB" "GO:MF"
[13] "HPO" "CP:KEGG" "MIR:MIRDB" "CM"
[17] "CP" "CP:PID" "CP:REACTOME" "CP:WIKIPATHWAYS"
# Subset which gene set you want to use.
<- msigdbgmt[msigdbgmt$gs_subcat == "CP:WIKIPATHWAYS", ]
msigdbgmt_subset <- lapply(unique(msigdbgmt_subset$gs_name), function(x) {
gmt $gs_name == x, "gene_symbol"]
msigdbgmt_subset[msigdbgmt_subset
})names(gmt) <- unique(paste0(msigdbgmt_subset$gs_name, "_", msigdbgmt_subset$gs_exact_source))
Next, we will run GSEA. This will result in a table containing information for several pathways. We can then sort and filter those pathways to visualize only the top ones. You can select/filter them by either p-value
or normalized enrichment score (NES
).
library(fgsea)
# Perform enrichemnt analysis
<- fgsea(pathways = gmt, stats = gene_rank, minSize = 15, maxSize = 500, nperm = 10000)
fgseaRes <- fgseaRes[order(fgseaRes$NES, decreasing = T), ]
fgseaRes
# Filter the results table to show only the top 10 UP or DOWN regulated processes (optional)
<- fgseaRes$pathway[1:10]
top10_UP
# Nice summary table (shown as a plot)
plotGseaTable(gmt[top10_UP], gene_rank, fgseaRes, gseaParam = 0.5)
Which KEGG pathways are upregulated in this cluster? Which KEGG pathways are dowregulated in this cluster? Change the pathway source to another gene set (e.g. CP:WIKIPATHWAYS or CP:REACTOME or CP:BIOCARTA or GO:BP) and check the if you get similar results?
Finally, let’s save the integrated data for further analysis.
saveRDS(sce, "data/covid/results/bioc_covid_qc_dr_int_cl_dge.rds")
6 Session info
Click here
sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Big Sur ... 10.16
Matrix products: default
BLAS/LAPACK: /Users/asabjor/miniconda3/envs/seurat5/lib/libopenblasp-r0.3.27.dylib; LAPACK version 3.12.0
locale:
[1] sv_SE.UTF-8/sv_SE.UTF-8/sv_SE.UTF-8/C/sv_SE.UTF-8/sv_SE.UTF-8
time zone: Europe/Stockholm
tzcode source: system (macOS)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] fgsea_1.28.0 msigdbr_7.5.1
[3] enrichR_3.2 dplyr_1.1.4
[5] igraph_2.0.3 pheatmap_1.0.12
[7] patchwork_1.2.0 scran_1.30.0
[9] scater_1.30.1 ggplot2_3.5.1
[11] scuttle_1.12.0 SingleCellExperiment_1.24.0
[13] SummarizedExperiment_1.32.0 Biobase_2.62.0
[15] GenomicRanges_1.54.1 GenomeInfoDb_1.38.1
[17] IRanges_2.36.0 S4Vectors_0.40.2
[19] BiocGenerics_0.48.1 MatrixGenerics_1.14.0
[21] matrixStats_1.4.1
loaded via a namespace (and not attached):
[1] bitops_1.0-8 gridExtra_2.3
[3] rlang_1.1.4 magrittr_2.0.3
[5] compiler_4.3.3 DelayedMatrixStats_1.24.0
[7] vctrs_0.6.5 pkgconfig_2.0.3
[9] crayon_1.5.3 fastmap_1.2.0
[11] XVector_0.42.0 labeling_0.4.3
[13] utf8_1.2.4 rmarkdown_2.28
[15] ggbeeswarm_0.7.2 xfun_0.47
[17] bluster_1.12.0 WriteXLS_6.7.0
[19] zlibbioc_1.48.0 beachmat_2.18.0
[21] jsonlite_1.8.8 DelayedArray_0.28.0
[23] BiocParallel_1.36.0 irlba_2.3.5.1
[25] parallel_4.3.3 cluster_2.1.6
[27] R6_2.5.1 RColorBrewer_1.1-3
[29] limma_3.58.1 Rcpp_1.0.13
[31] knitr_1.48 Matrix_1.6-5
[33] tidyselect_1.2.1 abind_1.4-5
[35] yaml_2.3.10 viridis_0.6.5
[37] codetools_0.2-20 curl_5.2.1
[39] lattice_0.22-6 tibble_3.2.1
[41] withr_3.0.1 evaluate_0.24.0
[43] pillar_1.9.0 generics_0.1.3
[45] RCurl_1.98-1.16 sparseMatrixStats_1.14.0
[47] munsell_0.5.1 scales_1.3.0
[49] glue_1.7.0 metapod_1.10.0
[51] tools_4.3.3 data.table_1.15.4
[53] BiocNeighbors_1.20.0 ScaledMatrix_1.10.0
[55] locfit_1.5-9.9 babelgene_22.9
[57] fastmatch_1.1-4 cowplot_1.1.3
[59] grid_4.3.3 edgeR_4.0.16
[61] colorspace_2.1-1 GenomeInfoDbData_1.2.11
[63] beeswarm_0.4.0 BiocSingular_1.18.0
[65] vipor_0.4.7 cli_3.6.3
[67] rsvd_1.0.5 fansi_1.0.6
[69] S4Arrays_1.2.0 viridisLite_0.4.2
[71] gtable_0.3.5 digest_0.6.37
[73] SparseArray_1.2.2 ggrepel_0.9.6
[75] dqrng_0.3.2 rjson_0.2.21
[77] htmlwidgets_1.6.4 farver_2.1.2
[79] htmltools_0.5.8.1 lifecycle_1.0.4
[81] httr_1.4.7 statmod_1.5.0