Introduction

Gene set analysis (GSA), or alternatively gene set enrichment analysis (GSEA), is a commonly used approach for interpreting omics data, whereby genes are aggregated into groups (gene sets) on the basis of their shared biological or functional properties. The aim of GSA is to interpret biological phenomena (the “how”), and from that, generate a hypothesis for the underlying mechanism (the “why”). These hypotheses can then be validated in replication studies and other independent laboratory experiments.

Genome-scale metabolic models (GEMs) are a computational representation of the chemical reactions that comprise the metabolic network of an organism or biological system. A valuable feature of GEMs is their connectivity information, such as gene-reaction, reaction-metabolite, and reaction-pathway associations. In this exercise, we will demonstrate how this connectivity information can be used to interpret transcriptomic data in the context of metabolism. Specifically, gene-metabolite and gene-pathway connections derived from Human-GEM will be used to identify regions of the human metabolic network that are significantly associated with trancriptional changes in tumors.

For this exercise we will use publically available RNA-Seq data from The Cancer Genome Atlas (TCGA) database, specifically from the Breast Invasive Carcinoma (BRCA) project.

Image from US National Cancer Institute GDC Portal homepage (portal.gdc.cancer.gov)

Setup option

In r/rmarkdown run the following code

1. If your docker installation fails, navigate to the directory where “renv.lock” file is located and run following code chunk.

install.packages("renv")
renv::restore("renv.lock")

In R Studio

2. Load the R packages that will be used in this exercise

# downloading and processing TCGA data
library(TCGAbiolinks)
library(SummarizedExperiment)

# differential expression anaylsis
library(edgeR)

# gene set (enrichment) analysis
library(fgsea)
library(piano)

# plotting
library(viridisLite)
library(ggplot2)
library(gridExtra)
library(pheatmap)
library(RColorBrewer)

# misc
library(tibble)
library(org.Hs.eg.db)

3. Specify the location of the main tutorial directory

main_dir = './session_gsa'

4. Create a TCGA_data/ subdirectory within data/ (if it does not yet exist)

tcga_dir <- file.path(main_dir, 'data', 'TCGA_data')
if (!dir.exists(tcga_dir)) { dir.create(tcga_dir) }

## Warning in dir.create(tcga_dir): cannot create dir './session_gsa/data/
## TCGA_data', reason 'No such file or directory'

Retrieve RNA-Seq data

Note: This and the Differential expression analysis section below are optional! If the GDC or Ensembl database is slow/unavailable, or you are comfortable with differential expression analysis, you can skip ahead to the Gene set analysis section below. The processed count matrix and differential expression analysis result files are already available in the data/TCGA_data/ directory.

5. Query the TCGA database (hosted by GDC). Specifically, we are querying for RNA-Seq gene counts associated with the breast cancer project, and filtering to include only sample types Primary Tumor and Solid Tissue Normal.

query <- GDCquery(project='TCGA-BRCA',  # breast cancer
                  data.category='Transcriptome Profiling',
                  data.type='Gene Expression Quantification',
                  experimental.strategy='RNA-Seq',
                  workflow.type='HTSeq - Counts',
                  sample.type=c('Primary Tumor', 'Solid Tissue Normal'))

6. For simplicity, identify subject IDs that are associated with exactly 1 normal and 1 tumor sample

# Identify subjects with paired data
q <- getResults(query)
paired_subjects <- intersect(q$cases.submitter_id[q$sample_type == 'Primary Tumor'],
                             q$cases.submitter_id[q$sample_type == 'Solid Tissue Normal'])

# extract barcodes associated with those subjects
keep <- unlist(lapply(paired_subjects, function(x) sum(q$cases.submitter_id %in% x) == 2))
barcodes <- q$cases[q$cases.submitter_id %in% paired_subjects[keep]]

7. Update our query, now filtering samples to include only those in our list of barcodes

query <- GDCquery(project='TCGA-BRCA',
                  data.category='Transcriptome Profiling',
                  data.type='Gene Expression Quantification',
                  experimental.strategy='RNA-Seq',
                  workflow.type='HTSeq - Counts',
                  sample.type=c('Primary Tumor', 'Solid Tissue Normal'),
                  barcode=barcodes)  # specify sample barcodes

8. Download count data. The data will be stored in separate folders for every sample, with very non-human-friendly names (take a look into the directory to see for yourself…).

GDCdownload(query, directory=tcga_dir)

Note: Even though we specify the directory tcga_dir where we want the files to be downloaded, it will still place a MANIFEST.txt file in your current working directory. You can just delete that file.

9. Load the RNA-Seq gene counts from the downloaded files.

# loaded as a "SummarizedExperiment" object
TCGAdata <- GDCprepare(query, directory=tcga_dir)

# If that fails (i.e., Ensembl unavaiable)
#TCGAdata <- readRDS(file.path(tcga_dir, 'TCGAdata.rds'))

10. Further filter the data to only include subjects with the Luminal B subtype (LumB) of breast cancer

meta <- colData(TCGAdata)
keep_subjects <- meta$patient[meta$paper_BRCA_Subtype_PAM50 %in% 'LumB']
TCGAdata <- TCGAdata[, meta$patient %in% keep_subjects]

11. Extract the relevant fields from the TCGAdata SummarizedExperiment object

counts <- assay(TCGAdata)
gene_symbols <- rowData(TCGAdata)$external_gene_name
subject_id <- as.factor(TCGAdata$patient)

# We explicitly state the ordering of the levels for sample_type as healthy -> diseased so that
# it uses this order for the differential expression analysis below.
sample_type <- factor(TCGAdata$sample_type, levels=c('Solid Tissue Normal', 'Primary Tumor'))

12. Clear some intermediate variables to free some memory

rm(TCGAdata, query, q)
invisible(gc())

Differential expression analysis

Note: The differential expression (DE) analysis here will be performed using the edgeR package. Since the focus of this workshop is not on DE analysis, we don’t provide much detail surrounding the commands below. However, you can find a lot of examples and detail in the very helpful edgeR User’s Guide. Another excellent tool for DE analysis is DESeq2, which also has a fantastic User Guide.

13. Create DGEList object from the gene counts matrix

y <- DGEList(counts=counts)

14. Generate a design matrix such that samples are paired by subject ID

design <- model.matrix(~subject_id + sample_type)
rownames(design) <- colnames(y)

15. Filter low-count genes

keep <- filterByExpr(y, design=design, min.count=10)
counts <- counts[keep, ]
y <- y[keep, , keep.lib.sizes=F]
gene_symbols <- gene_symbols[keep]

16. Calculate normalization factors, and estimate dispersion

y <- calcNormFactors(y)
y <- estimateDisp(y, design)  # calc time is ~1 minute

17. Fit a generalized linear model (GLM) to the data and test for DE genes

fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit)
topTags(qlf)  # view top DE genes

18. Extract results table and add FDR-adjusted p-values

res.table <- qlf$table
res.table$FDR <- p.adjust(res.table$PValue, 'BH')  # BH = Benjamini-Hochberg method
res.table <- cbind(gene_symbols, res.table) %>% rownames_to_column(var='gene')

19. View the top DE genes again, this time with the gene symbols added to the table

res.table[order(res.table$FDR)[1:10], c('gene_symbols', 'logFC', 'logCPM', 'PValue', 'FDR')]

20. Export filtered count matrix and DE results to a plain text file

colnames(counts) <- paste(c('normal', 'tumor')[sample_type], as.numeric(subject_id), sep='.')
counts <- rownames_to_column(as.data.frame(counts), var='gene')
write.table(counts, file=file.path(tcga_dir, 'count_matrix.txt'), quote=F, sep='\t', row.names=F)
write.table(res.table, file=file.path(tcga_dir, 'DE_results.txt'), quote=F, sep='\t', row.names=F)

21. Also calculate median counts per million (CPM) for each sample type, and export to a plain text file

median_cpm <- data.frame(Normal=apply(cpm(y)[, sample_type == 'Solid Tissue Normal'], 1, median),
                         Tumor=apply(cpm(y)[, sample_type == 'Primary Tumor'], 1, median))
median_cpm <- median_cpm %>% rownames_to_column(var='gene')
write.table(median_cpm, file=file.path(tcga_dir, 'median_cpm.txt'), quote=F, sep='\t', row.names=F)

Gene set analysis

There are many different methods for performing gene set analysis (GSA), but Piano is convenient in that it implements many of these variants in a single package (see more info here). In this example, we will use one of the gene set collections (GSCs) we extracted from Human-GEM to explore which areas of metabolism are enriched in significant expression changes between normal and tumor samples.

22. Load DE results (if not already loaded)

de.res <- read.delim("./data/TCGA_data/DE_results.txt", row.names=1)

23. Load one of the GEM-derived gene set collections

#gsc <- loadGSC(file.path(main_dir, 'data', 'gene_set_collections', 'HumanGEM_subsystem_GSC.gmt'))
gsc <- piano::loadGSC("./data/gene_set_collections/HumanGEM_subsystem_GSC.gmt")
# gsc <- loadGSC(file.path(main_dir, 'data', 'gene_set_collections', 'HumanGEM_metabolite_GSC.gmt'))

24. View a summary of the GSC to make sure that it was properly loaded, and to view some basic stats about the collection

gsc

## First 50 (out of 134) gene set names:
##  [1] "Acyl-CoA hydrol..." "Acylglycerides ..." "Alanine, aspart..."
##  [4] "Amino sugar and..." "Aminoacyl-tRNA ..." "Androgen metabo..."
##  [7] "Arachidonic aci..." "Arginine and pr..." "Ascorbate and a..."
## [10] "Beta oxidation ..." "Beta oxidation ..." "Beta oxidation ..."
## [13] "Beta oxidation ..." "Beta oxidation ..." "Beta oxidation ..."
## [16] "Beta oxidation ..." "Beta oxidation ..." "Beta oxidation ..."
## [19] "Beta oxidation ..." "Beta oxidation ..." "Beta oxidation ..."
## [22] "Beta-alanine me..." "Bile acid biosy..." "Bile acid recyc..."
## [25] "Biopterin metab..." "Biotin metaboli..." "Blood group bio..."
## [28] "Butanoate metab..." "C5-branched dib..." "Carnitine shutt..."
## [31] "Carnitine shutt..." "Carnitine shutt..." "Carnitine shutt..."
## [34] "Cholesterol bio..." "Cholesterol bio..." "Cholesterol bio..."
## [37] "Cholesterol met..." "Chondroitin / h..." "Chondroitin sul..."
## [40] "CoA synthesis"      "Cysteine and me..." "Drug metabolism"   
## [43] "Eicosanoid meta..." "Estrogen metabo..." "Ether lipid met..."
## [46] "Fatty acid acti..." "Fatty acid acti..." "Fatty acid bios..."
## [49] "Fatty acid bios..." "Fatty acid bios..."
## 
## First 50 (out of 3628) gene names:
##  [1] "ENSG00000097021" "ENSG00000101473" "ENSG00000119673" "ENSG00000123130"
##  [5] "ENSG00000136881" "ENSG00000177465" "ENSG00000184227" "ENSG00000205669"
##  [9] "ENSG00000006757" "ENSG00000058866" "ENSG00000062282" "ENSG00000065357"
## [13] "ENSG00000074416" "ENSG00000077044" "ENSG00000079435" "ENSG00000100344"
## [17] "ENSG00000100997" "ENSG00000101670" "ENSG00000102780" "ENSG00000106384"
## [21] "ENSG00000114771" "ENSG00000115159" "ENSG00000115884" "ENSG00000116906"
## [25] "ENSG00000124003" "ENSG00000134780" "ENSG00000136267" "ENSG00000142798"
## [29] "ENSG00000145214" "ENSG00000149091" "ENSG00000153933" "ENSG00000157680"
## [33] "ENSG00000163686" "ENSG00000164535" "ENSG00000166035" "ENSG00000166391"
## [37] "ENSG00000170835" "ENSG00000175445" "ENSG00000175535" "ENSG00000177666"
## [41] "ENSG00000182333" "ENSG00000185000" "ENSG00000187021" "ENSG00000203837"
## [45] "ENSG00000266200" "ENSG00000274588" "ENSG00000277494" "ENSG00000006625"
## [49] "ENSG00000021826" "ENSG00000070669"
## 
## Gene set size summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   10.00   21.00   48.87   53.75  979.00 
## 
## First 10 gene sets with additional info:
##       Gene set                  Additional info
##  [1,] "Acyl-CoA hydrolysis"     "na"           
##  [2,] "Acylglycerides metab..." "na"           
##  [3,] "Alanine, aspartate a..." "na"           
##  [4,] "Amino sugar and nucl..." "na"           
##  [5,] "Aminoacyl-tRNA biosy..." "na"           
##  [6,] "Androgen metabolism"     "na"           
##  [7,] "Arachidonic acid met..." "na"           
##  [8,] "Arginine and proline..." "na"           
##  [9,] "Ascorbate and aldara..." "na"           
## [10,] "Beta oxidation of br..." "na"

Overrepresentation analysis

We will first use Piano to perform an overrepresentation analysis - i.e., given a list of “interesting” genes, which gene sets (if any) contain more of these genes than one would expect due to random chance? This overrepresentation or enrichment effect can be quantified using the hypergeometric test.

25. Generate a list of “interesting” genes. In this case, we (arbitrarily) select genes that are differentially expressed with an FDR < 0.01 and an absolute log2(fold-change) > 2.

selected_genes = rownames(de.res)[de.res$FDR < 0.01 & abs(de.res$logFC) > 2]
length(selected_genes)  # see how many genes are in our list

## [1] 3841

26. Define the background or “universe” of genes from which we selected our list of genes. Here, we define the universe as all genes in the DE results and the GSC.

all_genes <- union(rownames(de.res), unlist(gsc$gsc))
length(all_genes)

## [1] 27830

Note: It may be better if the “universe” includes all genes that were detected in the RNA-Seq dataset. Since we filtered out some of the low-count genes before performing our DE analysis, our universe may be somewhat incomplete. It is good practice to run additional analyses with variations in parameters such as the “gene universe” to ensure that the results are robust to such changes.

27. Run a hypergeometric test with our selected list of genes using the Piano runGSAhyper function

res <- runGSAhyper(genes=selected_genes, universe=all_genes, gsc=gsc, gsSizeLim=c(10, 500))

## Analyzing the overrepresentation of 3841 genes of interest in 100 gene sets, using a background of 23989 non-interesting genes.

Question 1: The gsSizeLim parameter sets the limits on the number of genes allowed in a gene set. Why would we want to filter out gene sets that contain very few or very many genes?

Answer: Very small gene sets (e.g., those with fewer than 5 or 10 genes) are generally uninformative and often lack the statistical power to effectively quantify the significance of their enrichment. Very large gene sets (containing several hundreds or thousands of genes) are usually so broad that it is difficult to interpret their results, and some algorithms are slowed when having to work with large gene sets.

28. View the top 10 most enriched gene sets and their stats

res_table <- res$resTab[order(res$pvalues), ]  # sort by p-value
knitr::kable(res_table[1:10, ])

	p-value	Adjusted p-value	Significant (in gene set)	Non-significant (in gene set)	Significant (not in gene set)	Non-significant (not in gene set)
Tyrosine metabolism	0.0000286	0.0028565	21	39	3820	23950
Miscellaneous	0.0003113	0.0155631	26	68	3815	23921
Serotonin and melatonin biosynthesis	0.0029446	0.0981525	16	40	3825	23949
Retinol metabolism	0.0083115	0.2077882	28	100	3813	23889
Fatty acid activation (cytosolic)	0.0218572	0.3850581	6	11	3835	23978
Metabolism of xenobiotics by cytochrome P450	0.0231035	0.3850581	22	81	3819	23908
Pool reactions	0.0376259	0.4560020	6	13	3835	23976
Phenylalanine, tyrosine and tryptophan biosynthesis	0.0384826	0.4560020	39	174	3802	23815
Bile acid biosynthesis	0.0410402	0.4560020	20	77	3821	23912
Acylglycerides metabolism	0.0800916	0.7235754	9	30	3832	23959

29. Visualize the results as an interactive network using the networkPlot2 function. Gene sets are represented as nodes that are sized based on the number of genes in the set, and colored according to the enrichment p-value (log-transformed). Nodes are connected if the gene sets share a specified fraction of genes (overlap), with thicker edges indicating more overlap.

networkPlot2(res,
             class="non",
             adjusted=F,  # too few points when using FDR-adjusted p-values
             significance=0.05,
             overlap=0.4,
             lay='2',
             labelSize=16,
             ncharLabel=50,
             scoreColors=rev(magma(50)))  # darker = more significant

Reporter metabolite/pathway analysis

For the overrepresentation analysis, we evaluated the enrichment of a list of genes among different gene sets. However, if we want to use the information from all of the genes in our DE results, we can perform a GSA using one of the methods available in Piano.

30. Extract p-values and log2fold-changes from DE results dataframe

pval <- de.res$PValue %>% setNames(rownames(de.res))
logfc <- de.res$logFC %>% setNames(rownames(de.res))

31. Run a gene set analysis (GSA) using the “Reporter” statistic. To understand the meaning of each parameter, view the help information for the runGSA function.

gsaRes <- runGSA(geneLevelStats=pval,
                 directions=logfc,
                 geneSetStat='reporter',
                 signifMethod='geneSampling',
                 gsc=gsc,
                 gsSizeLim=c(10, 500),
                 nPerm=1000)  # 10,000+ is better, but slower

## Final gene/gene-set association: 2221 genes and 99 gene-sets
##   Details:
##   Calculating gene set statistics from 2221 out of 27635 gene-level statistics
##   Using all 27635 gene-level statistics for significance estimation
##   Removed 195 genes from GSC due to lack of matching gene statistics
##   Removed 0 gene sets containing no genes after gene removal
##   Removed additionally 35 gene sets not matching the size limits
##   Loaded additional information for 99 gene sets
## 
## Gene statistic type: p-like
## Method: reporter
## Gene-set statistic name: Z 
## Significance: Gene sampling
## Adjustment: fdr
## Gene set size limit: (10,500)
## Permutations: 1000 
## Total run time: 0.33 min

32. View the results in a tabular format

gsaRes_table <- GSAsummaryTable(gsaRes)

View(gsaRes_table)

33. Visualize the top most significant gene sets using a heatmap.

Note: The GSAheatmap function often has a figure margins too large error when using the default colorkey=TRUE. If you get this error, reset the plot window using the dev.off() command. You can then try to resize your plot window to a larger size before re-running the GSAheatmap function, or turn the key off (colorkey=FALSE), or skip this command and continue with the custom heatmap generation in the next step.

# include top 3 gene sets of each directional category, and print the
# gene set rank (lower = more significant) in the heatmap cells
GSAheatmap(gsaRes, cutoff=3, adjusted=T, cellnote='rank', colorkey=F,
           cex=1, ncharLabel=50)

Note: There are many repeated numbers in the heatmap (e.g., rank 1 is repeated very often). This is because these gene sets have identical p-values, and therefore have equal ranks. We could re-run the GSA with more permutations (nPerm) to obtain a higher resolution on the p-values and potentially begin to separate some of these gene sets, but it will come at the cost of increased computation time.

Another option is to generate our own custom heatmap. Although the setup is a bit more involved, it allows for more flexibility and customization.

34. Extract the adjusted p-value columns from the GSA results structure, and filter out rows that do not have at least one FDR-adjusted p-value < 0.05

table_sel <- gsaRes_table[, c('p adj (dist.dir.dn)', 'p adj (mix.dir.dn)', 'p adj (non-dir.)',
                              'p adj (mix.dir.up)', 'p adj (dist.dir.up)')]
table_sel[is.na(table_sel)] <- 1  # set NA p-values equal to 1
rownames(table_sel) <- gsaRes_table$Name
table_sel <- table_sel[apply(table_sel, 1, min) < 0.05, ]

35. Generate the heatmap using the pheatmap function

pheatmap(-log10(table_sel),  # log-transform p-values
         scale='none',
         color=rev(magma(100)),
         cluster_cols=F,
         clustering_distance_rows='euclidean',
         clustering_method='ward.D2',
         breaks=seq(0, 2, len=100),
         border_color='black',
         angle_col=90)

36. We can also visualize the results as an interactive network

# define custom palette where blue = expression decrease and red = expression increase
pal <- colorRampPalette(brewer.pal(11, 'RdBu'))
networkPlot2(gsaRes,
             class='distinct',
             direction='both',
             significance=0.05,
             overlap=0.1,
             physics=F,  # physics=T can be slow/problematic for large networks
             lay='2',
             edgeWidth=c(5,20),
             ncharLabel=25,
             scoreColors=pal(20))

Yet another way to view the results of a GSA (and the gene set collection itself) is through the Piano exploreGSAresults function, which uses rShiny to enable interactive exploration of your GSA results. This function will open up an interactive session in your internet browser. Simply close the tab/window when you’re finished to continue working in your R session.

37. Explore the results interactively using a local rShiny app

exploreGSAres(gsaRes)

Gene set enrichment analysis

One of the gene set analysis algorithms included in the Piano package is named (perhaps confusingly) gene set enrichment analysis (GSEA), and is described in Subramanian et al. PNAS 2005.

GSEA is one of the more commonly used gene set analysis algorithms. Although the method can be called from Piano’s runGSA function (geneSetStat='gsea' or geneSetStat='fgsea'), we will use functions from the fgsea package directly because they offer some additional features.

Similar to the reporter analysis above, GSEA uses information from all genes in the dataset, not just a list of interesting or significant genes.

38. We will use the same DE results from text file loaded previously.

#de.res <- read.delim(file.path(tcga_dir, 'DE_results.txt'), row.names=1)

Unlike the get set analysis above using runGSA, we cannot input the differential expression fold-changes and p-values separately. We instead need to create a combined gene score that accounts for both. In this example, we will generate a signed Z-score such that significantly decreased genes have a very negative score, significantly increased genes have a very positive score, and genes exhibiting non-significant expression changes have a score near zero.

Note: This is just one approach for generating gene scores - there is no single “correct” approach, as it will depend on the question being asked and the type/behavior of data being used. Often it is good to try a few different variations to determine what impact the selected scoring method has on your results.

39. Generate Z-scores from the p-values

gene_scores <- qnorm(0.5*de.res$PValue, lower.tail=F)
gene_scores[de.res$logFC < 0] <- -gene_scores[de.res$logFC < 0]
names(gene_scores) <- rownames(de.res)

40. Load gene set collection file.

Note: The fgsea package uses a different format than Piano for the GSC object, so we need to re-load the file using the package’s gmtPathways function.

# Option 1: GSC based on GEM subsystems
gsc_fgsea <- fgsea::gmtPathways("./data/gene_set_collections/HumanGEM_subsystem_GSC.gmt")

# Option 2: GSC based on GEM metabolites
# gsc_fgsea <- gmtPathways("./data/gene_set_collections/HumanGEM_metabolite_GSC.gmt"))

41. Run gene set enrichment analysis (GSEA)

# fgseaRes <- fgsea(gsc_fgsea, gene_scores, nperm=10000, minSize=5, maxSize=500)  # significance estimated by gene shuffling
fgseaRes <- fgseaMultilevel(gsc_fgsea, gene_scores, minSize=5, maxSize=500, eps=0)  # significance estimated by MC approach
fgseaRes <- fgseaRes[order(pval), ]  # sort by gene set significance

42. Create an enrichment plot for the top 3 “positive” and top 3 “negative” gene sets. Gene sets with a positive Enrichment Score (ES) represent those that are enriched with genes that are increased in abundance in tumor samples relative to the healthy controls, whereas those with a negative ES are enriched with genes that have decreased in abundance.

# get top 3 positive and negative gene sets
gs_select <- c(head(fgseaRes$pathway[fgseaRes$NES > 0], 3),
               head(fgseaRes$pathway[fgseaRes$NES < 0], 3))

# iterate through gene sets and create an enrichment plot for each
p <- list()
for (gs in gs_select) {
  p[[gs]] <- plotEnrichment(gsc_fgsea[[gs]], gene_scores) + ggtitle(gs)
}
do.call(grid.arrange, c(p, list(ncol=3)))

43. For contrast, create an enrichment plot for one of the non-significant gene sets

ns_pathway <- tail(fgseaRes$pathway[fgseaRes$size > 50], 1)  # select one with >50 genes
plotEnrichment(gsc_fgsea[[ns_pathway]], gene_scores) + ggtitle(ns_pathway)

44. View the genes comprising the Leading Edge of the top few gene sets. The Leading Edge is the subset of genes in the gene set that are primarily responsible for it being identified as “significant”.

# select the leading edges for the same (top 3 positive and negative) gene sets
leadingEdge <- fgseaRes$leadingEdge %>% setNames(fgseaRes$pathway)
edge_select <- leadingEdge[gs_select]

# convert the Ensembl IDs to gene symbols
eid2sym <- AnnotationDbi::select(org.Hs.eg.db, unique(unlist(edge_select)), 'SYMBOL', 'ENSEMBL')
edge_select <- lapply(edge_select, function(x) sort(unique(eid2sym$SYMBOL[eid2sym$ENSEMBL %in% x])))

# print leading edge genes for top 3 positive gene sets
edge_select[1:3]

## $`Oxidative phosphorylation`
##  [1] "ATP5F1B" "ATP5F1C" "ATP5F1E" "ATP5MC1" "ATP5MC2" "ATP5MC3" "ATP5ME" 
##  [8] "ATP5MF"  "ATP5MJ"  "ATP5MK"  "ATP5PB"  "ATP5PD"  "COX5A"   "COX5B"  
## [15] "COX6A1"  "COX6A2"  "COX6B1"  "COX6C"   "COX7A2"  "COX7B"   "COX7B2" 
## [22] "COX7C"   "COX8A"   "CYC1"    "NDUFA1"  "NDUFA12" "NDUFA13" "NDUFA2" 
## [29] "NDUFA3"  "NDUFA4"  "NDUFA5"  "NDUFA6"  "NDUFA8"  "NDUFA9"  "NDUFAB1"
## [36] "NDUFB10" "NDUFB11" "NDUFB2"  "NDUFB3"  "NDUFB4"  "NDUFB5"  "NDUFB9" 
## [43] "NDUFC1"  "NDUFC2"  "NDUFS1"  "NDUFS2"  "NDUFS3"  "NDUFS6"  "NDUFS8" 
## [50] "NDUFV2"  "PRUNE1"  "SDHC"    "UQCR10"  "UQCR11"  "UQCRH"   "UQCRQ"  
## 
## $`Nucleotide metabolism`
##  [1] "ACP2"      "ADK"       "CANT1"     "CMPK2"     "DCK"       "DCTPP1"   
##  [7] "DGUOK"     "DNM1L"     "DNM2"      "DTYMK"     "DUT"       "EEF1A2"   
## [13] "EFL1"      "ENTPD6"    "ENTPD8"    "GART"      "GFM1"      "GFM2"     
## [19] "ITPA"      "MTPAP"     "NME1"      "NME1-NME2" "NME2"      "NME3"     
## [25] "NME4"      "NME6"      "NME7"      "NT5C3A"    "NT5M"      "NUDT5"    
## [31] "NUDT9"     "PAPOLA"    "PDE6D"     "PDE6G"     "PNP"       "POLA1"    
## [37] "POLA2"     "POLB"      "POLD1"     "POLD2"     "POLD3"     "POLD4"    
## [43] "POLE"      "POLE2"     "POLE3"     "POLG2"     "POLH"      "POLQ"     
## [49] "POLR1A"    "POLR1B"    "POLR1C"    "POLR1H"    "POLR2B"    "POLR2D"   
## [55] "POLR2G"    "POLR2H"    "POLR2I"    "POLR2J"    "POLR2K"    "POLR3A"   
## [61] "POLR3B"    "POLR3C"    "POLR3F"    "POLR3K"    "RAD1"      "RFC5"     
## [67] "TK1"       "TUFM"      "TXNRD1"   
## 
## $`Aminoacyl-tRNA biosynthesis`
##  [1] "AARS1" "CARS1" "EPRS1" "FARSA" "FARSB" "GARS1" "IARS1" "MARS1" "NARS1"
## [10] "RARS1" "TARS1" "VARS1" "YARS1"

# print leading edge genes for top 3 negative gene sets
edge_select[4:6]

## $`Serotonin and melatonin biosynthesis`
##  [1] "ADH1A"   "ADH1B"   "ADH1C"   "ADH4"    "ADH5"    "ADH6"    "ADHFE1" 
##  [8] "AOX1"    "CHST2"   "CHST4"   "CHST7"   "CHST9"   "INMT"    "MAOA"   
## [15] "NDST1"   "SULT1C4" "UST"    
## 
## $`Tyrosine metabolism`
##  [1] "ADH1A"      "ADH1B"      "ADH1C"      "ADH4"       "ADH5"      
##  [6] "ADH6"       "ADHFE1"     "ALDH1A3"    "ALDH3A1"    "AOC3"      
## [11] "AOX1"       "CSGALNACT1" "DUOX2"      "GSTM2"      "MAOA"      
## [16] "TPO"        "UGT2B7"    
## 
## $`Retinol metabolism`
##  [1] "ADH1A"    "ADH1B"    "ADH1C"    "ADH4"     "ADH5"     "ADH6"    
##  [7] "ADHFE1"   "AKR1B15"  "AKR1C1"   "ALDH1A1"  "ALDH1A2"  "BDH2"    
## [13] "CES1"     "CYP26B1"  "CYP2U1"   "CYP3A4"   "CYP3A5"   "CYP4F12" 
## [19] "DGAT2"    "DHRS3"    "HSD17B11" "LIPE"     "PNPLA2"   "RBP4"    
## [25] "RDH5"     "RPE65"    "UGT2B7"

45. Create a table plot of the top 10 positive and negative gene sets

topPathwaysUp <- fgseaRes[ES > 0][head(order(pval), n=10), pathway]
topPathwaysDown <- fgseaRes[ES < 0][head(order(pval), n=10), pathway]
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))

plotGseaTable(gsc_fgsea[topPathways], gene_scores, fgseaRes, gseaParam=0.5)

46. For GSCs with gene sets that are subsets of each other or are largely overlapping (very common with gene-metabolite associations), create a table using only independent gene sets

# determine which pathways are largely a subset of another pathway
collapsedPathways <- collapsePathways(fgseaRes[order(pval)][padj < 0.05], gsc_fgsea, gene_scores)

# see if any pathways are significantly overlapping
head(collapsedPathways$parentPathways[!is.na(collapsedPathways$parentPathways)])

##                                Miscellaneous 
##                        "Tyrosine metabolism" 
## Metabolism of xenobiotics by cytochrome P450 
##                        "Tyrosine metabolism"

47. View a table plot of significant gene sets, excluding the larger “parent pathways”

mainPathways <- fgseaRes[pathway %in% collapsedPathways$mainPathways][order(-NES), pathway]
plotGseaTable(gsc_fgsea[mainPathways], gene_scores, fgseaRes, gseaParam=0.5)

Revisiting gene-level data

It is always good to go back to the original gene-level data when interpreting GSA results. Here we will use a heatmap to visualize gene-level data for selected gene-sets.

48. Load the RNA-Seq gene count matrix

counts <- read.table("./data/TCGA_data/count_matrix.txt", row.names=1, header=T)

49. Subset the count matrix to include only genes in a specific gene set. We will choose an example gene set that appeared as significant in many of the results above: Serotonin and melatonin biosynthesis.

cs <- counts[row.names(counts) %in% gsc$gsc$`Serotonin and melatonin biosynthesis`, ]

50. Convert row names from Ensembl IDs to gene symbols for easier interpretation

eid2sym <- AnnotationDbi::select(org.Hs.eg.db, row.names(cs), 'SYMBOL', 'ENSEMBL')

# if the symbol is missing (NA), just keep the Ensembl ID
eid2sym$SYMBOL[is.na(eid2sym$SYMBOL)] <- eid2sym$ENSEMBL[is.na(eid2sym$SYMBOL)]

row.names(cs) <- make.unique(eid2sym$SYMBOL[match(row.names(cs), eid2sym$ENSEMBL)])

51. Prepare gene (row) and sample (column) annotation information that will be included with the heatmap

# row annotation: highlight genes in the "leading edge" from the GSEA analysis above
LE_genes <- eid2sym$SYMBOL[match(leadingEdge$`Serotonin and melatonin biosynthesis`, eid2sym$ENSEMBL)]
rowAnn <- data.frame(LeadingEdge=ifelse(row.names(cs) %in% LE_genes, 'True', 'False'),
                     row.names=row.names(cs))

# also indicate which genes were significantly differentially expressed
DE_down_genes <- de.res$gene_symbols[de.res$FDR < 0.01 & de.res$logFC < 0]
DE_up_genes <- de.res$gene_symbols[de.res$FDR < 0.01 & de.res$logFC > 0]
rowAnn$Direction <- ifelse(row.names(cs) %in% DE_down_genes, 'Decrease', 'ns')
rowAnn$Direction[row.names(cs) %in% DE_up_genes] <- 'Increase'

# column annotation: color by sample type (Normal or Tumor)
colAnn <- data.frame(Sample=ifelse(grepl('normal', colnames(cs)), 'Normal', 'Tumor'))
row.names(colAnn) <- colnames(cs)

# specify annotation colors
ann_colors <- list(LeadingEdge = c(True='black', False='gray80'),
                   Direction = c(Decrease='#2166AC', ns='white', Increase='#B2182B'),
                   Sample = c(Normal='#defcc2', Tumor='#634901'))

Generate the heatmap, adding the annotation information as colorbars beside each axis

# generate heatmap
pheatmap(log2(cs+1),
         color=rev(magma(100)),
         breaks=seq(-2, 2, len=100),
         border_color='black',
         scale='row',
         annotation_row=rowAnn,
         annotation_col=colAnn,
         annotation_colors=ann_colors,
         labels_col=rep('',ncol(cs)))  # omit column labels

Note: We scaled the rows (genes) so the colors will reflect the difference for each gene individually, which means that the values are not comparable across genes.

Do the gene-level data shown in the heatmap seem reasonable given that this gene set was identified as significantly enriched in genes that decreased in expression? And regarding the subset of genes annotated as part of the “Leading Edge” identified by GSEA - does it make sense that these genes are more “responsible” for the gene set being identified as significant than some of the other genes?

Exploration using Metabolic Atlas

When analyzing data in the context of metabolism, it can be helpful to visualize the genes and reaction pathways and how they are connected/distributed. Metabolic Atlas is an online resource that faciliates the exploration of GEMs by providing metabolic network contents in tabular, map, or network format.

To demonstrate some of its functions, we will use Metabolic Atlas to explore some of the results that we found interesting/significant in the above analyses.

Data overlay on 2D maps

Metabolic Atlas contains manually drawn 2D maps of metabolic compartments and subsystems (pathways) corresponding to those in the Human-GEM model. The maps allow users to overlay pre-loaded mRNA expression data from the Human Protein Atlas, as well as user-provided data.

To demonstrate the data overlay function, we will use the median gene CPM values generated during the initial retrieval of the TCGA data (median_cpm.txt). Note that more detailed information on using the data overlay function can be found in the Metabolic Atlas Documentation.

52. Navigate to the Map Viewer for Human-GEM on Metabolic Atlas.

53. From the Subsystems menu on the left panel, choose one of subsystems that we found as significant in our analysis above - Serotonin and melatonin biosynthesis.

54. Click the DATA OVERLAY bar on the right side of the window to expand the overlay menu.

55. Near the top of the overlay menu, select Choose a file, and use the file browser to navigate to the session_gems/data/TCGA_data/ subdirectory and select the median_cpm.txt file that should be there.

You should now see the file name median_cpm.txt in green below the button, indicating that the file was successfully loaded and in the correct format.

56. In the Data 1 and Data 2 (for comparison) boxes, select Normal and Tumor, respectively, from the uploaded data drop-down menus (the second menu in each box). The sidebar should now look as follows:

The log fold-change in median CPM should now be overlaid onto the gene labels of the map as a color, ranging from blue (decrease) to red (increase). It is usually easier to view the colors if the subsystem highlighting is removed, by toggling the Show/Hide subsystem button:

To make the map view larger, you can collapse the DATA OVERLAY side panel by clicking its green bar, or use the full screen button:

The map should look something like this:

Overall, the pathway shows many blue gene labels, representing an expression decrease in tumor relative to normal samples, which is consistent with the GSA results. There are also a few genes with increased expression, which was also visible in the gene-level heatmap. The advantage of the pathway representation is that we can now easily see which parts of the pathway appear to be coordinated expression changes, and which are more “mixed” (increases and decreases).

Furthermore, selecting a gene, metabolite, or reaction on the map will highlight all appearances of that component on the map, as well as provide a brief summary on the left side bar. Following the GEM Browser link below the short summary will take the user to the more detailed page for the selected component.

Interaction partners

As discussed above, a key benefit of a GEM is the interaction information that it contains. The metabolite-gene interactions were, for example, used to extract a metabolite-based gene set collection from the Human-GEM. This same type of information can be viewed on Metabolic Atlas using the Interaction Partners tool.

To demonstrate, we will view for example one of the metabolites that is identified as significantly associated with gene expression increases when using the metabolite-gene GSC (HumanGEM_metabolite_GSC.gmt): 4-hydroxyproline (named trans-4-hydroxy-L-proline in the model/GSC)

57. Navigate to the Interaction Partners tool on Metabolic Atlas, and enter trans-4-hydroxy-L-proline in the search bar. Select the top entry in the drop-down menu that appears below the search bar (the Cytosol form of the metabolite)

Note: As in Human-GEM, Metabolic Atlas contains metabolites that are repeated in different compartments; e.g., H2O[c] and H2O[m] represents water in the cytoplasm and in the mitochondria, respectively. Although identical in composition, these metabolites are assigned different IDs in the model and different pages/entries in Metabolic Atlas to maintain the spatial organization of metabolism among different organelles.

The interaction partners graph for 4-hydroxyproline will open, showing the genes and other metabolites with which the metabolite is associated based on shared reactions. From here, one can right-click on a gene or other metabolite to expand or re-center on its interaction partners.

58. To view 4-hydroxyproline in a pathway context, navigate to the GEM Browser for the metabolite by selecting the trans-4-hydroxy-L-proline node at the center of the graph, and clicking the GEM Browser link to the right of the graph window.

59. Once in the trans-4-hydroxy-L-proline GEM Browser, select one of the 2D pathway map links shown on the Map Viewer right-side panel: Cytosol (part5) or Arginine and proline metabolism.

60. If our uploaded data (from medium_cpm.txt) is no longer loaded, simply re-load the file using the DATA OVERLAY side panel and again select Normal and Tumor for Data 1 and Data 2.

We can now view the trans-4-hydroxy-L-proline metabolite within its reaction pathway, where it is clear that many of the genes encoding the surrounding reactions exhibit increased expression in tumor relative to normal tissue.

If viewing the subsystem map for Arginine and Proline Metabolism, you may notice that many of the other genes involved in this subsystem exhibit a decreased expression in tumor samples. This highlights the difference in information that can be extracted from data when using metabolite-gene compared to subsystem-gene associations. Specifically, that although the Arginine and Proline Metabolism subsystem was not identified as significantly enriched in expression changes when using the subsystem GSC, applying the metabolite GSC revealed a significantly upregulated subnetwork of this pathway surrounding the 4-hydroxyproline metabolite.

Session info

This page was generated using the following R session:

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices datasets  utils    
## [8] methods   base     
## 
## other attached packages:
##  [1] org.Hs.eg.db_3.13.0         AnnotationDbi_1.54.1       
##  [3] tibble_3.1.4                RColorBrewer_1.1-2         
##  [5] pheatmap_1.0.12             gridExtra_2.3              
##  [7] ggplot2_3.3.5               viridisLite_0.4.0          
##  [9] piano_2.8.0                 fgsea_1.18.0               
## [11] edgeR_3.34.1                limma_3.48.3               
## [13] SummarizedExperiment_1.22.0 Biobase_2.52.0             
## [15] GenomicRanges_1.44.0        GenomeInfoDb_1.28.4        
## [17] IRanges_2.26.0              S4Vectors_0.30.0           
## [19] BiocGenerics_0.38.0         MatrixGenerics_1.4.3       
## [21] matrixStats_0.60.1          TCGAbiolinks_2.20.0        
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_2.0-2            ellipsis_0.3.2             
##   [3] XVector_0.32.0              farver_2.1.0               
##   [5] DT_0.19                     bit64_4.0.5                
##   [7] fansi_0.5.0                 xml2_1.3.2                 
##   [9] R.methodsS3_1.8.1           cachem_1.0.6               
##  [11] knitr_1.33                  jsonlite_1.7.2             
##  [13] cluster_2.1.2               dbplyr_2.1.1               
##  [15] png_0.1-7                   R.oo_1.24.0                
##  [17] shinydashboard_0.7.1        shiny_1.6.0                
##  [19] readr_2.0.1                 compiler_4.1.0             
##  [21] httr_1.4.2                  assertthat_0.2.1           
##  [23] Matrix_1.3-4                fastmap_1.1.0              
##  [25] later_1.3.0                 visNetwork_2.0.9           
##  [27] htmltools_0.5.2             prettyunits_1.1.1          
##  [29] tools_4.1.0                 igraph_1.2.6               
##  [31] gtable_0.3.0                glue_1.4.2                 
##  [33] GenomeInfoDbData_1.2.6      dplyr_1.0.7                
##  [35] rappdirs_0.3.3              fastmatch_1.1-3            
##  [37] Rcpp_1.0.7                  slam_0.1-48                
##  [39] jquerylib_0.1.4             vctrs_0.3.8                
##  [41] Biostrings_2.60.2           xfun_0.25                  
##  [43] stringr_1.4.0               rvest_1.0.1                
##  [45] mime_0.11                   lifecycle_1.0.0            
##  [47] renv_0.14.0                 gtools_3.9.2               
##  [49] XML_3.99-0.7                zlibbioc_1.38.0            
##  [51] scales_1.1.1                hms_1.1.0                  
##  [53] promises_1.2.0.1            relations_0.6-9            
##  [55] sets_1.0-18                 yaml_2.2.1                 
##  [57] curl_4.3.2                  memoise_2.0.0              
##  [59] downloader_0.4              sass_0.4.0                 
##  [61] biomaRt_2.48.3              stringi_1.7.4              
##  [63] RSQLite_2.2.8               highr_0.9                  
##  [65] caTools_1.18.2              filelock_1.0.2             
##  [67] BiocParallel_1.26.2         rlang_0.4.11               
##  [69] pkgconfig_2.0.3             bitops_1.0-7               
##  [71] evaluate_0.14               TCGAbiolinksGUI.data_1.12.0
##  [73] lattice_0.20-44             purrr_0.3.4                
##  [75] labeling_0.4.2              htmlwidgets_1.5.4          
##  [77] bit_4.0.4                   tidyselect_1.1.1           
##  [79] plyr_1.8.6                  magrittr_2.0.1             
##  [81] R6_2.5.1                    gplots_3.1.1               
##  [83] generics_0.1.0              DelayedArray_0.18.0        
##  [85] DBI_1.1.1                   withr_2.4.2                
##  [87] pillar_1.6.2                KEGGREST_1.32.0            
##  [89] RCurl_1.98-1.4              crayon_1.4.1               
##  [91] KernSmooth_2.23-20          utf8_1.2.2                 
##  [93] BiocFileCache_2.0.0         tzdb_0.1.2                 
##  [95] rmarkdown_2.10              progress_1.2.2             
##  [97] locfit_1.5-9.4              grid_4.1.0                 
##  [99] data.table_1.14.0           marray_1.70.0              
## [101] blob_1.2.2                  digest_0.6.27              
## [103] xtable_1.8-4                tidyr_1.1.3                
## [105] httpuv_1.6.2                R.utils_2.10.1             
## [107] munsell_0.5.0               bslib_0.3.0                
## [109] shinyjs_2.0.0

GEM-based gene set analysis