In this exercise we will use publicly available gene expression profiles of a large number of prostate cancer samples. One of the most important prognostic factors in prostate cancer is the tumor grade, which is a measure of how dedifferenciated or transformed the cells are. Grading of prostate cancer is performed by a pathologist according to the Gleason grading system. A section of tumor tissue may have several different patterns correpsonding to different tumor grades. The Gleason grade is the sum of the most abunant grade and the second most abundant grade observed. Each grade is on a scale from 1 to 5 where 3 is borderline malignant, 4 is malignant and 5 is highly malignant. The Gleason grade of a slow growing tumor is often 6 (3+3), and these tumors rarely become life-threatening during the patient’s lifetime and require little treatment. Gleason 8 (4+4) or higher require surgery and sometimes additional hormone- or chemotherapy (and may still disseminate into deadly metastatic disease). Gleason 7 (3+4 or 4+3) can behave either as an indolent Gleason 6, as a malignant Gleason 8, or in between. The difficulty of using grading to clearly separate indolent from malignant tumors and the variability between how different pathologists perform grading has motivated efforts to use genomics to complement or even replace grading as a prognostic tool.
The convenient cgdsr
package provides a basic set of R functions for querying the Cancer Genomic Data Server (CGDS) hosted by the Computational Biology Center (cBio) at the Memorial Sloan-Kettering Cancer Center (MSKCC). This service is a part of the cBio Cancer Genomics Portal, http://www.cbioportal.org/. Install the package from CRAN and load it.
install.packages('cgdsr')
library(cgdsr)
Using a CGDS connection object, you can now use the following functions to query the cBioPortal:
getCancerStudies()
: What studies are hosted on the server? For example, TCGA glioblastoma or TCGA ovarian cancer.getGeneticProfiles()
: What genetic profile types are available for cancer study X? For example, mRNA expression or copy number alterations.getCaseLists()
: What case sets are available for cancer study X? For example, all samples or only samples corresponding to a given cancer subtype.getProfileData()
: Retrieve slices of genomic data. For example, a client can retrieve all mutation data for PTEN and EGFR in TCGA glioblastoma.getClinicalData()
: Retrieve clinical data (e.g. patient survival time and age) for a given cancer study and list of cases.
Start by initiating the connection object with CGDS("http://www.cbioportal.org/public-portal/")
mycgds = CGDS("http://www.cbioportal.org/public-portal/")
test(mycgds)
getCancerStudies()
to investigate which prostate cancer studies are available.studies=getCancerStudies(mycgds)
studies[grep(pattern = 'prostate',x = name,ignore.case = T)]
We will use publicly available gene expression data from The Cancer Genome Atlas (TCGA) in this exercise. The data represents the log2 of the measured mRNA abundance (using NGS of cDNA) per gene or transcript in the tumor sample.
prad_tcga
?mycaselist=getCaseLists(mycgds,'prad_tcga')
mygeneticprofile=getGeneticProfiles(mycgds,'prad_tcga')
myclinicaldata=getClinicalData(mycgds,'prad_tcga_rna_seq_v2_mrna')
myclinicaldata
You can use getProfileData()
to download gene expression data for your case list and up to a few hundred specific genes.
prad_tcga
, and for all genes.# Will need to explicitly name the genes
library("biomaRt") # will use bioMart to extract 'all' gene names
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
gene_names=getBM(attributes= "hgnc_symbol",mart=ensembl)[,1]
# Test the getProfileData function:
gene_expression=getProfileData(x=mycgds,
genes=c('BRAF','KRAS','notarealgene'),
geneticProfiles='prad_tcga_rna_seq_v2_mrna',
caseList = 'prad_tcga_rna_seq_v2_mrna')
# Unfortunately the number of genes you can query simultaneously is limited, therefore we'll download using mutiple queries of 100 genes:
length(gene_names)
# first the last few genes:
gene_expression=getProfileData(x=mycgds,
genes=gene_names[37401:37416],
geneticProfiles='prad_tcga_rna_seq_v2_mrna',
caseList='prad_tcga_rna_seq_v2_mrna')
for (i in 0:373) {
cat('\nDownloading', (100*i+1), 'to', (100*i+100))
temp=getProfileData(x=mycgds,
genes=gene_names[(100*i+1):(100*i+100)],
geneticProfiles='prad_tcga_rna_seq_v2_mrna',
caseList='prad_tcga_rna_seq_v2_mrna')
gene_expression=cbind(temp,gene_expression)
}
# For many genes there were no data. Remove them:
na=is.na(gene_expression[1,])
gene_expression=gene_expression[,!na]
gene_expression=as.matrix(gene_expression)
You should now have a 498 by 18361 matrix with gene expression data. Use your R skills to analyze the data:
pca=prcomp(gene_expression)
plot(pca$x[,1],pca$x[,2])
# check sample order:
all(rownames(gene_expression)==rownames(myclinicaldata))
# correct sample order:
order=match(rownames(myclinicaldata),rownames(gene_expression))
gene_expression=gene_expression[order,]
all(rownames(gene_expression)==rownames(myclinicaldata))
# high Gleason: >7
high_gleason=as.numeric(myclinicaldata$GLEASON_SCORE>7)
plot(pca$x[,1],pca$x[,2],col=high_gleason+1)
# build and cross validate a regression model
library(glmnet)
cv.lasso <- cv.glmnet(gene_expression, high_gleason, family='binomial', alpha=1, parallel=F, standardize=TRUE, type.measure='auc')
plot(cv.lasso)
# just to make sure this worked out, test what happens if a dummy "Gleason" is provided:
dummy.lasso <- cv.glmnet(gene_expression, sample(high_gleason), family='binomial', alpha=1, parallel=F, standardize=TRUE, type.measure='auc')
plot(dummy.lasso)
train=runif(n = length(high_gleason))>0.2
cv.lasso <- cv.glmnet(gene_expression[train,], high_gleason[train], family='binomial', alpha=1, parallel=F, standardize=TRUE, type.measure='auc')
plot(cv.lasso)
# make prediction on the test set
prediction.testset=predict.glmnet(cv.lasso$glmnet.fit,gene_expression[!train,],s = cv.lasso$lambda.min)
plot(myclinicaldata$GLEASON_SCORE[!train],prediction.testset)
# make prediction on the training set
prediction.trainset=predict.glmnet(cv.lasso$glmnet.fit,gene_expression[train,],s = cv.lasso$lambda.min)
plot(myclinicaldata$GLEASON_SCORE[train],prediction.trainset)
The most commonly mutated genes in prostate cancer include SPOP, TP53, ATM, PTEN, FOXA1, PIK3CA, MED12 and COL5A1. If you have time, download copy number and mutation data for them and see if that can improve the predictive model.
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.2
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] cgdsr_1.2.10 bsplus_0.1.1 forcats_0.3.0 stringr_1.3.1
## [5] dplyr_0.7.5 purrr_0.2.5 readr_1.1.1 tidyr_0.8.1
## [9] tibble_1.4.2 ggplot2_2.2.1 tidyverse_1.2.1 captioner_2.2.3
## [13] bookdown_0.7 knitr_1.20
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.4 xfun_0.1 reshape2_1.4.3
## [4] haven_1.1.1 lattice_0.20-35 colorspace_1.3-2
## [7] htmltools_0.3.6 yaml_2.1.19 rlang_0.2.0
## [10] R.oo_1.22.0 pillar_1.2.3 foreign_0.8-70
## [13] glue_1.2.0 modelr_0.1.2 readxl_1.1.0
## [16] bindrcpp_0.2.2 bindr_0.1.1 plyr_1.8.4
## [19] munsell_0.4.3 gtable_0.2.0 cellranger_1.1.0
## [22] R.methodsS3_1.7.1 rvest_0.3.2 psych_1.8.4
## [25] evaluate_0.10.1 parallel_3.5.0 broom_0.4.4
## [28] Rcpp_0.12.17 backports_1.1.2 scales_0.5.0
## [31] jsonlite_1.5 mnormt_1.5-5 hms_0.4.2
## [34] digest_0.6.15 stringi_1.2.2 grid_3.5.0
## [37] rprojroot_1.3-2 cli_1.0.0 tools_3.5.0
## [40] magrittr_1.5 lazyeval_0.2.1 crayon_1.3.4
## [43] pkgconfig_2.0.1 xml2_1.2.0 lubridate_1.7.4
## [46] rstudioapi_0.7 assertthat_0.2.0 rmarkdown_1.9
## [49] httr_1.3.1 R6_2.2.2 nlme_3.1-137
## [52] compiler_3.5.0
Page built on: 18-Jun-2018 at 12:23:37.
2018 | SciLifeLab > NBIS > RaukR