logo_raukr

In this exercise we will use publicly available gene expression profiles of a large number of prostate cancer samples. One of the most important prognostic factors in prostate cancer is the tumor grade, which is a measure of how dedifferenciated or transformed the cells are. Grading of prostate cancer is performed by a pathologist according to the Gleason grading system. A section of tumor tissue may have several different patterns correpsonding to different tumor grades. The Gleason grade is the sum of the most abunant grade and the second most abundant grade observed. Each grade is on a scale from 1 to 5 where 3 is borderline malignant, 4 is malignant and 5 is highly malignant. The Gleason grade of a slow growing tumor is often 6 (3+3), and these tumors rarely become life-threatening during the patient’s lifetime and require little treatment. Gleason 8 (4+4) or higher require surgery and sometimes additional hormone- or chemotherapy (and may still disseminate into deadly metastatic disease). Gleason 7 (3+4 or 4+3) can behave either as an indolent Gleason 6, as a malignant Gleason 8, or in between. The difficulty of using grading to clearly separate indolent from malignant tumors and the variability between how different pathologists perform grading has motivated efforts to use genomics to complement or even replace grading as a prognostic tool.

1 Setup

The convenient cgdsr package provides a basic set of R functions for querying the Cancer Genomic Data Server (CGDS) hosted by the Computational Biology Center (cBio) at the Memorial Sloan-Kettering Cancer Center (MSKCC). This service is a part of the cBio Cancer Genomics Portal, http://www.cbioportal.org/. Install the package from CRAN and load it.

install.packages('cgdsr')
library(cgdsr)

2 Download data

Using a CGDS connection object, you can now use the following functions to query the cBioPortal:

getCancerStudies(): What studies are hosted on the server? For example, TCGA glioblastoma or TCGA ovarian cancer.
getGeneticProfiles(): What genetic profile types are available for cancer study X? For example, mRNA expression or copy number alterations.
getCaseLists(): What case sets are available for cancer study X? For example, all samples or only samples corresponding to a given cancer subtype.
getProfileData(): Retrieve slices of genomic data. For example, a client can retrieve all mutation data for PTEN and EGFR in TCGA glioblastoma.
getClinicalData(): Retrieve clinical data (e.g. patient survival time and age) for a given cancer study and list of cases.
Start by initiating the connection object with CGDS("http://www.cbioportal.org/public-portal/")

mycgds = CGDS("http://www.cbioportal.org/public-portal/")
test(mycgds)

Use getCancerStudies() to investigate which prostate cancer studies are available.

studies=getCancerStudies(mycgds)
studies[grep(pattern = 'prostate',x = name,ignore.case = T)]

Which appears to be the largest study?

We will use publicly available gene expression data from The Cancer Genome Atlas (TCGA) in this exercise. The data represents the log2 of the measured mRNA abundance (using NGS of cDNA) per gene or transcript in the tumor sample.

Which case lists are available from prad_tcga?
Which genetic profiles are available?
Investigate which clinical data are available for this study.

mycaselist=getCaseLists(mycgds,'prad_tcga')
mygeneticprofile=getGeneticProfiles(mycgds,'prad_tcga')
myclinicaldata=getClinicalData(mycgds,'prad_tcga_rna_seq_v2_mrna')
myclinicaldata

You can use getProfileData() to download gene expression data for your case list and up to a few hundred specific genes.

Download gene expression data for all possible samples in prad_tcga, and for all genes.

# Will need to explicitly name the genes
library("biomaRt") # will use bioMart to extract 'all' gene names
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
gene_names=getBM(attributes= "hgnc_symbol",mart=ensembl)[,1]

# Test the getProfileData function:
gene_expression=getProfileData(x=mycgds,
                               genes=c('BRAF','KRAS','notarealgene'),
                               geneticProfiles='prad_tcga_rna_seq_v2_mrna',
                               caseList = 'prad_tcga_rna_seq_v2_mrna')

# Unfortunately the number of genes you can query simultaneously is limited, therefore we'll download using mutiple queries of 100 genes:
length(gene_names)
# first the last few genes:
gene_expression=getProfileData(x=mycgds,
                               genes=gene_names[37401:37416],
                               geneticProfiles='prad_tcga_rna_seq_v2_mrna',
                               caseList='prad_tcga_rna_seq_v2_mrna')

for (i in 0:373) {
  cat('\nDownloading', (100*i+1), 'to', (100*i+100))
  temp=getProfileData(x=mycgds,
                      genes=gene_names[(100*i+1):(100*i+100)],
                      geneticProfiles='prad_tcga_rna_seq_v2_mrna',
                      caseList='prad_tcga_rna_seq_v2_mrna')
  gene_expression=cbind(temp,gene_expression)
}

# For many genes there were no data. Remove them:
na=is.na(gene_expression[1,])
gene_expression=gene_expression[,!na]
gene_expression=as.matrix(gene_expression)

3 Analyse

You should now have a 498 by 18361 matrix with gene expression data. Use your R skills to analyze the data:

Visualize the expression data using a PCA or similar.

pca=prcomp(gene_expression)
plot(pca$x[,1],pca$x[,2])

Use your favorite machine learning tools to build a model for predicting a Gleason sum above 7 from gene expression data.

# check sample order:
all(rownames(gene_expression)==rownames(myclinicaldata))

# correct sample order:
order=match(rownames(myclinicaldata),rownames(gene_expression))
gene_expression=gene_expression[order,]
all(rownames(gene_expression)==rownames(myclinicaldata))

# high Gleason: >7
high_gleason=as.numeric(myclinicaldata$GLEASON_SCORE>7)
plot(pca$x[,1],pca$x[,2],col=high_gleason+1)

# build and cross validate a regression model
library(glmnet)
cv.lasso <- cv.glmnet(gene_expression, high_gleason, family='binomial', alpha=1, parallel=F, standardize=TRUE, type.measure='auc') 
plot(cv.lasso)

# just to make sure this worked out, test what happens if a dummy "Gleason" is provided:
dummy.lasso <- cv.glmnet(gene_expression, sample(high_gleason), family='binomial', alpha=1, parallel=F, standardize=TRUE, type.measure='auc') 
plot(dummy.lasso)

Set 20% of samples aside as a test set, retrain the model, and run the model on the test and training sets.
Do you see a similar performance?

train=runif(n = length(high_gleason))>0.2
cv.lasso <- cv.glmnet(gene_expression[train,], high_gleason[train], family='binomial', alpha=1, parallel=F, standardize=TRUE, type.measure='auc') 
plot(cv.lasso)

# make prediction on the test set
prediction.testset=predict.glmnet(cv.lasso$glmnet.fit,gene_expression[!train,],s = cv.lasso$lambda.min)
plot(myclinicaldata$GLEASON_SCORE[!train],prediction.testset)

# make prediction on the training set
prediction.trainset=predict.glmnet(cv.lasso$glmnet.fit,gene_expression[train,],s = cv.lasso$lambda.min)
plot(myclinicaldata$GLEASON_SCORE[train],prediction.trainset)

4 Bonus challenge

The most commonly mutated genes in prostate cancer include SPOP, TP53, ATM, PTEN, FOXA1, PIK3CA, MED12 and COL5A1. If you have time, download copy number and mutation data for them and see if that can improve the predictive model.

5 Session Info

This document has been created in RStudio using R Markdown and related packages.
For R Markdown, see http://rmarkdown.rstudio.com
For details about the OS, packages and versions, see detailed information below:

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.2
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] cgdsr_1.2.10    bsplus_0.1.1    forcats_0.3.0   stringr_1.3.1  
##  [5] dplyr_0.7.5     purrr_0.2.5     readr_1.1.1     tidyr_0.8.1    
##  [9] tibble_1.4.2    ggplot2_2.2.1   tidyverse_1.2.1 captioner_2.2.3
## [13] bookdown_0.7    knitr_1.20     
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.4  xfun_0.1          reshape2_1.4.3   
##  [4] haven_1.1.1       lattice_0.20-35   colorspace_1.3-2 
##  [7] htmltools_0.3.6   yaml_2.1.19       rlang_0.2.0      
## [10] R.oo_1.22.0       pillar_1.2.3      foreign_0.8-70   
## [13] glue_1.2.0        modelr_0.1.2      readxl_1.1.0     
## [16] bindrcpp_0.2.2    bindr_0.1.1       plyr_1.8.4       
## [19] munsell_0.4.3     gtable_0.2.0      cellranger_1.1.0 
## [22] R.methodsS3_1.7.1 rvest_0.3.2       psych_1.8.4      
## [25] evaluate_0.10.1   parallel_3.5.0    broom_0.4.4      
## [28] Rcpp_0.12.17      backports_1.1.2   scales_0.5.0     
## [31] jsonlite_1.5      mnormt_1.5-5      hms_0.4.2        
## [34] digest_0.6.15     stringi_1.2.2     grid_3.5.0       
## [37] rprojroot_1.3-2   cli_1.0.0         tools_3.5.0      
## [40] magrittr_1.5      lazyeval_0.2.1    crayon_1.3.4     
## [43] pkgconfig_2.0.1   xml2_1.2.0        lubridate_1.7.4  
## [46] rstudioapi_0.7    assertthat_0.2.0  rmarkdown_1.9    
## [49] httr_1.3.1        R6_2.2.2          nlme_3.1-137     
## [52] compiler_3.5.0

Page built on: 18-Jun-2018 at 12:23:37.

2018 | SciLifeLab > NBIS > RaukR

Cancer genomics workshop

Predict tumor grade from gene expression data

Markus Mayrhofer

18-Jun-2018

1 Setup

2 Download data

3 Analyse

4 Bonus challenge

5 Session Info