## LIBRARIES AND VARIABLES
# load the packages you need for this document

library(bsplus)

# variables needed in this document
# raukr colours
#rv_col_dark <- "#125687"
#rv_col_light <- "#e7eef3"

logo_raukr

In this exercise we will parse and review somatic mutations based on a single whole-genome sequenced tumor-normal pair.

First, download the data (available with course material) and extract into a suitable working directory.

1 Point mutations

Somatic point mutations include single-nucleotide substitutions (SNVs) and small insertions/deletions (indels) that have occurred somatic cells and are therefore not present in other cells of the same individual. Cells of a clonal expansion such as a tumor will carry any mutations present in their last common ancestor including mutations that gave rise to the cancer. We will now review the result from the variant caller Mutect, which is based on comparing sequence variants in tumor DNA with those in a patient-matched normal DNA. Although this example contains only SNVs, recent versions of Mutect will produce indels too. The result has been annotated with the Variant Effect Predictor (VEP).

1.1 Parse a VCF (Variant Call Format) file

Parse mutect_vep.vcf and investigate its contents. Good options for parsing vcf files are VariantAnnotation and vcfR.

library(VariantAnnotation)
?readVcf
vcf=readVcf(file = 'data/mutect_vep.vcf',genome = 'hg19') # this is a hard coded file path
vcf
info(vcf)
rowRanges(vcf)
geno(vcf)


library(vcfR)
?read.vcfR
vcfr=read.vcfR(file = 'data/mutect_vep.vcf')
vcfr

In germline DNA, the allele ratio of variants are typically expeceted at either 0.5 or 1, depending on whether the variant is present on 1/2 or all copies of the locus. In a sample of tumor DNA, the allele ratio of somatic mutations depend on the tumor purity (fraction of cells in the sample that are part of the malignant clonal expansion) and the number of mutated and unmutated copies per cell.

Investigate the coverage and allele ratio of the mutations.
What reasons are there for low and high mutation allele ratio?
Can you say anything about tumor purity?

library(data.table)
mutations=data.table(name=names(vcf),
                coverage=geno(vcf)$DP,
                alleleratio=as.numeric(geno(vcf)$FA))
summary(mutations)

library(ggplot2)
ggplot()+theme_light()+ggtitle('Mutation allele ratio')+xlab('')+
  geom_point(mapping = aes(x=1:length(alleleratio),y=alleleratio,col=log10(coverage.HCC1143)),data = mutations)

Make a plot of mutation density or spacing throughout the genome.
Is it largely even?
Is is associated with coverage?

r <- rowRanges(vcf)
# let's add chromosome and position to the table
mutations$chr=as.character(seqnames(r))
mutations$pos=as.numeric(start(r))

# and the distance to the next mutation
mutations$spacing=c(diff(mutations$pos),0)

ggplot()+theme_light()+ggtitle('Mutation spacing')+scale_y_log10()+xlab('')+
  geom_point(mapping = aes(x=1:nrow(mutations),y=spacing,col=log10(coverage.HCC1143)),data=mutations)

1.2 Explore annotations

Tens of thousands of mutations is typical for many adult solid cancers. Most are so-called passenger mutations and have no impact on the development of disease. In order to identify driver mutations (which may influence diagnosis, prognosis and choice of treatment) we need to annotate all variants with their putative effect. There are many tools available for that, e.g. Annovar, Oncotator, SNP Effect and Variant Effect Predictor (VEP). We have used the latter (http://www.ensembl.org/info/docs/tools/vep/index.html) to annotate this vcf file and the result is available in the CSQ column of the INFO field of the vcf file. It is one or more pipe separated annotation chunks per variant. The annotation format is described in the VCF header.

Parse the annotations into a table. Merge it with your table of mutations.

info=as.data.table(info(vcf))

header(vcf)
info(header(vcf))
vep_header=strsplit(info(header(vcf))['CSQ',3],'Format: ')[[1]][2]
vep_header=strsplit(vep_header,'\\|')[[1]]

head(info$CSQ)

# start with an empty matrix that fits the annotations (and one number that identifies which mutation it annotates)
annotation_table=matrix(data = NA,nrow = length(unlist(info$CSQ)),ncol = length(vep_header)+1)
colnames(annotation_table)=c('N',vep_header)
row=1
for (i in 1:nrow(info)) { # for each mutation
  for (j in 1:length(info$CSQ[[i]])) { # for each VEP annotation of that mutation
    line=strsplit(info$CSQ[[i]][j],'\\|')[[1]]
    annotation_table[row,1]=i # to remember which mutation it belongs to
    annotation_table[row,1+(1:length(line))]=line
    row=row+1
  }
}

# merge with the mutation table
mutations$N=as.character(1:nrow(mutations))
mutations=merge(mutations,as.data.table(annotation_table),by='N')

A more detailed description of each annotation column is available at http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html (see the “Output fields” column of each table). The Consequence and IMPACT columns are described at: https://www.ensembl.org/info/genome/variation/predicted_data.html

1.3 Variant interpretation

If we had a large number of samples, we could use MutSig (http://archive.broadinstitute.org/cancer/cga/mutsig) or MuSiC (http://archive.broadinstitute.org/cancer/cga/mutsig) to identify genes or genomic regions with a higher mutation frequency than expected from the background distribution. With a single sample we will instead focus on genes already known to be involved in cancer. You can download a list of known cancer genes from https://civicdb.org/releases, http://oncokb.org/#/cancerGenes or https://cancer.sanger.ac.uk/cosmic/download.

How many mutations do you find in known cancer genes?

cancergenes=fread('https://civicdb.org/downloads/nightly/nightly-GeneSummaries.tsv')
mutations_knowngene=mutations[SYMBOL %in% cancergenes$name]
table(mutations_knowngene$Consequence)

Take a look at the “Consequence” column. How many do you think may be driver mutations?
Which consequences do you find if you limit the impact to “HIGH” and “MODERATE”?

mutations[SYMBOL %in% cancergenes$name & IMPACT %in% c('HIGH','MODERATE')]

This is a more reasonable number of mutations, largely in line with the number of driver mutations typically observed. Since all of them are missense however, we should not take for granted that each has any oncogenic effect. Note that as part of the VEP annotation we get SIFT and Polyphen estimates of pathogeneicity, and some of them imply that the effect may be small. To better separate drivers from passengers you can use information on known hotspots, i.e. specific residues recurrently mutated in cancer which therefore likely has effect. Memorial Sloan-Kettering has released a hotspot table based on some 25000 cancer samples. You can access it from http://cancerhotspots.org/#/download.

Use the table of known hotspots to check which mutations are hotspots.

library(gdata) # to be able to parse an xls file
hotspots=read.xls('http://cancerhotspots.org/files/hotspots_v2.xls')
hotspots=as.data.table(hotspots)

# will use a combination of gene and residue (protein position) as key
mutations_hotspot=mutations[
  paste(
    SYMBOL,sub(pattern = '/.*',replacement = '',x = Protein_position)) %in% paste(hotspots$Hugo_Symbol,hotspots$Amino_Acid_Position
    )
  ]
mutations_hotspot

You can also browse individual genes manually at e.g. http://tumorportal.org/ to see whether your mutation is typical for how the gene is commonly mutated. As you can see here: http://tumorportal.org/view?geneSymbol=DCC, DCC (Deleted in Colorectal Cancer) is often deleted but does not seem to have mutational hotspots.

Review the allele ratio of the putative driver mutations you have identified. Is there any indication of concomitant deletion (of the unmutated copy) of the gene?

2 Copy number alterations

Cancer genomes are frequently aneuploid and rearranged, leading to each cancer cell containing an aberrant number of copies of part (or all) of the genome. The copy number of whole or large segments of chromosomes will be at least 1 - the cell is unlikely to survive the complete loss of many genes - and rarely amplfied beyond a copy number of 10 or so. Small segments however can be homozygously deleted (zero remaining copies) or amplified to hundreds of copies.

Copy number analysis is relatively straightforward to do from sequencing data - the higher the coverage, the higher the copy number per cell of that part if the reference genome. To reduce the impact of systematic noise and germline CNVs, coverage of the normal sample is subtracted from that of the tumor sample. The log-transformed tumor/normal coverage ratio is called the log ratio. Each data point in the log ratio represents the log2 of tumor/normal sequence coverge of a small segment or a single position of the reference genome.

Copy-neutral loss of heterozygosity (LOH) is the result of deletion of either parental “homologous” copy with concominant duplication of the other, and is not evident from sequence coverage alone. To be able to identify copy neutral events, and to support copy number estimates in general, SNP allele ratio is commonly used alongside logratio in copy number analysis. The allele ratio of SNPs that are heterozygous in germline DNA will change dramatically when the copy number of either homologous copy is different from the other.

2.1 Parse and explore copy number data

Load ascat_log_ratio and ascat_allele_ratio into memory and investigate their content.

logratio=fread('data/ascat_log_ratio.txt')
alleleratio=fread('data/ascat_allele_ratio.txt')

summary(logratio)
summary(alleleratio)

g=ggplot()+theme_light()

g+geom_point(mapping = aes(x=Position,y=Value),data = logratio[Chr==1],shape='.')+ggtitle('Log ratio')
g+geom_point(mapping = aes(x=Position,y=Value),data = alleleratio[Chr==1],shape='.')+ggtitle('Allele ratio')

The log ratio is median centered to mitigate any difference in overall sequence coverage between the tumor and normal sample, and often normalized for systematic noise such as the effect of local GC content on sequence coverage.

What is the log ratio measuring?
Is the log transformation necessary?
If the tumor purity is 100% and the copy number changes from 2 to 1, what should be the relative change in sequence coverage?
How would log ratio change?
If a tumor genome is triploid on average, what is the expected coverage (relative to average coverage) where the copy number is 3?
What is the expected log ratio?

The SNP allele ratio is a useful complement to the log ratio and can be used to infer absolute copy number and copy-neutral loss of heterozygosity (LOH). Only heterozygous SNPs (based on germline DNA) are informative, homozygous SNPs are removed.

If there are two homolgous (non-identical) copies of a genomic segment, what is the expected (ALT) allele ratio of SNPs there?
If the tumor purity is 50% and there is a copy-neutral LOH in the tumor genome, what is the expected allele ratio there?
What about 50% purity and a homozygous deletion?

2.2 Plot a genome-wide copy number profile

To better visualize copy number along the reference genome, we need to convert the positions from chromsome-based to genome-based. Use the length of each chromosome to calculate the genome-wide positions for log ratio and allele ratio. The reference genome is hg19 (if it matters).

logratio$g_Position=NA
alleleratio$g_Position=NA
chrom_lengths=NULL # estimate the lengths from the copy number data.
chrom_cum_start=NULL # also store the "offset" for use with other data
for (chr in unique(logratio$Chr)) {
  ix=logratio$Chr==chr
  chrom_lengths[[chr]]=as.numeric(max(logratio[ix,Position],na.rm=T))
  chrom_cum_start[[chr]]=sum(chrom_lengths)-chrom_lengths[chr]+5e6
  logratio$g_Position[ix]=logratio$Position[ix]+chrom_cum_start[[chr]]
  alleleratio$g_Position[ix]=alleleratio$Position[ix]+chrom_cum_start[[chr]]
}

Visualize log ratio (or just “ratio” if you prefer) and allele ratio along the reference genome. Label the chromosomes and add a smoothed average to the log ratio.

logratio$Ratio=2^logratio$Value
logratio$Smoothed=runmed(x = logratio$Ratio,k = 99)

copyplot1=ggplot()+theme_light()+ggtitle('Genome-wide copy number profile')+
  xlab('Genomic position')+
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank())+
  ylab('B allele ratio and coverage ratio')+ylim(c(-1,3))+
  # tumor logR
  geom_point(aes(x=g_Position,y=Ratio),data=logratio,alpha=1/8,shape='.')+
  # smoothed logR
  geom_line(aes(x=g_Position,y=Smoothed),data=logratio,color='green',size=0.05)+
  # SNP allele ratio
  geom_point(data = alleleratio,aes(x=g_Position,y=Value-1),alpha=1/12,shape='.')+
  # add chromosome lines
  geom_segment(aes(x = chrom_cum_start[-1],xend=chrom_cum_start[-1],y= -1,yend=3),alpha=1/5)+
  # add chromosome labels
  geom_text(aes(x=chrom_cum_start+0.5*chrom_lengths,y=0.1,label=names(chrom_lengths)))
copyplot1

Add your putative driver mutations to the SNP allele ratio track. Do their alt-allele ratios make sense?
Would the allele ratio of somatic point mutations remain similar to that of heterozugous SNPs if the purity was much lower?

# adjust positions to genome
mutations_hotspot$g_pos=mutations_hotspot$pos + chrom_cum_start[mutations_hotspot$chr]

# add to previous plot:
copyplot2=copyplot1+
  geom_point(data = mutations_hotspot,mapping = aes(x=g_pos,y=alleleratio-1),col='red')
copyplot2

When interpreting the result, keep in mind that the X chromosome may have been compared to a non-diploid normal which would skew the result compared to the autosomes. The average log ratio of autosomal segments should largely line up at coverages (log ratio or ratio) corresponding to whole numbers of copies per cell.

2.3 Estimate tumor ploidy and purity

What is the absolute difference in coverage ratio that seems to correspond to one more or less copy?
Assuming the average copy number of the sample is found at a log ratio of 0 (ratio of 1), what is it likely to be here?

# the density of coverage ratio should peak at whole copy numbers
plot(d <- density(logratio[Chr!='X',Smoothed]),xlim=c(0,2), main='Genome-wide smoothed coverage ratio')

# the slope changes sign at peaks and valleys
points(d$x[-1],s <- sign(diff(d$y)),type = 'l',col='lightblue') 

# peaks are where the slope changes sign from positive to negative
peak_index=which(diff(s)==-2)+1 # `diff` causes loss of a data point
points(x <- d$x[peak_index],d$y[peak_index],col='red')

# at what spacing do we see the peaks
data.table(x,difference=diff(c(0,x)))[1:6] # near 0.23. We consider only the first few peaks which are well represented throughout the genome.

Hint: As the first peak includes a large genomic segment, it must represent at least 1 copy. Whatever coverage ratio represents zero copies must be positive or very near zero, but not necessarily present in the sample.

With the data you now have, estimate the purity of this sample.

# 1. Based on the copy number data:

expected=1/4 # expected change in coverage ratio per copy given 100% purity and avergae of four copies
observed=0.23 # from the peaks
print(purity <- observed/expected)

# 2. Based on mutation allele ratio

# We can infer a copy number of 2 with loss-of-heterozygosity throughout chromosome 13.
mutations_13=unique(mutations[chr==13,.(alleleratio)])
plot(d <- density(mutations_13$alleleratio), main='Mutation allele ratio at two copies and LOH')

# the clonal mutations with multiplicity 2 (present on both copies) comprise the right-most distribution
peak=d$x[d$x>0.8][which.max(d$y[d$x>0.8])]
text(x=peak,y=d$y[d$x==peak],col='red',labels = 'Clonal mutations\non both copies')

# Conveniently, with equal amount of DNA per cell in tumor and normal (2 copies of chr13) and the selected mutations present in all tumor DNA, purity=allelratio.
print(purity <- peak)

3 Bonus challenges

If you have time, choose one or more of the following bonus exercises.

3.1 Mutational signatures

The relative abundance of certain mutations in a given sequence context have been shown to correlate with mutational phenotypes such as DNA repair deficiency, UV light and aging (https://en.wikipedia.org/wiki/Mutational_signatures).

Use the MutationalSignatures or MutationalPatterns packages to identify the predominant mutational sigantures of this cancer genome.

3.2 Structural variants

Putative somatic structual variants are available for this sample in the manta_snpEff.vcf file. They have been generated with Manta, a structural variant caller. Annotations have been generated with SnpEff and are available in the ANN column of the INFO field.

Parse and explore the somatic structural variants. Investigate if there is any fusion involving known cancer genes.

3.3 Circos plots

Visualize the cancer genome using a Circos plot: https://cran.r-project.org/web/packages/RCircos/vignettes/Using_RCircos.pdf

4 Session Info

This document has been created in RStudio using R Markdown and related packages.
For R Markdown, see http://rmarkdown.rstudio.com
For details about the OS, packages and versions, see detailed information below:

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.2
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] gdata_2.18.0                data.table_1.11.4          
##  [3] vcfR_1.8.0                  VariantAnnotation_1.26.0   
##  [5] Rsamtools_1.32.0            Biostrings_2.48.0          
##  [7] XVector_0.20.0              SummarizedExperiment_1.10.1
##  [9] DelayedArray_0.6.0          BiocParallel_1.14.1        
## [11] matrixStats_0.53.1          Biobase_2.40.0             
## [13] GenomicRanges_1.32.3        GenomeInfoDb_1.16.0        
## [15] IRanges_2.14.10             S4Vectors_0.18.2           
## [17] BiocGenerics_0.26.0         bsplus_0.1.1               
## [19] forcats_0.3.0               stringr_1.3.1              
## [21] dplyr_0.7.5                 purrr_0.2.5                
## [23] readr_1.1.1                 tidyr_0.8.1                
## [25] tibble_1.4.2                ggplot2_2.2.1              
## [27] tidyverse_1.2.1             captioner_2.2.3            
## [29] bookdown_0.7                knitr_1.20                 
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-137             bitops_1.0-6            
##  [3] lubridate_1.7.4          bit64_0.9-7             
##  [5] progress_1.1.2           httr_1.3.1              
##  [7] rprojroot_1.3-2          tools_3.5.0             
##  [9] backports_1.1.2          vegan_2.5-2             
## [11] R6_2.2.2                 mgcv_1.8-23             
## [13] DBI_1.0.0                lazyeval_0.2.1          
## [15] colorspace_1.3-2         permute_0.9-4           
## [17] prettyunits_1.0.2        tidyselect_0.2.4        
## [19] mnormt_1.5-5             curl_3.2                
## [21] bit_1.1-14               compiler_3.5.0          
## [23] cli_1.0.0                rvest_0.3.2             
## [25] Cairo_1.5-9              xml2_1.2.0              
## [27] labeling_0.3             rtracklayer_1.40.2      
## [29] scales_0.5.0             psych_1.8.4             
## [31] digest_0.6.15            foreign_0.8-70          
## [33] rmarkdown_1.9            pkgconfig_2.0.1         
## [35] htmltools_0.3.6          BSgenome_1.48.0         
## [37] rlang_0.2.0              readxl_1.1.0            
## [39] rstudioapi_0.7           RSQLite_2.1.1           
## [41] bindr_0.1.1              jsonlite_1.5            
## [43] gtools_3.5.0             RCurl_1.95-4.10         
## [45] magrittr_1.5             GenomeInfoDbData_1.1.0  
## [47] Matrix_1.2-14            Rcpp_0.12.17            
## [49] munsell_0.4.3            ape_5.1                 
## [51] stringi_1.2.2            yaml_2.1.19             
## [53] MASS_7.3-49              zlibbioc_1.26.0         
## [55] plyr_1.8.4               pinfsc50_1.1.0          
## [57] grid_3.5.0               blob_1.1.1              
## [59] crayon_1.3.4             lattice_0.20-35         
## [61] haven_1.1.1              GenomicFeatures_1.32.0  
## [63] hms_0.4.2                pillar_1.2.3            
## [65] reshape2_1.4.3           biomaRt_2.36.1          
## [67] XML_3.98-1.11            glue_1.2.0              
## [69] evaluate_0.10.1          memuse_4.0-0            
## [71] modelr_0.1.2             cellranger_1.1.0        
## [73] gtable_0.2.0             assertthat_0.2.0        
## [75] xfun_0.1                 broom_0.4.4             
## [77] viridisLite_0.3.0        GenomicAlignments_1.16.0
## [79] AnnotationDbi_1.42.1     memoise_1.1.0           
## [81] cluster_2.0.7-1          bindrcpp_0.2.2

Page built on: 18-Jun-2018 at 12:06:50.

2018 | SciLifeLab > NBIS > RaukR

Cancer genomics workshop

Review and interpret somatic mutations

Markus Mayrhofer

18-Jun-2018