Synopsis

This document documents an example toy analyses of a VCF file with genotyping data using the gwasim package. Package can be installed by typing: devtools::install_github("NBISweden/a_johansson_2020"). For further details visit gwasim GitHub pages.

Setup

Before starting our simulation, we need to load the necessary packages (install them earlier if you do not already have them in your library). The easiest way is to:

Next, load the packages:

library('vcfR')
library('stringr')
library('gwasim')

Reading data

For the sake of learning how to use gwasim we will read first 500 rows (markers) from the ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz file. This file contains 1000 genomes phase 3 data for human chromosome 22. The file can be downloaded from the 1000 Genomes data repository via FTP link.

input_vcf <- '~/Projects/johansson/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz'
vcf_data <- vcfR::read.vcfR(file = input_vcf, nrows = 500)
## Scanning file to determine attributes.
## File attributes:
##   meta lines: 252
##   header_line: 253
##   variant count: 500
##   column count: 2513
## 
Meta line 252 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 500
##   Character matrix gt cols: 2513
##   skip: 0
##   nrows: 500
##   row_num: 0
## 
Processed variant: 500
## All variants processed
markers <- vcf_data@fix[,'ID'] # Extract marker names
coords <- vcf_data@fix[, 'POS'] # Extract genomic marker coordinates

We discovered that we need some simple data cleaning because we had few problematic entries in the VCF file. You should always examine your particular dataset for any oddities, try to understand their source and either filter them out or, better, fix them in some other way.

to_remove <- which(str_detect(markers, ";")) # detect weird markers (multi-marker)
markers <- markers[-to_remove]
coords <- coords[-to_remove]

Get genotypes

Now, let us build a genotyping matrix \(G\). We will narrow down our scope to the first 10 markers. First, we use gwasim::get_genotypes() to retrieve the genotypes, next we pipe our data stream to the gwasim::fix_allele_encoding function which recodes the \(G\) matrix. By doing so, we make sure the major homozygote will be encoded as 0, minor homozygote as 2 and heterozygotes as 1. In another words, our genotyping matrix now contains a per locus count of minor alleles. Finally, we impute the missing genotypes by inserting the most common genotype at all NA loci.

Note that this step may potentially change which allele is minor but it will not be reflected in the encoding.

G <- gwasim::get_genotypes(x = vcf_data, marker_names = markers[1:10]) %>% 
  gwasim::fix_allele_encoding() %>%
  gwasim::impute_G()

Compute maf

Once the data have been loaded, we can start looking at the data. Let’s plot e.g. the minor allele frequency and check that it is the same as the reference (minor) allele frequency, i.e. that the gwasim::fix_allele_encoding function worked as it is supposed to and to see if imputations had any effect on maf.

raf <- colSums(G) / (2 * dim(G)[1])
maf <- pmin(raf, 1 - raf)
plot(maf, cex=.5, col='slateblue', type='l', las=1, cex.axis=.5)

plot(maf-raf, cex=.5, col='tomato', type='l', las=1, cex.axis=.5)

Simulate effect of rare alleles for genotype

Now, it is time to do some basic simulations, i.e. to simulate phenotypic values based on the genotypes in our G. To this end, we will use gwasim::get_effects. In this simple example (we will show more complex scenarios later in this document), we would like to simulate the following genetic architecture affecting phenotype:

effects <- gwasim::get_effects(maf=maf, thr=0.01, N=2, get_betas_fun = dbeta, get_betas_args = list(shape1 = .1, shape2 = .1), rare=T, frac_negative=0, seed = 2)
effects
## $marker_idx
## rs587769434 rs587638893 
##           5           6 
## 
## $effects
## rs587769434 rs587638893 
##   108.38947    58.09492

Here, we can see that, indeed, two markers have been selected: rs587720402 and rs587638893 and the effects at these loci are: \(\beta_{rs587720402} = 13.66754\) and \(\beta_{rs587638893} = 31.14350\) respectively. We are than ready to simulate our phenotype.

Compute simulated phenotypes

First, we want some noise in our phenotype to simulate the effect of environment, other loci or simply the measurement error. Let us add some very tiny noise here sampled from \(\mathcal{N}(0, 1)\).

error_term <- rnorm(n = dim(G)[1], mean = 0, sd = 1)
# Check the error is as you want it
plot(error_term, pch=19, cex=.5, las=1)

hist(error_term)

Luckily, our error seems to be what we want – it looks normally distributed around 0 and also it is pretty homogenous. Having this last bit ready, we can simulate our phenotype according to the \(\hat{y} = G \times \beta + \epsilon\):

# Simulate phenotypes (here we do not want negative phenotype values so, we set them to 0)
y <- pmax(0, G[, effects$marker_idx] %*% effects$effects + error_term)
plot(y, 1:length(y), pch = 1 + rowSums(G[, effects$marker_idx]), xlab='individual', ylab='phenotype', las=1, cex.axis=.7, main = "Simulated phenotype per individual")

Indeed, we see there are a number of rs587720402 heterozygotes with phenotype close to the effect of \(13.67 + \epsilon\) and one heterozygote at rs587638893 with phenotype close to the corresponding effect of \(58.09 + \epsilon\). We can check it matches the values in the \(G\) matrix:

G[rowSums(G[, effects$marker_idx]) > 0, effects$marker_idx]
##         rs587769434 rs587638893
## HG01624           1           0
## HG01855           0           1
## HG02389           0           1

Testing GWAS models.

Now, it’d be really interesting to see whether a tool for genome-wide association studies, e.g. GenABEL can single out our markers that were used for simulation. Let’s try. Now things are becoming exciting and one can start asking different research questions.

library('GenABEL')
test_header <- tibble(
       chr = c(22),
       pos = as.numeric(coords[1:10]),
       snp = colnames(G),
       strand = c('u')
)
test_G <- as_tibble(t(G))
test_data <- cbind(test_header, test_G)
data <- gwasim::tibble_to_gwaa(test_data, sex=rep(2, times=dim(G)[1]), trait=y, progress = F)
qt <- GenABEL::qtscore(y~1, data)
## Warning in GenABEL::qtscore(y ~ 1, data): Number of observations < 100, Lambda
## estimate is unreliable
plot(qt@annotation$Position,-log10(qt@results$P1df), pch=19, cex=.5)

best <- which.min(qt@results$P1df)
qt@results[best,]
##                N     effB  se_effB chi2.1df P1df    effAB effBB chi2.2df P2df
## rs587670191 2504 57.12083 1.405634  1651.37    0 57.12083    NA  1651.37    0

What happened with our rs587720402 marker, it did not pop up in the Manhattan plot. Apparently its effect got masked… Let us do a conditional analysis this time and see whether it does help:

top_gt <- as.double(data@gtdata[,'rs587670191'])
## Loading required package: GenABEL
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: GenABEL.data
## 
## Attaching package: 'GenABEL'
## The following object is masked from 'package:purrr':
## 
##     map
qt_cond <- GenABEL::qtscore(y~top_gt, data)
## Warning in GenABEL::qtscore(y ~ top_gt, data): Number of observations < 100,
## Lambda estimate is unreliable
plot(qt_cond@annotation$Position,-log10(qt_cond@results$P1df))
points(qt@annotation$Position,-log10(qt@results$P1df), pch=19, cex=.5)

best <- which.min(qt_cond@results$P1df)
qt_cond@results[best,]
##                N     effB   se_effB chi2.1df         P1df    effAB effBB
## rs587720402 2504 12.33029 0.3671112 1128.109 2.56775e-247 12.33029    NA
##             chi2.2df         P2df
## rs587720402 1128.109 2.56775e-247

Hurray! Our conditional analysis worked! On the plot above, you can see old, non-conditional results as black filled points while the results of the conditional analysis are in shown as hollow circles.