Variant calling and genotyping

Per Unneberg

Variant and genotype calling

Nielsen et al. (2011)

SNP calling: Identification of polymorphic sites (>1% allele frequency)
Variant calling: Identification of variant sites (sufficient that any allele differs); single nucleotide variant (SNV)
Genotype calling: Determine the allele combination for each individual (aa, aA, or AA for bi-allelic variants)

Knowing variant sites informs us of possible genotypes and improves genotyping.

Example: knowing a site has A or C limits possible genotype calls to AA, AC, or CC

Figure caption from (Nielsen et al., 2011):

Pre-processing steps (shown in yellow) transform the raw data from next-generation sequencing technology into a set of aligned reads that have a measure of confidence, or quality score, associated with the bases of each read. The per-base quality scores produced by base-calling algorithms may need to be recalibrated to accurately reflect the true error rates. Depending on the number of samples and the depth of coverage, either a multi-sample calling procedure (green) or a single-sample calling procedure (orange) may then be applied to obtain SNP or genotype calls and associated quality scores. Note that the multi-sample procedure may include a linkage-based analysis, which can substantially improve the accuracy of SNP or genotype calls. Finally, post-processing (purple) uses both known data and simple heuristics to filter the set of SNPs and/or improve the associated quality scores. Optional, although recommended, steps are shown in dashed lines.

Variants show up in pileup alignments

Sample PUN-Y-INJ

LG4:30430

 30431     30441     30451     30461     30471              
CATTGGCAATGGCATCAGTTGAGCATCTTAGTACGAACTAAAAGCTGCGAAAAAATATTT
...............M............................................
...............A...                               ,,,,,,,,,,
...............A.................                 ,,,,,,,,,,
..............................................      ,,,,,,,,
..............................................              
..............................................              
...............A............................................
...............A............................................
,,,,,,,,,,,,a,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,a,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,a,,,,,aa,a,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
  ,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

Sample PUN-R-ELF

LG4:30430

 30431     30441     30451     30461     30471              
CATTGGCAATGGCATCAGTTGAGCATCTTAGTACGAACTAAAAGCTGCGAAAAAATATTT
...............A............................................
,,,,,,,                           ..........................
...............A.....                  .....................
...............A...A....               .....................
...............A.........                                   
...............A...........C.............                   
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...............A............................................
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...............A............................................
,,,g,,,,,,,a,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
  .............A............................................

Potential variants show up as multiple mismatches in a column. Two questions arise:

how do we detect variant sites?
how do we distinguish variants from sequencing error?

Simple approach: filter bases on quality (e.g., Q20), call heterozygous if 20-80% bases non-reference.

Issues: undercalls heterozygotes, no measure of uncertainty

Solution: probabilistic methods!

(Nielsen et al., 2011)

We can calculate likelihoods of observed data

Example (excluding sequencing error!)

   
TGC
.K.
...
,,,
.T.
.T.
...
,,,
,t,

Goal: calculate likelihood of observing X=G,g,T,T,G,g,t from a genotype G. Denote by X_i the observation at each position i of X. We further assume each observation X_i can be treated independently. We restrict possible genotypes to the observed alleles (i.e., G, T). Some observations:

Prob(X_1=G assuming genotype G=T,T) = P(X_1=G|T,T) = 0

Prob(X_1=G assuming genotype G,G) = P(X_1=G|G,G) = 1

Prob(X_1=G assuming genotype G,T) = P(X_1=G|G,T) = 0.5

To get total likelihood P(X|G) assuming a genotype G (here G,T), we can multiply over all observations (reads):

\begin{align} P(X|\mathsf{G,T}) & = P(X_1=\mathsf{G}|\mathsf{G,T})%% P(X_2=\mathsf{g}|\mathsf{G,T})%% P(X_3=\mathsf{T}|\mathsf{G,T})%% P(X_4=\mathsf{T}|\mathsf{G,T}) \\ & P(X_5=\mathsf{G}|\mathsf{G,T})%% P(X_6=\mathsf{g}|\mathsf{G,T})%% P(X_7=\mathsf{t}|\mathsf{G,T}) = 0.5^7 \end{align}

We can use Bayes’ theorem to genotype

Example

   
TGC
.K.
...
,,,
.T.
.T.
...
,,,
,t,

For a given site, we have a number of observations X. We have shown we can calculate the likelihood of observing X given a genotype G, P(X|G).

However; what we really want to know is the most likely genotype G given the data X, or P(G|X).

Apply Bayes’ theorem:

P(G|X) \sim P(X|G)\cdot P(G)

\text{posterior} \sim \text{likelihood} \cdot \text{prior}

Consequently we need to set a prior on G. If allele frequencies are known, we can constrain the frequencies; for example, if A is known to be low (\sim1%) AA genotype is very unlikely. Otherwise, could set all equal (flat prior).

cf https://gatk.broadinstitute.org/hc/en-us/articles/360035890511

R. Li et al. (2009), Table 1, shows a nice numerical example of one way of setting priors. The authors assume a specific allele, G, in the reference sequence:

–G–

They start by calculating the frequency of haploid genotypes. They first determine p_G by assuming a heterozygous SNP rate f=0.001, which means 1 in a 1000 sites has G/G genotype mutated to G/X, where X is one of {A,C,T}. They assume a transition to transversion (ts/tv) ratio of 4, meaning X=A four times as often as C or T (there is an error in the text where C is taken to be the transition; the numbers in the table are correct however). This gives the following haploid genotype frequencies:

\begin{align} p_G & = 1-f = 0.999 \\ p_A & = 4f/6 = 6.67\times10^{-4} \\ p_C & = f/6 = 1.67\times10^{-4}\\ p_T & = f/6 =1.67\times10^{-4} \end{align}

To get the diploid genotypes, we simply multiply the corresponding entries, e.g., p_{AC} = p_Ap_C. For homozygote ALT, we need to account for the homozygous SNP rate r = 0.0005, where G/G mutates to X/X, for X one of {A,C,T}:

\begin{align} p_{AA} & = p_Ap_A + 4r/6 = 3.33\times10^{-4} \\ p_{AC} & = p_Ap_C = 1.11\times10^{-7}\\ p_{AT} & = p_Ap_T = 1.11\times10^{-7}\\ p_{CC} & = p_Cp_C + r/6 = 8.34\times10^{-5} \\ p_{CG} & = p_Cp_G = 1.67\times10^{-4}\\ p_{CT} & = p_Cp_T = 2.78\times10^{-8}\\ p_{GG} & = 1 - f - r = 0.9985 \\ p_{GT} & = p_Gp_T = 1.67\times10^{-4}\\ p_{TT} & = p_Tp_T + r = 8.34\times10^{-5}\\ \end{align}

Genotype likelihoods

We have outlined a probabilistic approach to variant calling where we obtain a posterior probability of observing a genotype G given data X:

P(G|X) \sim P(X|G)P(G)

Assuming a bi-allelic site, and letting H_1, H_2 denote the two alleles, we have three possible genotype likelihoods P(H_1H_1|X), P(H_1H_2|X), and P(H_2H_2|X).

The highest posterior probability is typically chosen as the genotype call, with a measure confidence represented by the genotype probability or ratio between the two most probable calls.

Genotype likelihoods are often represented as Phred-scaled likelihoods (again!):

\text{QUAL} = -10 \log_{10} P(G|X)

Variant Call Format (VCF) - header

bcftools view --header-only vcf/allsites.vcf.gz | head --lines 1
bcftools view --header-only vcf/allsites.vcf.gz | grep "##FILTER"
bcftools view -h vcf/allsites.vcf.gz | grep "##INFO" | head -n 4
bcftools view -h vcf/allsites.vcf.gz | grep "##FORMAT" | head -n 8

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=LowQual,Description="Low quality">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">

FILTER defines applied filters , INFO fields provide additional information to genotypes, FORMAT specification fields define genotype entries, and more. NB: PL format definition.

https://samtools.github.io/hts-specs/VCFv4.4.pdf

Variant Call Format (VCF) - data

bcftools view --header-only --samples PUN-R-ELF,PUN-Y-INJ vcf/allsites.vcf.gz |\
 tail --lines 1
bcftools view --no-header  --samples PUN-R-ELF,PUN-Y-INJ vcf/allsites.vcf.gz LG4:6886

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  PUN-R-ELF   PUN-Y-INJ
LG4 6886    .   C   T   804.48  .   AC=2;AF=0.222;AN=4;BaseQRankSum=0;DP=82;ExcessHet=1.8123;FS=1.309;MLEAC=4;MLEAF=0.222;MQ=60;MQRankSum=0;QD=20.63;ReadPosRankSum=0.577;SOR=0.44  GT:AD:DP:GQ:PGT:PID:PL:PS   0/1:3,8:11:89:.:.:293,0,89:.    0/1:2,2:4:35:.:.:35,0,35:.

QUAL: Phred-scaled quality score for Prob(ALT is wrong): 722.43 (p=10^{-Q/10}=5.7e-73)

INFO field summarizes data for all samples. For instance:

allele count 2 (AC=2)
allele frequency minor allele 0.222 (AF=0.222)

https://samtools.github.io/hts-specs/VCFv4.4.pdf

Variant Call Format (VCF) - data

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  PUN-R-ELF   PUN-Y-INJ
LG4 6886    .   C   T   804.48  .   AC=2;AF=0.222;AN=4;BaseQRankSum=0;DP=82;ExcessHet=1.8123;FS=1.309;MLEAC=4;MLEAF=0.222;MQ=60;MQRankSum=0;QD=20.63;ReadPosRankSum=0.577;SOR=0.44  GT:AD:DP:GQ:PGT:PID:PL:PS   0/1:3,8:11:89:.:.:293,0,89:.    0/1:2,2:4:35:.:.:35,0,35:.

Genotypes (`GT:AD:DP:GQ:PGT:PID:PL:PS`)

PUN-R-ELF: 0/1:3,8:11:50:.:.:189,0,50:.

GT=0/1, AD=3,8 => 3 REF, 8 ALT, DP=11 => sequence depth = 11, PL=189,0,50

PUN-Y-INJ: 0/1:2,2:4:45:.:.:45,0,45:.

GT=0/1, AD=2,2 => 2 REF, 2 ALT, DP=4 => sequence depth = 4, PL=45,0,45

Relative genotype probabilities

Can convert Phred-scaled quality scores to probabilities as

p = 10^{-Q/10}

For PUN-R-ELF the relative probabilities are 10^{-189/10}\approx1.26e-9, 10^{0}=1, 10^{50}=10^{-5}.

Interpretation: 0/1 10,000 times more likely than 1/1 (1/10^{-5})

ELF

   
TCT
.Y.
.T.
...
.T.
,,,
.T.
,,,
.T.
...
...
...
,t,
,t,
,t,
,t,

INJ

   
TCT
.Y.
.T.
...
,,,
,,,
,,,
.T.

GATK best practice

https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-

Pros

Best practices
Large documentation
Variant quality score recalibration

Cons

Human-centric - very slow runtime on genomes with many sequences
Complicated setup

Alternative variant callers

freebayes

Bayesian genetic variant detector. Simpler setup.

May struggle in high-coverage regions.

(Garrison & Marth, 2012)

bcftools

Utilities for variant calling and manipulating VCFs and BCFs.

(Danecek et al., 2021)

ANGSD

For low-coverage sequencing. Doesn’t do explicit genotyping; most methods take genotype uncertainty into account.

(Korneliussen et al., 2014)

Bibliography

Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., Whitwham, A., Keane, T., McCarthy, S. A., Davies, R. M., & Li, H. (2021). Twelve years of SAMtools and BCFtools. GigaScience, 10(2), giab008. https://doi.org/10.1093/gigascience/giab008

DePristo, M. A., Banks, E., Poplin, R., Garimella, K. V., Maguire, J. R., Hartl, C., Philippakis, A. A., del Angel, G., Rivas, M. A., Hanna, M., McKenna, A., Fennell, T. J., Kernytsky, A. M., Sivachenko, A. Y., Cibulskis, K., Gabriel, S. B., Altshuler, D., & Daly, M. J. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics, 43(5), 491–498. https://doi.org/10.1038/ng.806

Garrison, E., & Marth, G. (2012). Haplotype-based variant detection from short-read sequencing. arXiv:1207.3907 [q-Bio]. http://arxiv.org/abs/1207.3907

Korneliussen, T. S., Albrechtsen, A., & Nielsen, R. (2014). ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics, 15(1), 356. https://doi.org/10.1186/s12859-014-0356-4

Li, H. (2014). Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics, 30(20), 2843–2851. https://doi.org/10.1093/bioinformatics/btu356

Li, R., Li, Y., Fang, X., Yang, H., Wang, J., Kristiansen, K., & Wang, J. (2009). SNP detection for massively parallel whole-genome resequencing. Genome Research, 19(6), 1124–1132. https://doi.org/10.1101/gr.088013.108

Maruki, T., & Lynch, M. (2017). Genotype Calling from Population-Genomic Sequencing Data. G3 Genes|Genomes|Genetics, 7(5), 1393–1404. https://doi.org/10.1534/g3.117.039008

Nielsen, R., Paul, J. S., Albrechtsen, A., & Song, Y. S. (2011). Genotype and SNP calling from next-generation sequencing data. Nature Reviews Genetics, 12(6), 443–451. https://doi.org/10.1038/nrg2986

Variant calling and genotyping

Variant and genotype calling

Variants show up in pileup alignments

Sample PUN-Y-INJ

Sample PUN-R-ELF

We can calculate likelihoods of observed data

Example (excluding sequencing error!)

We can use Bayes’ theorem to genotype

Example

Genotype likelihoods

Variant Call Format (VCF) - header

Variant Call Format (VCF) - data

Variant Call Format (VCF) - data

Genotypes (GT:AD:DP:GQ:PGT:PID:PL:PS)

Relative genotype probabilities

GATK best practice

Pros

Cons

Alternative variant callers

freebayes

bcftools

ANGSD

Bibliography

Genotypes (`GT:AD:DP:GQ:PGT:PID:PL:PS`)