Variant calling and genotyping

Per Unneberg

Variant and genotype calling

Nielsen et al. (2011)

SNP calling
Identification of polymorphic sites (>1% allele frequency)
Variant calling
Identification of variant sites (sufficient that any allele differs); single nucleotide variant (SNV)
Genotype calling
Determine the allele combination for each individual (aa, aA, or AA for bi-allelic variants)

Knowing variant sites informs us of possible genotypes and improves genotyping.

Example: knowing a site has A or C limits possible genotype calls to AA, AC, or CC

Variants show up in pileup alignments

Sample PUN-Y-INJ
LG4:30430
LG4:30430
 30431     30441     30451     30461     30471              
CATTGGCAATGGCATCAGTTGAGCATCTTAGTACGAACTAAAAGCTGCGAAAAAATATTT
...............M............................................
...............A... ,,,,,,,,,,
...............A................. ,,,,,,,,,,
.............................................. ,,,,,,,,
..............................................
..............................................
...............A............................................
...............A............................................
,,,,,,,,,,,,a,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,a,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,a,,,,,aa,a,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Sample PUN-R-ELF
LG4:30430
LG4:30430
 30431     30441     30451     30461     30471              
CATTGGCAATGGCATCAGTTGAGCATCTTAGTACGAACTAAAAGCTGCGAAAAAATATTT
...............A............................................
,,,,,,, ..........................
...............A..... .....................
...............A...A.... .....................
...............A.........
...............A...........C.............
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...............A............................................
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...............A............................................
,,,g,,,,,,,a,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
.............A............................................

Potential variants show up as multiple mismatches in a column. Two questions arise:

  • how do we detect variant sites?
  • how do we distinguish variants from sequencing error?

Simple approach: filter bases on quality (e.g., Q20), call heterozygous if 20-80% bases non-reference.

Issues: undercalls heterozygotes, no measure of uncertainty

Solution: probabilistic methods!

We can calculate likelihoods of observed data

Example (excluding sequencing error!)

   
TGC
.K.
...
,,,
.T.
.T.
...
,,,
,t,

Goal: calculate likelihood of observing X=G,g,T,T,G,g,t from a genotype G. Denote by X_i the observation at each position i of X. We further assume each observation X_i can be treated independently. We restrict possible genotypes to the observed alleles (i.e., G, T). Some observations:

Prob(X_1=G assuming genotype G=T,T) = P(X_1=G|T,T) = 0

Prob(X_1=G assuming genotype G,G) = P(X_1=G|G,G) = 1

Prob(X_1=G assuming genotype G,T) = P(X_1=G|G,T) = 0.5

To get total likelihood P(X|G) assuming a genotype G (here G,T), we can multiply over all observations (reads):

\begin{align} P(X|\mathsf{G,T}) & = P(X_1=\mathsf{G}|\mathsf{G,T})%% P(X_2=\mathsf{g}|\mathsf{G,T})%% P(X_3=\mathsf{T}|\mathsf{G,T})%% P(X_4=\mathsf{T}|\mathsf{G,T}) \\ & P(X_5=\mathsf{G}|\mathsf{G,T})%% P(X_6=\mathsf{g}|\mathsf{G,T})%% P(X_7=\mathsf{t}|\mathsf{G,T}) = 0.5^7 \end{align}

We can use Bayes’ theorem to genotype

Example

   
TGC
.K.
...
,,,
.T.
.T.
...
,,,
,t,

For a given site, we have a number of observations X. We have shown we can calculate the likelihood of observing X given a genotype G, P(X|G).

However; what we really want to know is the most likely genotype G given the data X, or P(G|X).

Apply Bayes’ theorem:

P(G|X) \sim P(X|G)\cdot P(G)

\text{posterior} \sim \text{likelihood} \cdot \text{prior}

Consequently we need to set a prior on G. If allele frequencies are known, we can constrain the frequencies; for example, if A is known to be low (\sim1%) AA genotype is very unlikely. Otherwise, could set all equal (flat prior).

Genotype likelihoods

We have outlined a probabilistic approach to variant calling where we obtain a posterior probability of observing a genotype G given data X:

P(G|X) \sim P(X|G)P(G)

Assuming a bi-allelic site, and letting H_1, H_2 denote the two alleles, we have three possible genotype likelihoods P(H_1H_1|X), P(H_1H_2|X), and P(H_2H_2|X).

The highest posterior probability is typically chosen as the genotype call, with a measure confidence represented by the genotype probability or ratio between the two most probable calls.

Genotype likelihoods are often represented as Phred-scaled likelihoods (again!):

\text{QUAL} = -10 \log_{10} P(G|X)

Variant Call Format (VCF) - header

bcftools view --header-only vcf/allsites.vcf.gz | head --lines 1
bcftools view --header-only vcf/allsites.vcf.gz | grep "##FILTER"
bcftools view -h vcf/allsites.vcf.gz | grep "##INFO" | head -n 4
bcftools view -h vcf/allsites.vcf.gz | grep "##FORMAT" | head -n 8
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=LowQual,Description="Low quality">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">

FILTER defines applied filters , INFO fields provide additional information to genotypes, FORMAT specification fields define genotype entries, and more. NB: PL format definition.

Variant Call Format (VCF) - data

bcftools view --header-only --samples PUN-R-ELF,PUN-Y-INJ vcf/allsites.vcf.gz |\
 tail --lines 1
bcftools view --no-header  --samples PUN-R-ELF,PUN-Y-INJ vcf/allsites.vcf.gz LG4:6886
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  PUN-R-ELF   PUN-Y-INJ
LG4 6886    .   C   T   804.48  .   AC=2;AF=0.222;AN=4;BaseQRankSum=0;DP=82;ExcessHet=1.8123;FS=1.309;MLEAC=4;MLEAF=0.222;MQ=60;MQRankSum=0;QD=20.63;ReadPosRankSum=0.577;SOR=0.44  GT:AD:DP:GQ:PGT:PID:PL:PS   0/1:3,8:11:89:.:.:293,0,89:.    0/1:2,2:4:35:.:.:35,0,35:.

QUAL: Phred-scaled quality score for Prob(ALT is wrong): 722.43 (p=10^{-Q/10}=5.7e-73)

INFO field summarizes data for all samples. For instance:

  • allele count 2 (AC=2)
  • allele frequency minor allele 0.222 (AF=0.222)

Variant Call Format (VCF) - data

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  PUN-R-ELF   PUN-Y-INJ
LG4 6886    .   C   T   804.48  .   AC=2;AF=0.222;AN=4;BaseQRankSum=0;DP=82;ExcessHet=1.8123;FS=1.309;MLEAC=4;MLEAF=0.222;MQ=60;MQRankSum=0;QD=20.63;ReadPosRankSum=0.577;SOR=0.44  GT:AD:DP:GQ:PGT:PID:PL:PS   0/1:3,8:11:89:.:.:293,0,89:.    0/1:2,2:4:35:.:.:35,0,35:.
Genotypes (GT:AD:DP:GQ:PGT:PID:PL:PS)

PUN-R-ELF: 0/1:3,8:11:50:.:.:189,0,50:.

GT=0/1, AD=3,8 => 3 REF, 8 ALT, DP=11 => sequence depth = 11, PL=189,0,50

PUN-Y-INJ: 0/1:2,2:4:45:.:.:45,0,45:.

GT=0/1, AD=2,2 => 2 REF, 2 ALT, DP=4 => sequence depth = 4, PL=45,0,45

Relative genotype probabilities

Can convert Phred-scaled quality scores to probabilities as

p = 10^{-Q/10}

For PUN-R-ELF the relative probabilities are 10^{-189/10}\approx1.26e-9, 10^{0}=1, 10^{50}=10^{-5}.

Interpretation: 0/1 10,000 times more likely than 1/1 (1/10^{-5})

ELF
ELF
   
TCT
.Y.
.T.
...
.T.
,,,
.T.
,,,
.T.
...
...
...
,t,
,t,
,t,
,t,
INJ
INJ
   
TCT
.Y.
.T.
...
,,,
,,,
,,,
.T.

GATK best practice

Pros
  • Best practices
  • Large documentation
  • Variant quality score recalibration
Cons
  • Human-centric - very slow runtime on genomes with many sequences
  • Complicated setup

Alternative variant callers

freebayes

Bayesian genetic variant detector. Simpler setup.

May struggle in high-coverage regions.

bcftools

Utilities for variant calling and manipulating VCFs and BCFs.

ANGSD

For low-coverage sequencing. Doesn’t do explicit genotyping; most methods take genotype uncertainty into account.

Bibliography

Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., Whitwham, A., Keane, T., McCarthy, S. A., Davies, R. M., & Li, H. (2021). Twelve years of SAMtools and BCFtools. GigaScience, 10(2), giab008. https://doi.org/10.1093/gigascience/giab008
DePristo, M. A., Banks, E., Poplin, R., Garimella, K. V., Maguire, J. R., Hartl, C., Philippakis, A. A., del Angel, G., Rivas, M. A., Hanna, M., McKenna, A., Fennell, T. J., Kernytsky, A. M., Sivachenko, A. Y., Cibulskis, K., Gabriel, S. B., Altshuler, D., & Daly, M. J. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics, 43(5), 491–498. https://doi.org/10.1038/ng.806
Garrison, E., & Marth, G. (2012). Haplotype-based variant detection from short-read sequencing. arXiv:1207.3907 [q-Bio]. http://arxiv.org/abs/1207.3907
Korneliussen, T. S., Albrechtsen, A., & Nielsen, R. (2014). ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics, 15(1), 356. https://doi.org/10.1186/s12859-014-0356-4
Li, H. (2014). Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics, 30(20), 2843–2851. https://doi.org/10.1093/bioinformatics/btu356
Li, R., Li, Y., Fang, X., Yang, H., Wang, J., Kristiansen, K., & Wang, J. (2009). SNP detection for massively parallel whole-genome resequencing. Genome Research, 19(6), 1124–1132. https://doi.org/10.1101/gr.088013.108
Maruki, T., & Lynch, M. (2017). Genotype Calling from Population-Genomic Sequencing Data. G3 Genes|Genomes|Genetics, 7(5), 1393–1404. https://doi.org/10.1534/g3.117.039008
Nielsen, R., Paul, J. S., Albrechtsen, A., & Song, Y. S. (2011). Genotype and SNP calling from next-generation sequencing data. Nature Reviews Genetics, 12(6), 443–451. https://doi.org/10.1038/nrg2986